---
title: "Rotating Proxies"
description: "webscraping with python  using rotating proxies"
author: "Aakash Basnet"
date: "2024/02/03"
categories:
  - webscraping
  - code
  - ETL
  - python
format:
  html:
    code-fold: true
jupyter: python3
---

## Rotating Proxies
The proxies needs to be rotated to not be detected by anti scrapping tools used by the servers. For this we will scrape the list of free available ip address and test them using multithreading. This will filter the working proxies. Later on, we will use working proxies to make the request

### Extracting Proxies

In [3]:
import pandas as pd
import requests


def extract_proxies():
    print("Extracting proxies...")
    proxy_url  = "https://www.us-proxy.org/"
    r = requests.get(proxy_url)
    dfs  = pd.read_html(r.text)
    df = dfs[0]
    print(df.shape)
    return df
proxies_df = extract_proxies()
proxies_df.head(20)
   


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Extracting proxies...
(200, 8)


  dfs  = pd.read_html(r.text)


Unnamed: 0,IP Address,Port,Code,Country,Anonymity,Google,Https,Last Checked
0,35.209.198.222,80,US,United States,elite proxy,no,no,1 min ago
1,209.97.150.167,3128,US,United States,anonymous,no,no,1 min ago
2,198.199.86.11,8080,US,United States,anonymous,no,no,1 min ago
3,162.223.94.164,80,US,United States,anonymous,no,no,1 min ago
4,34.23.45.223,80,US,United States,elite proxy,yes,no,1 min ago
5,64.225.4.81,10002,US,United States,anonymous,yes,yes,1 min ago
6,12.186.205.121,80,US,United States,anonymous,yes,no,1 min ago
7,50.172.75.126,80,US,United States,anonymous,no,no,1 min ago
8,50.170.90.31,80,US,United States,anonymous,no,no,1 min ago
9,50.172.75.121,80,US,United States,anonymous,no,no,1 min ago


### Testing Proxies

In [9]:
def test_proxy(proxy):
    url= "https://www.google.com"
    print(f"testing {proxy}")
    try:
        r = requests.get(url, proxies={"http": proxy , "https": proxy}, timeout=5)
        print(r.status_code)
        if r.status_code == 200: 
            return proxy  
    except Exception as e:
        print(e)

In [10]:
import concurrent.futures

proxies = proxies_df["IP Address"].to_list()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
   test_results = executor.map(test_proxy, proxies)
        

testing 35.209.198.222
testing 209.97.150.167
testing 198.199.86.11
testing 162.223.94.164
testing 34.23.45.223
testing 64.225.4.81
testing 12.186.205.121
testing 50.172.75.126
testing 50.170.90.31
testing 50.172.75.121
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
testing 50.223.38.6
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x10f5ea440>: Failed to establish a new connection: [Errno 61] Connection refused')))
testing 50.222.245.43
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x10f5e9540>: Failed to establish a new connection: [Errno 61] Connection refused')))
testing 32.223.6.94
HTTPSConnectionPool(host='www.google.com', 

In [11]:
print(list(test_results))

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, Non