---
title: "Webscraping Indeed Job Portal"
description: "webscraping with python and using proxies"
author: "Aakash Basnet"
date: "2024/02/03"
categories:
  - webscraping
  - code
  - ETL
  - python
format:
  html:
    code-fold: true
jupyter: python3
---

##  Building URL
After navigating the developer toolbar for Indeed job listing, I found the pattern in the url query for each job title search and location. We can use this info to build the url. The link printed from the code below will take you to the Indeed page having listing for python developer in Dalla, TX

In [60]:
def url_builder(job_title, location, page_number=10 ):
    job_title = "+".join(job_title.split(" "))
    location = "+".join(location.split(" "))
    base_url = "https://www.indeed.com/jobs"
    query_str = f"?q={job_title}&l={location}"
    url = f"{base_url}{query_str}"
     
    return url

print(url_builder(job_title="python developer", location="Dallas, TX"))

https://www.indeed.com/jobs?q=python+developer&l=Dallas,+TX


## Rotating Proxies
The proxies needs to be rotated to not be detected by anti scrapping tools used by the servers. For this we will scrape the list of free available ip address and test them using multithreading. This will filter the working proxies. Later on, we will use working proxies to make the request

In [70]:
import pandas as pd
import requests


def extract_proxies():
    print("Extracting proxies...")
    proxy_url  = "https://free-proxy-list.net/"
    r = requests.get(proxy_url)
    dfs  = pd.read_html(r.text)
    df = dfs[0]
    print(df.shape)
    return df

   
def test_proxy(proxy):
    url= "https://www.google.com"
    print(f"testing {proxy}")
    try:
        r = requests.get(url, proxies={"http": proxy , "https": proxy}, timeout=5)
    except Exception as e:
        print(e)
        return None
        
    print(r.status_code)
    if r.status_code == 200: 
        return proxy
    return None   


In [71]:
proxies_df = extract_proxies()
proxies_df.head(20)

Extracting proxies...
(300, 8)


  dfs  = pd.read_html(r.text)


Unnamed: 0,IP Address,Port,Code,Country,Anonymity,Google,Https,Last Checked
0,140.238.18.180,21000,KR,South Korea,elite proxy,,yes,0 secs ago
1,114.129.2.82,8081,JP,Japan,elite proxy,no,yes,0 secs ago
2,113.161.131.43,80,VN,Vietnam,anonymous,no,no,8 secs ago
3,116.203.28.43,80,DE,Germany,anonymous,yes,no,8 secs ago
4,209.121.164.50,31147,CA,Canada,anonymous,,no,8 secs ago
5,198.44.255.3,80,HK,Hong Kong,anonymous,no,no,8 secs ago
6,139.162.78.109,3128,JP,Japan,anonymous,no,no,8 secs ago
7,89.36.114.38,80,GB,United Kingdom,anonymous,yes,no,8 secs ago
8,41.207.187.178,80,TG,Togo,anonymous,no,no,8 secs ago
9,198.176.56.43,80,US,United States,anonymous,yes,no,8 secs ago


In [66]:
import concurrent.futures

proxies = proxies_df["IP Address"].to_list()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
   test_results = executor.map(test_proxy, proxies)
        

Extracting proxies...


  dfs  = pd.read_html(r.text)


(300, 8)
testing 140.238.18.180
testing 114.129.2.82
testing 113.161.131.43
testing 116.203.28.43
testing 209.121.164.50
testing 198.44.255.3
testing 139.162.78.109
testing 89.36.114.38
testing 41.207.187.178
testing 198.176.56.43
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x133820ee0>: Failed to establish a new connection: [Errno 61] Connection refused')))
testing 154.118.228.212
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x133821ff0>: Failed to establish a new connection: [Errno 61] Connection refused')))
testing 195.181.172.230
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Unable to connect to proxy', 

In [None]:
from selenium import webdriver

url = url_builder(job_title='python developer', location='Fort Worth,TX')

driver = webdriver.Chrome()
print (url)
driver.get(url)
