<h1 style='color:blue'>Real-World Example of Web Scraping</h1>

Let's say you want to scrape job listings from a website like Indeed or LinkedIn (assuming they allow it). The goal is to extract job titles, company names, locations, and posted dates.

Since LinkedIn and Indeed have strict anti-scraping measures, we'll use an example from Times Jobs, which is more open to scraping.

<h1 style='color:red'>Steps to Scrape Job Listings Using Python</h1>

We'll use:

    requests – to fetch the webpage
    BeautifulSoup – to parse HTML
    pandas – to save results in a structured format

<h1 style='color:red'>Step 1: Install Required Libraries</h1>

pip install requests beautifulsoup4 pandas

<h1 style='color:red'>Step 2: Import Necessary Modules</h1>

In [12]:
from bs4 import BeautifulSoup
import requests
import pandas as pd



<h1 style='color:red'>Step 3: Define the URL to Scrape</h1>

This URL searches for Python jobs on TimesJobs.

In [22]:
url = "https://www.timesjobs.com/candidate/job-search.html?searchType=personalizedSearch&from=submit&txtKeywords=python&txtLocation="

# page = requests.get(url)

# soup = BeautifulSoup(page.text, 'html')

# print(soup)

<h1 style='color:red'>Step 4: Send a Request to the Website</h1>

In [14]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}

response = requests.get(url, headers=headers)

# Check if request was successful
if response.status_code == 200:
    print("Successfully fetched the webpage")
else:
    print("Failed to fetch the webpage")


Successfully fetched the webpage


<h1 style='color:red'>Step 5: Parse the HTML Content</h1>

In [28]:
soup = BeautifulSoup(response.text, "html.parser")

# Find job listings
jobs = soup.find_all("li", class_="clearfix job-bx wht-shd-bx")
print(soup.find_all('table'))

[]


<h1 style='color:red'>Step 6: Extract Job Information</h1>

In [17]:
job_list = []

for job in jobs:
    title = job.find("h2").text.strip() if job.find("h2") else "N/A"
    company = job.find("h3", class_="joblist-comp-name").text.strip() if job.find("h3", class_="joblist-comp-name") else "N/A"
    
    # Handle missing location data
    location_tag = job.find("ul", class_="top-jd-dtl clearfix")
    location = location_tag.li.text.strip() if location_tag and location_tag.li else "N/A"

    # Handle missing posted date
    posted_tag = job.find("span", class_="sim-posted")
    posted = posted_tag.text.strip() if posted_tag else "N/A"

    job_list.append({
        "Title": title,
        "Company": company,
        "Location": location,
        "Posted": posted
    })

# Display extracted data
print(job_list[:5])  # Print first 5 jobs


[{'Title': 'Python Developer , Python , Python Full stack', 'Company': 'Peopleplease Consulting', 'Location': 'N/A', 'Posted': 'a month ago'}, {'Title': 'Python Developer', 'Company': 'LAKSH HUMAN RESOURCE', 'Location': 'N/A', 'Posted': 'few days ago'}, {'Title': 'Python Developer', 'Company': 'Analytics Vidhya', 'Location': 'N/A', 'Posted': '6 days ago'}, {'Title': 'Python Developer', 'Company': 'zenga tv', 'Location': 'N/A', 'Posted': 'few days ago'}, {'Title': 'Python Developer', 'Company': 'SEVEN CONSULTANCY', 'Location': 'N/A', 'Posted': 'few days ago'}]


<h1 style='color:red'>Step 7: Save Data to CSV</h1>

In [18]:
df = pd.DataFrame(job_list)
df.to_csv("python_jobs.csv", index=False)
print("Data saved to python_jobs.csv")


Data saved to python_jobs.csv


<h1 style='color:red'>Step 8: Read Data from CSV</h1>

In [20]:
df = pd.read_csv("python_jobs.csv")
df.head()  # Display the first 5 rows


Unnamed: 0,Title,Company,Location,Posted
0,"Python Developer , Python , Python Full stack",Peopleplease Consulting,,a month ago
1,Python Developer,LAKSH HUMAN RESOURCE,,few days ago
2,Python Developer,Analytics Vidhya,,6 days ago
3,Python Developer,zenga tv,,few days ago
4,Python Developer,SEVEN CONSULTANCY,,few days ago


In [21]:
len(df)

25