Skip to content

mohammedarmaan/webscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Web Scraping Project

A Python web scraping project that extracts tabular data from Wikipedia and converts it into CSV format.

Project Overview

This project demonstrates web scraping techniques using BeautifulSoup to extract the "List of Largest Companies in the United States by Revenue" from Wikipedia and save it as a structured CSV file.

What I Learned

Technical Skills

  • Web Scraping

    • Using requests library to fetch web pages
    • Parsing HTML with BeautifulSoup
    • Extracting data from HTML tables
  • Data Cleaning

    • Removing whitespace with .strip()
    • List comprehensions for efficient data processing
    • Handling nested HTML structures
  • Data Export

    • Converting scraped data to Pandas DataFrame
    • Exporting data to CSV format
    • Column header extraction and formatting

Best Practices

  • Always add User-Agent headers to avoid being blocked
  • Check website's robots.txt before scraping
  • Clean extracted data (remove newlines, extra spaces)
  • Validate data before exporting

Technologies Used

  • Python
  • Requests
  • BeautifulSoup4
  • Pandas

How It Works

  1. Sends HTTP request to Wikipedia page with proper headers
  2. Parses HTML content using BeautifulSoup
  3. Locates and extracts table data
  4. Cleans the extracted text data
  5. Creates Pandas DataFrame with proper headers
  6. Exports to CSV file

Output

File: largest_public_csv.csv

Contents: Top 100 largest US companies with:

  • Company rank
  • Company name
  • Industry sector
  • Revenue (USD millions)
  • Revenue growth percentage
  • Number of employees
  • Headquarters location

Installation

pip install requests beautifulsoup4 pandas

Usage

python webscraping.ipynb

The script will automatically scrape the data and create largest_public_csv.csv in the project directory.

Important Notes

  • Always respect website's robots.txt
  • Use appropriate User-Agent headers
  • Data accuracy depends on source website
  • For educational purposes only

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors