Wikipedia Web Scraping Project

A Python web scraping project that extracts tabular data from Wikipedia and converts it into CSV format.

Project Overview

This project demonstrates web scraping techniques using BeautifulSoup to extract the "List of Largest Companies in the United States by Revenue" from Wikipedia and save it as a structured CSV file.

What I Learned

Technical Skills

Web Scraping
- Using requests library to fetch web pages
- Parsing HTML with BeautifulSoup
- Extracting data from HTML tables
Data Cleaning
- Removing whitespace with .strip()
- List comprehensions for efficient data processing
- Handling nested HTML structures
Data Export
- Converting scraped data to Pandas DataFrame
- Exporting data to CSV format
- Column header extraction and formatting

Best Practices

Always add User-Agent headers to avoid being blocked
Check website's robots.txt before scraping
Clean extracted data (remove newlines, extra spaces)
Validate data before exporting

Technologies Used

Python
Requests
BeautifulSoup4
Pandas

How It Works

Sends HTTP request to Wikipedia page with proper headers
Parses HTML content using BeautifulSoup
Locates and extracts table data
Cleans the extracted text data
Creates Pandas DataFrame with proper headers
Exports to CSV file

Output

File: largest_public_csv.csv

Contents: Top 100 largest US companies with:

Company rank
Company name
Industry sector
Revenue (USD millions)
Revenue growth percentage
Number of employees
Headquarters location

Installation

pip install requests beautifulsoup4 pandas

Usage

python webscraping.ipynb

The script will automatically scrape the data and create largest_public_csv.csv in the project directory.

Important Notes

Always respect website's robots.txt
Use appropriate User-Agent headers
Data accuracy depends on source website
For educational purposes only

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
largest_public_csv.csv		largest_public_csv.csv
webscraping.ipynb		webscraping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Web Scraping Project

Project Overview

What I Learned

Technical Skills

Best Practices

Technologies Used

How It Works

Output

Installation

Usage

Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Web Scraping Project

Project Overview

What I Learned

Technical Skills

Best Practices

Technologies Used

How It Works

Output

Installation

Usage

Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages