A Python web scraping project that extracts tabular data from Wikipedia and converts it into CSV format.
This project demonstrates web scraping techniques using BeautifulSoup to extract the "List of Largest Companies in the United States by Revenue" from Wikipedia and save it as a structured CSV file.
-
Web Scraping
- Using
requestslibrary to fetch web pages - Parsing HTML with BeautifulSoup
- Extracting data from HTML tables
- Using
-
Data Cleaning
- Removing whitespace with
.strip() - List comprehensions for efficient data processing
- Handling nested HTML structures
- Removing whitespace with
-
Data Export
- Converting scraped data to Pandas DataFrame
- Exporting data to CSV format
- Column header extraction and formatting
- Always add User-Agent headers to avoid being blocked
- Check website's
robots.txtbefore scraping - Clean extracted data (remove newlines, extra spaces)
- Validate data before exporting
- Python
- Requests
- BeautifulSoup4
- Pandas
- Sends HTTP request to Wikipedia page with proper headers
- Parses HTML content using BeautifulSoup
- Locates and extracts table data
- Cleans the extracted text data
- Creates Pandas DataFrame with proper headers
- Exports to CSV file
File: largest_public_csv.csv
Contents: Top 100 largest US companies with:
- Company rank
- Company name
- Industry sector
- Revenue (USD millions)
- Revenue growth percentage
- Number of employees
- Headquarters location
pip install requests beautifulsoup4 pandaspython webscraping.ipynbThe script will automatically scrape the data and create largest_public_csv.csv in the project directory.
- Always respect website's
robots.txt - Use appropriate User-Agent headers
- Data accuracy depends on source website
- For educational purposes only