This project demonstrates a beginner-friendly web scraping pipeline built entirely in Python. The target is a Wikipedia table listing the largest companies in the United States by revenue — a clean, publicly accessible dataset that makes an ideal scraping exercise.
The workflow walks through every stage: sending an HTTP request, parsing raw HTML, locating the right table, extracting column headers and row data, and finally exporting the result as a clean CSV file using Pandas. No database, no API key, no complex setup — just Python doing what it does best.
By the end of this pipeline, raw webpage content becomes a structured, analysis-ready spreadsheet in under 30 lines of code.
pip install requests beautifulsoup4 pandasStep 1 Step 2 Step 3 Step 4
Import → Set Headers → Fetch & Parse → Locate Table
Libraries (User-Agent) HTML (find table index)
Step 5 Step 6 Step 7 Step 8
Confirm → Inspect → Store Column → Clean &
Table Content Column Headers Names Print Headers
Step 9 Step 10 Step 11 Step 12
Import → Create Empty → Find All → Iterate &
Pandas DataFrame Table Rows Fill DataFrame
Step 13
Verify &
Export CSV
from bs4 import BeautifulSoup
import requestsPull in
requeststo make HTTP calls to the web, andBeautifulSoupfrom thebs4library to parse and navigate the raw HTML response. These two are the core tools of any Python scraping project.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36 Edg/145.0.0.0'
}Websites can block requests that don't look like they come from a real browser. Setting a
User-Agentheader disguises our script as a normal Chrome/Edge browser visit, preventing the server from rejecting our request.
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')Send a GET request to the Wikipedia URL using our spoofed headers. The response (
page.text) is the raw HTML string of the entire page. We pass it into BeautifulSoup, which turns it into a navigable tree we can search and query.
soup.find_all('table')[1]Wikipedia pages often contain multiple tables (navigation bars, infoboxes, data tables).
find_all('table')returns a list of every table on the page. We use the index[1]to inspect and confirm which table holds the company data we want.
table = soup.find_all('table')[0]
print(table)Once the right table is identified, store it in the variable
tableand print it to verify the content looks correct before extracting anything. This is a good sanity-check habit before building the full pipeline.
soup.find_all('th')
<th>tags in HTML mark table header cells. Runningfind_all('th')lets us see what column names are present in the table, so we know exactly what we're working with before building the DataFrame.
world_titles = table.find_all('th')Scope the header search specifically to our target
table(rather than the whole page) to avoid picking up unrelated headers from other parts of the Wikipedia page.
world_table_titles = [title.text.strip() for title in world_titles]
print(world_table_titles)Loop through each
<th>element, extract its inner text with.text, and use.strip()to remove any leading/trailing whitespace or newline characters. The result is a clean Python list of column name strings, ready to use as DataFrame headers.
import pandas as pdBring in Pandas — the go-to Python library for tabular data. We'll use it to create a structured DataFrame from the scraped rows, and later to export the result as a CSV file.
df = pd.DataFrame(columns=world_table_titles)
dfInitialize an empty Pandas DataFrame using the column names we extracted in Step 8. This gives us the correct structure — all the right columns with no rows yet. We just need to fill it in.
column_data = table.find_all('tr')
<tr>tags represent table rows in HTML.find_all('tr')returns every row in the table, including the header row at index[0]. We'll skip that in the next step since we already captured the column names.
for row in column_data[1:]:
row_data = row.find_all('td')
individual_row_data = [data.text.strip() for data in row_data]
length = len(df)
df.loc[length] = individual_row_dataLoop through every data row (skipping the header at
[0]). For each row, find all<td>(data cell) tags, strip their text, and collect them into a list. Then append that list as a new row in the DataFrame usingdf.loc[length]— wherelengthalways points to the next available index position.
df
df.to_csv(r'D:\Finished project\python-webscraping-wikipedia\Companies.csv', index=False)Display the filled DataFrame to confirm everything looks right, then export it to a CSV file using
.to_csv(). Settingindex=Falseprevents Pandas from writing the auto-generated row numbers as an extra column in the output file.
| File | Description |
|---|---|
Companies.csv |
Scraped table of the largest US companies by revenue |
| Concept | Tool / Method |
|---|---|
| HTTP request with browser spoofing | requests.get() + User-Agent header |
| HTML parsing | BeautifulSoup(page.text, 'html.parser') |
| Locating elements | .find_all('table'), .find_all('th'), .find_all('tr'), .find_all('td') |
| Text cleaning | .text.strip() |
| DataFrame creation | pd.DataFrame(columns=[...]) |
| Row insertion | df.loc[len(df)] = row |
| CSV export | df.to_csv(..., index=False) |
💬 Always check a website's
robots.txtand Terms of Service before scraping. Wikipedia permits scraping for non-commercial and research purposes.