Skip to content

izzdataflow/python-webscraping-wikipedia

Repository files navigation

🕸️ Python Web Scraping — Wikipedia Company Data

Preface

This project demonstrates a beginner-friendly web scraping pipeline built entirely in Python. The target is a Wikipedia table listing the largest companies in the United States by revenue — a clean, publicly accessible dataset that makes an ideal scraping exercise.

The workflow walks through every stage: sending an HTTP request, parsing raw HTML, locating the right table, extracting column headers and row data, and finally exporting the result as a clean CSV file using Pandas. No database, no API key, no complex setup — just Python doing what it does best.

By the end of this pipeline, raw webpage content becomes a structured, analysis-ready spreadsheet in under 30 lines of code.


🔧 Requirements

pip install requests beautifulsoup4 pandas

🔁 Process Flow

Step 1            Step 2            Step 3            Step 4
Import         →  Set Headers    →  Fetch & Parse  →  Locate Table
Libraries         (User-Agent)      HTML               (find table index)

Step 5            Step 6            Step 7            Step 8
Confirm        →  Inspect        →  Store Column   →  Clean &
Table Content     Column Headers    Names              Print Headers

Step 9            Step 10           Step 11           Step 12
Import         →  Create Empty   →  Find All       →  Iterate &
Pandas            DataFrame         Table Rows         Fill DataFrame

Step 13
Verify &
Export CSV

📋 Step-by-Step Breakdown


Step 1 — Import Libraries

from bs4 import BeautifulSoup
import requests

Pull in requests to make HTTP calls to the web, and BeautifulSoup from the bs4 library to parse and navigate the raw HTML response. These two are the core tools of any Python scraping project.


Step 2 — Set a User-Agent Header

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/145.0.0.0 Safari/537.36 Edg/145.0.0.0'
}

Websites can block requests that don't look like they come from a real browser. Setting a User-Agent header disguises our script as a normal Chrome/Edge browser visit, preventing the server from rejecting our request.


Step 3 — Fetch the Page & Parse HTML

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

Send a GET request to the Wikipedia URL using our spoofed headers. The response (page.text) is the raw HTML string of the entire page. We pass it into BeautifulSoup, which turns it into a navigable tree we can search and query.


Step 4 — Locate the Target Table

soup.find_all('table')[1]

Wikipedia pages often contain multiple tables (navigation bars, infoboxes, data tables). find_all('table') returns a list of every table on the page. We use the index [1] to inspect and confirm which table holds the company data we want.


Step 5 — Confirm Table Contents

table = soup.find_all('table')[0]
print(table)

Once the right table is identified, store it in the variable table and print it to verify the content looks correct before extracting anything. This is a good sanity-check habit before building the full pipeline.


Step 6 — Inspect Column Headers

soup.find_all('th')

<th> tags in HTML mark table header cells. Running find_all('th') lets us see what column names are present in the table, so we know exactly what we're working with before building the DataFrame.


Step 7 — Store Column Names in a Variable

world_titles = table.find_all('th')

Scope the header search specifically to our target table (rather than the whole page) to avoid picking up unrelated headers from other parts of the Wikipedia page.


Step 8 — Clean & Print Column Names

world_table_titles = [title.text.strip() for title in world_titles]
print(world_table_titles)

Loop through each <th> element, extract its inner text with .text, and use .strip() to remove any leading/trailing whitespace or newline characters. The result is a clean Python list of column name strings, ready to use as DataFrame headers.


Step 9 — Import Pandas

import pandas as pd

Bring in Pandas — the go-to Python library for tabular data. We'll use it to create a structured DataFrame from the scraped rows, and later to export the result as a CSV file.


Step 10 — Create an Empty DataFrame

df = pd.DataFrame(columns=world_table_titles)
df

Initialize an empty Pandas DataFrame using the column names we extracted in Step 8. This gives us the correct structure — all the right columns with no rows yet. We just need to fill it in.


Step 11 — Find All Table Rows

column_data = table.find_all('tr')

<tr> tags represent table rows in HTML. find_all('tr') returns every row in the table, including the header row at index [0]. We'll skip that in the next step since we already captured the column names.


Step 12 — Iterate & Populate the DataFrame

for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    length = len(df)
    df.loc[length] = individual_row_data

Loop through every data row (skipping the header at [0]). For each row, find all <td> (data cell) tags, strip their text, and collect them into a list. Then append that list as a new row in the DataFrame using df.loc[length] — where length always points to the next available index position.


Step 13 — Verify & Export to CSV

df
df.to_csv(r'D:\Finished project\python-webscraping-wikipedia\Companies.csv', index=False)

Display the filled DataFrame to confirm everything looks right, then export it to a CSV file using .to_csv(). Setting index=False prevents Pandas from writing the auto-generated row numbers as an extra column in the output file.


📂 Output

File Description
Companies.csv Scraped table of the largest US companies by revenue

💡 Key Concepts

Concept Tool / Method
HTTP request with browser spoofing requests.get() + User-Agent header
HTML parsing BeautifulSoup(page.text, 'html.parser')
Locating elements .find_all('table'), .find_all('th'), .find_all('tr'), .find_all('td')
Text cleaning .text.strip()
DataFrame creation pd.DataFrame(columns=[...])
Row insertion df.loc[len(df)] = row
CSV export df.to_csv(..., index=False)

💬 Always check a website's robots.txt and Terms of Service before scraping. Wikipedia permits scraping for non-commercial and research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors