## Scraping Table From Wikipedia: Top 100 Largest Companies in Africa. 

### 1. Project Overview
This project demonstrates how to scrape structured tabular data from Wikipedia using Python. The goal was to extract a ranked list of the largest companies in Africa by revenue and convert the table into a clean, reusable CSV dataset suitable for analysis.

The project focuses on:
- Making compliant HTTP requests to Wikipedia
- Parsing HTML tables with BeautifulSoup
- Structuring scraped data using pandas
- Exporting the final dataset for downstream use

---

### 2. Data Source
- **Website:** Wikipedia
- **Page:** *List of largest companies in Africa by revenue*
- **URL:** https://en.wikipedia.org/wiki/List_of_largest_companies_in_Africa_by_revenue


---

### 3. Tools & Libraries
- **requests** – for sending HTTP requests
- **BeautifulSoup (bs4)** – for parsing and navigating HTML
- **pandas** – for tabular data handling and CSV export


In [1]:
# Importing the necessary tools and libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
# Identify the client to comply with Wikipedia’s robot and scraping policy

headers = {
    "User-Agent": "Table Scraping Project/1.0 (Educational Purpose)"
}

# Source page containing the ranked companies table
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_in_Africa_by_revenue"

# Fetch page HTML using a compliant request
page = requests.get(url, headers=headers)

In [3]:
# Parse raw HTML into a navigable DOM structure
soup = BeautifulSoup(page.text, 'html.parser')

In [4]:
print(soup)

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of largest companies in Africa by revenue - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited

In [5]:

# Select Wikipedia tables using the standard 'wikitable' class
table = soup.select("table.wikitable")
table

[<table class="wikitable sortable">
 <tbody><tr>
 <th>Rank</th>
 <th>Company</th>
 <th>Industry</th>
 <th>Revenue<br/>(US$ billions)</th>
 <th width="150">Headquarters
 </th></tr>
 <tr>
 <td>1</td>
 <td><a href="/wiki/Sonatrach" title="Sonatrach">Sonatrach</a></td>
 <td>Oil and gas</td>
 <td>77.013</td>
 <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/77/Flag_of_Algeria.svg/40px-Flag_of_Algeria.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/77/Flag_of_Algeria.svg/60px-Flag_of_Algeria.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/Algeria" title="Algeria">Algeria</a>
 </td></tr>
 <tr>
 <td>2</td>
 <td><a href="/wiki/Eskom" title="Eskom">Eskom</a></td>
 <td>Electric utility</td>
 <td>13.941</td>
 <td><span class="flagicon"><span class="mw-image-border" typeof="mw

In [6]:
# Extract table column names from header cells
headers = soup.find_all("th")
headers


[<th>Rank</th>,
 <th>Company</th>,
 <th>Industry</th>,
 <th>Revenue<br/>(US$ billions)</th>,
 <th width="150">Headquarters
 </th>]

In [7]:
table_headers = [header.text.strip() for header in headers]
table_headers

['Rank', 'Company', 'Industry', 'Revenue(US$ billions)', 'Headquarters']

In [8]:
# Prepare an empty DataFrame to enforce column structure
df = pd.DataFrame(columns = table_headers)
df

Unnamed: 0,Rank,Company,Industry,Revenue(US$ billions),Headquarters


In [9]:
# Locate all table rows in the document
rows = soup.find_all("tr")
rows

[<tr><td class="mbox-image"><div class="mbox-image-div"><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Question_book-new.svg"><img alt="icon" class="mw-file-element" data-file-height="399" data-file-width="512" decoding="async" height="39" src="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/60px-Question_book-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/9/99/Question_book-new.svg/120px-Question_book-new.svg.png 1.5x" width="50"/></a></span></div></td><td class="mbox-text"><div class="mbox-text-span">This article <b>relies largely or entirely on a <a href="/wiki/Wikipedia:Articles_with_a_single_source" title="Wikipedia:Articles with a single source">single source</a></b>.<span class="hide-when-compact"> Relevant discussion may be found on the <a href="/wiki/Talk:List_of_largest_companies_in_Africa_by_revenue" title="Talk:List of largest companies in Africa by revenue">talk page</a>. Please help <a class="external text" href

In [10]:
# Skip non-data rows and collect cell values
table_data = []
for row in rows[2:]:
    cells = row.find_all("td")
    if cells:
        row_data = [cell.text.strip() for cell in cells]
        table_data.append(row_data)


In [11]:
# Assemble structured dataset from extracted rows
df = pd.DataFrame(table_data, columns=table_headers)
df

Unnamed: 0,Rank,Company,Industry,Revenue(US$ billions),Headquarters
0,1,Sonatrach,Oil and gas,77.013,Algeria
1,2,Eskom,Electric utility,13.941,South Africa
2,3,Sasol,Chemistry,12.989,South Africa
3,4,MTN Group,Telecommunications,12.238,South Africa
4,5,Shoprite Holdings,Retail,10.802,South Africa
...,...,...,...,...,...
95,96,Blue Label Telecoms,Telecommunications,1.442,South Africa
96,97,Kibali Gold Mine,Mining,1.440,DR Congo
97,98,Aveng,Conglomerate,1.425,South Africa
98,99,Murray and Roberts Holdings,Construction,1.422,South Africa


In [None]:
# Persist dataset for reuse outside the notebook
df.to_csv("C:/Users/Vanessa/Documents/top_100_african_companies.csv", index=False)