# Project: GDP Data extraction and processing

## Project Scenario:


An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

## Setup

In [1]:
# Install required packages
!pip install pandas numpy 
!pip install lxml




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# import required libraries
import numpy as np
import pandas as pd

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## Exercises

### Exercise 1

Extract the required GDP data from the given URL using Web Scraping.

In [3]:
URL="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

In [4]:
# after looking at the indicated webpage, we notice that the desired table is the third one.
# we can read it with Pandas, so it converts easily to a DataFrame
# the raw use of pd.read_html(URL) does not work (403 Forbidden error), so we use the requests library to fetch the page content first

import requests

headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(URL, headers=headers)   # HTTP GET request
df = pd.read_html(response.text)[2]  # get the third table
df.head()

Unnamed: 0,Country/Territory,IMF (2025)[1][6],World Bank (2022–24)[7],United Nations (2023)[8]
0,World,117165394,111326370,100834796
1,United States,30615743,29184890,27720700
2,China[n 1],19398577,18743803,17794782
3,Germany,5013574,4659929,4525704
4,Japan,4279828,4026211,4204495


In [5]:
# these next lines are requirements from the exercise instructions

# replace column headers with column numbers
df.columns = range(df.shape[1])

# retain columns with name of the country and value of GDP by IMF
df = df[[0, 1]]

# retain the top 10 economies of the world (countries with highest GDP)
df = df.iloc[1:11, :]

# assign column names as 'Country' and 'GDP (Million USD)'
df.columns = ['Country', 'GDP (Million USD)']

df.head()

Unnamed: 0,Country,GDP (Million USD)
1,United States,30615743
2,China[n 1],19398577
3,Germany,5013574
4,Japan,4279828
5,India,4125213


We notice one of the disadvantages or limitations of using read_html from Pandas. For example, in the China entry (row 2), an undesired hyperlink appears. It was scraped automatically, needing our action with further data cleaning.

## Exercise 2

Modify the GDP column of the DataFrame, converting the value available in Million USD to Billion USD. Use the round() method of Numpy library to round the value to 2 decimal places. Modify the header of the DataFrame to GDP (Billion USD).

In [6]:
# requirements:

# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

# Convert the GDP value in Million USD to Billion USD
df['GDP (Million USD)'] = df['GDP (Million USD)'] / 1000   # american billion = 1,000 million

# Use numpy.round() method to round the value to 2 decimal places.
df['GDP (Million USD)'] = np.round(df['GDP (Million USD)'], 2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df = df.rename(columns={'GDP (Million USD)': 'GDP (Billion USD)'})

df.head()

Unnamed: 0,Country,GDP (Billion USD)
1,United States,30615.74
2,China[n 1],19398.58
3,Germany,5013.57
4,Japan,4279.83
5,India,4125.21


## Exercise 3

Load the DataFrame to the CSV file named "Largest_economies.csv"

In [8]:
df.to_csv('./Largest_economies.csv')