# WebScraping little project

In [None]:
## Made by Pablo Herrador 10/09/2025 (d/m/y)

## Project Scenario:
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

In [10]:
import pandas as pd
import requests
from io import StringIO  # Needed for wrapping the HTML

### Pandas

In [24]:
url = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
response = requests.get(url)

# Now we are gonna get the tables from the url using pandas
if response.status_code == 200:
    #Wrap HTML text in StringIO to avoid warning
    html_data = StringIO(response.text)
    tables = pd.read_html(html_data)
    print(f"There are {len(tables)} tables.\n")
    # Now that we know how many tables we have, we are going to check them and choose the one we are interested in, in this case, GDP
    gdp_table = None
    for i, table in enumerate(tables:
        print(f"Table {i}")
        print(table.head(3)) # we show the firs 3 rows
        print("-" * 40)

        # To pick the correct table:
        for c in table.columns:
            cols = [str(c).lower()]
            if 'country' in cols and any('gdp' in c for c in cols):
                gdp_table = table
    if gdp_table is not None:
        print("\n Selected GDP table:")
        print(gdp_table.head())
        gdp_table.to_csv("gdp_table.csv", index = false)
        print("GDP table saved as gdp_table.csv")
    else:
        print("Could not automatically find a GDP table")
else:
    print("Something went wrong :( :", response.status_code)
    

There are 7 tables.

Table 0
                                                   0
0  Largest economies in the world by GDP (nominal...
----------------------------------------
Table 1
                                                   0  \
0  > $20 trillion $10–20 trillion $5–10 trillion ...   

                                                   1  \
0  $750 billion – $1 trillion $500–750 billion $2...   

                                                   2  
0  $50–100 billion $25–50 billion $5–25 billion <...  
----------------------------------------
Table 2
  Country/Territory UN region IMF[1][13]            World Bank[14]             \
  Country/Territory UN region   Estimate       Year       Estimate       Year   
0             World         —  105568776       2023      100562011       2022   
1     United States  Americas   26854599       2023       25462700       2022   
2             China      Asia   19373586  [n 1]2023       17963171  [n 3]2022   

  United Nations[15]     

We can see that it hasn't been possible to find the table automatially but we can see that is the table to whi we are interested in

In [25]:
html_data = StringIO(response.text)
table = pd.read_html(html_data)
gdp_table = table[2]
type(gdp_table)
print(gdp_table)
gdp_table.to_csv("gdp_table.csv", index = False)
print("GDP table saved as gdp_table.csv")


    Country/Territory UN region IMF[1][13]            World Bank[14]  \
    Country/Territory UN region   Estimate       Year       Estimate   
0               World         —  105568776       2023      100562011   
1       United States  Americas   26854599       2023       25462700   
2               China      Asia   19373586  [n 1]2023       17963171   
3               Japan      Asia    4409738       2023        4231141   
4             Germany    Europe    4308854       2023        4072192   
..                ...       ...        ...        ...            ...   
209          Anguilla  Americas          —          —              —   
210          Kiribati   Oceania        248       2023            223   
211             Nauru   Oceania        151       2023            151   
212        Montserrat  Americas          —          —              —   
213            Tuvalu   Oceania         65       2023             60   

               United Nations[15]             
          Year  

### Beautiful Soup (has to be fixed)

In [59]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text,"html.parser")

# Nowfind al the tablles "wikitable" class
tables = soup.find_all("table")
gdp_table_bs = tables[2] # we already know this

# extract rows
rows = gdp_table_bs.find_all("tr") ## <tr> -> table row; as well as: <th> -> header cell; <td> -> data cell

# Now  we can construct a list with the content

table_data = []
for row in rows:
    cells = row.find_all(["th", "td"])
    row_data = [cell.get_text(strip=True) for cell in cells]
    table_data.append(row_data)
gdp_table = pd.DataFrame(table_data[1:], columns=table_data[0])  # first row as a header

    # save as a CSV
gdp_table.to_csv("gdp_table_bs.csv", index=False)
print("GDP table saved as gdp_table_bs.csv")


ValueError: 5 columns passed, passed data had 8 columns

## Pandas Again

In [48]:
import pandas as pd
import requests
import numpy as np
url = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

# next thing is going to suppress warnings  instead of using StringIO
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [56]:
# Extract tables from webpage using Pandas. Retain table number 3 as the required dataframe.
tables = pd.read_html(url)
df = tables[3]
# Replace the column headers with column numbers
df.columns = range(df.shape[1])
# Retain columns with index 0 and 2 (name of country and value of GDP quoted by IMF)
df = df[[0,2]]
# Retain the Rows with index 1 to 10, indicating the top 10 economies of the world.
df = df.iloc[1:11,:]
# Assign column names as "Country" and "GDP (Million USD)"
df.columns = ["Country", "GDP (Million USD)"]
df

Unnamed: 0,Country,GDP (Million USD)
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672
10,Brazil,2081235


In [57]:
# Change the data type of the 'GDP (Million USD)' column to integer. Use astype() method.
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(str)
# Convert the GDP value in Million USD to Billion USD
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(float) / 1000
# Use numpy.round() method to round the value to 2 decimal places.
df[['GDP (Million USD)']] = np.round(df[['GDP (Million USD)']],2)

# you can do this last two in one for cleaner code
# df[['GDP (Million USD)']] = np.round(df['GDP (Million USD)'].astype(float) / 1000,2)

# Rename the column header from 'GDP (Million USD)' to 'GDP (Billion USD)'
df = df.rename(columns = {'GDP (Million USD)' : 'GDP (Billion USD)'})

df

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


In [58]:
# Load the DataFrame to the CSV file named "Largest_economies.csv"
df.to_csv('./Largest_economies.csv')

In [70]:
def add(x):
    return(x+x)
add('1')

'11'