# Project: Webscraping and API process.

Project Scenario:
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF)

**Project Results**:
Please see the enclosed csv file with the following table:razil,2.08
.

| Place | Country | GDP (USD Billions) |
| --- | --- | --- 
| 1   | Unites States | 26.85 |
| 2   | China         | 19.37 |
| 3   | Japan         | 4.41  |
| 4   | Germany       | 4.31  |
| 5   | India         | 3.74  |
| 6   | United Kingdom| 3.16  |
| 7   | France        | 2. 92 |
| 8   | Italy         | 2. 17 |
| 9   | Canada        | 2. 09 |
| 10  | Brazil        | 2. 08 || | | |


In [37]:
import numpy as np
import pandas as pd

In [38]:
# Data Source on Wikipedia
URL="https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"

In [39]:
# Extract tables from webpage and select table #3 as the required table. 
tables_all = pd.read_html(URL)
df = tables_all[3]
print(df)
df.info() 

    Country/Territory UN region IMF[1][13]            World Bank[14]  \
    Country/Territory UN region   Estimate       Year       Estimate   
0               World         —  105568776       2023      100562011   
1       United States  Americas   26854599       2023       25462700   
2               China      Asia   19373586  [n 1]2023       17963171   
3               Japan      Asia    4409738       2023        4231141   
4             Germany    Europe    4308854       2023        4072192   
..                ...       ...        ...        ...            ...   
209          Anguilla  Americas          —          —              —   
210          Kiribati   Oceania        248       2023            223   
211             Nauru   Oceania        151       2023            151   
212        Montserrat  Americas          —          —              —   
213            Tuvalu   Oceania         65       2023             60   

               United Nations[15]             
          Year  

In [40]:
# Replace the column headers with column numbers
df.columns = range(df.shape[1])
print(df)
df.info()

                 0         1          2          3          4          5  \
0            World         —  105568776       2023  100562011       2022   
1    United States  Americas   26854599       2023   25462700       2022   
2            China      Asia   19373586  [n 1]2023   17963171  [n 3]2022   
3            Japan      Asia    4409738       2023    4231141       2022   
4          Germany    Europe    4308854       2023    4072192       2022   
..             ...       ...        ...        ...        ...        ...   
209       Anguilla  Americas          —          —          —          —   
210       Kiribati   Oceania        248       2023        223       2022   
211          Nauru   Oceania        151       2023        151       2022   
212     Montserrat  Americas          —          —          —          —   
213         Tuvalu   Oceania         65       2023         60       2022   

            6          7  
0    96698005       2021  
1    23315081       2021  
2    1

In [47]:
# Columns [0] and [2] are the needed information. NOTE: GDP is in USD Millions. 
df_info = df[[0, 2]]
df_info.head(16)

Unnamed: 0,0,2
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
5,India,3736882
6,United Kingdom,3158938
7,France,2923489
8,Italy,2169745
9,Canada,2089672


In [42]:
# Extract top 10 economies of the world (top 10 rows, [1:11], excluding the row for the world.
df_info = df_info.iloc[1:11,:]
print(df_info)

                 0         2
1    United States  26854599
2            China  19373586
3            Japan   4409738
4          Germany   4308854
5            India   3736882
6   United Kingdom   3158938
7           France   2923489
8            Italy   2169745
9           Canada   2089672
10          Brazil   2081235


In [43]:
# Rename column names. "Country" and "GDP (Million USD)"
df_info.columns = ["Country", "GDP (USD Billions)"]
df_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 1 to 10
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Country             10 non-null     object
 1   GDP (USD Billions)  10 non-null     object
dtypes: object(2)
memory usage: 292.0+ bytes


In [44]:
# Convert "GDP (USD Billions)" column from object to integer. 
df_info["GDP (USD Billions)"] = df_info["GDP (USD Billions)"].astype(int)
df_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 1 to 10
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Country             10 non-null     object
 1   GDP (USD Billions)  10 non-null     int32 
dtypes: int32(1), object(1)
memory usage: 252.0+ bytes


In [45]:
# Convert USD Millions to Billions and round to 2 decimal places
df_info[["GDP (USD Billions)"]] = df_info[["GDP (USD Billions)"]]/1000
df_info[["GDP (USD Billions)"]] = np.round(df_info[["GDP (USD Billions)"]], 2)

In [46]:
# Create an csv file with final results
df_info.to_csv("./Countries_Top10_GDP.csv")