# Practice Project: GDP Data extraction and processing

Goal: create a script to extract the list of the top 10 largest economies of the world in descending order of the GDPs in Billion USD (round to 2 decimals) as logged by the International Monetary Fund (IMF).
- Use webscrapting to extract required info from the website
- Use pandas to load and process the tabular data as a dataframe
- Use numpy to manipulate dataframe info
- Load the updated df to a CSV file

[Data URL](https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29)

Import libraries

In [3]:
import pandas as pd
import numpy as np

In [None]:
"""exercise suggested supressing warnings with the following code

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')"""


In [12]:
#save the website URL in a variable
url='https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'

Pull the third table from the website.
Will only need the first 3 (maybe four because of table structure) cols

In [13]:
#extract the tables from the website
dfs = pd.read_html(url)




In [15]:

#this pulled the second table?? wtf?
#df = dfs[2]

#get the third table
df = dfs[3]

#view the first 5 rows of the table
print(df.head())

  Country/Territory UN region IMF[1][13]            World Bank[14]             \
  Country/Territory UN region   Estimate       Year       Estimate       Year   
0             World         —  105568776       2023      100562011       2022   
1     United States  Americas   26854599       2023       25462700       2022   
2             China      Asia   19373586  [n 1]2023       17963171  [n 3]2022   
3             Japan      Asia    4409738       2023        4231141       2022   
4           Germany    Europe    4308854       2023        4072192       2022   

  United Nations[15]             
            Estimate       Year  
0           96698005       2021  
1           23315081       2021  
2           17734131  [n 1]2021  
3            4940878       2021  
4            4259935       2021  


In [23]:
# extract the first and third columns
imf_df = df.iloc[:, [0, 2]]

#view the first 5 rows of the new dataframe
print(imf_df.head())

  Country/Territory IMF[1][13]
  Country/Territory   Estimate
0             World  105568776
1     United States   26854599
2             China   19373586
3             Japan    4409738
4           Germany    4308854


In [24]:
#change column names
imf_df.columns = ['Country', 'IMF GDP Millions']


In [25]:
#retain the first 10 rows of country data excluding 'world' data (1st row)
imf_df = imf_df.iloc[1:11]

#view the new dataframe
print(imf_df)

           Country IMF GDP Millions
1    United States         26854599
2            China         19373586
3            Japan          4409738
4          Germany          4308854
5            India          3736882
6   United Kingdom          3158938
7           France          2923489
8            Italy          2169745
9           Canada          2089672
10          Brazil          2081235


In [28]:
#what type of data is in IMP GDP Millions column?
print(imf_df.dtypes)

Country             object
IMF GDP Millions    object
dtype: object


In [29]:
# convert data in 'IMF GDP Millions' column to integer
imf_df['IMF GDP Millions'] = imf_df['IMF GDP Millions'].astype(int)


In [30]:
#convert currency to billions (divide GDP Millions by 1,000)
imf_df['IMF GDP Billions'] = imf_df['IMF GDP Millions']/1000



In [31]:
print(imf_df)

           Country  IMF GDP Millions  IMF GDP Billions
1    United States          26854599         26854.599
2            China          19373586         19373.586
3            Japan           4409738          4409.738
4          Germany           4308854          4308.854
5            India           3736882          3736.882
6   United Kingdom           3158938          3158.938
7           France           2923489          2923.489
8            Italy           2169745          2169.745
9           Canada           2089672          2089.672
10          Brazil           2081235          2081.235


In [32]:
#remove the 'IMF GDP Millions' column
imf_billions_df = imf_df[['Country', 'IMF GDP Billions']]

print(imf_billions_df)

           Country  IMF GDP Billions
1    United States         26854.599
2            China         19373.586
3            Japan          4409.738
4          Germany          4308.854
5            India          3736.882
6   United Kingdom          3158.938
7           France          2923.489
8            Italy          2169.745
9           Canada          2089.672
10          Brazil          2081.235


Note: Billions still not rounded to 2 decimals

In [33]:
# save df as a csv file
imf_billions_df.to_csv('imf_gdp.csv', index=False)

Make it a function

In [19]:

#need to update with changes to the code above

#url is the website URL with the data of interest
#tbl is the table number that contains the data of interest

def imf_gdp(url, tbl):
    url = url
    dfs = pd.read_html(url)
    df = dfs[tbl]
    imf_df = df.iloc[:, [0, 2]]
    imf_df.columns = ['Country', 'GDP (IMF)']
    imf_df.to_csv('imf_gdp.csv', index=False)
    print('your file has been saved as imf_gdp.csv in the current directory')
    
    