## Analysis of the GDP of Various Countries ##

## Abstract ##

*** Introduction ***
> In this lab, I will be using WorldBank's 2017 GDP data to determine a few things. I want to find the average GDP of every country, as well as the minimum and maximum GDP's. I'm specifically curious to see whether the U.S. has the highest GDP of any country. I also want to see whether or not the lowest GDP is shockingly low or not. Finally, I'm curious whether or not the average GDP is comparable to other country's GDP's (such as the mininum and maximum GDP).

## Dataset Preparation ##

***Data Location***

> This data was retrieved from Worldbank's public database: [Data Link](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD)

> The data was not modified from the form stored on the server before importing to Jupyter Notebook


***Accessing the Data in Python***

> To access the data, I had to first to open the file in Python. To do that, I created a new dataframe with Pandas, and stored the csv within it. I then printed out the first five elements with the "head" method.

***Storing the Data for Analysis***
> The data is stored in a Pandas dataframe. It makes no sense to store it as anything else, and this "should" (you'll see in the next section why the quotes are there) make it easier to analyze data than with vanilla Python. 

In [4]:
import pandas as pd

df = pd.read_csv('./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_10080925.csv', skiprows=3, index_col='Country Code')
df.head()

Unnamed: 0_level_0,Country Name,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
XKX,Kosovo,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,5653793000.0,5829934000.0,6686683000.0,6500193000.0,7073420000.0,7386891000.0,6439947000.0,6715487000.0,7128691000.0,
YEM,"Yemen, Rep.",GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,25130270000.0,30906750000.0,32726420000.0,35401330000.0,40415240000.0,43228580000.0,34602480000.0,18213330000.0,,
ZAF,South Africa,GDP (current US$),NY.GDP.MKTP.CD,7575248000.0,7972841000.0,8497830000.0,9423212000.0,10373790000.0,11334170000.0,12354750000.0,...,297216700000.0,375298100000.0,416878200000.0,396332700000.0,366829400000.0,350904600000.0,317741000000.0,295762700000.0,349419300000.0,
ZMB,Zambia,GDP (current US$),NY.GDP.MKTP.CD,713000000.0,696285700.0,693142900.0,718714300.0,839428600.0,1082857000.0,1264286000.0,...,15328340000.0,20265560000.0,23460100000.0,25503370000.0,28045460000.0,27150630000.0,21154390000.0,20954750000.0,25808670000.0,
ZWE,Zimbabwe,GDP (current US$),NY.GDP.MKTP.CD,1052990000.0,1096647000.0,1117602000.0,1159512000.0,1217138000.0,1311436000.0,1281750000.0,...,8621574000.0,10141860000.0,12098450000.0,14242490000.0,15451770000.0,15891050000.0,16304670000.0,16619960000.0,17845820000.0,


***Cleaning the Data***

> Cleaning the data was should be fairly straightforward, as it was when I did this lab without Pandas. I have currently spend over an hour trying to figure out how to drop the various groups of countries, because for whatever godforsaken reason Pandas just does not want to drop rows. Even though dropping rows is instrumental to this lab's validity, I have to call it quits. It's absolutely insane that neither of the two lines listed below actually drop the rows, as that is what everyone says should work (https://chrisalbon.com/python/data_wrangling/pandas_dropping_column_and_rows/). I genuinely cannot take more time on this lab, especially when the actual data has already been analyzed and this is a re-write.

> I am aware that the data could be cleaned in Excel beforehand, but I strongly think that all cleaning should be done in code so that this lab can be easily reproduced. Furthermore, me cleaning the data in Excel is a step backwards, and I want to expose the apparent issue with Pandas. If you want to see cleaned data with code, check out Lab 2.

In [17]:
#### I want to be able to drop the following "countries"
# World
# High income
# OECD members
# Post-demographic dividend
# IDA & IBRD total
# Low & middle income
# Middle income
# IBRD only
# East Asia & Pacific
# Europe & Central Asia
# Upper middle income
# North America
# Late-demographic dividend
# East Asia & Pacific (excluding high income)
# Euro area
# European Union
# East Asia & Pacific (IDA & IBRD countries)

# These are the country codes that I want to drop
country_groups = ['WLD', 'HIC', 'OED', 'PRE', 'IBT', 'LMY', 'MIC', 'IBD', 'EAS', 'ECS', 'UMC', 'NAC', 'LTE', 'EAP', 'EMU', 'EUU', 'TEA']

# Either of these should drop Aruba, however they just don't work. 
# I have spent over an hour trying to get dropping to work.
df.drop(['ABW'])
df.drop(df.index[0])
df.head()

Unnamed: 0_level_0,Country Name,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,Aruba,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,2498933000.0,2467704000.0,2584464000.0,,,,,,,
AFG,Afghanistan,GDP (current US$),NY.GDP.MKTP.CD,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,...,12486940000.0,15936800000.0,17930240000.0,20536540000.0,20264250000.0,20616100000.0,19215560000.0,19469020000.0,20815300000.0,
AGO,Angola,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,75492390000.0,82526140000.0,104115800000.0,113923200000.0,124912500000.0,126730200000.0,102621200000.0,95337200000.0,124209400000.0,
ALB,Albania,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,12044210000.0,11926950000.0,12890870000.0,12319780000.0,12776280000.0,13228240000.0,11386930000.0,11883680000.0,13039350000.0,
AND,Andorra,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,...,3660531000.0,3355695000.0,3442063000.0,3164615000.0,3281585000.0,3350736000.0,2811489000.0,2877312000.0,3012914000.0,


## Data Modelling ##

***Finding the Highest GDP***
> To find the maximum GDP, I used the Pandas function "idmax", which returns the maximum value of a given column. I know that the maximum value will be in 2017, as there has been a net growth in GDP every year, which means the maxmimum must be in 2017. I can use the returned value to find other aspects about the maximum value, such as its row, column, or value. 

> Notice how it returns "World." This is because of the issue mentioned above. It should return USA, if the data could be cleaned with Pandas.

In [25]:
# Grabs row with maximum value. I know it will be 2017, as there has been an overall growth in GDP
max = df.loc[df['2017'].idxmax()]

print("Highest GDP: $%s -- %s %s" %(int(max['2017']), max['Country Name'], "2017"))

Highest GDP: $80683787437857 -- World 2017


***Finding the Lowest GDP***
> To find the minimum GDP, I used the Pandas function "idmin", which returns the maximum value of a given column. I know that the maximum value will be in 1961, as there has been a net growth in GDP every year, which means the minimum must be in 1961 (the earliest year). I can use the returned value to find other aspects about the minimum value, such as its row, column, or value. 

> This time, it actually returns the correct value, since the groups of countries that would have been cleaned only affect maximum values.

In [29]:
# Grabs row with minimum value. I know it will be 1961, as there has been an overall growth in GDP
min = df.loc[df['1961'].idxmin()]

print("Lowest GDP: $%s -- %s %s" %(int(min['1961']), min['Country Name'], "1961"))

Lowest GDP: $11592024 -- Seychelles 1961


***Finding the Average GDP***
> To find the average GDP, I used the Pandas function "mean," which when run on the entire dataframe, returns the mean of every column. Since I want the overall mean, I just added another "mean," which took the means of those means (that's a lot of "means!"). Finding the mean of all of the means is the average of the whole dataframe, which is what I was looking for.

> This was also affected by the lack of cleaning by Pandas. It is higher than it should be when there are still groups of countries present in the dataframe.

In [32]:
# Grabs mean of every column and finds mean of that
mean = int(df.mean().mean())

print("Average GDP: $%s" %mean)

Average GDP: $903428582591


***Finding the Average GDP in 1961***
> To find the average GDP in 1961, I used the Pandas method "mean", which returns the mean of a given column. I passed in '1961', so it found the mean of the 1961 column.

> This was also affected by the lack of cleaning by Pandas. It is higher than it should be when there are still groups of countries present in the dataframe.

In [36]:
# Grabs mean of every column and finds mean of that
oldMean = int(df['1961'].mean())

print("1961 Average GDP: $%s" %oldMean)

1961 Average GDP: $76576679586


***Finding the Average GDP in 2017***
> To find the average GDP in 2017, I used the Pandas method "mean", which returns the mean of a given column. I passed in '2017', so it found the mean of the 2017 column.

> This was also affected by the lack of cleaning by Pandas. It is higher than it should be when there are still groups of countries present in the dataframe.

In [35]:
# Grabs mean of every column and finds mean of that
currentMean = int(df['2017'].mean())

print("2017 Average GDP: $%s" %currentMean)

2017 Average GDP: $2847927579477


## Data Analysis & Conclusion ##

*** Conclusion ***
> After analyzing the data, I found that the lowest GDP was of Seychelles, which I was certainly surprised by. I would have thought that it would be a country thought of as more "third-world," but Seychelles makes sense due solely to its size.

> I was unsurprised to find that the U.S. has the highest GDP. Frankly I didn't know what to make of the average GDP, as I have no way of judging it. It is significantly higher than Seychelles GDP, while under one-thousandth of the U.S.'s. It just gives me a sense for the shocking amount of money the U.S. generates.

> I was unsurprised that the 1961 mean was significantly lower than the 2017 mean. I know that the GDP of the whole world has had a net increase every year, so I would expect modern GDP's to be higher than they were half a century ago.

> My only regrets for this lab is that I couldn't figure out how to drop rows in Pandas. I do not want to clean data in Excel, as I know there is a way to do it in Pandas. I can update this lab when our class meets (if I update it, this paragraph won't be present anymore). I need to understand how dropping works in Pandas, or why it's not working for me to have accurate results, and I want to learn how to clean data the right, reproducable way.

## Acknowledgements ##

***No acknowledgements applicable for this lab.***