## An Exploration of UN data
This project is an exploratory analysis on two country-level metrics, gross domestic product (GDP) per capita and overall life expectancy. 
### Data Source:
 [http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD](http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD). 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Read in Data

In [None]:
gdp = pd.read_csv('../Data/gdp_per_capita.csv')

### Dropping and Renaming Columns

In [None]:
gdp = gdp.drop(['Value Footnotes'], axis = 1)
gdp = gdp.rename(columns={'Country or Area':'Country','Value':'GDP'})

### Data frame dimensions and data types
* Find the number of rows and columns
* Find the data types of its columns
* Fix data types if necessary

In [None]:
row_count = gdp.shape[0]
col_count = gdp.shape[1]
print(f'There are {row_count:,} rows and {col_count:,} columns in the "gdp" data frame')

In [None]:
gdp['Year'] = gdp['Year'].astype(str)

In [None]:
gdp['Year'] = gdp['Year'].astype(str)
gdp.dtypes

### Exploring the year 2021
* Create a new dataframe by subsetting `gdp` to just the year 2021
* Number of countries are represented in 2021
* Countries are least represented in the accross all years
* Why so few observations
* Use `.describe()` to find the summary statistics for GDP per capita in 2021.
* Create a histogram of GDP Per Capita numbers for 2021 (you may wish to adjust the number of bins for your histogram).
* Find the top 5 counties and bottom 5 countries by GDP per capita in 2021.

In [None]:
gdp_2021 = gdp[gdp['Year'] == '2021']
unique_countries = gdp_2021['Country'].nunique()
print(f'In 2021 there were {unique_countries} represented')

In [None]:
gdp_least_represented = gdp['Country'].value_counts().reset_index()
gdp_least_represented = gdp_least_represented.nsmallest(5, 'count')

In [None]:
least_represented_list = gdp_least_represented['Country']
years_in_lrl = gdp[gdp['Country'].isin(least_represented_list)]

In [None]:
describe_2021 = gdp_2021['GDP'].describe().reset_index()
describe_2021 = describe_2021.rename(columns={'index':'Index'})
describe_2021

The shape of the distribution below is skewed to the right.

In [None]:
gdp_2021.hist(column='GDP',bins = 20)
print('The shape of the distribution below is skewed to the right.')

In [None]:
top_five_gdp_2021 = gdp_2021.nlargest(columns ='GDP',n= 5)
top_five_gdp_2021.head()

In [None]:
bottom_five_gdp_2021 = gdp_2021.nsmallest(columns ='GDP',n= 5)
bottom_five_gdp_2021.head()

### Comparing 2021 with other years
* Pivot the data for 1990 and 2021
* Drop any rows that are missing values for either 1990 or 2021.
* Create a new column in `gdp_pivoted` named `Percent_Change`.
* Show the top two countries in terms of growth when comparing GDP in 1990 and 2021

In [None]:
gdp_1990_2021 = gdp[(gdp['Year'] == '1990') | (gdp['Year'] == '2021')]
gdp_pivoted = pd.pivot_table(gdp_1990_2021, values='GDP', index=['Country'],

                       columns=['Year'], aggfunc="sum")

null_list = gdp_pivoted.isna()
null_list = null_list[(null_list['1990'] == True) | (null_list['2021'] == True)]
null_list = null_list.index.to_list()

filtered_gdp_pivoted = gdp_pivoted[~gdp_pivoted.index.isin(null_list)]

filtered_gdp_pivoted = filtered_gdp_pivoted.copy()
filtered_gdp_pivoted['Percent_Change'] = (
    (filtered_gdp_pivoted['2021'] - filtered_gdp_pivoted['1990']) / filtered_gdp_pivoted['1990']) * 100

In [None]:
negative_growth = filtered_gdp_pivoted[filtered_gdp_pivoted['Percent_Change'] <0]
negative_growth.sort_values(by = 'Percent_Change')
change_count = len(negative_growth.index)
print(f'A total of {change_count} countries experienced negative GDP growth from 1990 to 2021')

In [None]:
first_percent_change = filtered_gdp_pivoted.nlargest(1, 'Percent_Change')
second_percent_change = filtered_gdp_pivoted.nlargest(2, 'Percent_Change')
eqg_china = gdp[(gdp['Country'] == 'Equatorial Guinea') | (gdp['Country'] == 'China')]
eqg_china = eqg_china.reset_index()
eqg_china['Year'] = eqg_china['Year'].astype(float)
sns.lineplot(x='Year', y='GDP', data=eqg_china, hue='Country')

### Comparing continent trends
* Read in continents data and merge with gdp data frame
* Countries are least represented in the accross all years
* Why so few observations
* Use `.describe()` to find the summary statistics for GDP per capita in 2021.
* Create a histogram of GDP Per Capita numbers for 2021 (you may wish to adjust the number of bins for your histogram).
* Find the top 5 counties and bottom 5 countries by GDP per capita in 2021.

In [None]:
continents = pd.read_csv('../data/continents.csv')
gdp_continents = pd.merge(gdp, continents, on = 'Country', how = 'inner')
countries_per_continent = gdp_continents.groupby('Continent')['Country'].nunique().reset_index(name = 'Country_Count')

plt.bar(countries_per_continent.Continent,
        countries_per_continent.Country_Count)
plt.ylabel('Count')
plt.xlabel('Continent')
plt.title('Countries by Continent')
plt.xticks(rotation = 90)
plt.show()

### Exploring the relationship between GDP and Life Expectancy
* Read in life expectancy data: [https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false](https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false). 
* Use [`.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to convert your data from wide to long.
* First country to exceed life expectancy of 80 years old
* Find the countries that had the top 3 largest GDP per capita figures for 2021.
* Create facet gridshowing the change in life expectancy over time for these three countries.
* Create a scatter plot of Life Expectancy vs GDP per Capita for the year 2021.
* Find the correlation between Life Expectancy and GDP per Capita for the year 2021.
* Add a column to `gdp_le_2021` and calculate the logarithm of GDP per capita. Find the correlation between the log of GDP per capita and life expectancy.

In [None]:
le = pd.read_csv('../data/life_exp.csv')
le = le.drop(columns=['Country Code', 'Indicator Name', 'Indicator Code'])
le_melt = pd.melt(le, id_vars = 'Country Name', var_name = 'Year', value_name = 'Life_exp')
le_melt = le_melt.rename(columns = {'Country Name' : 'Country'})
gdp_le = pd.merge(gdp, le_melt, on = ['Country','Year'], how = 'inner')

In [None]:
first_to_eighty = le_melt[le_melt['Life_exp'] >= 80]
first_to_eighty_filtered = first_to_eighty[first_to_eighty['Year'] == first_to_eighty['Year'].min()]
first_country = first_to_eighty_filtered.iloc[0,0]
print(f'{first_country} was the first country to exceed a life expectancy of 80 years old')

In [None]:
gdp_le_2021 = gdp_le[gdp_le['Year'] == '2021']

In [None]:
count_of_eighty = gdp_le_2021[gdp_le_2021['Life_exp']>=80]
count_of_eighty= count_of_eighty.shape[0]
print(f' In 2021 there were {count_of_eighty} that had a life expectancy of at least 80 years old')

In [None]:
top_three = gdp_le_2021.nlargest(3,'GDP')['Country'].head().to_list()
gdp_top_three = gdp_le[gdp_le['Country'].isin(top_three)]
a = sns.FacetGrid(gdp_top_three, col = 'Country')
a.map(sns.scatterplot,'Life_exp','GDP')
a.add_legend();

In [None]:
gdp_le_2021.plot(kind = 'scatter', x = 'GDP', y = 'Life_exp', figsize = (6,4))
plt.title(' GDP v Life Exp');

In [None]:
corr_coef = gdp_le_2021[['Life_exp','GDP']].corr().iloc[0,1]
corr_coef

print(f' A correlation coefficient of {corr_coef:.2f} indicates a strong relationship between two variables')

How does this compare to the calculation in the previous part? Look at a scatter plot to see if the result of this calculation makes sense.

In [None]:
gdp_le_2021 = gdp_le_2021.copy()
gdp_le_2021['GDP_log'] = np.log(gdp_le_2021['GDP'])
corr_coef_log = gdp_le_2021[['Life_exp','GDP_log']].corr().iloc[0,1]
corr_coef_log

print(f' Converting the GDP to a log scale increased the score to {corr_coef_log:.2f}')

In [None]:
gdp_le_2021.plot(kind = 'scatter', x = 'GDP_log', y = 'Life_exp', figsize = (6,4))
print('Using the log scale of GDP seems to make the relationship stronger')
plt.title(' GDP v Life Exp');

### Bonus: Solo Exploration:
* This section explores the relationship between GDP and the % of women in the labor force (Wmn_Lbr_Pct)
* Read in women labor particpation data
* Merge with gdp and continents data
* Find the correlation coefficents for each country/each continent

In [None]:
wmn_lbr = pd.read_csv('../Data/wmn_lbr.csv')
wmn_lbr = wmn_lbr.drop(columns = ['Subgroup', 'Source', 'Unit', 'Value Footnotes'])
wmn_lbr = wmn_lbr.rename(columns = {'Country or Area' : 'Country','Value' : 'Wmn_Lbr_Pct'})
wmn_lbr['Year'] = wmn_lbr['Year'].astype(str)
wmn_lbr['Year'] = wmn_lbr['Year'].str.replace('.0',' ')

wmn_gdp = pd.merge(wmn_lbr,gdp_le, on = ['Country','Year'], how = 'inner')
wmn_gdp = pd.merge(wmn_gdp, continents, on = 'Country', how = 'inner')
wmn_gdp[['Wmn_Lbr_Pct','GDP']].corr()

In [None]:
# Finding the correlation coefficents for each country
results = []

for country in wmn_gdp['Country'].unique():
    subset = wmn_gdp[wmn_gdp['Country'] == country]
    correlation = subset[['Wmn_Lbr_Pct', 'GDP']].corr().iloc[0, 1]  
    results.append({'Country': country, 'Correlation': correlation})  

correlation_country = pd.DataFrame(results)

correlation_country = pd.merge(correlation_country, continents, on = ['Country'], how = 'inner')

In [None]:
correlation_country.hist(column='Correlation',bins =10)
print('It looks like a large number of countries show a moderate to strong relationship')

In [None]:
bins = pd.cut(correlation_country['Correlation'], bins=10)
value_counts = bins.value_counts(normalize = True).sort_index().reset_index()
max_corr_bin = value_counts.iloc[9,0]
max_corr_bin_pct = value_counts.iloc[9,1]

In [None]:
print(f'{max_corr_bin_pct * 100:.1f}% of countries fell between {max_corr_bin}')

In [None]:
correlation_country['Correlation'].describe()

In [None]:
sns.boxplot(y='Continent', x='Correlation', data=correlation_country, hue='Continent', palette='Set1')


plt.title('Box Plot of Correlation by Continent')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Continent')


plt.show()

In [None]:
results_continent = []

for continent in wmn_gdp['Continent'].unique():
    subset_continent = wmn_gdp[wmn_gdp['Continent'] == continent]
    correlation_continent = subset_continent[['Wmn_Lbr_Pct', 'GDP']].corr().iloc[0, 1]
    results_continent.append({'Continent': continent, 'Correlation': correlation_continent})

correlation_continent_df = pd.DataFrame(results_continent)
correlation_continent_df .head(10)