## An Exploration of UN data
In this project, you'll be doing some exploratory analysis on two country-level metrics, gross domestic product (GDP) per capita and overall life expectancy. After completing the guided practice section, you will have a chance to find some additional data and do some more exploring of your own.

### Guided Practice:
 1.	Download the Gross Domestic Product (GDP) per capita dataset from [http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD](http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD). Rename it to gdp_per_capita.csv and place it in the `data` folder of your project repository.

2. Create a Jupyter Notebook in the `notebooks` folder and name it `UN_Data_Exploration`.
    *  You are likely to get errors along the way. When you do, read the errors to try to understand what is happening and how to correct it.
    * Use markdown cells to record your answers to any questions asked in this exercise. On the menu bar, you can toggle the cell type from 'Code' to 'Markdown'. [Here](https://www.markdownguide.org/cheat-sheet/) is a link to a cheat sheet showing the basics of styling text using Markdown.

3.	In the first cell of your notebook, import the required packages with their customary aliases as follows:

    `import pandas as pd`   
    `import numpy as np`  
    `import matplotlib.pyplot as plt`  
    `import seaborn as sns`
    
    Keep all imports in this cell at the top of your notebook.

Imported UN data sets on 10/03/2024 from [http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD](http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD)

In [7]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

4.	Using the pandas `read_csv()` function, read the GDP dataset into your notebook as a DataFrame called `gdp_df`. After reading it in, inspect the first 10 rows and then inspect the last 10 rows. 


5. Drop the 'Value Footnotes' column, and rename the remaining columns to 'Country', 'Year', and 'GDP_Per_Capita'.


- Removed column Value Footnotes due to all values being NaN. 
- Renamed column names from Country or Area to Country and Value to GDP_Per_Capita

In [8]:
gdp_df = (
    pd.read_csv('../data/gdp_per_capita.csv')
    .drop(columns = ['Value Footnotes'])
    .rename(columns = {'Country or Area': 'Country', 'Value': 'GDP_Per_Capita'})
)
continents = pd.read_csv('../data/continents.csv')
gdp_df.head(10)

Unnamed: 0,Country,Year,GDP_Per_Capita
0,Afghanistan,2021,1517.016266
1,Afghanistan,2020,1968.341002
2,Afghanistan,2019,2079.921861
3,Afghanistan,2018,2060.698973
4,Afghanistan,2017,2096.093111
5,Afghanistan,2016,2101.422187
6,Afghanistan,2015,2108.714173
7,Afghanistan,2014,2144.449634
8,Afghanistan,2013,2165.340915
9,Afghanistan,2012,2122.830759


In [9]:
gdp_df.tail(10)

Unnamed: 0,Country,Year,GDP_Per_Capita
7652,Zimbabwe,1999,2866.032886
7653,Zimbabwe,1998,2931.725144
7654,Zimbabwe,1997,2896.147308
7655,Zimbabwe,1996,2867.026043
7656,Zimbabwe,1995,2641.378271
7657,Zimbabwe,1994,2670.106615
7658,Zimbabwe,1993,2458.783255
7659,Zimbabwe,1992,2468.278257
7660,Zimbabwe,1991,2781.787843
7661,Zimbabwe,1990,2704.757299


üêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêïüêï

6. How many rows and columns does gdp_df have? What are the data types of its columns? If any of the columns are not the expected types, figure out why and fix it.

- rows: 7662
- columns: 3 (not including the index(not sure if that is counted))
- data types: country - object, year - int64, gdp_per_capita - float64
- changed year to a object (did not have issues with it in an int in anaconda, but when using vs code had major issues with it being a int and not a object).
- so really none of the data types needed to be chagned, but I did have issues when changing to vs code with the year


In [None]:
gdp_df.shape
# trying different ways to get columns and rows
# this gives rows, columns

In [None]:
gdp_df.shape[0]
# gives rows

In [None]:
gdp_df.shape[1]
# gives columns

In [None]:
gdp_df.info()
# gives more inforamtion than just using df.types()

In [None]:
gdp_df.count()
# number of values? in each columns?

In [None]:
len(gdp_df.index)
# number of rows ?

In [None]:
len(gdp_df.axes[0])
# gives the rows

In [None]:
# gives the columns 
len(gdp_df.axes[1])

In [None]:
len(gdp_df)
# rows

In [None]:
len(gdp_df.columns)
# columns

In [None]:
gdp_df.size
# not what I am looking for

In [23]:
# changing year to a string
gdp_df['Year'] = gdp_df['Year'].astype(str)

7. Which years are represented in this dataset? Take a look at the number of observations per year. What do you notice?

   - the number of observations per year were increasing until 2013 where it stays the same till 2020, then decreases in 2021 to 241, and again in 2022 to 232. 


In [None]:
gdp_df['Year'].unique()

In [None]:
gdp_df['Year'].min()

In [None]:
gdp_df['Year'].max()

In [None]:
gdp_df['Year'].value_counts(ascending = True) 
# value counts sorts by the count. it is not in order by year. to sort by year, could you sort_index()

In [None]:
gdp_df['Year'].value_counts().sort_index()
# this isnt the same as above result due to the dataframe changing in cells further down

8. How many countries are represented in this dataset? Which countries are least represented in the dataset? Why do you think these countries have so few observations?

 - Do you have to be in the UN to be reporting this? Were these countries in the UN at that time of the reporting? 
 - Some of these are grouped by area and not country, were they too small to be reported by themselfs in some years and then they are grouped into the area.
 - Some of the places listed are cities, and geographical areas, and not countries


older versions of pandas were only objects, you could change to a string, but not needed

In [None]:
gdp_df['Country'].unique()

In [None]:
gdp_df['Country'].value_counts()
# this is from the review, kinda used it below. can tack a second value counts on to the value counts... 
gdp_df['Country'].value_counts().value_counts() 
# this counts the number of times a country shows up the count value 

In [None]:
gdp_df['Country'].nunique()

In [None]:
gdp_df['Country'].value_counts(ascending = True).head(10)

9. Create a new dataframe by subsetting `gdp_df` to just the year 2021. Call this new dataframe `gdp_2021`.

In [33]:
gdp_2021 = gdp_df.loc[gdp_df['Year'] == '2021']

In [None]:
gdp_2021.head()
# checking to see if it worked

10. Use `.describe()` to find the summary statistics for GDP per capita in 2021. 

In [None]:
gdp_2021.describe()

In [None]:
# should have done, this does have a big gap between the mean and the meadian, the stanard deviation is large compared to the mean, the first quartile is small compared to the mean, in revese the max is 5x the mean. 
gdp_2021['GDP_Per_Capita'].describe()

11. Create a histogram of GDP Per Capita numbers for 2021 (you may wish to adjust the number of bins for your histogram). How would you describe the shape of the distribution?
- shape is right skewed

- .plot(kind = 'hist'), vs .hist(), .hist() gives title and grid lines, .plot(kind = 'hist') has a legend, and label on the y axis (not sure if that is the same time after time). .plot(kind = x) is more standardarized being just need to change the kind = to change the chart type

In [None]:
gdp_2021.hist()

In [None]:
gdp_2021.plot(kind = 'hist')

In [None]:
gdp_2021.hist(bins = 20)
# changing bins

In [None]:
gdp_2021.hist(rwidth = 0.9)
# puts spaces inbetween the columns

In [None]:
gdp_2021.hist(bins = 5)
# less bins

12. Find the top 5 counties and bottom 5 countries by GDP per capita in 2021.

In [None]:
gdp_2021.sort_values('GDP_Per_Capita').head()

In [None]:
gdp_2021.sort_values('GDP_Per_Capita').tail()

In [None]:
# using a different method 
gdp_2021.nlargest(10, 'GDP_Per_Capita')

In [None]:
gdp_2021.nsmallest(10, 'GDP_Per_Capita')

13. Now, return to the full dataset, `gdp_df`. Pivot the data for 1990 and 2021 (using the pandas `.pivot_table()` method or another method) so that each row corresponds to a country, each column corresponds to a year, and the values in the table give the GDP_Per_Capita amount. Drop any rows that are missing values for either 1990 or 2021. Save the result to a dataframe named `gdp_pivoted`.

In [46]:
gdp_pivoted = gdp_df.pivot_table(index = 'Country', columns = 'Year', values = 'GDP_Per_Capita')
gdp_pivoted = gdp_pivoted.dropna(subset=['2021'])
gdp_pivoted = gdp_pivoted.dropna(subset=['1990'])

In [None]:
gdp_pivoted.head()

14. Create a new column in `gdp_pivoted` named `Percent_Change`. This column should contain the percent change in GDP_Per_Capita from 1990 to 2021. Hint: Percent change is calculated as 100*(New Value - Old Value) / Old Value.

In [48]:
gdp_pivoted['Percent_Change'] = 100 * ((gdp_pivoted['2021'] - gdp_pivoted['1990'] )/ gdp_pivoted['1990'])

In [None]:
gdp_pivoted.head()
# checking to see if it worked

15. How many countries experienced a negative percent change in GDP per capita from 1990 to 2021?

- 19 countries

In [None]:
gdp_pivoted[gdp_pivoted['Percent_Change'] < 0]

16. Which country had the highest % change in GDP per capita? Create a line plot showing this country's GDP per capita for all years from 1990 to 2022. Create another showing the country with the second highest % change in GDP. How do the trends in these countries compare?  
**Bonus:** Put both line charts on the same plot.

- highest - Equatorial Guinea 
- 2nd highest - China
- China had a gradual increase over the years and Equarorial Guinea had a pretty significant increase and decrease

In [None]:
gdp_pivoted['Percent_Change'].nlargest(2)

In [52]:
gdp_eg = gdp_pivoted.loc['Equatorial Guinea']
gdp_ch = gdp_pivoted.loc['China']

In [None]:
gdp_eg.plot(kind = 'line')
gdp_ch.plot(kind = 'line')
plt.title('GDP from 1990 - 2022 for Top Two Highest Percent Change Countries')
plt.legend()
plt.show()

17. Read in continents.csv contained in the `data` folder into a new dataframe called `continents`. We will be using this dataframe to add a new column to our dataset.

In [None]:
continents = pd.read_csv('../data/continents.csv')
continents.head()

18. Merge gdp_df and continents. Keep only the countries that appear in both data frames. Save the result back to gdp_df.

In [None]:
gdp_df = pd.merge(gdp_df, continents, how = 'inner')
gdp_df.head()

19. Determine the number of countries per continent. Create a bar chart showing this.

In [None]:
gdp_df.groupby('Continent')['Country'].nunique()
# testing to see how this works

In [None]:
gdp_df.groupby('Continent')['Country'].nunique().plot(kind = 'bar')
plt.title('Number of Countries per Continent')
plt.show()

20. Create a seaborn boxplot showing GDP per capita in 2021 split out by continent. What do you notice?

- South America has no outliers, it seems odd. While Europe has the largest outlier. 
- For Europe and North America the median seems to be in the middle of the box, but does have some skew to the right. 
- Asia, Africa, South America and Oceania are all right skewed. 

In [None]:
gdp_df.head()

In [None]:
# need to get gdp per capita for each continent for 2021 only 
# gdp_df.loc[gdp_df['Year'] == '2021'].groupby('Continent')['GDP_Per_Capita'].sum() #this really didnt work
gdp_df[gdp_df['Year'] == '2021']    


In [None]:
plt.figure(figsize = (10, 8))
sns.boxplot(data = gdp_df[gdp_df['Year'] == '2021'], x= 'Continent', y = 'GDP_Per_Capita')
plt.title('GDP per Capita by Continent in 2021')
plt.show()

In [None]:
plt.figure(figsize = (10, 8))
sns.boxplot(data = gdp_df[gdp_df['Year'] == '2021'], x = 'GDP_Per_Capita', y = 'Continent')
plt.title('GDP per Capita by Continent in 2021')
plt.show()
# changing azis to see different view 

21. Download the full csv containing Life expectancy at birth, total (years) from [https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false](https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false). Read this data into a DataFrame named `life_expectancy`. Note: When reading this dataset it, you may encounter an error. Modify your `read_csv` call to correct this **without modifying the original csv file**.

Life Expectancy was downloaded 10/12/24 from [https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false](https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false). The first top rows were not needed, so started reading csv a few rows down on the header row. 

In [None]:
life_expectancy = pd.read_csv('../data/life_exp.csv', header = 2)

22. Drop the Country Code, Indicator Name, and Indicator Code columns. Then use [`.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to convert your data from wide to long. That is, instead of having one row per country and multiple colums per year, we want to have multiple rows per country and a single column for year. After melting, rename the columns to `Country`, `Year`, and `Life_Expectancy`.

In [63]:
life_expectancy = life_expectancy.drop(columns = ['Country Code', 'Indicator Name', 'Indicator Code'])

In [None]:
life_expectancy.head()

In [None]:
life_expectancy = pd.melt(life_expectancy, id_vars = ['Country Name'], var_name = 'Year', value_name = 'Life_Expectancy')
life_expectancy.head()

23. What was the first country with a life expectancy to exceed 80?

Japan

In [None]:
life_expectancy.loc[life_expectancy['Life_Expectancy'] >= 80]


24. Merge `gdp_df` and `life_expectancy`, keeping all countries and years that appear in both DataFrames. Save the result to a new DataFrame named `gdp_le`. If you get any errors in doing this, read them carefully and correct them. Look at the first five rows of your new data frame to confirm it merged correctly. Also, check the last five rows to make sure the data is clean and as expected.

In [67]:
gdp_le = pd.merge(
    left = gdp_df, 
    right = life_expectancy[['Country Name', 'Year', 'Life_Expectancy']].rename(columns =  {'Country Name': 'Country'}), 
    left_on = ['Country', 'Year'], 
    right_on = ['Country', 'Year'], 
    how = 'inner')


In [None]:
gdp_le.head()

25. Create a new DataFrame, named `gdp_le_2021` by extracting data for the year 2021 from `gdp_le`. How many countries have a life expectancy of at least 80 in 2021?


In [69]:
gdp_le_2021 = gdp_le.loc[gdp_le['Year'] == '2021']

In [None]:
gdp_le_2021.head()

26. Find the countries that had the top 3 largest GDP per capita figures for 2021. Create a [seaborn FacetGrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) showing the change in life expectancy over time for these three countries. Each individual figure in the facet grid will represent a single country.

In [None]:
gdp_le_2021.nlargest(3, 'GDP_Per_Capita')
# to find the top three countries by gdp per capita

In [77]:
# putting three coutnries into varaiable. had to resort them due to the most recent year being first, made the chart different
gdp_le_21_fil = gdp_le.loc[gdp_le['Country'].isin(['Luxembourg', 'Singapore', 'Ireland'])].sort_values('Year')

In [None]:
gdp_le_21_fil.head()    

In [None]:
g = sns.FacetGrid(gdp_le_21_fil, col = 'Country')
g.map(plt.scatter, 'Year', 'Life_Expectancy')
plt.xticks(np.arange(0, 32, step = 5))
g.tick_params(axis='x', rotation = 45)

In [None]:
g = sns.FacetGrid(gdp_le_21_fil, col = 'Country')
g.map(sns.lineplot, 'Year', 'Life_Expectancy')
plt.xticks(np.arange(0, 32, step = 5))
g.tick_params(axis='x', rotation = 45)

27. Create a scatter plot of Life Expectancy vs GDP per Capita for the year 2021. What do you notice?

There is a positive trend between Gdp per capita and life expectancy 

In [None]:
gdp_le_2021.plot(
    kind = 'scatter', 
    x = 'Life_Expectancy', 
    y = 'GDP_Per_Capita'
)
plt.show()

28. Find the correlation between Life Expectancy and GDP per Capita for the year 2021. What is the meaning of this number?

- The correlation the relationship between two things. The closer the number is to one means it has a postive linear relationship, negative one is a negative linear relationship. The closer the value is to zero indicates there is no linear relationship. 
- The correlation between life expectancy and gdp per capita is .745, which is a strong postive relationship between the two.

In [None]:
gdp_le_2021[['Life_Expectancy', 'GDP_Per_Capita']].corr()

29. Add a column to `gdp_le_2021` and calculate the logarithm of GDP per capita. Find the correlation between the log of GDP per capita and life expectancy. How does this compare to the calculation in the previous part? Look at a scatter plot to see if the result of this calculation makes sense.

- The correlation between life expectancy and gdp per capita is .846, which is a strong postive relationship between the two. Using the log value it increased the relationship by .1, the log value normalizes the data some by compressing the scale so the value differences are reduced. When comparing the two scatterplots, you can see the plots in the log graph are more in the center of the graph and have a less of a curve in the plot.

In [134]:
gdp_le_2021 = pd.DataFrame(gdp_le_2021)
# gdp_le_2021.reset_index()

In [135]:
gdp_le_2021['log_GDP_Per_Capita'] = np.log(gdp_le_2021['GDP_Per_Capita'])

In [None]:
gdp_le_2021.head()

In [None]:
gdp_le_2021[['Life_Expectancy', 'log_GDP_Per_Capita']].corr()

In [None]:
gdp_le_2021.plot(
    kind = 'scatter', 
    x = 'Life_Expectancy', 
    y = 'log_GDP_Per_Capita'
)
plt.show()

### Bonus: Solo Exploration:
1. Choose and download another data set from the UN data [http://data.un.org/Explorer.aspx](http://data.un.org/Explorer.aspx) to explore. You may want to combine your new dataset with one or both of the datasets that you already worked with. Report any interesting correlations or trends that you find. 

2.    If time allows, check out the plotly library to add additional interativity to your plots. [https://plotly.com/python/plotly-express/](https://plotly.com/python/plotly-express/).
