## UN Data Exploration

#### Q1. Download the Gross Domestic Product (GDP) per capita dataset from http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3aNY.GDP.PCAP.PP.KD. Rename it to gdp_per_capita.csv and place it in the data folder of your project repository.

#### Q2. Create a Jupyter Notebook in the notebooks folder and name it UN_Data_Exploration.

#### Q3. In the first cell of your notebook, import the required packages with their customary aliases as follows:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### Q4. Using the pandas read_csv() function, read the GDP dataset into your notebook as a DataFrame called gdp_df. After reading it in, inspect the first 10 rows and then inspect the last 10 rows

In [3]:
gdp_df = pd.read_csv('../data/gdp_per_capita.csv')
gdp_df

Unnamed: 0,Country or Area,Year,Value,Value Footnotes
0,Afghanistan,2021,1517.016266,
1,Afghanistan,2020,1968.341002,
2,Afghanistan,2019,2079.921861,
3,Afghanistan,2018,2060.698973,
4,Afghanistan,2017,2096.093111,
...,...,...,...,...
7657,Zimbabwe,1994,2670.106615,
7658,Zimbabwe,1993,2458.783255,
7659,Zimbabwe,1992,2468.278257,
7660,Zimbabwe,1991,2781.787843,


In [None]:
gdp_df.head(10)

In [None]:
gdp_df.tail(10)

#### Q5. Drop the 'Value Footnotes' column, and rename the remaining columns to 'Country', 'Year', and 'GDP_Per_Capita'.

In [5]:
#drop column: df.drop(columns=['col_name'])
#rename column: df.rename(columns={""oldname": "newname"})
gdp_df = (
    gdp_df
    .drop(columns=['Value Footnotes'])
    .rename(columns={"Country or Area": "Country", "Value": "GDP_Per_Capita"})
)
gdp_df

Unnamed: 0,Country,Year,GDP_Per_Capita
0,Afghanistan,2021,1517.016266
1,Afghanistan,2020,1968.341002
2,Afghanistan,2019,2079.921861
3,Afghanistan,2018,2060.698973
4,Afghanistan,2017,2096.093111
...,...,...,...
7657,Zimbabwe,1994,2670.106615
7658,Zimbabwe,1993,2458.783255
7659,Zimbabwe,1992,2468.278257
7660,Zimbabwe,1991,2781.787843


#### Q6. How many rows and columns does gdp_df have? What are the data types of its columns? If any of the columns are not the expected types, figure out why and fix it.

7662 rows and 3 columns:
1. Country - object
2. Year - integer
3. GDP_Per_Capita - float

In [None]:
gdp_df.shape

In [None]:
gdp_df.info()
#gdp_df.dtypes

#### Q7. Which years are represented in this dataset? Take a look at the number of observations per year. What do you notice?

1990 - 2022 (2022 is hidden between 2001 and 2002!)  
The number of observations per year increases over time.

In [None]:
gdp_df['Year'].unique()

In [11]:
#gdp_df['Year'].value_counts(sort=False)
gdp_df['Year'].value_counts().sort_index(ascending=False)

Year
2022    232
2021    241
2020    242
2019    242
2018    242
2017    242
2016    242
2015    242
2014    242
2013    242
2012    240
2011    240
2010    239
2009    239
2008    238
2007    237
2006    237
2005    236
2004    236
2003    235
2002    235
2001    234
2000    233
1999    227
1998    226
1997    226
1996    223
1995    223
1994    213
1993    211
1992    210
1991    208
1990    207
Name: count, dtype: int64

#### Q8. How many countries are represented in this dataset? Which countries are least represented in the dataset? Why do you think these countries have so few observations?

242 Countries (and Regions)  
Djibouti and Somalia have the fewest entries; they didn't start reporting or tracking until 2013 (maybe it's voluntary?)?

In [None]:
gdp_df['Country'].nunique()

In [None]:
gdp_df['Country'].unique()

In [13]:
gdp_df['Country'].value_counts()

Country
Least developed countries: UN classification    33
Middle East & North Africa                      33
Middle East & North Africa (IDA & IBRD)         33
Middle income                                   33
Mongolia                                        33
                                                ..
Kosovo                                          15
Sint Maarten (Dutch part)                       14
Turks and Caicos Islands                        12
Somalia                                         10
Djibouti                                        10
Name: count, Length: 242, dtype: int64

In [17]:
gdp_df['Country'].value_counts().value_counts(sort=False)

count
33    202
32      5
31      3
30      1
29      2
28     10
26      2
25      1
24      1
23      4
22      3
20      1
19      1
16      1
15      1
14      1
12      1
10      2
Name: count, dtype: int64

In [None]:
gdp_df.loc[gdp_df['Country'] == 'Djibouti']

In [None]:
gdp_df.loc[gdp_df['Country'] == 'Somalia']

In [None]:
gdp_df.loc[gdp_df['Country'] == 'Afghanistan']

#### Q9. Create a new dataframe by subsetting gdp_df to just the year 2021. Call this new dataframe gdp_2021

In [21]:
gdp_2021 = gdp_df.loc[gdp_df['Year'] == 2021]
#gdp_2021 = gdp_df[gdp_df['Year'] == 2021]
gdp_2021

Unnamed: 0,Country,Year,GDP_Per_Capita
0,Afghanistan,2021,1517.016266
21,Africa Eastern and Southern,2021,3519.174840
54,Africa Western and Central,2021,4014.607965
87,Albania,2021,14595.944386
120,Algeria,2021,11029.138782
...,...,...,...
7502,Viet Nam,2021,10628.219166
7535,West Bank and Gaza,2021,5641.044400
7564,World,2021,17055.357429
7597,Zambia,2021,3236.788981


#### Q10. Use .describe() to find the summary statistics for GDP per capita in 2021.

In [None]:
gdp_2021['GDP_Per_Capita'].describe()

#### Q11. Create a histogram of GDP Per Capita numbers for 2021 (you may wish to adjust the number of bins for your histogram). How would you describe the shape of the distribution?

The distribution is heavily right-skewed, with most countries/regions at or below a GDP per Capita of 20K.

In [None]:
fig,ax = plt.subplots(figsize = (10,6))   

plt.hist(
    data=gdp_2021,
    x='GDP_Per_Capita',
    edgecolor='Black',
    linewidth=2,
    bins = 15
    );
plt.xlabel('GDP Per Capita')                            
plt.ylabel('Number of Countries')
plt.title('Distribution of Global GDP Per Capita for 2021');

#### Q12. Find the top 5 countries and bottom 5 countries by GDP per capita in 2021.

In [None]:
gdp_2021.sort_values('GDP_Per_Capita', ascending = False)

#### Q13. Now, return to the full dataset, gdp_df. Pivot the data for 1990 and 2021 (using the pandas .pivot_table() method or another method) so that each row corresponds to a country, each column corresponds to a year, and the values in the table give the GDP_Per_Capita amount. Drop any rows that are missing values for either 1990 or 2021. Save the result to a dataframe named gdp_pivoted.

In [97]:
gdp_pivotedtotal = gdp_df.pivot_table(values = "GDP_Per_Capita", index = "Country", columns = 'Year')
gdp_pivotedtotal

Year,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,,,,,,,,,...,2165.340915,2144.449634,2108.714173,2101.422187,2096.093111,2060.698973,2079.921861,1968.341002,1517.016266,
Africa Eastern and Southern,3037.297466,2955.642238,2823.940366,2737.731240,2715.131116,2764.305017,2838.692029,2886.566235,2867.960243,2873.553735,...,3593.299065,3642.875373,3658.533588,3654.578815,3659.059097,3661.360566,3648.220302,3455.023119,3519.174840,3553.913370
Africa Western and Central,2788.301039,2750.790764,2743.855561,2644.709683,2575.064177,2561.665446,2612.194795,2654.384927,2676.529845,2649.555854,...,4026.231916,4146.994622,4148.547272,4055.943254,4051.271199,4064.079894,4093.442853,3957.933804,4014.607965,4063.857691
Albania,4827.027705,3496.369626,3264.820757,3598.810267,3921.614970,4471.601702,4908.932392,4400.312754,4819.067832,5474.849914,...,11361.252492,11586.817446,11878.437602,12291.842060,12770.991863,13317.119264,13653.182207,13278.369769,14595.944386,15501.662931
Algeria,8828.874473,8517.376962,8471.527605,8109.883559,7869.270272,8013.123442,8195.860480,8147.878198,8435.035658,8584.071496,...,11360.637612,11561.259795,11751.634119,11888.322967,11809.483033,11725.877741,11627.279918,10844.770764,11029.138782,11187.382303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Viet Nam,2099.394649,2177.473744,2317.266668,2455.508783,2623.720022,2825.016443,3039.938041,3239.681646,3378.904834,3495.097968,...,7257.729273,7641.909252,8091.090101,8545.702594,9050.688534,9636.012495,10252.004622,10450.622382,10628.219166,11396.531469
West Bank and Gaza,,,,,3951.205493,4047.128488,3916.925775,4294.746098,4786.480236,5052.064072,...,6118.257181,5967.073437,6048.976597,6438.933640,6401.740891,6318.210068,6245.448697,5402.538773,5641.044400,5722.409175
World,9705.981267,9669.677060,9665.890260,9675.232260,9799.764965,9957.172695,10179.565344,10424.112458,10532.457767,10754.895302,...,14801.332173,15120.730322,15442.986012,15762.038311,16170.193777,16573.992656,16864.894576,16204.169107,17055.357429,17485.934316
Zambia,2290.039226,2232.837441,2141.504615,2232.710379,1991.185925,1999.356842,2071.708828,2096.294593,2034.897183,2074.453663,...,3330.876903,3375.941270,3365.379259,3384.268144,3395.479686,3425.948936,3372.358980,3183.650773,3236.788981,3298.142890


In [101]:
gdp_subset = gdp_df[gdp_df["Year"].isin([1990, 2001])]
gdp_subset

Unnamed: 0,Country,Year,GDP_Per_Capita
41,Africa Eastern and Southern,2001,2928.062946
52,Africa Eastern and Southern,1990,3037.297466
74,Africa Western and Central,2001,2734.257633
85,Africa Western and Central,1990,2788.301039
107,Albania,2001,6441.440698
...,...,...,...
7595,World,1990,9705.981267
7617,Zambia,2001,2142.787524
7628,Zambia,1990,2290.039226
7650,Zimbabwe,2001,2772.325234


In [103]:
gdp_pivoted = gdp_subset.pivot_table(values = "GDP_Per_Capita", index = "Country", columns = 'Year')
gdp_pivoted

Year,1990,2001
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa Eastern and Southern,3037.297466,2928.062946
Africa Western and Central,2788.301039,2734.257633
Albania,4827.027705,6441.440698
Algeria,8828.874473,8926.110134
Angola,5793.084512,4768.008894
...,...,...
Viet Nam,2099.394649,3879.338958
West Bank and Gaza,,3980.933349
World,9705.981267,11221.662910
Zambia,2290.039226,2142.787524


In [105]:
gdp_pivoted = gdp_pivoted.dropna()
gdp_pivoted

Year,1990,2001
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa Eastern and Southern,3037.297466,2928.062946
Africa Western and Central,2788.301039,2734.257633
Albania,4827.027705,6441.440698
Algeria,8828.874473,8926.110134
Angola,5793.084512,4768.008894
...,...,...
Vanuatu,2774.138350,2782.053642
Viet Nam,2099.394649,3879.338958
World,9705.981267,11221.662910
Zambia,2290.039226,2142.787524


#### Q14. Create a new column in gdp_pivoted named Percent_Change. This column should contain the percent change in GDP_Per_Capita from 1990 to 2021. Hint: Percent change is calculated as 100*(New Value - Old Value) / Old Value.

In [None]:
gdp_pivoted['Percent_Change'] = 100*(gdp_pivoted[2001] - gdp_pivoted[1990]) / gdp_pivoted[1990]
gdp_pivoted

#### Q15. How many countries experienced a negative percent change in GDP per capita from 1990 to 2021?

In [None]:
gdp_pivoted['Percent_Change'].agg(lambda x: sum(x < 0))

#### Q16. Which country had the highest % change in GDP per capita? Create a line plot showing this country's GDP per capita for all years from 1990 to 2018. Create another showing the country with the second highest % change in GDP. How do the trends in these countries compare?

Equatorial Guinea	1543.403611%  
China	160.716880%  
EG GDP per capita has a meteoric rise until about 2009, at which point it begins to trends downward; whereas China trends upward more gradually

In [None]:
gdp_pivoted.sort_values('Percent_Change', ascending = False)

#### Q16 Bonus: Put both line charts on the same plot.

In [None]:
gdp_top2 = gdp_df[gdp_df['Country'].isin(['Equatorial Guinea', 'China'])]
gdp_top2 = gdp_top2[gdp_top2['Year'] < 2019]
gdp_top2

In [None]:
gdp_pivoted_top2 = gdp_top2.pivot_table(values = "GDP_Per_Capita", index = 'Year', columns = 'Country')
gdp_pivoted_top2

In [None]:
gdp_pivoted_top2.plot()
plt.title('GDP over Time')
plt.ylabel('GDP');

In [None]:
gdp_df_top2.groupby('Country').plot(x='Year')

#### Q17. Read in continents.csv contained in the data folder into a new dataframe called continents. We will be using this dataframe to add a new column to our dataset.

In [107]:
continents = pd.read_csv('../data/continents.csv')
continents

Unnamed: 0,Continent,Country
0,Asia,Afghanistan
1,Europe,Albania
2,Africa,Algeria
3,Europe,Andorra
4,Africa,Angola
...,...,...
211,Asia,Vietnam
212,Asia,West Bank and Gaza
213,Asia,Yemen
214,Africa,Zambia


#### Q18. Merge gdp_df and continents. Keep only the countries that appear in both data frames. Save the result back to gdp_df.

In [109]:
gdp_df = pd.merge(gdp_df, continents, on = 'Country', how = 'inner', validate='many_to_one')
gdp_df

Unnamed: 0,Country,Year,GDP_Per_Capita,Continent
0,Afghanistan,2021,1517.016266,Asia
1,Afghanistan,2020,1968.341002,Asia
2,Afghanistan,2019,2079.921861,Asia
3,Afghanistan,2018,2060.698973,Asia
4,Afghanistan,2017,2096.093111,Asia
...,...,...,...,...
5888,Zimbabwe,1994,2670.106615,Africa
5889,Zimbabwe,1993,2458.783255,Africa
5890,Zimbabwe,1992,2468.278257,Africa
5891,Zimbabwe,1991,2781.787843,Africa


#### Q19. Determine the number of countries per continent. Create a bar chart showing this.

In [None]:
gdp_df['Country'].nunique()

In [None]:
co_subset = gdp_df[['Country', 'Continent']]
co_subset

In [None]:
co_subset = co_subset.drop_duplicates(subset=['Country'])
co_subset

In [None]:
co_subset['Continent'].value_counts().plot(kind='bar')
plt.xticks(rotation=0)
plt.title('Number of Countries by Continent');

#### Q20. Create a seaborn boxplot showing GDP per capita in 2021 split out by continent. What do you notice?

Europe and Asia have a lot more variability, and a couple of pretty extreme outliers; Africa, despite having more countries that any other continent has very little variability, and is also at the lower end for GDP per capita

In [None]:
gdp_subset2021 = gdp_df.loc[gdp_df['Year'] == 2021]
gdp_subset2021

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = gdp_subset2021.sort_values('Continent'), x = 'GDP_Per_Capita', y ='Continent');

#### Q21. Download the full csv containing Life expectancy at birth, total (years) from https://data.worldbank.org/indicator/SP.DYN.LE00.IN?name_desc=false. Read this data into a DataFrame named life_expectancy. Note: When reading this dataset it, you may encounter an error. Modify your read_csv call to correct this without modifying the original csv file.


In [151]:
#skiprows argument added to skip over 'header' rows that were messing up the parsing of the dataset
life_expectancy = pd.read_csv('../data/API_SP.DYN.LE00.IN_DS2_en_csv_v2_31632/API_SP.DYN.LE00.IN_DS2_en_csv_v2_31632.csv', skiprows=4)
life_expectancy

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,64.152000,64.537000,64.752000,65.132000,65.294000,65.502000,...,75.683000,75.617000,75.903000,76.072000,76.248000,75.723000,74.626000,74.992000,,
1,Africa Eastern and Southern,AFE,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,44.085552,44.386697,44.752182,44.913159,45.479043,45.498338,...,61.856458,62.444050,62.922390,63.365863,63.755678,63.313860,62.454590,62.899031,,
2,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.535000,33.068000,33.547000,34.016000,34.494000,34.953000,...,62.659000,63.136000,63.016000,63.081000,63.565000,62.575000,61.982000,62.879000,,
3,Africa Western and Central,AFW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.845152,38.164950,38.735102,39.063715,39.335360,39.618038,...,56.195872,56.581678,56.888446,57.189139,57.555796,57.226373,56.988657,57.626176,,
4,Angola,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,38.211000,37.267000,37.539000,37.824000,38.131000,38.495000,...,60.655000,61.092000,61.680000,62.144000,62.448000,62.261000,61.643000,61.929000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,61.485000,61.836000,62.134000,62.440000,62.734000,63.041000,...,78.922000,78.981000,78.783000,78.696000,79.022000,76.567000,76.806000,79.524000,,
262,"Yemen, Rep.",YEM,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,33.678000,34.098000,33.615000,33.247000,34.738000,35.373000,...,65.873000,66.064000,65.957000,64.575000,65.092000,64.650000,63.753000,63.720000,,
263,South Africa,ZAF,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,52.669000,53.085000,53.376000,53.633000,53.906000,54.192000,...,63.950000,64.747000,65.402000,65.674000,66.175000,65.252000,62.341000,61.480000,,
264,Zambia,ZMB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,49.042000,49.452000,49.794000,50.133000,49.849000,50.563000,...,61.208000,61.794000,62.120000,62.342000,62.793000,62.380000,61.223000,61.803000,,


#### Q22. Drop the Country Code, Indicator Name, and Indicator Code columns. Then use .melt() to convert your data from wide to long. That is, instead of having one row per country and multiple colums per year, we want to have multiple rows per country and a single column for year. After melting, rename the columns to Country, Year, and Life_Expectancy.

In [153]:
life_expectancy = life_expectancy.drop(columns=['Country Code', 'Indicator Name', 'Indicator Code'])
life_expectancy

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,64.152000,64.537000,64.752000,65.132000,65.294000,65.502000,66.063000,66.439000,66.757000,...,75.683000,75.617000,75.903000,76.072000,76.248000,75.723000,74.626000,74.992000,,
1,Africa Eastern and Southern,44.085552,44.386697,44.752182,44.913159,45.479043,45.498338,45.249105,45.924905,46.223097,...,61.856458,62.444050,62.922390,63.365863,63.755678,63.313860,62.454590,62.899031,,
2,Afghanistan,32.535000,33.068000,33.547000,34.016000,34.494000,34.953000,35.453000,35.924000,36.418000,...,62.659000,63.136000,63.016000,63.081000,63.565000,62.575000,61.982000,62.879000,,
3,Africa Western and Central,37.845152,38.164950,38.735102,39.063715,39.335360,39.618038,39.837827,39.471500,40.085679,...,56.195872,56.581678,56.888446,57.189139,57.555796,57.226373,56.988657,57.626176,,
4,Angola,38.211000,37.267000,37.539000,37.824000,38.131000,38.495000,38.757000,39.092000,39.484000,...,60.655000,61.092000,61.680000,62.144000,62.448000,62.261000,61.643000,61.929000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,61.485000,61.836000,62.134000,62.440000,62.734000,63.041000,63.323000,63.653000,63.935000,...,78.922000,78.981000,78.783000,78.696000,79.022000,76.567000,76.806000,79.524000,,
262,"Yemen, Rep.",33.678000,34.098000,33.615000,33.247000,34.738000,35.373000,36.097000,36.866000,37.796000,...,65.873000,66.064000,65.957000,64.575000,65.092000,64.650000,63.753000,63.720000,,
263,South Africa,52.669000,53.085000,53.376000,53.633000,53.906000,54.192000,54.391000,54.626000,54.876000,...,63.950000,64.747000,65.402000,65.674000,66.175000,65.252000,62.341000,61.480000,,
264,Zambia,49.042000,49.452000,49.794000,50.133000,49.849000,50.563000,50.679000,50.802000,50.856000,...,61.208000,61.794000,62.120000,62.342000,62.793000,62.380000,61.223000,61.803000,,


In [155]:
life_expectancy = (
    life_expectancy
    .melt(id_vars = 'Country Name', var_name = 'Year', value_name = 'Life Expectancy')
    .rename(columns={'Country Name': 'Country'})
)
life_expectancy

Unnamed: 0,Country,Year,Life Expectancy
0,Aruba,1960,64.152000
1,Africa Eastern and Southern,1960,44.085552
2,Afghanistan,1960,32.535000
3,Africa Western and Central,1960,37.845152
4,Angola,1960,38.211000
...,...,...,...
17285,Kosovo,Unnamed: 68,
17286,"Yemen, Rep.",Unnamed: 68,
17287,South Africa,Unnamed: 68,
17288,Zambia,Unnamed: 68,


In [157]:
life_expectancy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17290 entries, 0 to 17289
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Country          17290 non-null  object 
 1   Year             17290 non-null  object 
 2   Life Expectancy  16124 non-null  float64
dtypes: float64(1), object(2)
memory usage: 405.4+ KB


#### Q23. What was the first country with a life expectancy to exceed 80?

Japan, 1996

In [159]:
life_expectancy = life_expectancy.dropna()
life_expectancy

Unnamed: 0,Country,Year,Life Expectancy
0,Aruba,1960,64.152000
1,Africa Eastern and Southern,1960,44.085552
2,Afghanistan,1960,32.535000
3,Africa Western and Central,1960,37.845152
4,Angola,1960,38.211000
...,...,...,...
16753,Kosovo,2022,79.524000
16754,"Yemen, Rep.",2022,63.720000
16755,South Africa,2022,61.480000
16756,Zambia,2022,61.803000


In [173]:
life_expectancy[life_expectancy['Life Expectancy'] >= 80]

Unnamed: 0,Country,Year,Life Expectancy
9695,Japan,1996,80.219756
9926,Gibraltar,1997,80.343000
9938,"Hong Kong SAR, China",1997,80.112195
9961,Japan,1997,80.424146
9988,"Macao SAR, China",1997,80.162000
...,...,...,...
16692,Qatar,2022,81.559000
16700,Singapore,2022,82.895122
16714,Slovenia,2022,81.282927
16715,Sweden,2022,83.109756


#### Q24. Merge gdp_df and life_expectancy, keeping all countries and years that appear in both DataFrames. Save the result to a new DataFrame named gdp_le. If you get any errors in doing this, read them carefully and correct them. Look at the first five rows of your new data frame to confirm it merged correctly. Also, check the last five rows to make sure the data is clean and as expected.

In [181]:
life_expectancy['Year'] = life_expectancy['Year'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  life_expectancy['Year'] = life_expectancy['Year'].astype(int)


In [183]:
life_expectancy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16124 entries, 0 to 16757
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Country          16124 non-null  object 
 1   Year             16124 non-null  int32  
 2   Life Expectancy  16124 non-null  float64
dtypes: float64(1), int32(1), object(1)
memory usage: 956.9+ KB


In [185]:
life_expectancy

Unnamed: 0,Country,Year,Life Expectancy
0,Aruba,1960,64.152000
1,Africa Eastern and Southern,1960,44.085552
2,Afghanistan,1960,32.535000
3,Africa Western and Central,1960,37.845152
4,Angola,1960,38.211000
...,...,...,...
16753,Kosovo,2022,79.524000
16754,"Yemen, Rep.",2022,63.720000
16755,South Africa,2022,61.480000
16756,Zambia,2022,61.803000


In [197]:
gdp_le = pd.merge(gdp_df, life_expectancy, on = ['Country', 'Year'] , how = 'inner', validate='one_to_one')
gdp_le

Unnamed: 0,Country,Year,GDP_Per_Capita,Continent,Life Expectancy
0,Afghanistan,2021,1517.016266,Asia,61.982
1,Afghanistan,2020,1968.341002,Asia,62.575
2,Afghanistan,2019,2079.921861,Asia,63.565
3,Afghanistan,2018,2060.698973,Asia,63.081
4,Afghanistan,2017,2096.093111,Asia,63.016
...,...,...,...,...,...
5499,Zimbabwe,1994,2670.106615,Africa,52.588
5500,Zimbabwe,1993,2458.783255,Africa,54.426
5501,Zimbabwe,1992,2468.278257,Africa,56.435
5502,Zimbabwe,1991,2781.787843,Africa,58.091


#### Q25. Create a new DataFrame, named gdp_le_2021 by extracting data for the year 2021 from gdp_le. How many countries have a life expectancy of at least 80 in 2021?