## UN Data Exploration

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [2]:
continents = pd.read_csv("../data/continents.csv")

In [5]:
continents

Unnamed: 0,Continent,Country
0,Asia,Afghanistan
1,Europe,Albania
2,Africa,Algeria
3,Europe,Andorra
4,Africa,Angola
...,...,...
211,Asia,Vietnam
212,Asia,West Bank and Gaza
213,Asia,Yemen
214,Africa,Zambia


In [7]:
gdp_df = pd.read_csv('../data/gdp_per_capita.csv')

In [9]:
gdp_df

Unnamed: 0,Country or Area,Year,Value,Value Footnotes
0,Afghanistan,2021,1517.016266,
1,Afghanistan,2020,1968.341002,
2,Afghanistan,2019,2079.921861,
3,Afghanistan,2018,2060.698973,
4,Afghanistan,2017,2096.093111,
...,...,...,...,...
7657,Zimbabwe,1994,2670.106615,
7658,Zimbabwe,1993,2458.783255,
7659,Zimbabwe,1992,2468.278257,
7660,Zimbabwe,1991,2781.787843,


Inspect the first 10 rows

In [12]:
gdp_df.head(10)

Unnamed: 0,Country or Area,Year,Value,Value Footnotes
0,Afghanistan,2021,1517.016266,
1,Afghanistan,2020,1968.341002,
2,Afghanistan,2019,2079.921861,
3,Afghanistan,2018,2060.698973,
4,Afghanistan,2017,2096.093111,
5,Afghanistan,2016,2101.422187,
6,Afghanistan,2015,2108.714173,
7,Afghanistan,2014,2144.449634,
8,Afghanistan,2013,2165.340915,
9,Afghanistan,2012,2122.830759,


Inspect the last 10 rows:

In [15]:
gdp_df.tail(10)

Unnamed: 0,Country or Area,Year,Value,Value Footnotes
7652,Zimbabwe,1999,2866.032886,
7653,Zimbabwe,1998,2931.725144,
7654,Zimbabwe,1997,2896.147308,
7655,Zimbabwe,1996,2867.026043,
7656,Zimbabwe,1995,2641.378271,
7657,Zimbabwe,1994,2670.106615,
7658,Zimbabwe,1993,2458.783255,
7659,Zimbabwe,1992,2468.278257,
7660,Zimbabwe,1991,2781.787843,
7661,Zimbabwe,1990,2704.757299,


Drop unneeded column of Value Footnotes:

In [18]:
gdp_df = gdp_df.drop(columns=['Value Footnotes'])

In [20]:
gdp_df

Unnamed: 0,Country or Area,Year,Value
0,Afghanistan,2021,1517.016266
1,Afghanistan,2020,1968.341002
2,Afghanistan,2019,2079.921861
3,Afghanistan,2018,2060.698973
4,Afghanistan,2017,2096.093111
...,...,...,...
7657,Zimbabwe,1994,2670.106615
7658,Zimbabwe,1993,2458.783255
7659,Zimbabwe,1992,2468.278257
7660,Zimbabwe,1991,2781.787843


Rename remaining columns to 'Country' and 'GDP_Per_Capita'

In [23]:
gdp_df = gdp_df.rename(columns={'Country or Area': 'Country'})

In [25]:
gdp_df = gdp_df.rename(columns={'Value': 'GDP_Per_Capita'})

In [27]:
gdp_df

Unnamed: 0,Country,Year,GDP_Per_Capita
0,Afghanistan,2021,1517.016266
1,Afghanistan,2020,1968.341002
2,Afghanistan,2019,2079.921861
3,Afghanistan,2018,2060.698973
4,Afghanistan,2017,2096.093111
...,...,...,...
7657,Zimbabwe,1994,2670.106615
7658,Zimbabwe,1993,2458.783255
7659,Zimbabwe,1992,2468.278257
7660,Zimbabwe,1991,2781.787843


Determine rows and columns of gdp_df and data types of columns

In [30]:
gdp_df.shape

(7662, 3)

In [32]:
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7662 entries, 0 to 7661
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         7662 non-null   object 
 1   Year            7662 non-null   int64  
 2   GDP_Per_Capita  7662 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 179.7+ KB


Year should be categorical instead of numeric, therefore change the data type to categorical

In [35]:
gdp_df['Year'] = gdp_df['Year'].astype(str)

In [37]:
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7662 entries, 0 to 7661
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country         7662 non-null   object 
 1   Year            7662 non-null   object 
 2   GDP_Per_Capita  7662 non-null   float64
dtypes: float64(1), object(2)
memory usage: 179.7+ KB


7. Which years are represented in this dataset? Take a look at the number of observations per year. What do you notice?


In [40]:
gdp_df['Year'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', '2015', '2014',
       '2013', '2012', '2011', '2010', '2009', '2008', '2007', '2006',
       '2005', '2004', '2003', '2002', '2022', '2001', '2000', '1999',
       '1998', '1997', '1996', '1995', '1994', '1993', '1992', '1991',
       '1990'], dtype=object)

In [42]:
gdp_df['Year'].value_counts()


Year
2013    242
2016    242
2014    242
2015    242
2020    242
2017    242
2018    242
2019    242
2021    241
2012    240
2011    240
2010    239
2009    239
2008    238
2007    237
2006    237
2004    236
2005    236
2003    235
2002    235
2001    234
2000    233
2022    232
1999    227
1998    226
1997    226
1996    223
1995    223
1994    213
1993    211
1992    210
1991    208
1990    207
Name: count, dtype: int64

According to the data above, I notice that the later years have more observations in general than the earlier years. 2022 seems to be out of order though.

8. How many countries are represented in this dataset? Which countries are least represented in the dataset? Why do you think these countries have so few observations?


In [46]:
gdp_df['Country'].value_counts()

Country
Least developed countries: UN classification    33
Middle East & North Africa                      33
Middle East & North Africa (IDA & IBRD)         33
Middle income                                   33
Mongolia                                        33
                                                ..
Kosovo                                          15
Sint Maarten (Dutch part)                       14
Turks and Caicos Islands                        12
Somalia                                         10
Djibouti                                        10
Name: count, Length: 242, dtype: int64

In [52]:
gdp_df['Country'].value_counts().sum()

7662