In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# 0. Initial Data Loading and Exploration

## HDI

In [2]:
hdi_df = pd.read_csv('data/HDR21-22_Composite_indices_complete_time_series.csv')
hdi_df.sample(3)

Unnamed: 0,iso3,country,hdicode,region,hdi_rank_2021,hdi_1990,hdi_1991,hdi_1992,hdi_1993,hdi_1994,...,mf_2012,mf_2013,mf_2014,mf_2015,mf_2016,mf_2017,mf_2018,mf_2019,mf_2020,mf_2021
102,LKA,Sri Lanka,High,SA,73.0,0.636,0.641,0.65,0.658,0.663,...,5.68,3.52,5.73,6.2,4.55,7.29,4.33,4.36,4.36,4.36
20,BLR,Belarus,Very High,ECA,60.0,,,,,,...,6.07,7.55,7.98,6.53,5.63,6.18,6.75,5.59,5.59,5.59
162,SUR,Suriname,High,LAC,99.0,,,,,,...,,,,,,,,,,


In [26]:
# Exploring the columns

# hdi_df.columns.tolist()

In [4]:
# Selecting the columns of interest
hdi_df = hdi_df[['country', 'hdicode', 'hdi_2021', 'region','hdi_rank_2021']]
hdi_df.head()

Unnamed: 0,country,hdicode,hdi_2021,region,hdi_rank_2021
0,Afghanistan,Low,0.478,SA,180.0
1,Angola,Medium,0.586,SSA,148.0
2,Albania,High,0.796,ECA,67.0
3,Andorra,Very High,0.858,,40.0
4,United Arab Emirates,Very High,0.911,AS,26.0


In [5]:
# Renaming the columns so that they are easier to understand 
hdi_df.rename(columns={'hdi_2021': 'HDI',
                       'country': 'Country',
                       'hdicode': 'HDI Group',
                       'region': 'Region',
                       'hdi_rank_2021': 'HDI_Rank'}, inplace=True)    

In [6]:
hdi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    206 non-null    object 
 1   HDI Group  191 non-null    object 
 2   HDI        202 non-null    float64
 3   Region     151 non-null    object 
 4   HDI_Rank   191 non-null    float64
dtypes: float64(2), object(3)
memory usage: 8.2+ KB


#### Lets look at the null values

In [7]:
# Looking at missing HDI values
hdi_df[hdi_df['HDI'].isnull()]

Unnamed: 0,Country,HDI Group,HDI,Region,HDI_Rank
108,Monaco,,,,
132,Nauru,,,EAP,
142,Korea (Democratic People's Rep. of),,,EAP,
158,Somalia,,,AS,


In [27]:
# Looking at the missing HDI Group values

# hdi_df[hdi_df['HDI Group'].isnull()]

In [9]:
# Getting the regions (abbreviations)
hdi_df['Region'].value_counts()  

Region
SSA    46
LAC    33
EAP    26
AS     20
ECA    17
SA      9
Name: count, dtype: int64

As we can see we have 4 countries with missing HDI scores. These will have to be dropped as there would be no way to fill in the data for these cells (one could try to look for other datasets). We can also observe that there are summaries of the HDI scores by region. We save these in a separate dataframe so that we can have the data on the countries in one dataframe and the data about the regions in a separate one.

In [10]:
hdi_df_regions = hdi_df.tail(11)
hdi_df_regions

Unnamed: 0,Country,HDI Group,HDI,Region,HDI_Rank
195,Very high human development,,0.896,,
196,High human development,,0.754,,
197,Medium human development,,0.636,,
198,Low human development,,0.518,,
199,Arab States,,0.708,,
200,East Asia and the Pacific,,0.749,,
201,Europe and Central Asia,,0.796,,
202,Latin America and the Caribbean,,0.754,,
203,South Asia,,0.632,,
204,Sub-Saharan Africa,,0.547,,


In [11]:
hdi_df = hdi_df.drop(hdi_df.tail(11).index)
hdi_df.tail()

Unnamed: 0,Country,HDI Group,HDI,Region,HDI_Rank
190,Samoa,High,0.707,EAP,111.0
191,Yemen,Low,0.455,AS,183.0
192,South Africa,High,0.713,SSA,109.0
193,Zambia,Medium,0.565,SSA,154.0
194,Zimbabwe,Medium,0.593,SSA,146.0


In [12]:
# Lets drop the countries with NaN HDI values
hdi_df['HDI'].dropna(inplace=True)
hdi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    195 non-null    object 
 1   HDI Group  191 non-null    object 
 2   HDI        191 non-null    float64
 3   Region     151 non-null    object 
 4   HDI_Rank   191 non-null    float64
dtypes: float64(2), object(3)
memory usage: 7.7+ KB


In [13]:
# Lets set the countries as the index
hdi_df.set_index('Country', inplace=True)

We won't drop the rows with null values in the 'Region' row as we can still use the HDI-value of these countries for non-regional analysis.

## IQ

In [14]:
iq_df = pd.read_csv('data/National_IQ.csv')
iq_df.sample(3)

Unnamed: 0,Rank,Country,Measured IQ,IQ data quality,SchAch,SA direct,SA scaled,SA data quality,Final IQ,Final IQ.1
182,,Comoros,,,,,,,(77),77.0
193,,Benin,,,,,,,(71),71.0
98,99.5,Algeria,,,403.6,81.5,84.2,2.0,84.2,84.2


In [15]:
iq_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Rank             162 non-null    float64
 1   Country          204 non-null    object 
 2   Measured IQ      136 non-null    float64
 3   IQ data quality  137 non-null    float64
 4    SchAch          113 non-null    float64
 5   SA direct        111 non-null    float64
 6   SA scaled        112 non-null    float64
 7   SA data quality  112 non-null    float64
 8    Final IQ        204 non-null    object 
 9    Final IQ.1      204 non-null    float64
dtypes: float64(8), object(2)
memory usage: 16.1+ KB


The values which we are intersted in are ones which are easily interpretable, and relevant to the research question, which aim it is to compare the HDI scores with the IQ scores. To do this, the IQ, the Rank of the country and the country itself ought to be included. 

In [16]:
iq_df.columns

Index(['Rank', 'Country', 'Measured IQ', 'IQ data quality', ' SchAch',
       'SA direct', 'SA scaled', 'SA data quality', ' Final IQ',
       ' Final IQ.1'],
      dtype='object')

In [17]:
columns_of_interest = ['Rank','Country', ' Final IQ']
iq_df = iq_df[columns_of_interest]


In [18]:
iq_df = iq_df.rename(columns=
{' Final IQ': 'IQ',
'Rank': 'IQ_Rank'
})

Since we are searching for potential biases and other correlations between the variables it is crucial that all the IQ values used will be the actual reported ones. Because of this, along with the same methodlogy being applied for the HDI dataset, we have to drop the NaN values rows of the IQ column. 

In [19]:
iq_df['IQ'].dropna(inplace=True)

In [20]:
iq_df.set_index('Country',inplace=True)

In [21]:
iq_df

Unnamed: 0_level_0,IQ_Rank,IQ
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Singapore,1.0,107.1
China,2.0,105.8
Hong Kong,3.0,105.7
Korea: South,4.5,104.6
Taiwan,4.5,104.6
...,...,...
GuineaBissau,,(69)
Liberia,,(68)
Haiti,,(67)
Sao Tome & Principe,,(67)


# 1. Cleaning the Data


# 1. Merge Data


In [22]:
outer_join_df = hdi_df.merge(iq_df, how='outer',on='Country')
outer_join_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 248 entries, Afghanistan to Sao Tome & Principe
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   HDI Group  191 non-null    object 
 1   HDI        191 non-null    float64
 2   Region     151 non-null    object 
 3   HDI_Rank   191 non-null    float64
 4   IQ_Rank    162 non-null    float64
 5   IQ         204 non-null    object 
dtypes: float64(3), object(3)
memory usage: 13.6+ KB


In [23]:
missing_in_hdi = outer_join_df[outer_join_df['HDI'].isnull()].index
missing_in_hdi = sorted(missing_in_hdi.tolist())
print("In IQ but missing in HDI: \n", missing_in_hdi)

missing_in_IQ = outer_join_df[outer_join_df['IQ'].isnull()].index
missing_in_IQ = sorted(missing_in_IQ.tolist())
print('In HDI but missing in IQ: \n', missing_in_IQ)
#A few countries on the list seem to have data for both data sets, but are named differently, which makes for the large amount of values (252)
iq_renaming = {
'(Serbia &) Montenegro' : 'Montenegro',
'Central African Rep.' : 'Central African Republic',
'CostaRica' : 'Costa Rica',
'GuineaBissau' : 'Guinea-Bissau',
'Myanmar/Burma' : 'Myanmar',
'Papua N.G.' : 'Papua New Guinea',
'CzechRep.' : 'Czechia',
'Macedonia' : 'North Macedonia'
}
hdi_renaming = {
    'Bolivia (Plurinational State of)' : 'Bolivia',
    'Bosnia and Herzegovina': 'Bosnia',
    'Brunei Darussalam' : 'Brunei',
    'Cabo Verde': 'Cape Verde',
    'Congo' : 'Congo (Brazzaville)',
    'Congo (Democratic Republic of the)' : 'Congo (Zaire)',
    'Hong Kong, China (SAR)' : 'Hong Kong',
    'Iran (Islamic Republic of)' : 'Iran', 
    "Lao People's Democratic Republic" : 'Laos',
    'Moldova (Republic of)' : 'Moldova',
    'Palestine, State of' : 'Palestine',
    'Russian Federation': 'Russia',
    "Côte d'Ivoire": "Cote d'Ivoire",

}
iq_df = iq_df.rename(index = iq_renaming)
hdi_df = hdi_df.rename(index = hdi_renaming)


In IQ but missing in HDI: 
 ['(Serbia &) Montenegro', 'Antigua/Barbuda', 'Bermuda', 'Bolivia', 'Bosnia', 'Brunei', 'Cape Verde', 'Central African Rep.', 'Congo (Brazzaville)', 'Congo (Zaire)', 'Cook Islands', 'CostaRica', "Cote d'Ivoire", 'CzechRep.', 'EastTimor', 'England', 'Greenland', 'GuineaBissau', 'Hong Kong', 'Iran', "Korea (Democratic People's Rep. of)", 'Korea: North', 'Korea: South', 'Laos', 'Macao', 'Macedonia', 'Mariana Islands', 'Micronesia', 'Moldova', 'Monaco', 'Myanmar/Burma', 'Nauru', 'Netherlands Antilles', 'New Caledonia', 'Palestine', 'Papua N.G.', 'Puerto Rico', 'Russia', 'Samoa (Western)', 'Sao Tome & Principe', 'Scotland', 'Serbia & (Montenegro)', 'Somalia', 'St  Helena', 'St Kitts & Nevis', 'St Lucia', 'St Vincent', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Tibet', 'Trinidad & Tobago', 'USA', 'Venezuela', 'Vietnam', 'Zanzibar']
In HDI but missing in IQ: 
 ['Antigua and Barbuda', 'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina', 'Brunei Darussalam'

In [24]:

outer_join_df = hdi_df.merge(iq_df, how='outer',on='Country')
missing_in_hdi = outer_join_df[outer_join_df['HDI'].isnull()].index
missing_in_hdi = sorted(missing_in_hdi.tolist())
print("In IQ but missing in HDI: \n", missing_in_hdi)

missing_in_IQ = outer_join_df[outer_join_df['IQ'].isnull()].index
missing_in_IQ = sorted(missing_in_IQ.tolist())
print('In HDI but missing in IQ: \n', missing_in_IQ)

In IQ but missing in HDI: 
 ['Antigua/Barbuda', 'Bermuda', 'Cook Islands', 'EastTimor', 'England', 'Greenland', "Korea (Democratic People's Rep. of)", 'Korea: North', 'Korea: South', 'Macao', 'Mariana Islands', 'Micronesia', 'Monaco', 'Nauru', 'Netherlands Antilles', 'New Caledonia', 'Puerto Rico', 'Samoa (Western)', 'Sao Tome & Principe', 'Scotland', 'Serbia & (Montenegro)', 'Somalia', 'St  Helena', 'St Kitts & Nevis', 'St Lucia', 'St Vincent', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Tibet', 'Trinidad & Tobago', 'USA', 'Venezuela', 'Vietnam', 'Zanzibar']
In HDI but missing in IQ: 
 ['Antigua and Barbuda', 'Eswatini (Kingdom of)', "Korea (Democratic People's Rep. of)", 'Korea (Republic of)', 'Micronesia (Federated States of)', 'Monaco', 'Nauru', 'Palau', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Serbia', 'South Sudan', 'Syrian Arab Republic', 'Tanzania (United Republic of)', 'Timor-Leste', 'Trinidad

In [32]:
merged_df = outer_join_df.copy()

In [33]:
# Countries with NO IQ values
nan_indices = merged_df[merged_df['IQ'].isnull()].index
print(nan_indices)


Index(['Antigua and Barbuda', 'Micronesia (Federated States of)',
       'Saint Kitts and Nevis', 'Korea (Republic of)', 'Saint Lucia', 'Monaco',
       'Nauru', 'Palau', 'Korea (Democratic People's Rep. of)', 'San Marino',
       'Serbia', 'South Sudan', 'Sao Tome and Principe',
       'Eswatini (Kingdom of)', 'Syrian Arab Republic', 'Timor-Leste',
       'Trinidad and Tobago', 'Tuvalu', 'Tanzania (United Republic of)',
       'United States', 'Saint Vincent and the Grenadines',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Samoa'],
      dtype='object', name='Country')


# 2. INTRODUCTION

*In the introduction, provide the description of the problem addressed (the context of your data) and the project objectives.
Very briefly describe the analysis design and how it accomplishes the stated objectives. 
State your research hypotheses in a human-understandable language.
What  can the results be used for?*

**Data**  
The Data includes all countries (indices) and their global rank as well as value of their Human Development Index and Average IQ from the year 2021. The Data was merged from two distrinct databases, first the HDI data was sourced from the United Nations Development Program, short UNDP, public databases, the second IQ Data __ INSERT SOURCE __.

**Problem**
This projects aims to invetigate the correlation of IQ and HDI in different countries and regions of the world.

**Hypotheses**
Average Country IQ and Human Dvelopment Index are positively correlated.
Data is mainly missing in third world countries and less developed regions.

**Analysis Design**
The Design is to first merge the two sources into one comprehensive Pandas Dataframe. After furter preperation and cleaning of the Data, we plan to do deailed univariate analysis
such as ...
....
followed by bivariate analysis from
...

**What the results can be used for**
The result can be used by legislators to asses their countries position, in perspective of wheter they are under or overperforming in terms of IQ. This means that if a cuntries IQ is lower than to other countries with similar HDI, they should aim to set goals in terms of education, mental fitness, etc. to improve their AVG Iq in the long run.

# 3. DATA CLEANING AND PREPARATION

What did you need to do to clean and prepare your dataset?
Missing values, duplicates, inconsistent data types…


# 4.  DESCRIPTIVE STATISTICS

## 4.1  Univariate analysis
Histogram and metrics introduced in class. Outliers identification. Interpret and discuss your results.

## 4.2  Bivariate analysis
Scatter plots and correlation for pairs of variables of interest. Interpret and discuss your results.



# 5.  DISCUSSION AND PRELIMINARY CONCLUSIONS 

Discuss the initial insights and how they align with the objectives set in the Introduction. Briefly address any limitations or challenges encountered in the data or analysis. Reflect on the implications of these findings and how they might guide future research directions or applications
