<a href="https://colab.research.google.com/github/nishbh01/nishbh01/blob/main/data_301_FinalProjectCode_GroupTwo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## World Happiness Report - Final Project

In [25]:
import pandas as pd
import numpy as np
from plotnine import *

%matplotlib inline

import seaborn as sb
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

from pandas.api.types import CategoricalDtype

**Reading and Writing Files**

For the purpose of this project, since we only have one file i.e., WorldHappinessReport.csv, we'll be importing it by utilizing the pd.read_csv function.

In [111]:
WHRdf = pd.read_csv('WorldHappinessReport.csv')
WHRdf.head()

Unnamed: 0,Country Name,Regional Indicator,Year,Life Ladder,Log GDP Per Capita,Social Support,Healthy Life Expectancy At Birth,Freedom To Make Life Choices,Generosity,Perceptions Of Corruption,Positive Affect,Negative Affect,Confidence In National Government
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


The above dataframe imports the data from the WorldHappinessReport.csv, however, upon reading information on what each column represents, we've decided to change some of the column names as they would depict the data in those columns more accurately and would help us in our analysis. Additionally, we will be converting all the column names to camelCase or snake_case as the initial data consists of column names with spaces - thus making it harder for us to reference the column.  

In [112]:
# Renaming the columns
dfModified = WHRdf.rename(columns={
    'Country Name': 'country',
    'Regional Indicator': 'region',
    'Year': 'year',
    'Life Ladder': 'happinessScore',
    'Log GDP Per Capita': 'logGdpPerCapita',
    'Social Support': 'socialSupport',
    'Healthy Life Expectancy At Birth': 'LifeExpAtBirth',
    'Freedom To Make Life Choices': 'lifeChoicesFreedom',
    'Generosity': 'generosity',
    'Perceptions Of Corruption': 'corruptionPreception',
    'Positive Affect': 'positiveAffect',
    'Negative Affect': 'negativeAffect',
    'Confidence In National Government': 'confInNatGov'
})
dfModified.head()

Unnamed: 0,country,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


**Data Description**

In [113]:
dfModified.ndim

2

In [114]:
dfModified.shape

(2199, 13)

By utilizing the .ndim and .shape functions, we can see that the dataframe is 2 dimensional and consists of 2199 rows/observations and 13 columns/variables.

In [115]:
dfModified.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   country               2199 non-null   object 
 1   region                2087 non-null   object 
 2   year                  2199 non-null   int64  
 3   happinessScore        2199 non-null   float64
 4   logGdpPerCapita       2179 non-null   float64
 5   socialSupport         2186 non-null   float64
 6   LifeExpAtBirth        2145 non-null   float64
 7   lifeChoicesFreedom    2166 non-null   float64
 8   generosity            2126 non-null   float64
 9   corruptionPreception  2083 non-null   float64
 10  positiveAffect        2175 non-null   float64
 11  negativeAffect        2183 non-null   float64
 12  confInNatGov          1838 non-null   float64
dtypes: float64(10), int64(1), object(2)
memory usage: 223.5+ KB


Write about what you found through the .info function

The World Happiness Report consist of both categorical and quantitative variables. The categrocial variables are: 'Country Name', and 'Region'. Remaining 11 columns are numerical variables. The total number of rows in our dataset is 2199, so ideally we will be expecting 2199 non-null value counts for every column in our dataset. However, it is not the case. There are missing values in all every columns except 'countryName', 'year', and'happinessScore'. 

In [116]:
dfModified.describe()

Unnamed: 0,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
count,2199.0,2199.0,2179.0,2186.0,2145.0,2166.0,2126.0,2083.0,2175.0,2183.0,1838.0
mean,2014.161437,5.479226,9.389766,0.810679,63.294583,0.747858,9.6e-05,0.745195,0.652143,0.271501,0.483999
std,4.718736,1.125529,1.153387,0.120952,6.901104,0.14015,0.161083,0.185837,0.105922,0.086875,0.193071
min,2005.0,1.281271,5.526723,0.228217,6.72,0.257534,-0.337527,0.035198,0.178886,0.082737,0.068769
25%,2010.0,4.64675,8.499764,0.746609,59.119999,0.656528,-0.112116,0.688139,0.571684,0.20766,0.332549
50%,2014.0,5.432437,9.498955,0.835535,65.050003,0.769821,-0.022671,0.799654,0.663063,0.260671,0.46714
75%,2018.0,6.30946,10.373216,0.904792,68.5,0.859382,0.09207,0.868827,0.737936,0.322894,0.618846
max,2022.0,8.018934,11.663788,0.987343,74.474998,0.985178,0.702708,0.983276,0.883586,0.70459,0.993604


Write about what you found using the .describe function

**Identifying and Handling Missing Values**

In [117]:
# getting the no. of missing values for each variable in the dataset
dfModified.isna().sum()

country                   0
region                  112
year                      0
happinessScore            0
logGdpPerCapita          20
socialSupport            13
LifeExpAtBirth           54
lifeChoicesFreedom       33
generosity               73
corruptionPreception    116
positiveAffect           24
negativeAffect           16
confInNatGov            361
dtype: int64

In [118]:
# getting the total no. of missing values in the dataset
dfModified.isna().sum().sum()

822

Write about the missing values

*Dealing With Categorical/Qualitative Missing Values*

The only categorical variable that consists of missing values is the 'region' column. This column has 112 missing values in total, however, upon looking filtering the region column by the missing values and outputting the unique values of the countryName column, we can see that there are only 21 countries whose region is missing from the data. Therefore, instead of imputing these missing values with the 'mode' which might replace them with the incorrect region, we are going to impute these missing vlues manually by utilizing the continent listing from worldpopulationreview.org (https://worldpopulationreview.com/country-rankings/list-of-countries-by-continent)

In [119]:
# Filtering the missing values in region column
missing_regions = dfModified.loc[dfModified['region'].isna()]

# printing the unique values in the countryName column
unique_countries = missing_regions['country'].unique()
unique_countries

array(['Angola', 'Belize', 'Bhutan', 'Central African Republic',
       'Congo (Kinshasa)', 'Cuba', 'Czechia', 'Djibouti', 'Eswatini',
       'Guyana', 'Oman', 'Qatar', 'Somalia', 'Somaliland region',
       'South Sudan', 'State of Palestine', 'Sudan', 'Suriname', 'Syria',
       'Trinidad and Tobago', 'Turkiye'], dtype=object)

The total no. of countries with missing regional values are:

In [60]:
len(unique_countries)

21

In [120]:
# Manually imputing missing values
dfModified.loc[dfModified['country'] == 'Angola', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Belize', 'region'] = 'North America'
dfModified.loc[dfModified['country'] == 'Bhutan', 'region'] = 'Asia'
dfModified.loc[dfModified['country'] == 'Central African Republic', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Congo (Kinshasa)', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Cuba', 'region'] = 'North America'
dfModified.loc[dfModified['country'] == 'Czechia', 'region'] = 'Europe'
dfModified.loc[dfModified['country'] == 'Djibouti', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Eswatini', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Guyana', 'region'] = 'South America'
dfModified.loc[dfModified['country'] == 'Oman', 'region'] = 'Asia'
dfModified.loc[dfModified['country'] == 'Qatar', 'region'] = 'Asia'
dfModified.loc[dfModified['country'] == 'Somalia', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Somaliland region', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'South Sudan', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'State of Palestine', 'region'] = 'Asia'
dfModified.loc[dfModified['country'] == 'Sudan', 'region'] = 'Africa'
dfModified.loc[dfModified['country'] == 'Suriname', 'region'] = 'South America'
dfModified.loc[dfModified['country'] == 'Syria', 'region'] = 'Asia'
dfModified.loc[dfModified['country'] == 'Trinidad and Tobago', 'region'] = 'North America'
dfModified.loc[dfModified['country'] == 'Turkiye', 'region'] = 'Europe'

dfModified.head()


Unnamed: 0,country,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


In [121]:
dfModified.isna().sum()

country                   0
region                    0
year                      0
happinessScore            0
logGdpPerCapita          20
socialSupport            13
LifeExpAtBirth           54
lifeChoicesFreedom       33
generosity               73
corruptionPreception    116
positiveAffect           24
negativeAffect           16
confInNatGov            361
dtype: int64

After imputing the missing values manually, we applied to isna,sum() function to confirm if all the missing values in the region column had been dealt with. As shown above, there are now zero missing values in the region column.

In [122]:
dfModified.shape

(2199, 13)

In [123]:
# looking for unique values in Region Column
dfModified['region'].unique()

array(['South Asia', 'Central and Eastern Europe',
       'Middle East and North Africa', 'Africa',
       'Latin America and Caribbean',
       'Commonwealth of Independent States', 'North America and ANZ',
       'Western Europe', 'North America', 'Sub-Saharan Africa', 'Asia',
       'Southeast Asia', 'East Asia', 'Europe', 'South America'],
      dtype=object)

In [124]:
country_continent = pd.read_csv('countryContinent.csv')

In [47]:
country_continent.head()

Unnamed: 0,country,code_2,code_3,country_code,iso_3166_2,continent,sub_region,region_code,sub_region_code
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142.0,34.0
1,land Islands,AX,ALA,248,ISO 3166-2:AX,Europe,Northern Europe,150.0,154.0
2,Albania,AL,ALB,8,ISO 3166-2:AL,Europe,Southern Europe,150.0,39.0
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ,Africa,Northern Africa,2.0,15.0
4,American Samoa,AS,ASM,16,ISO 3166-2:AS,Oceania,Polynesia,9.0,61.0


In [125]:
country_continent = country_continent[['country', 'continent']]
country_continent.head()

Unnamed: 0,country,continent
0,Afghanistan,Asia
1,land Islands,Europe
2,Albania,Europe
3,Algeria,Africa
4,American Samoa,Oceania


In [126]:
# merging to datasets on 'country' column
WHRdf1 = pd.merge(dfModified, country_continent, on = 'country', how = 'left')


In [127]:
# rearranging columns
WHRdf1 = WHRdf1[['country', 'continent', 'region', 'year', 'happinessScore', 'logGdpPerCapita', 'socialSupport', 'LifeExpAtBirth', 'lifeChoicesFreedom', 'generosity', 'corruptionPreception', 'positiveAffect', 'negativeAffect', 'confInNatGov']]

In [128]:
# Filtering the missing values in region column
print(WHRdf1.loc[WHRdf1['continent'].isna()]['country'].unique())


['Bolivia' 'Congo (Brazzaville)' 'Congo (Kinshasa)' 'Czechia' 'Eswatini'
 'Hong Kong S.A.R. of China' 'Iran' 'Ivory Coast' 'Kosovo' 'Laos'
 'Moldova' 'North Macedonia' 'Russia' 'Somaliland region' 'South Korea'
 'State of Palestine' 'Syria' 'Taiwan Province of China' 'Tanzania'
 'Turkiye' 'United Kingdom' 'United States' 'Venezuela' 'Vietnam']


In [129]:
# creating replacements
country_to_continent = {'Bolivia': 'South America',
                        'Congo (Brazzaville)': 'Africa',
                        'Congo (Kinshasa)': 'Africa',
                        'Czechia': 'Europe',
                        'Eswatini': 'Africa',
                        'Hong Kong S.A.R. of China': 'Asia',
                        'Iran': 'Middle East and North Africa',
                        'Ivory Coast': 'Africa',
                        'Kosovo': 'Europe',
                        'Laos': 'Asia',
                        'Moldova': 'Europe',
                        'North Macedonia': 'Europe',
                        'Russia': 'Europe',
                        'Somaliland region': 'Africa',
                        'South Korea': 'Asia',
                        'State of Palestine': 'Middle East and North Africa',
                        'Syria': 'Middle East and North Africa',
                        'Taiwan Province of China': 'Asia',
                        'Tanzania': 'Africa',
                        'Turkiye': 'Europe',
                        'United Kingdom': 'Europe',
                        'United States': 'North America',
                        'Venezuela': 'South America',
                        'Vietnam': 'Asia'}
# replacing
WHRdf1['continent'] = WHRdf1['continent'].fillna(WHRdf1['country'].map(country_to_continent))

In [130]:
# let's see if we have any missing value for our contient column
WHRdf1['continent'].isna().sum()
# still keeping region column, just in case. 

0

In [131]:
WHRdf1.head()

Unnamed: 0,country,continent,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
0,Afghanistan,Asia,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,Asia,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,Asia,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,Asia,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,Asia,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


In [132]:
WHRdf1.isna().sum()

country                   0
continent                 0
region                    0
year                      0
happinessScore            0
logGdpPerCapita          20
socialSupport            13
LifeExpAtBirth           54
lifeChoicesFreedom       33
generosity               73
corruptionPreception    116
positiveAffect           24
negativeAffect           16
confInNatGov            361
dtype: int64

In [133]:
# the max missing values in our columns exist in 'Confidence in Govt and Corruption Perception. Let see their data.
WHRdf1.loc[WHRdf1['confInNatGov'].isna()].groupby('continent').apply(lambda x: len(x['country'].tolist()))

continent
Africa                           80
Americas                         18
Asia                            159
Europe                           75
Middle East and North Africa     24
North America                     1
Oceania                           2
South America                     2
dtype: int64

Most of the African and Asian countries have missing values in 'Confidence in Government' column. Based on the frequent disatisfaction with goverment in these continents, it can also be inferred that people did not bother answering the survey questions related to the government in their country. 

In [134]:
WHRdf1.loc[WHRdf1['corruptionPreception'].isna()].groupby('continent').apply(lambda x: len(x['country'].tolist()))

continent
Africa      15
Americas     1
Asia        94
Europe       6
dtype: int64

Looks like asian people are really disappointed with their government. 

In [135]:
# imputing all the missing values by the mean of specific country
WHRdf1.fillna(WHRdf1.groupby('country').transform(lambda x: x.fillna(x.mean())), inplace=True, downcast='infer')




In [137]:
WHRdf1.isna().sum()

country                   0
continent                 0
region                    0
year                      0
happinessScore            0
logGdpPerCapita           9
socialSupport             1
LifeExpAtBirth           32
lifeChoicesFreedom        0
generosity                9
corruptionPreception     29
positiveAffect            2
negativeAffect            1
confInNatGov            130
dtype: int64

In [139]:
still_missing = WHRdf1.loc[WHRdf1.isna().any(axis =1)]
still_missing

Unnamed: 0,country,continent,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
29,Algeria,Africa,Middle East and North Africa,2010,5.463567,9.306355,0.814826,65.500000,0.592696,-0.209753,0.618038,0.535673,0.267095,
30,Algeria,Africa,Middle East and North Africa,2011,5.317194,9.315958,0.810234,65.599998,0.529561,-0.185084,0.637982,0.502736,0.254897,
31,Algeria,Africa,Middle East and North Africa,2012,5.604596,9.329962,0.839397,65.699997,0.586663,-0.176571,0.690116,0.540059,0.229716,
32,Algeria,Africa,Middle East and North Africa,2014,6.354898,9.355415,0.818189,65.900002,0.530804,-0.140975,0.697673,0.558359,0.176866,
33,Algeria,Africa,Middle East and North Africa,2016,5.340854,9.383312,0.748588,66.099998,0.530804,-0.140975,0.697673,0.565026,0.377112,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2049,United Arab Emirates,Asia,Middle East and North Africa,2018,6.603744,11.178160,0.851041,65.849998,0.943664,0.044617,0.299117,0.722823,0.302042,
2050,United Arab Emirates,Asia,Middle East and North Africa,2019,6.710783,11.181391,0.861533,66.000000,0.911420,0.119298,0.299117,0.730052,0.283763,
2051,United Arab Emirates,Asia,Middle East and North Africa,2020,6.458392,11.122373,0.826756,66.150002,0.942161,0.050477,0.299117,0.702395,0.298480,
2052,United Arab Emirates,Asia,Middle East and North Africa,2021,6.733068,11.152440,0.826061,66.300003,0.951328,0.151219,0.299117,0.696670,0.217110,


it looks like some of the countries didn't have any entries for certain variables like 'confidence in government' since 2005. So we will imput with the man of the continet. may be people did not have opportunity to speak for their country's govenment or the interest.

In [140]:
WHRdf1.fillna(WHRdf1.groupby('continent').transform(lambda x: x.fillna(x.mean())), inplace=True, downcast='infer')



In [141]:
WHRdf1.isna().any()

country                 False
continent               False
region                  False
year                    False
happinessScore          False
logGdpPerCapita         False
socialSupport           False
LifeExpAtBirth          False
lifeChoicesFreedom      False
generosity              False
corruptionPreception    False
positiveAffect          False
negativeAffect          False
confInNatGov            False
dtype: bool

so WHRdf1 looks like our final dataframe. we can rename it as well. Then for the following steps, we will solely be using `WHRdf1`. 