<a href="https://colab.research.google.com/github/nishbh01/nishbh01/blob/main/data_301_FinalProjectCode_GroupTwo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## World Happiness Report - Final Project

In [1]:
import pandas as pd
import numpy as np
from plotnine import *

%matplotlib inline

import seaborn as sb
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

from pandas.api.types import CategoricalDtype

**Reading and Writing Files**

For the purpose of this project, since we only have one file i.e., WorldHappinessReport.csv, we'll be importing it by utilizing the pd.read_csv function.

In [2]:
WHRdf = pd.read_csv('WorldHappinessReport.csv')
WHRdf.head()

FileNotFoundError: ignored

The above dataframe imports the data from the WorldHappinessReport.csv, however, upon reading information on what each column represents, we've decided to change some of the column names as they would depict the data in those columns more accurately and would help us in our analysis. Additionally, we will be converting all the column names to camelCase or snake_case as the initial data consists of column names with spaces - thus making it harder for us to reference the column.  

In [None]:
# Renaming the columns
dfModified = WHRdf.rename(columns={
    'Country Name': 'countryName',
    'Regional Indicator': 'region',
    'Year': 'year',
    'Life Ladder': 'happinessScore',
    'Log GDP Per Capita': 'logGdpPerCapita',
    'Social Support': 'socialSupport',
    'Healthy Life Expectancy At Birth': 'LifeExpAtBirth',
    'Freedom To Make Life Choices': 'lifeChoicesFreedom',
    'Generosity': 'generosity',
    'Perceptions Of Corruption': 'corruptionPreception',
    'Positive Affect': 'positiveAffect',
    'Negative Affect': 'negativeAffect',
    'Confidence In National Government': 'confInNatGov'
})
dfModified.head()

Unnamed: 0,countryName,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


**Data Description**

In [None]:
dfModified.ndim

2

In [None]:
dfModified.shape

(2199, 13)

By utilizing the .ndim and .shape functions, we can see that the dataframe is 2 dimensional and consists of 2199 rows/observations and 13 columns/variables.

In [None]:
dfModified.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2199 entries, 0 to 2198
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   countryName           2199 non-null   object 
 1   region                2087 non-null   object 
 2   year                  2199 non-null   int64  
 3   happinessScore        2199 non-null   float64
 4   logGdpPerCapita       2179 non-null   float64
 5   socialSupport         2186 non-null   float64
 6   LifeExpAtBirth        2145 non-null   float64
 7   lifeChoicesFreedom    2166 non-null   float64
 8   generosity            2126 non-null   float64
 9   corruptionPreception  2083 non-null   float64
 10  positiveAffect        2175 non-null   float64
 11  negativeAffect        2183 non-null   float64
 12  confInNatGov          1838 non-null   float64
dtypes: float64(10), int64(1), object(2)
memory usage: 223.5+ KB


Write about what you found through the .info function

In [None]:
dfModified.describe()

Unnamed: 0,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
count,2199.0,2199.0,2179.0,2186.0,2145.0,2166.0,2126.0,2083.0,2175.0,2183.0,1838.0
mean,2014.161437,5.479226,9.389766,0.810679,63.294583,0.747858,9.6e-05,0.745195,0.652143,0.271501,0.483999
std,4.718736,1.125529,1.153387,0.120952,6.901104,0.14015,0.161083,0.185837,0.105922,0.086875,0.193071
min,2005.0,1.281271,5.526723,0.228217,6.72,0.257534,-0.337527,0.035198,0.178886,0.082737,0.068769
25%,2010.0,4.64675,8.499764,0.746609,59.119999,0.656528,-0.112116,0.688139,0.571684,0.20766,0.332549
50%,2014.0,5.432437,9.498955,0.835535,65.050003,0.769821,-0.022671,0.799654,0.663063,0.260671,0.46714
75%,2018.0,6.30946,10.373216,0.904792,68.5,0.859382,0.09207,0.868827,0.737936,0.322894,0.618846
max,2022.0,8.018934,11.663788,0.987343,74.474998,0.985178,0.702708,0.983276,0.883586,0.70459,0.993604


Write about what you found using the .describe function

**Identifying and Handling Missing Values**

In [None]:
# getting the no. of missing values for each variable in the dataset
dfModified.isna().sum()

countryName               0
region                  112
year                      0
happinessScore            0
logGdpPerCapita          20
socialSupport            13
LifeExpAtBirth           54
lifeChoicesFreedom       33
generosity               73
corruptionPreception    116
positiveAffect           24
negativeAffect           16
confInNatGov            361
dtype: int64

In [None]:
# getting the total no. of missing values in the dataset
dfModified.isna().sum().sum()

822

Write about the missing values

*Dealing With Categorical/Qualitative Missing Values*

The only categorical variable that consists of missing values is the 'region' column. This column has 112 missing values in total, however, upon looking filtering the region column by the missing values and outputting the unique values of the countryName column, we can see that there are only 21 countries whose region is missing from the data. Therefore, instead of imputing these missing values with the 'mode' which might replace them with the incorrect region, we are going to impute these missing vlues manually by utilizing the continent listing from worldpopulationreview.org (https://worldpopulationreview.com/country-rankings/list-of-countries-by-continent)

In [None]:
# Filtering the missing values in region column
missing_regions = dfModified.loc[dfModified['region'].isna()]

# printing the unique values in the countryName column
unique_countries = missing_regions['countryName'].unique()
unique_countries

array(['Angola', 'Belize', 'Bhutan', 'Central African Republic',
       'Congo (Kinshasa)', 'Cuba', 'Czechia', 'Djibouti', 'Eswatini',
       'Guyana', 'Oman', 'Qatar', 'Somalia', 'Somaliland region',
       'South Sudan', 'State of Palestine', 'Sudan', 'Suriname', 'Syria',
       'Trinidad and Tobago', 'Turkiye'], dtype=object)

The total no. of countries with missing regional values are:

In [None]:
len(unique_countries)

21

In [None]:
# Manually imputing missing values
dfModified.loc[dfModified['countryName'] == 'Angola', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Belize', 'region'] = 'North America'
dfModified.loc[dfModified['countryName'] == 'Bhutan', 'region'] = 'Asia'
dfModified.loc[dfModified['countryName'] == 'Central African Republic', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Congo (Kinshasa)', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Cuba', 'region'] = 'North America'
dfModified.loc[dfModified['countryName'] == 'Czechia', 'region'] = 'Europe'
dfModified.loc[dfModified['countryName'] == 'Djibouti', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Eswatini', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Guyana', 'region'] = 'South America'
dfModified.loc[dfModified['countryName'] == 'Oman', 'region'] = 'Asia'
dfModified.loc[dfModified['countryName'] == 'Qatar', 'region'] = 'Asia'
dfModified.loc[dfModified['countryName'] == 'Somalia', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Somaliland region', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'South Sudan', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'State of Palestine', 'region'] = 'Asia'
dfModified.loc[dfModified['countryName'] == 'Sudan', 'region'] = 'Africa'
dfModified.loc[dfModified['countryName'] == 'Suriname', 'region'] = 'South America'
dfModified.loc[dfModified['countryName'] == 'Syria', 'region'] = 'Asia'
dfModified.loc[dfModified['countryName'] == 'Trinidad and Tobago', 'region'] = 'North America'
dfModified.loc[dfModified['countryName'] == 'Turkiye', 'region'] = 'Europe'

dfModified.head()


Unnamed: 0,countryName,region,year,happinessScore,logGdpPerCapita,socialSupport,LifeExpAtBirth,lifeChoicesFreedom,generosity,corruptionPreception,positiveAffect,negativeAffect,confInNatGov
0,Afghanistan,South Asia,2008,3.72359,7.350416,0.450662,50.5,0.718114,0.167652,0.881686,0.414297,0.258195,0.612072
1,Afghanistan,South Asia,2009,4.401778,7.508646,0.552308,50.799999,0.678896,0.190809,0.850035,0.481421,0.237092,0.611545
2,Afghanistan,South Asia,2010,4.758381,7.6139,0.539075,51.099998,0.600127,0.121316,0.706766,0.516907,0.275324,0.299357
3,Afghanistan,South Asia,2011,3.831719,7.581259,0.521104,51.400002,0.495901,0.163571,0.731109,0.479835,0.267175,0.307386
4,Afghanistan,South Asia,2012,3.782938,7.660506,0.520637,51.700001,0.530935,0.237588,0.77562,0.613513,0.267919,0.43544


In [None]:
dfModified.isna().sum()

countryName               0
region                    0
year                      0
happinessScore            0
logGdpPerCapita          20
socialSupport            13
LifeExpAtBirth           54
lifeChoicesFreedom       33
generosity               73
corruptionPreception    116
positiveAffect           24
negativeAffect           16
confInNatGov            361
dtype: int64

After imputing the missing values manually, we applied to isna,sum() function to confirm if all the missing values in the region column had been dealt with. As shown above, there are now zero missing values in the region column.