Exploratory data analysis

Let's just look at 2019 and see what kind of data we are dealing with

- Do we have 1000 or 1 million entries in our data ?
- Are we dealing with text or numbers ?
- Do we have dates ? What format to these dates have ?
- Do we have outliers ? (Data points that are extremely different than all the other ones)
- Do we have missing data ? That is, is any of the cells in our dataset empty ?

In [1]:
import pandas as pd
df = pd.read_csv("/Users/lalitha/Downloads/happiness/2019.csv")

In [2]:
#let's set the precision to 2 decimal places
pd.set_option("display.precision", 2)

#the first 3 rows of our pandas DataFrame object
#if we run df.head(), it will display the first 5 rows by default
df.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.77,1.34,1.59,0.99,0.6,0.15,0.39
1,2,Denmark,7.6,1.38,1.57,1.0,0.59,0.25,0.41
2,3,Norway,7.55,1.49,1.58,1.03,0.6,0.27,0.34


Let's find maxiumum of all happiness scores

In [20]:
#Let's import the numpy library
import numpy as np

#and use a numpy function to see what's the maximum value for our Ladder score feature
np.max(df["Score"])

7.769

In [21]:
df['Score'].idxmax()

0

In [23]:
df.iloc[df['Score'].idxmax()]

Overall rank                          1
Country or region               Finland
Score                               7.8
GDP per capita                      1.3
Social support                      1.6
Healthy life expectancy            0.99
Freedom to make life choices        0.6
Generosity                         0.15
Perceptions of corruption          0.39
Name: 0, dtype: object

In [24]:
#DataFrame has this very handy method.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
Overall rank                    156 non-null int64
Country or region               156 non-null object
Score                           156 non-null float64
GDP per capita                  156 non-null float64
Social support                  156 non-null float64
Healthy life expectancy         156 non-null float64
Freedom to make life choices    156 non-null float64
Generosity                      156 non-null float64
Perceptions of corruption       156 non-null float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


In [26]:
print(df['Country or region'][0])

Finland


Exploring categorical features

We have 2 features which contain text:
- Country
- Region

By country

Our intuition is that each country is unique in our dataset (one country per row). This is what we would expect from a study of happiness levels in different countries across the worls. We can verify this assumption, to make sure we don't have errors in our data. 

In [29]:
#how many entries we have for each country
#shown in descending order (highest value first)
df["Country or region"].value_counts().sort_values(ascending = False)

Bolivia                1
Saudi Arabia           1
Denmark                1
Slovenia               1
Chad                   1
                      ..
Congo (Brazzaville)    1
Ghana                  1
Morocco                1
Tajikistan             1
Gabon                  1
Name: Country or region, Length: 156, dtype: int64

In [None]:
#Uncomment the line below to see what data type we used. This is a nice way to explore the functioning of pandas.
#print("\nThe code above returns a date of type: ", type(df['Country name'].value_counts()))

In [None]:
#here's each individual region and its corresponding frequency (the statistical term 
#for the number of times this region appears in our dataset)
df['Regional indicator'].value_counts()

In [None]:
#we have 10 regions and pandas DataFrame has a method to find this out
print(f"The number of regions in our dataset is: {df['Regional indicator'].nunique()}")

I just used Python's fancy formatting in the line of code above. If you like it and want to read more, know that it's called Literal String Interpolation (but the popular name is f-string). You can read more [here](https://www.programiz.com/python-programming/string-interpolation).

In [30]:
df.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.41,0.91,1.21,0.73,0.39,0.18,0.11
std,45.18,1.11,0.4,0.3,0.24,0.14,0.1,0.09
min,1.0,2.85,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.54,0.6,1.06,0.55,0.31,0.11,0.05
50%,78.5,5.38,0.96,1.27,0.79,0.42,0.18,0.09
75%,117.25,6.18,1.23,1.45,0.88,0.51,0.25,0.14
max,156.0,7.77,1.68,1.62,1.14,0.63,0.57,0.45


Insights from the descriptive statistics above:
- Ladder score actually goes from 2.5 to 7.8. There's no 0 or 10. 
- Healthy life expectancy has a minimum of 45 and a maximum of 76. This is a large range. There are countries in our dataset where life expenctancy is 45 years !
- Generosity can be negative. It's the only feature that has negative values.  
- Other features are more difficult to interpret from the descriptive stats above.

In [15]:
twenty17 = pd.read_csv('/Users/lalitha/Downloads/input/2017.csv')
twenty18 = pd.read_csv('/Users/lalitha/Downloads/input/2018.csv')
twenty19 = pd.read_csv('/Users/lalitha/Downloads/input/2019.csv')
twenty20 = pd.read_csv('/Users/lalitha/Downloads/input/2020.csv')

In [16]:
twenty20.shape

(153, 20)

In [None]:
twenty20.head(5)
twenty20.tail(5)
twenty20.duplicated().sum()
twenty20.describe()
twenty20.isnull().sum() 

In [None]:
col_names_dict = {'Country name':'Country', 'Regional indicator':'Region',
                  'Standard error of ladder score':'Standard Error', 'Logged GDP per capita':'Logged GDPPC',
                  'Social support':'Social Support', 'Healthy life expectancy':'Life Expectancy',
                  'Freedom to make life choices':'Freedom', 'Perceptions of corruption': 'Corruption'}

twenty.rename(columns = col_names_dict, inplace = True)

In [None]:

# UNUSED: A CORRELATION MATRIX
fig = plt.figure(figsize = (13, 10))
plt.style.use('seaborn-white')

plt.matshow(df.corr(), fignum = fig.number, cmap = 'viridis')
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)

cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

plt.title('Correlation Matrix', fontsize = 24, y = 1.2);