## Introduction 

In this course, we'll work with the World Happiness Report, an annual report created by the UN Sustainable Development Solutions Network with the intent of guiding policy. The report assigns each country a happiness score based on the answers to a poll question that asks respondents to rank their life on a scale of 0 - 10.
<br>
<br>
It also includes estimates of factors that may contribute to each country's happiness, including economic production, social support, life expectancy, freedom, absence of corruption, and generosity, to provide context for the score. Although these factors aren't actually used in the calculation of the happiness score, they can help illustrate why a country received a certain score.
<br>
<br> In the report, there are couple of questions: 
- How can aggregating the data give us more insight into happiness scores?
- How did world happiness change from 2015 to 2017?
- Which factors contribute the most to the happiness score?

To complete the data cleaning process, there are couple of topics that we will go through: 
- Data aggregation 
- How to combine data 
- How to transform data 
- How to clean strings with pandas 
- How to handle missing and duplicate data

## Introduction to data 

In [1]:
import pandas as pd 

happiness2015 = pd.read_csv('World_Happiness_2015.csv')
first_5 = happiness2015.head(5)
happiness2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
Country                          158 non-null object
Region                           158 non-null object
Happiness Rank                   158 non-null int64
Happiness Score                  158 non-null float64
Standard Error                   158 non-null float64
Economy (GDP per Capita)         158 non-null float64
Family                           158 non-null float64
Health (Life Expectancy)         158 non-null float64
Freedom                          158 non-null float64
Trust (Government Corruption)    158 non-null float64
Generosity                       158 non-null float64
Dystopia Residual                158 non-null float64
dtypes: float64(9), int64(1), object(2)
memory usage: 14.9+ KB


## Using loops to aggregate data 

In [2]:
mean_happiness = {}
regions = happiness2015['Region'].unique()

for rows in regions:
    #1. Split the dataframe into groups.
    region_group = happiness2015[happiness2015['Region'] == rows]
    
    #2. Apply a function to each group.
    region_mean = region_group['Happiness Score'].mean()
    
    #3. Combine the results into one data structure.
    mean_happiness[rows] = region_mean

print(mean_happiness)    

{'Western Europe': 6.689619047619048, 'North America': 7.273, 'Australia and New Zealand': 7.285, 'Middle East and Northern Africa': 5.406899999999999, 'Latin America and Caribbean': 6.144681818181818, 'Southeastern Asia': 5.317444444444445, 'Central and Eastern Europe': 5.332931034482758, 'Eastern Asia': 5.626166666666666, 'Sub-Saharan Africa': 4.202800000000001, 'Southern Asia': 4.580857142857143}


## Group by functions 

Another alrenatives to do it is to use a  has a built-in operation for this process. The groupby operation performs the "split-apply-combine" process on a dataframe, but condenses it into two steps:
- Create a GroupBy object.
- Call a function on the GroupBy object

### Creating GroupBy Objects 

Next, let's create a Groupby object and group the dataframe by the Region column:

In [3]:
test_groupby = happiness2015.groupby('Region')
print(test_groupby)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001BA2C85FB88>


This is telling us that an object of type GroupBy was returned, just like we expected.
<br>
<br>
Before we start aggregating data, we'll build some intuition around GroupBy objects. We can use the **GroupBy.get_group()** method to select data for a certain group.

In [4]:
grouped = happiness2015.groupby('Region')
aus_nz = grouped.get_group('Australia and New Zealand')
print(aus_nz)

       Country                     Region  Happiness Rank  Happiness Score  \
8  New Zealand  Australia and New Zealand               9            7.286   
9    Australia  Australia and New Zealand              10            7.284   

   Standard Error  Economy (GDP per Capita)   Family  \
8         0.03371                   1.25018  1.31967   
9         0.04083                   1.33358  1.30923   

   Health (Life Expectancy)  Freedom  Trust (Government Corruption)  \
8                   0.90837  0.63938                        0.42922   
9                   0.93156  0.65124                        0.35637   

   Generosity  Dystopia Residual  
8     0.47501            2.26425  
9     0.43562            2.26646  


### Common Aggregation Methods with GroupBy 

A basic example of aggregation is computing the number of rows for each of the groups. We can use the GroupBy.size() method to confirm the size of each region group

In [5]:
grouped = happiness2015.groupby('Region')
grouped.size()

Region
Australia and New Zealand           2
Central and Eastern Europe         29
Eastern Asia                        6
Latin America and Caribbean        22
Middle East and Northern Africa    20
North America                       2
Southeastern Asia                   9
Southern Asia                       7
Sub-Saharan Africa                 40
Western Europe                     21
dtype: int64

And if we want to get the the mean of each region for each numeric column:

In [6]:
grouped = happiness2015.groupby('Region')
means = grouped.mean()
print(means)

                                 Happiness Rank  Happiness Score  \
Region                                                             
Australia and New Zealand              9.500000         7.285000   
Central and Eastern Europe            79.000000         5.332931   
Eastern Asia                          64.500000         5.626167   
Latin America and Caribbean           46.909091         6.144682   
Middle East and Northern Africa       77.600000         5.406900   
North America                         10.000000         7.273000   
Southeastern Asia                     81.222222         5.317444   
Southern Asia                        113.142857         4.580857   
Sub-Saharan Africa                   127.900000         4.202800   
Western Europe                        29.523810         6.689619   

                                 Standard Error  Economy (GDP per Capita)  \
Region                                                                      
Australia and New Zealand    

### Aggregate particular column 

In some cases, we may only wish to aggregate one particular column in the original dataframe. GroupBy objects actually support column indexing, just like dataframes. You can select specific columns for a GroupBy object the same way you would for a dataframe.

In [7]:
grouped = happiness2015.groupby('Region')
happy_grouped = grouped['Happiness Score']
happy_mean = happy_grouped.mean()
happy_mean

Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64

### Aggregate multiple columns 

For example, suppose we wanted to calculate both the mean and maximum happiness score for each region. Using what we learned so far, we'd have to first calculate the mean, like we did above, and then calculate the maximum separately.
<br>
<br>
The GroupBy.agg() method can perform both aggregations at once. We can use the following syntax:

**GroupBy.agg([func_name1, func_name2, func_name3])**

This function supports the followings: 
- Return dataframe with multiple aggregations 
- Return series which is aggregated with customized function

In [8]:
import numpy as np
grouped = happiness2015.groupby('Region')
happy_grouped = grouped['Happiness Score']

happy_mean_max = happy_grouped.agg([np.mean, np.max])
print(happy_mean_max)

                                     mean   amax
Region                                          
Australia and New Zealand        7.285000  7.286
Central and Eastern Europe       5.332931  6.505
Eastern Asia                     5.626167  6.298
Latin America and Caribbean      6.144682  7.226
Middle East and Northern Africa  5.406900  7.278
North America                    7.273000  7.427
Southeastern Asia                5.317444  6.798
Southern Asia                    4.580857  5.253
Sub-Saharan Africa               4.202800  5.477
Western Europe                   6.689619  7.587


The aggregation can be written with custom functions:
<br>
<br>
Note that when we pass the functions into the agg() method as arguments, we don't use parentheses after the function names. For example, when we use np.mean, we refer to the function object itself and treat it like a variable, whereas np.mean() would be used to call the function and get the returned value.

In [9]:
def dif(group):
    return (group.max() - group.mean())

mean_max_dif = happy_grouped.agg(dif)
print(mean_max_dif)

Region
Australia and New Zealand          0.001000
Central and Eastern Europe         1.172069
Eastern Asia                       0.671833
Latin America and Caribbean        1.081318
Middle East and Northern Africa    1.871100
North America                      0.154000
Southeastern Asia                  1.480556
Southern Asia                      0.672143
Sub-Saharan Africa                 1.274200
Western Europe                     0.897381
Name: Happiness Score, dtype: float64


An alternative for the function above can be as follows: 

In [10]:
happiness2015.groupby('Region')['Happiness Score'].agg(dif)

Region
Australia and New Zealand          0.001000
Central and Eastern Europe         1.172069
Eastern Asia                       0.671833
Latin America and Caribbean        1.081318
Middle East and Northern Africa    1.871100
North America                      0.154000
Southeastern Asia                  1.480556
Southern Asia                      0.672143
Sub-Saharan Africa                 1.274200
Western Europe                     0.897381
Name: Happiness Score, dtype: float64

In [11]:
happiness_means = happiness2015.groupby('Region')['Happiness Score'].mean()
happiness_means = happiness2015.groupby('Region')['Happiness Score'].mean()
print(happiness_means)

Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64


## Aggregation with pivot tables

When you printed happiness_means, you should've seen that the values in the **Region column are the index** of the resulting series and the **Happiness Score column contained the values** that would be aggregated.
<br>
<br>
Index and values are actually arguments used in another method used to aggregate data - the DataFrame.pivot_table() method. This df.pivot_table() method can perform the same kinds of aggregations as the df.groupby method and make the code for complex aggregations easier to read.

This concept is highly similar to pivot table in excel and **returns a dataframe**, so normal dataframe filtering and methods can be applied to the result. For example, let's use the DataFrame.plot() method to create a visualization. Note that we exclude aggfunc below because the mean is the default aggregation function of df.pivot_table().

In [12]:
pv_happiness = happiness2015.pivot_table(values='Happiness Score', index='Region', aggfunc=np.mean, margins=True)
pv_happiness.plot(kind = 'barh', title = 'Mean Happiness Scores by Region', xlim=(0,10), legend=False);

### Pivot table to aggregate multiple columns 

In [13]:
happiness2015.pivot_table(['Happiness Score', 'Family'], 'Region')

Unnamed: 0_level_0,Family,Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia and New Zealand,1.31445,7.285
Central and Eastern Europe,1.053042,5.332931
Eastern Asia,1.099427,5.626167
Latin America and Caribbean,1.10472,6.144682
Middle East and Northern Africa,0.92049,5.4069
North America,1.28486,7.273
Southeastern Asia,0.940468,5.317444
Southern Asia,0.645321,4.580857
Sub-Saharan Africa,0.809085,4.2028
Western Europe,1.247302,6.689619


### Pivot table to apply multiple columns 

In [14]:
happiness2015.pivot_table('Happiness Score', 'Region', aggfunc=[np.mean, np.min , np.max], margins=True)

Unnamed: 0_level_0,mean,amin,amax
Unnamed: 0_level_1,Happiness Score,Happiness Score,Happiness Score
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Australia and New Zealand,7.285,7.284,7.286
Central and Eastern Europe,5.332931,4.218,6.505
Eastern Asia,5.626167,4.874,6.298
Latin America and Caribbean,6.144682,4.518,7.226
Middle East and Northern Africa,5.4069,3.006,7.278
North America,7.273,7.119,7.427
Southeastern Asia,5.317444,3.819,6.798
Southern Asia,4.580857,3.575,5.253
Sub-Saharan Africa,4.2028,2.839,5.477
Western Europe,6.689619,4.857,7.587


### Comparison between groupby and pivot table 

In [15]:
grouped = happiness2015.groupby('Region')[['Happiness Score','Family']]
happy_family_stats = grouped.agg([np.min, np.max, np.mean])
pv_happy_family_stats = happiness2015.pivot_table(['Happiness Score', 'Family'], 'Region', aggfunc=[np.min, np.max, np.mean], margins=True)

In [16]:
happy_family_stats

Unnamed: 0_level_0,Happiness Score,Happiness Score,Happiness Score,Family,Family,Family
Unnamed: 0_level_1,amin,amax,mean,amin,amax,mean
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Australia and New Zealand,7.284,7.286,7.285,1.30923,1.31967,1.31445
Central and Eastern Europe,4.218,6.505,5.332931,0.38562,1.34043,1.053042
Eastern Asia,4.874,6.298,5.626167,0.94675,1.3006,1.099427
Latin America and Caribbean,4.518,7.226,6.144682,0.74302,1.30477,1.10472
Middle East and Northern Africa,3.006,7.278,5.4069,0.47489,1.22393,0.92049
North America,7.119,7.427,7.273,1.24711,1.32261,1.28486
Southeastern Asia,3.819,6.798,5.317444,0.62736,1.26504,0.940468
Southern Asia,3.575,5.253,4.580857,0.30285,1.10395,0.645321
Sub-Saharan Africa,2.839,5.477,4.2028,0.0,1.18468,0.809085
Western Europe,4.857,7.587,6.689619,0.89318,1.40223,1.247302


In [17]:
pv_happy_family_stats

Unnamed: 0_level_0,amin,amin,amax,amax,mean,mean
Unnamed: 0_level_1,Family,Happiness Score,Family,Happiness Score,Family,Happiness Score
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Australia and New Zealand,1.30923,7.284,1.31967,7.286,1.31445,7.285
Central and Eastern Europe,0.38562,4.218,1.34043,6.505,1.053042,5.332931
Eastern Asia,0.94675,4.874,1.3006,6.298,1.099427,5.626167
Latin America and Caribbean,0.74302,4.518,1.30477,7.226,1.10472,6.144682
Middle East and Northern Africa,0.47489,3.006,1.22393,7.278,0.92049,5.4069
North America,1.24711,7.119,1.32261,7.427,1.28486,7.273
Southeastern Asia,0.62736,3.819,1.26504,6.798,0.940468,5.317444
Southern Asia,0.30285,3.575,1.10395,5.253,0.645321,4.580857
Sub-Saharan Africa,0.0,2.839,1.18468,5.477,0.809085,4.2028
Western Europe,0.89318,4.857,1.40223,7.587,1.247302,6.689619


We can rename the column in the pivot table as follows:

In [18]:
pv_happy_family_stats.columns = ['_'.join(col) for col in pv_happy_family_stats.columns]
pv_happy_family_stats

Unnamed: 0_level_0,amin_Family,amin_Happiness Score,amax_Family,amax_Happiness Score,mean_Family,mean_Happiness Score
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Australia and New Zealand,1.30923,7.284,1.31967,7.286,1.31445,7.285
Central and Eastern Europe,0.38562,4.218,1.34043,6.505,1.053042,5.332931
Eastern Asia,0.94675,4.874,1.3006,6.298,1.099427,5.626167
Latin America and Caribbean,0.74302,4.518,1.30477,7.226,1.10472,6.144682
Middle East and Northern Africa,0.47489,3.006,1.22393,7.278,0.92049,5.4069
North America,1.24711,7.119,1.32261,7.427,1.28486,7.273
Southeastern Asia,0.62736,3.819,1.26504,6.798,0.940468,5.317444
Southern Asia,0.30285,3.575,1.10395,5.253,0.645321,4.580857
Sub-Saharan Africa,0.0,2.839,1.18468,5.477,0.809085,4.2028
Western Europe,0.89318,4.857,1.40223,7.587,1.247302,6.689619
