### Summary functions

- Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.
- describe() method: generates a high-level summary of the attributes of the given column.
- It is type-aware - its output changes based on the data type of the input.

In [1]:
import pandas as pd
#reviews = pd.read_csv('world-happiness-report-2021/world-happiness-report-2021.csv', index_col=0)
reviews = pd.read_csv('world-happiness-report-2021/world-happiness-report-2021.csv')
pd.set_option('display.max_rows',5)
reviews

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,Western Europe,7.842,0.032,7.904,7.780,10.775,0.954,72.000,0.949,-0.098,0.186,2.43,1.446,1.106,0.741,0.691,0.124,0.481,3.253
1,Denmark,Western Europe,7.620,0.035,7.687,7.552,10.933,0.954,72.700,0.946,0.030,0.179,2.43,1.502,1.108,0.763,0.686,0.208,0.485,2.868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,Zimbabwe,Sub-Saharan Africa,3.145,0.058,3.259,3.030,7.943,0.750,56.201,0.677,-0.047,0.821,2.43,0.457,0.649,0.243,0.359,0.157,0.075,1.205
148,Afghanistan,South Asia,2.523,0.038,2.596,2.449,7.695,0.463,52.493,0.382,-0.102,0.924,2.43,0.370,0.000,0.126,0.000,0.122,0.010,1.895


In [2]:
# The output for numerical data
reviews['Ladder score'].describe()

count    149.000000
mean       5.532839
            ...    
75%        6.255000
max        7.842000
Name: Ladder score, Length: 8, dtype: float64

In [3]:
# for string data 
reviews['Regional indicator'].describe()

count                    149
unique                    10
top       Sub-Saharan Africa
freq                      36
Name: Regional indicator, dtype: object

In [4]:
reviews.loc[reviews['Regional indicator']=='Sub-Saharan Africa',['Country name']]

Unnamed: 0,Country name
49,Mauritius
82,Congo (Brazzaville)
...,...
146,Rwanda
147,Zimbabwe


In [5]:
reviews['Ladder score'].mean()

5.532838926174497

In [6]:
# the unique() function: to see a list of unique values 
reviews['Regional indicator'].unique()

array(['Western Europe', 'North America and ANZ',
       'Middle East and North Africa', 'Latin America and Caribbean',
       'Central and Eastern Europe', 'East Asia', 'Southeast Asia',
       'Commonwealth of Independent States', 'Sub-Saharan Africa',
       'South Asia'], dtype=object)

In [7]:
# the value_counts() method: to see a list of unique values and how often they occur in the dataset
reviews['Regional indicator'].value_counts()

Sub-Saharan Africa       36
Western Europe           21
                         ..
East Asia                 6
North America and ANZ     4
Name: Regional indicator, Length: 10, dtype: int64

In [8]:
#Return the row label of the maximum value.
id_max = reviews['Healthy life expectancy'].idxmax()
reviews.loc[id_max,'Country name']

'Singapore'

### Map
- A function that takes one set of values and "maps" them to another set of values.
- There are two mapping methods that you will use often.
   - map() 
      - The lambda you pass to map() should expect a single value from the Series, and return a transformed version of that value. 
      - map() returns a new Series where all the values have been transformed by your function.
   - apply()
      - transform a whole DataFrame by calling a custom method on each row.
      - axis parameter : {0 or ‘index’, 1 or ‘columns’}, default 0
      - axis='index' - give a function to transform each column.
      - axis='columns' - passing a function to transform each row

- Note that map() and apply() return new, transformed Series and DataFrames, respectively. 
    - They don't modify the original data they're called on. 

In [9]:
# rebase with the mean
ladder_score_mean = reviews['Ladder score'].mean()
reviews['Ladder score'].map(lambda x : x - ladder_score_mean)

0      2.309161
1      2.087161
         ...   
147   -2.387839
148   -3.009839
Name: Ladder score, Length: 149, dtype: float64

In [None]:
reviews['Ladder score'] # doen't change 

In [10]:
# count those have 'Asia' in the column of 'Regional indicator'
temp = reviews['Regional indicator'].map(lambda x : 'Asia' in x)
asia_count = temp.sum() # sum() will sum those True values(treat as 1)
non_asia_count = (~temp).sum() #(~temp).sum() sums those False values
print(asia_count)
print(non_asia_count)

22
127


In [11]:
reviews['Regional indicator'].str.contains('Asia')

0      False
1      False
       ...  
147    False
148     True
Name: Regional indicator, Length: 149, dtype: bool

In [12]:
# same as above, but use Series.str.contains(), and count() returns the number of non-NA/null observations in the Series.
asia_count=reviews['Regional indicator'].loc[reviews['Regional indicator'].str.contains('Asia')].count()
asia_count

22

In [17]:
ladder_score_mean = reviews['Ladder score'].mean()
def remean_ladder_score(row):
    row['Ladder score'] = row['Ladder score'] - ladder_score_mean
    return row
    
reviews.apply(remean_ladder_score, axis = 1) # axis = 'columns'

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,Western Europe,2.309161,0.032,7.904,7.780,10.775,0.954,72.000,0.949,-0.098,0.186,2.43,1.446,1.106,0.741,0.691,0.124,0.481,3.253
1,Denmark,Western Europe,2.087161,0.035,7.687,7.552,10.933,0.954,72.700,0.946,0.030,0.179,2.43,1.502,1.108,0.763,0.686,0.208,0.485,2.868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,Zimbabwe,Sub-Saharan Africa,-2.387839,0.058,3.259,3.030,7.943,0.750,56.201,0.677,-0.047,0.821,2.43,0.457,0.649,0.243,0.359,0.157,0.075,1.205
148,Afghanistan,South Asia,-3.009839,0.038,2.596,2.449,7.695,0.463,52.493,0.382,-0.102,0.924,2.43,0.370,0.000,0.126,0.000,0.122,0.010,1.895


In [None]:
reviews['Ladder score'] # doesn't change

In [None]:
def new_ladder_score(row):
    if row['Ladder score'] >=7:
        row['Ladder score'] = 3
    elif row['Ladder score'] >=5:
        row['Ladder score'] = 2
    else:
        row['Ladder score'] = 1
    return row
ladder_ratings = reviews.apply(new_ladder_score, axis = 1).loc[:,'Ladder score']
ladder_ratings 