### Summary functions

- Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way.
- describe() method: generates a high-level summary of the attributes of the given column.
- It is type-aware - its output changes based on the data type of the input.

In [None]:
import pandas as pd
#reviews = pd.read_csv('world-happiness-report-2021/world-happiness-report-2021.csv', index_col=0)
reviews = pd.read_csv('world-happiness-report-2021/world-happiness-report-2021.csv')
pd.set_option('display.max_rows',5)
reviews

In [None]:
# The output for numerical data
reviews['Ladder score'].describe()

In [None]:
# for string data 
reviews['Regional indicator'].describe()

In [None]:
reviews.loc[reviews['Regional indicator']=='Sub-Saharan Africa',['Country name']]

In [None]:
reviews['Ladder score'].mean()

In [None]:
# the unique() function: to see a list of unique values 
reviews['Regional indicator'].unique()

In [None]:
# the value_counts() method: to see a list of unique values and how often they occur in the dataset
reviews['Regional indicator'].value_counts()

In [None]:
#Return the row label of the maximum value.
id_max = reviews['Healthy life expectancy'].idxmax()
reviews.loc[id_max,'Country name']

In [None]:
reviews.loc[reviews['Generosity'].idxmax(),['Country name','Generosity']]

### Map
- A function that takes one set of values and "maps" them to another set of values.
- There are two mapping methods that you will use often.
   - map() 
      - The lambda you pass to map() should expect a single value from the Series, and return a transformed version of that value. 
      - map() returns a new Series where all the values have been transformed by your function.
   - apply()
      - transform a whole DataFrame by calling a custom method on each row.
      - axis parameter : {0 or ‘index’, 1 or ‘columns’}, default 0
      - axis='index' - give a function to transform each column.
      - axis='columns' - passing a function to transform each row

- Note that map() and apply() return new, transformed Series and DataFrames, respectively. 
    - They don't modify the original data they're called on. 

In [None]:
# rebase with the mean
ladder_score_mean = reviews['Ladder score'].mean()
reviews['Ladder score'].map(lambda x : x - ladder_score_mean)

In [None]:
reviews['Ladder score'] # doen't change 

In [None]:
# count those have 'Asia' in the column of 'Regional indicator'
temp = reviews['Regional indicator'].map(lambda x : 'Asia' in x)
asia_count = temp.sum() # sum() will sum those True values(treat as 1)
non_asia_count = (~temp).sum() #(~temp).sum() sums those False values
print(asia_count)
print(non_asia_count)

In [None]:
reviews['Regional indicator'].str.contains('Asia')

In [None]:
# same as above, but use Series.str.contains(), and count() returns the number of non-NA/null observations in the Series.
asia_count=reviews['Regional indicator'].loc[reviews['Regional indicator'].str.contains('Asia')].count()
asia_count

In [None]:
ladder_score_mean = reviews['Ladder score'].mean()
def remean_ladder_score(row):
    row['Ladder score'] = row['Ladder score'] - ladder_score_mean
    #return row['Country name': 'Ladder score']
    return row
    
reviews.apply(remean_ladder_score, axis = 1) # axis = 'columns'

In [None]:
reviews['Ladder score'] # doesn't change

In [None]:
def new_ladder_score(row):
    if row['Ladder score'] >=7:
        row['Ladder score'] = 3
    elif row['Ladder score'] >=5:
        row['Ladder score'] = 2
    else:
        row['Ladder score'] = 1
    return row
ladder_ratings = reviews.apply(new_ladder_score, axis = 1).loc[:,'Ladder score']
ladder_ratings 