<a href="https://colab.research.google.com/github/leslie-zi-pan/pandas/blob/main/Pandas_Summary_Functions_and_Maps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas -  Summary Functions and Maps
https://www.kaggle.com/code/residentmario/summary-functions-and-maps

In [2]:
import pandas as pd

## Summary functions

Pandas provides many 'summary functions' used to restructure data into more useful formats. 

Consider the .describe() format - generation of high-level summary of the attributes of given columns. It is type-aware, so output changes depending on input of data. 

In [25]:
df = pd.DataFrame({
    'dose': [1, 2, 3, 4, 5],
    'dose2': [2, 4, 6, 8, 10],
    'unit': ['mg', 'g', 'mg', 'mg', 'mmol'],
    'drug name': ['meformin', 'paracetamol', 'ibuprofen', 'drugs', 'drugs']
})
df

Unnamed: 0,dose,dose2,unit,drug name
0,1,2,mg,meformin
1,2,4,g,paracetamol
2,3,6,mg,ibuprofen
3,4,8,mg,drugs
4,5,10,mmol,drugs


In [26]:
df.dose.describe()

count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
Name: dose, dtype: float64

In [27]:
df.unit.describe()

count      5
unique     3
top       mg
freq       3
Name: unit, dtype: object

We can use the unique() method to get all unique values in a column

In [28]:
df.unit.unique()

array(['mg', 'g', 'mmol'], dtype=object)

We can use value_counts() to observe how often each unique value occurs

In [29]:
df.unit.value_counts()

mg      3
g       1
mmol    1
Name: unit, dtype: int64

## Maps

We use this to map one set of values to another. 

In [30]:
# We mean the dose to 0
df_mean = df.dose.mean()
df.dose.map(lambda p: p - df_mean)

0   -2.0
1   -1.0
2    0.0
3    1.0
4    2.0
Name: dose, dtype: float64

.apply()

map() function expects a single value from the series, and returns a transformed version of that. 

apply() is the equivalent method for whole dataframe transformation - calling a custom method for each row. 

In [31]:
def concat_dose(row):
    row['dose concat'] = f"{row.dose} {row.unit}"
    return row

df.apply(concat_dose, axis='columns')

Unnamed: 0,dose,dose2,unit,drug name,dose concat
0,1,2,mg,meformin,1 mg
1,2,4,g,paracetamol,2 g
2,3,6,mg,ibuprofen,3 mg
3,4,8,mg,drugs,4 mg
4,5,10,mmol,drugs,5 mmol


Operators like - are faster than map() or apply() as they use speed ups built into pandas. This is true for standard Python operators (>, < ==, etc) - but not as flexible in conditional logic application. 

In [32]:
# total dose
df.dose + df.dose2

0     3
1     6
2     9
3    12
4    15
dtype: int64

In [38]:
df['drug name'] + ': ' + df.dose.astype(str) + ' ' + df.unit

0      meformin: 1 mg
1    paracetamol: 2 g
2     ibuprofen: 3 mg
3         drugs: 4 mg
4       drugs: 5 mmol
dtype: object