# about
* created by: Piramol Krishnan
* created on : 10 Aug 2021
* goal: cover the most frequently used Pandas `transform()` features

In [2]:
import pandas as pd
import numpy as np

# Intro

* Pandas is an amazing library that contains extensive built-in functions for manipulating data. 
* Among them, transform() is super useful when you are looking to manipulate rows or columns.


## 1. Transforming Values
* let's examine `pd.transform(func,axis=0)`
    * first argument: func
        * specifies the function to be used for manipulating data
        * it can be a function, a string function name, a list of functions or a dictionary etc
    * second argument : axis
        * specifies which axis to apply the `func` to
            * `0`: applies func to each column
            * `1` applies func to each row
            
            
   ### Example
   * pass a function to func

In [3]:
df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })

def plus_10(x):
    return x+10

df.transform(plus_10)

Unnamed: 0,A,B
0,11,20
1,12,30
2,13,40


* you can also use the lambda expression, here is the equivalent:
    ```
    df.transform(lambda x : x+10
    ```

1. 
    ### A string function
    * you can pass any valid pandas string function to func e.g. sqrt
 

In [4]:
df.transform('sqrt')

Unnamed: 0,A,B
0,1.0,3.162278
1,1.414214,4.472136
2,1.732051,5.477226


1. 
   ### A list of functions
   * func can be a list of functions e.g. sqrt and exp from numpy

In [5]:
df.transform([np.sqrt, np.exp])


Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,sqrt,exp,sqrt,exp
0,1.0,2.718282,3.162278,22026.47
1,1.414214,7.389056,4.472136,485165200.0
2,1.732051,20.085537,5.477226,10686470000000.0


1. 
    ### A dict of axis labels -> function
    * func can be a dict of axis labels -> function. For example
    * it specifies what to apply on which column

---
## 2. Combining `groupby()` results
* One of the most compelling usages of Pandas transform() is combining grouby() results.


In [3]:
df = pd.DataFrame({
  'restaurant_id': [101,102,103,104,105,106,107],
  'address': ['A','B','C','D', 'E', 'F', 'G'],
  'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
  'sales': [10,500,48,12,21,22,14]
})
df

Unnamed: 0,restaurant_id,address,city,sales
0,101,A,London,10
1,102,B,London,500
2,103,C,London,48
3,104,D,Oxford,12
4,105,E,Oxford,21
5,106,F,Durham,22
6,107,G,Durham,14


* each city has multiple restaurants with sales
* what we ant to know is 'what is the percentage of sales each restaurant represents in the city
* we can do this in a few ways

## Approach 1
### Step 1: Use groupby() and apply() to calculate the city_total_sales


In [23]:
df.groupby('city').apply(sum)

Unnamed: 0_level_0,restaurant_id,address,city,sales
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Durham,213,FG,DurhamDurham,36
London,306,ABC,LondonLondonLondon,558
Oxford,209,DE,OxfordOxford,33


* as you can see, when you apply the function `sum` , it appends it but in a  weird way

In [22]:
df.groupby('city').apply(sum)

Unnamed: 0_level_0,restaurant_id,address,city,sales
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Durham,213,FG,DurhamDurham,36
London,306,ABC,LondonLondonLondon,558
Oxford,209,DE,OxfordOxford,33


In [13]:
city_sales = df.groupby('city')['sales'].apply(sum).rename('city_total_sales').reset_index()
city_sales

Unnamed: 0,city,city_total_sales
0,Durham,36
1,London,558
2,Oxford,33


In [19]:
# using pandas built-in `sum()` function, functionally the same
city_sales = df.groupby('city')['sales'].sum().rename('city_total_sales').reset_index()
city_sales

Unnamed: 0,city,city_total_sales
0,Durham,36
1,London,558
2,Oxford,33


### Step 2: Use merge() function to combine the results



In [24]:
df_new = pd.merge(df, city_sales, how='left')
df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


### Step 3: Calculate the percentage


In [25]:
df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))

In [26]:
df_new

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%
3,104,D,Oxford,12,33,36.36%
4,105,E,Oxford,21,33,63.64%
5,106,F,Durham,22,36,61.11%
6,107,G,Durham,14,36,38.89%


* the procedure was functional but multistep and convoluted
## Approach 2: `groupby()` and `transform()`
### Step 1: Use groupby() and transform() to calculate the city_total_sales
* The transform function retains the same number of items as the original dataset after performing the transformation. * Therefore, a one-line step using groupby followed by a transform(sum) returns the same output.


In [29]:
df['city_total_sales'] = df.groupby('city')['sales'].transform('sum')
df

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales
0,101,A,London,10,558
1,102,B,London,500,558
2,103,C,London,48,558
3,104,D,Oxford,12,33
4,105,E,Oxford,21,33
5,106,F,Durham,22,36
6,107,G,Durham,14,36


### Step 2: Calculate the percentage

In [30]:
df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))

# 3. Filtering data
* `transform()` can also be used to filter data. Here we are trying to get records where the city’s total sales is greater than 40

In [31]:
df[df.groupby('city')['sales'].transform('sum')>40]

Unnamed: 0,restaurant_id,address,city,sales,city_total_sales,pct
0,101,A,London,10,558,1.79%
1,102,B,London,500,558,89.61%
2,103,C,London,48,558,8.60%


# 4. Handling missing values at the group level
* `transform()` can also be used to handle missing values at the group level

In [32]:
df = pd.DataFrame({
    'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [1, np.nan, np.nan, 2, 8, 2, np.nan, 3]
})
df

Unnamed: 0,name,value
0,A,1.0
1,A,
2,B,
3,B,2.0
4,B,8.0
5,C,2.0
6,C,
7,C,3.0
