# Introduction
Pandas is a one of the most signature library in Python Data Science ecosystem. However, as Python does not natively support tabular data like R (with dataframe) or SQL (with table), a part of Pandas syntax is more likely of a work around and thus is not very user-friendly. This topic introduces some techniques and libraries that may help improve writing Pandas.

In [16]:
%%html
<style>
.dataframe th {
    font-size: 11px;
}
.dataframe td {
    font-size: 11px;
}
</style>

# 1. Pandas
What kind of tool can improve Pandas better than itself? It's pretty funny for Pandas to appear here, but this section is about an almost different syntax. As of 2014, Pandas added two new methods,
<code style='font-size:13px'><a href='https://pandas.pydata.org/docs/reference/api/pandas.eval.html'>eval()</a></code> and
<code style='font-size:13px'><a href='https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html'>query()</a></code>
using [Numexpr] backend to reduce intermediate computation and improve readability. These methods are decicated to specific tasks: selection and filtering, which are nightmare in Pandas becasue we have to repeating the dataframe variable so many times.

[Numexpr]: https://github.com/pydata/numexpr

## 1.1. Eval
The best practice using <code style='font-size:13px'>eval()</code> is to [improve column addition]. It has two backends, each has a different syntax but is quite easy to learn. The default backend, *numexpr*, supports wide range of mathematical trasformations with enhanced performance (you only see the different for large datasets with at least 10.000 rows). The other backend, *python*, does not improve performance but provides a clearer syntax by getting rid of dataframe repeatation.

[improve column addition]: https://pandas.pydata.org/docs/user_guide/enhancingperf.html#expression-evaluation-via-eval

In [1]:
import numpy as np
import pandas as pd

In [52]:
dfIris = pd.read_csv('data/iris.csv')

coef = 5

(
    dfIris
    .eval('sepal_ratio = sepal_length/sepal_width')
    .eval('spl_multiple = sepal_length * @coef')
    .eval('upper_species = species.str.upper()', engine='python')
    .head()
)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_ratio,spl_multiple,upper_species
0,5.1,3.5,1.4,0.2,setosa,1.457143,25.5,SETOSA
1,4.9,3.0,1.4,0.2,setosa,1.633333,24.5,SETOSA
2,4.7,3.2,1.3,0.2,setosa,1.46875,23.5,SETOSA
3,4.6,3.1,1.5,0.2,setosa,1.483871,23.0,SETOSA
4,5.0,3.6,1.4,0.2,setosa,1.388889,25.0,SETOSA


## 1.2. Query
If <code style='font-size:13px'>eval()</code> works on columns, then <code style='font-size:13px'>query()</code> works on rows to [improves filtering].

[improves filtering]: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-query

In [1]:
import numpy as np
import pandas as pd

In [80]:
dfIris = pd.read_csv('data/iris.csv')

splMin = 5
spwMax = 4

(
    dfIris
    .query('sepal_length > @splMin & sepal_width < @spwMax')
    .query('species.str.lower() == "virginica"')
    .query('sepal_length in sepal_length.sort_values().head(5)')
    .head()
)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
101,5.8,2.7,5.1,1.9,virginica
113,5.7,2.5,5.0,2.0,virginica
114,5.8,2.8,5.1,2.4,virginica
121,5.6,2.8,4.9,2.0,virginica
142,5.8,2.7,5.1,1.9,virginica


# 2. Janitor
This section is the heart of the entire topic. It's about [Pyjanitor], a Python version of the convinient R's Janitor. The API improves Pandas in three ways, (1) rewriting some existing Pandas methods in a more convinient way such as [pivot], (2) adding new high-level methods such as 
[clean names] and (3) utilizing *method chaining* with methods such as [add columns].

Method chaining is a killer feature in Python, can be considered equivalent to the *pipe operator* in R. It allows writting a sequence of method calls to a single object, thus it suits data processing. An exclusive feature that Pyjanitor offers is that it is native to Pandas. The only thing you need to do is import the library and Pyjanitor will add its powerful [methods] to the current <code style='font-size:13px'>DataFrame</code> class.

[Pyjanitor]: https://pyjanitor-devs.github.io/pyjanitor/api/functions/
[pivot]: https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.pivot
[clean names]: https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.clean_names
[add columns]: https://pyjanitor-devs.github.io/pyjanitor/api/functions/#janitor.functions.add_columns
[methods]: https://pyjanitor-devs.github.io/pyjanitor/api/functions/

## 2.1. Data cleaning

In [1]:
import numpy as np
import pandas as pd
import janitor

### Case when

In [2]:
dfSale = pd.DataFrame({
    'salesMonth': ['Jan', 'Feb', 'Mar','Jan', 'Feb', 'Mar','Jan', 'Feb', 'Mar'],
    'company': ['A1','A1','A1','A10','A10','A10','A2','A2','A2'],
    'sales': [400, 600, np.nan, 500, 550, 400, np.nan, 100, 300],
    'profit': [200, 350, np.nan, 300, 400, 220, np.nan, 50, 280]
})
dfSale.head()

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
1,Feb,A1,600.0,350.0
2,Mar,A1,,
3,Jan,A10,500.0,300.0
4,Feb,A10,550.0,400.0


In [7]:
dfSale.case_when(
    dfSale.salesMonth == 'Jan', dfSale.sales*1.5,
    "salesMonth == 'Feb'", dfSale.sales*1.2,
    400, column_name='newSale'
).head()

Unnamed: 0,salesMonth,company,sales,profit,newSale
0,Jan,A1,400.0,200.0,600.0
1,Feb,A1,600.0,350.0,720.0
2,Mar,A1,,,400.0
3,Jan,A10,500.0,300.0,750.0
4,Feb,A10,550.0,400.0,660.0


In [10]:
# update cols based on conditions
dfSale.update_where(
    conditions="salesMonth=='Jan'",
    target_column_name='profit',
    target_val=0
).head()

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,0.0
1,Feb,A1,600.0,350.0
2,Mar,A1,,
3,Jan,A10,500.0,0.0
4,Feb,A10,550.0,400.0


In [5]:
# apply function in cols
(
    dfSale
    .transform_column('salesMonth', function=lambda x : x.upper())
    .transform_columns(['sales','profit'], function=np.log, suffix='_log')
    .head()
)

Unnamed: 0,salesMonth,company,sales,profit,sales_log,profit_log
0,JAN,A1,400.0,200.0,5.991465,5.298317
1,FEB,A1,600.0,350.0,6.39693,5.857933
2,MAR,A1,,,,
3,JAN,A10,500.0,300.0,6.214608,5.703782
4,FEB,A10,550.0,400.0,6.309918,5.991465


In [64]:
# the also method allows printing intermediate steps
(
    dfSale
    .also(lambda df: print(f'DataFrame shape is: {df.shape}'))
    .sort_values(by='sales')
    .also(lambda df: print(f'Columns: {df.columns}'))
    .also(lambda df: df.dropna()) # function return new variable will be ignore
    .head(3)
)

DataFrame shape is: (9, 4)
Columns: Index(['salesMonth', 'company', 'sales', 'profit'], dtype='object')


Unnamed: 0,salesMonth,company,sales,profit
7,Feb,A2,100.0,50.0
8,Mar,A2,300.0,280.0
0,Jan,A1,400.0,200.0


### Data types

In [63]:
# change data type
dfSale.change_type(['sales', 'profit'], dtype=float).head(3)

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
1,Feb,A1,600.0,350.0
2,Mar,A1,,


In [66]:
# change data type to categorical
dfSale = dfSale.encode_categorical(salesMonth=['Jan','Feb','Mar'])
dfSale.dtypes

salesMonth    category
company         object
sales          float64
profit         float64
dtype: object

In [74]:
# convert date
df = pd.DataFrame({
    'excel_date': [39690, 39690, 37118],
    'unix_date': [1651510462, 53394822, 1126233195],
    'date': ['04/12/2021', '5/1/2022', '15/3/2022']
})

(
    df
    .convert_excel_date('excel_date')
    .convert_unix_date('unix_date')
    .to_datetime('date', format='%d/%m/%Y')
    .truncate_datetime_dataframe('month')
)

Unnamed: 0,excel_date,unix_date,date
0,2008-08-01,2022-05-01,2021-12-01
1,2008-08-01,1971-09-01,2022-01-01
2,2001-08-01,2005-09-01,2022-03-01


### Imputing

In [75]:
# fillna
dfSale.fill_direction(sales='up', profit='down')

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
1,Feb,A1,600.0,350.0
2,Mar,A1,500.0,350.0
3,Jan,A10,500.0,300.0
4,Feb,A10,550.0,400.0
5,Mar,A10,400.0,220.0
6,Jan,A2,100.0,220.0
7,Feb,A2,100.0,50.0
8,Mar,A2,300.0,280.0


In [15]:
dfSale.impute(column_name=["sales",'profit'], statistic_column_name='mean')

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
1,Feb,A1,600.0,350.0
2,Mar,A1,332.142857,332.142857
3,Jan,A10,500.0,300.0
4,Feb,A10,550.0,400.0
5,Mar,A10,400.0,220.0
6,Jan,A2,332.142857,332.142857
7,Feb,A2,100.0,50.0
8,Mar,A2,300.0,280.0


### Sorting

In [77]:
# sort by custom order
mapMonthOrder = {
    'Jan': 1,
    'Feb': 2,
    'Mar': 3
}
dfSale.sort_column_value_order('salesMonth', mapMonthOrder)

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
3,Jan,A10,500.0,300.0
6,Jan,A2,,
1,Feb,A1,600.0,350.0
4,Feb,A10,550.0,400.0
7,Feb,A2,100.0,50.0
2,Mar,A1,,
5,Mar,A10,400.0,220.0
8,Mar,A2,300.0,280.0


In [17]:
dfSale.sort_naturally('company')

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
1,Feb,A1,600.0,350.0
2,Mar,A1,332.142857,332.142857
6,Jan,A2,332.142857,332.142857
7,Feb,A2,100.0,50.0
8,Mar,A2,300.0,280.0
3,Jan,A10,500.0,300.0
4,Feb,A10,550.0,400.0
5,Mar,A10,400.0,220.0


In [78]:
# get the first value of subset
dfSale.take_first(subset='company', by='salesMonth')

Unnamed: 0,salesMonth,company,sales,profit
0,Jan,A1,400.0,200.0
3,Jan,A10,500.0,300.0
6,Jan,A2,,


### Rename

In [85]:
# clean cols name
dfName = pd.DataFrame({
    '__date of birth':[20220101,19980502,20191003],
    'FIRST name':['Ana', 'Lisa','Bella'],
    'last_name_': ['Jonh','Blackpink', 'Mone'],
    'Address@#$%':['HN-VN','Seoul-KOR','HCM-VN']
})

dfName = dfName.clean_names(strip_underscores=True, case_type='snake', remove_special=True)
dfName

Unnamed: 0,date_of_birth,first_name,last_name,address
0,20220101,Ana,Jonh,HN-VN
1,19980502,Lisa,Blackpink,Seoul-KOR
2,20191003,Bella,Mone,HCM-VN


In [86]:
# rename cols with a function
dfName.rename_columns(function=str.upper)

Unnamed: 0,DATE_OF_BIRTH,FIRST_NAME,LAST_NAME,ADDRESS
0,20220101,Ana,Jonh,HN-VN
1,19980502,Lisa,Blackpink,Seoul-KOR
2,20191003,Bella,Mone,HCM-VN


### Miscelaneous

In [87]:
# concat cols
dfName.concatenate_columns(['first_name', 'last_name'], new_column_name='full_name', sep=' ')

Unnamed: 0,date_of_birth,first_name,last_name,address,full_name
0,20220101,Ana,Jonh,HN-VN,Ana Jonh
1,19980502,Lisa,Blackpink,Seoul-KOR,Lisa Blackpink
2,20191003,Bella,Mone,HCM-VN,Bella Mone


In [88]:
# split cols
dfName.deconcatenate_column('address', sep='-', new_column_names=['city', 'country'])

Unnamed: 0,date_of_birth,first_name,last_name,address,full_name,city,country
0,20220101,Ana,Jonh,HN-VN,Ana Jonh,HN,VN
1,19980502,Lisa,Blackpink,Seoul-KOR,Lisa Blackpink,Seoul,KOR
2,20191003,Bella,Mone,HCM-VN,Bella Mone,HCM,VN


In [91]:
# filter date
(
    dfName
    .to_datetime('date_of_birth', format='%Y%m%d')
    .filter_date("date_of_birth", years=[2022])
    .filter_date("date_of_birth", start_date='2022-01-01', end_date='2022-12-31')
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, column_name] = pd.to_datetime(df.loc[:, column_name])


Unnamed: 0,date_of_birth,first_name,last_name,address,full_name
0,2022-01-01,Ana,Jonh,HN-VN,Ana Jonh


In [93]:
# filter contains string
dfName.filter_string(column_name="address", search_string="VN")

Unnamed: 0,date_of_birth,first_name,last_name,address,full_name
0,2022-01-01,Ana,Jonh,HN-VN,Ana Jonh
2,2019-10-03,Bella,Mone,HCM-VN,Bella Mone


In [94]:
# perform string func
import re
dfName.process_text(
    column_name='first_name',
    string_function='extract',
    pat=r'(an)',
    expand=False,
    flags=re.IGNORECASE
)

Unnamed: 0,date_of_birth,first_name,last_name,address,full_name
0,2022-01-01,An,Jonh,HN-VN,Ana Jonh
1,1998-05-02,,Blackpink,Seoul-KOR,Lisa Blackpink
2,2019-10-03,,Mone,HCM-VN,Bella Mone


In [97]:
dfName.find_replace(
    match="regex",
    address={"VN": "Việt Nam"}
)

Unnamed: 0,date_of_birth,first_name,last_name,address,full_name
0,2022-01-01,Ana,Jonh,Việt Nam,Ana Jonh
1,1998-05-02,Lisa,Blackpink,Seoul-KOR,Lisa Blackpink
2,2019-10-03,Bella,Mone,Việt Nam,Bella Mone


## 2.2. Data transformation 

In [1]:
import numpy as np
import pandas as pd
import janitor

In [13]:
dfSale = pd.DataFrame({
    'salesMonth': ['Jan', 'Feb', 'Mar','Jan', 'Feb', 'Mar','Jan', 'Feb', 'Mar'],
    'company': ['A1','A1','A1','A10','A10','A10','A2','A2','A2'],
    'sales': [400, 600, np.nan, 500, 550, 400, np.nan, 100, 300],
    'profit': [200, 350, np.nan, 300, 400, 220, np.nan, 50, 280]
})
dfCus = pd.DataFrame({
    'salesMonth': ['Jan', 'Jan', 'Jan','Jan', 'Jan', 'Jan'],
    'company': ['A1','A1','A1','A10','A10','A10'],
    'customer':['B1','B2','B3','C1','C2','C3'],
    'sales':[100.0,200.0,100.0,250.0,100,150]
})

### Aggregating

In [17]:
# flatten multi index
(
    dfSale
    .select_columns('company', 'sales', 'profit')
    .groupby('company')
    .agg(['sum', 'mean'])
    .collapse_levels(sep='_')
    .reset_index()
)

Unnamed: 0,company,sales_sum,sales_mean,profit_sum,profit_mean
0,A1,1000.0,500.0,550.0,275.0
1,A10,1450.0,483.333333,920.0,306.666667
2,A2,400.0,200.0,330.0,165.0


In [18]:
(
    dfSale
    .groupby_agg(
        by='company',
        agg='mean',
        new_column_name='mean_profit',
        agg_column_name='profit'
    )
    .head()
)

Unnamed: 0,salesMonth,company,sales,profit,mean_profit
0,Jan,A1,400.0,200.0,275.0
1,Feb,A1,600.0,350.0,275.0
2,Mar,A1,,,275.0
3,Jan,A10,500.0,300.0,306.666667
4,Feb,A10,550.0,400.0,306.666667


### Non-equi join

In [34]:
# non-equi join
dfSale.conditional_join(
    dfCus,
    ('profit', 'sales', '<'),
    ('company', 'company', '=='),
    how='inner'
)

Unnamed: 0_level_0,left,left,left,left,right,right,right,right
Unnamed: 0_level_1,salesMonth,company,sales,profit,salesMonth,company,customer,sales
0,Mar,A10,400.0,220.0,Jan,A10,C1,250.0


### Pivot

In [35]:
dfWide = dfSale.pivot_wider(
    index='salesMonth',
    names_from='company'
)
dfWide

Unnamed: 0,salesMonth,sales_A1,sales_A10,sales_A2,profit_A1,profit_A10,profit_A2
0,Feb,600.0,550.0,100.0,350.0,400.0,50.0
1,Jan,400.0,500.0,,200.0,300.0,
2,Mar,,400.0,300.0,,220.0,280.0


In [36]:
dfWide.pivot_longer(
    index='salesMonth',
    names_to=('type','company'),
    names_sep='_'
).head()

Unnamed: 0,salesMonth,type,company,value
0,Feb,sales,A1,600.0
1,Jan,sales,A1,400.0
2,Mar,sales,A1,
3,Feb,sales,A10,550.0
4,Jan,sales,A10,500.0


In [37]:
dfWide.pivot_longer(
    index='salesMonth',
    names_to=('.value','company'),
    names_sep='_'
).head()

Unnamed: 0,salesMonth,company,sales,profit
0,Feb,A1,600.0,350.0
1,Jan,A1,400.0,200.0
2,Mar,A1,,
3,Feb,A10,550.0,400.0
4,Jan,A10,500.0,300.0


In [38]:
# pivot longer with regex
df = pd.DataFrame({"id": [1], "new_sp_m5564": [2], "newrel_f65": [3]})
df.pivot_longer(
    index = 'id',
    names_to = ('diagnosis', 'gender', 'age'),
    names_pattern = r"new_?(.+)_(.)(\d+)",
)

Unnamed: 0,id,diagnosis,gender,age,value
0,1,sp,m,5564,2
1,1,rel,f,65,3


### Tidying

In [42]:
# split to dummy form
dfMovie = pd.DataFrame({
    'name': ['titanic', 'howl', 'stranger things'],
    'genre': ['roman,action', 'anime,roman', 'action,horor']
})
dfMovie.expand_column(column_name='genre', sep=',')

Unnamed: 0,name,genre,action,anime,horor,roman
0,titanic,"roman,action",1,0,0,1
1,howl,"anime,roman",0,1,0,1
2,stranger things,"action,horor",1,0,1,0


In [44]:
# add cols with all value of df
dfMovie.expand_grid(df_key='dfMovie', others={'rating': [1,2]})

Unnamed: 0_level_0,dfMovie,dfMovie,rating
Unnamed: 0_level_1,name,genre,0
0,titanic,"roman,action",1
1,titanic,"roman,action",2
2,howl,"anime,roman",1
3,howl,"anime,roman",2
4,stranger things,"action,horor",1
5,stranger things,"action,horor",2


### Time series

In [54]:
np.random.seed(7)
dfTime = pd.DataFrame({
    'date': pd.date_range('2020-01-01', '2020-01-10', freq='2d'),
    'value': np.random.random(size=5)
}).set_index('date')
dfTime

Unnamed: 0_level_0,value
date,Unnamed: 1_level_1
2020-01-01,0.076308
2020-01-03,0.779919
2020-01-05,0.438409
2020-01-07,0.723465
2020-01-09,0.97799


In [51]:
dfTime.fill_missing_timestamps(frequency='d')

Unnamed: 0,value
2020-01-01,1.0
2020-01-02,
2020-01-03,2.0
2020-01-04,
2020-01-05,3.0
2020-01-06,
2020-01-07,4.0
2020-01-08,
2020-01-09,5.0


In [59]:
dfTime.flag_jumps(scale='absolute', direction='any', threshold=0.5)

Unnamed: 0_level_0,value,value_jump_flag
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,0.076308,0
2020-01-03,0.779919,1
2020-01-05,0.438409,0
2020-01-07,0.723465,0
2020-01-09,0.97799,0


# 3. DuckDB
[DuckDB] allows writting SQL queries on Pandas' dataframe and provides a relation interface with some useful methods. It improves code readability in task such as non-equi join, sorting and selecting.

[DuckDB]: https://duckdb.org/docs/

In [2]:
import numpy as np
import pandas as pd
import duckdb
import janitor
from sspipe import p, px

In [3]:
dfMembership = pd.DataFrame({
    'customer_id': ['M777', 'F123'],
    'join_date': pd.to_datetime(['2018-10-01', '2020-04-01'])
})

dfTransaction = pd.DataFrame({
    'customer_id': ['M777', 'M777', 'F123', 'F123', 'F123'],
    'purchase_date': pd.to_datetime(['2018-09-12', '2018-12-07', '2020-03-16', '2020-05-16', '2020-08-01']),
    'value': [105, 112, 17, 20, 29],
})

In [87]:
query = '''
SELECT m.*, t.purchase_date, t.value
FROM dfMembership m
LEFT JOIN dfTransaction t
ON m.customer_id = t.customer_id AND m.join_date < t.purchase_date
'''

(
    duckdb.query(query)
    .project('*, 0.1*value AS tax')
    .filter('tax > 1')
    .order('join_date DESC, value ASC')
    .df()
)

Unnamed: 0,customer_id,join_date,purchase_date,value,tax
0,F123,2020-04-01,2020-08-01,20,2.0
1,F123,2020-04-01,2020-05-16,28,2.8
2,M777,2018-10-01,2018-12-07,112,11.2


# 4. Pandas profiling
[Pandas-profiling] generates a HTML [report] from a Pandas dataframe, it makes doing EDA more easily and quickly. This library is wonderful as it makes everything done in a single line of code, but it still has some drawbacks.

Sometimes, especially for large dataset with hundreds or thousands of features, the process of generating HTML report is very time consuming and the output file can be heavy enough to make a crash to the browser. So in this section, we are going to discuss three ways to optimize the performance of Pandas-profiling:
- *Minimal mode*. Pandas-profiling has a built-in [file] to import configurations. However it lacks of many useful features.
- *Sampling*. It's OK for large datasets to examine only a sample of it. However, this still cannot deal with a large number of features.
- *Customized configuration*. In this method we are going to disable expensive computations such as correlations and interations. A configuration file will be prepared instead of changing settings via Python interface.

[Pandas-profiling]: https://pandas-profiling.ydata.ai/docs/master/pages/getting_started/overview.html
[report]: https://pandas-profiling.ydata.ai/docs/master/pages/reference/api/_autosummary/pandas_profiling.profile_report.ProfileReport.html
[file]: https://github.com/ydataai/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml

In [1]:
import pandas as pd
import pandas_profiling

dfIris = pd.read_csv("data/iris.csv")

In [9]:
report = dfIris.profile_report(progress_bar=False)
report.to_file('output/iris_profile.html')

In [12]:
report = dfIris.sample(n=100).profile_report(progress_bar=False, minimal=True)

In [17]:
report = dfIris.profile_report(config_file='util/config_pandas_profiling.yaml')

In [19]:
report.description_set['correlations']['pearson']

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


---
*&#9829; By Quang Hung x Thuy Linh &#9829;*