<h2>Pandas optimizations<h2>

<h3>read the fines.csv<h3>

In [40]:
import pandas as pd
import gc
fines = pd.read_csv('fines.csv')
fines

Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year
0,Y163O8161RUS,2,3200.000000,Ford,Focus,2017
1,E432XX77RUS,1,6500.000000,Toyota,Camry,1990
2,7184TT36RUS,1,2100.000000,Ford,Focus,1995
3,X582HE161RUS,2,2000.000000,Ford,Focus,1989
4,92918M178RUS,1,5700.000000,Ford,Focus,1982
...,...,...,...,...,...,...
925,704887163RUS,2,163151.049126,Ford,Focus,2014
926,705087163RUS,2,163151.049126,Ford,Focus,2014
927,705187163RUS,2,163151.049126,Ford,Focus,2014
928,Y970O8197RUS,2,163151.049126,Ford,Focus,2014


<h3>iterations<h3>

<h4>loop<h4>

In [41]:
def loop(df):
    result = list()
    for i in range(len(df)):
        result.append(float(df.iloc[i]['Fines']) / float(df.iloc[i]['Refund']) * float(df.iloc[i]['Year']))
    df['Calculate_data'] = result

In [42]:
%%timeit
loop(fines)

378 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<h4>iterrows()<h4>

In [43]:
def iterrows_time(df):
    result = list()
    for row in df.iterrows():
        result.append(row[1]['Fines'] / row[1]['Refund'] * row[1]['Year'])
    df['Calculate_data'] = result

In [44]:
%%timeit
iterrows_time(fines)

91.3 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


<h4>apply() and lambda function<h4>

In [45]:
def apply_time(df):
    df['Calculate_data'] = df.apply(lambda row: row['Fines'] / row['Refund'] * row['Year'], axis=1)

In [46]:
%%timeit
apply_time(fines)

14.7 ms ± 641 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<h4>Series objects<h4>

In [47]:
def series_time(df):
    df['Calculate_data'] = df['Fines'] / df['Refund'] * df['Year']

In [48]:
%%timeit
series_time(fines)

426 µs ± 3.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<h4>Series objects with the method .values<h4>

In [49]:
def series_values_time(df):
    df['Calculate_data'] = df.Fines.values / df.Refund.values * df.Year.values

In [50]:
%%timeit
series_values_time(fines)

162 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


<h3>indexing
get a row for a specific CarNumber<h3>

In [51]:
fines

Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year,Calculate_data
0,Y163O8161RUS,2,3200.000000,Ford,Focus,2017,3.227200e+06
1,E432XX77RUS,1,6500.000000,Toyota,Camry,1990,1.293500e+07
2,7184TT36RUS,1,2100.000000,Ford,Focus,1995,4.189500e+06
3,X582HE161RUS,2,2000.000000,Ford,Focus,1989,1.989000e+06
4,92918M178RUS,1,5700.000000,Ford,Focus,1982,1.129740e+07
...,...,...,...,...,...,...,...
925,704887163RUS,2,163151.049126,Ford,Focus,2014,1.642931e+08
926,705087163RUS,2,163151.049126,Ford,Focus,2014,1.642931e+08
927,705187163RUS,2,163151.049126,Ford,Focus,2014,1.642931e+08
928,Y970O8197RUS,2,163151.049126,Ford,Focus,2014,1.642931e+08


In [53]:
%%timeit
fines[fines['CarNumber'] == 'O136HO197RUS']

420 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [54]:
fines.set_index('CarNumber', inplace=True)

In [57]:
%%timeit
fines.loc['O136HO197RUS']

245 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<h3>downcasting<h3>

In [58]:
df = fines
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Y971O8197RUS
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Refund          930 non-null    int64  
 1   Fines           930 non-null    float64
 2   Make            930 non-null    object 
 3   Model           919 non-null    object 
 4   Year            930 non-null    int64  
 5   Calculate_data  930 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 223.7 KB


In [59]:
fl_col = fines.select_dtypes('float').columns
int_col = fines.select_dtypes('integer').columns

fines[fl_col] = fines[fl_col].apply(pd.to_numeric, downcast='float')
fines[int_col] = fines[int_col].apply(pd.to_numeric, downcast='integer')

In [60]:
fines.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Y971O8197RUS
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Refund          930 non-null    int8   
 1   Fines           930 non-null    float32
 2   Make            930 non-null    object 
 3   Model           919 non-null    object 
 4   Year            930 non-null    int16  
 5   Calculate_data  930 non-null    float32
dtypes: float32(2), int16(1), int8(1), object(2)
memory usage: 204.6 KB


<h3>categories<h3>

In [61]:
fines["Make"] = fines["Make"].astype("category")
fines["Model"] = fines["Model"].astype("category")
fines.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Y971O8197RUS
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Refund          930 non-null    int8    
 1   Fines           930 non-null    float32 
 2   Make            930 non-null    category
 3   Model           919 non-null    category
 4   Year            930 non-null    int16   
 5   Calculate_data  930 non-null    float32 
dtypes: category(2), float32(2), int16(1), int8(1)
memory usage: 95.9 KB


<h3>memory clean<h3>

In [63]:
%reset_selective df
gc.collect()

24