In [1]:
import pandas as pd
import gc

read the fines.csv that you saved in the previous exercise

In [67]:
df = pd.read_csv('../data/fines.csv')
df['Year'].fillna('1989.0', inplace=True)
df.head(3)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Year'].fillna('1989.0', inplace=True)
  df['Year'].fillna('1989.0', inplace=True)


Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year
0,Y163O8161RUS,2.0,3200.0,Ford,Focus,1989.0
1,E432XX77RUS,1.0,6500.0,Toyota,Camry,1995.0
2,7184TT36RUS,1.0,2100.0,Ford,Focus,1984.0


iterations: in all the following subtasks, you need to calculate fines/refund*year for
each row and create a new column with the calculated data and measure the time
using the magic command %%timeit in the cell

- loop: write a function that iterates through the dataframe using for i in
range(0, len(df)), iloc and append() to a list, assign the result of the func-
tion to a new column in the dataframe
- do it using iterrows()
- do it using apply() and lambda function
- do it using Series objects from the dataframe
- do it as in the previous subtask but with the method .values

In [5]:
%%timeit
def calculate_value_1(df):
    result = []
    for i in range(0, len(df)):
        row = df.iloc[i]
        result.append(row['Fines'] / row['Refund'] * row['Year'])
    return result

df['NewColumn_1'] = calculate_value_1(df)

The slowest run took 4.08 times longer than the fastest. This could mean that an intermediate result is being cached.
118 ms ± 78.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
%%timeit
def calculate_value_2(df):
    result = []
    for _, row in df.iterrows():
        result.append(row['Fines'] / row['Refund'] * row['Year'])
    return result

df['NewColumn_2'] = calculate_value_2(df)

74.8 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [7]:
%%timeit
df['NewColumn_3'] = df.apply(lambda row: row['Fines'] / row['Refund'] * row['Year'], axis=1)

15.8 ms ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
%%timeit
df['NewColumn_4'] = df['Fines'].div(df['Refund']).mul(df['Year'])

423 µs ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [10]:
%%timeit
df['NewColumn_5'] = df['Fines'].values / df['Refund'].values * df['Year'].values

224 µs ± 42.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


indexing: measure the time using the magic command %%timeit in the cell

- get a row for a specific CarNumber, for example, ’O136HO197RUS’
- set the index in your dataframe with CarNumber
- again, get a row for the same CarNumber

In [46]:
# df.drop_duplicates(subset=['CarNumber'], keep='last', inplace=True)

In [51]:
%%timeit
df[df['CarNumber'] == '7184TT36RUS']

1.03 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [56]:
df.set_index('CarNumber', inplace=True)

In [57]:
%%timeit
df[df.index == '7184TT36RUS']

597 µs ± 199 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


downcasting:

- run df.info(memory_usage=’deep’), pay attention to the Dtype and the mem-
ory usage
- make a copy() of your initial dataframe into another dataframe optimized

In [68]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CarNumber  930 non-null    object 
 1   Refund     930 non-null    float64
 2   Fines      930 non-null    float64
 3   Make       930 non-null    object 
 4   Model      919 non-null    object 
 5   Year       930 non-null    object 
dtypes: float64(2), object(4)
memory usage: 218.5 KB


In [78]:
df_optimized = df.copy()

df_optimized[['Refund', 'Year']] = df_optimized[['Refund', 'Year']].apply(pd.to_numeric, downcast='integer')

df_optimized[['CarNumber', 'Make', 'Model']] = df_optimized[['CarNumber', 'Make', 'Model']].astype('category')

In [80]:
df_optimized.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   CarNumber  930 non-null    category
 1   Refund     930 non-null    float64 
 2   Fines      930 non-null    float64 
 3   Make       930 non-null    category
 4   Model      919 non-null    category
 5   Year       930 non-null    int16   
dtypes: category(3), float64(2), int16(1)
memory usage: 74.1 KB
