## **Exercise 02: preprocessing**

Prepare the project:

In [1]:
%pip install --upgrade pip
%pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Import allowed modules:

In [2]:
import warnings  # For ignoring warnings

import pandas as pd

In [3]:
warnings.filterwarnings("ignore", )  # Ignore warnings

### download and read the `.csv` file and make `ID` the index column:

Create a dictionary for `read_csv()` method calling:

In [4]:
read_csv_params: dict = {
    "file": "auto.csv",
    "file_path": "../../datasets/",
    "idx_col_name": "ID",
}

Convert data from file to *Pandas* dataframe:

In [5]:
df: pd.DataFrame = pd.read_csv(read_csv_params["file_path"] + read_csv_params["file"], index_col=read_csv_params["idx_col_name"], )

Check the dataframe:

In [6]:
df.head()

Unnamed: 0_level_0,CarNumber,Make_n_model,Refund,Fines,History
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Y163O8161RUS,Ford Focus,2.0,3200.0,
1,E432XX77RUS,Toyota Camry,1.0,6500.0,
2,7184TT36RUS,Ford Focus,1.0,2100.0,
3,X582HE161RUS,Ford Focus,2.0,2000.0,
4,E34877152RUS,Ford Focus,2.0,6100.0,


### count the number of observations using the method `count()`:

In [7]:
df.count()

CarNumber       931
Make_n_model    931
Refund          914
Fines           869
History          82
dtype: int64

### drop the duplicates, taking into account only the following columns: `CarNumber`, `Make_n_model`, `Fines`:

* between the two equal observations, you need to choose the `last`:

Create a list of columns for dropping dublicates:

In [8]:
attention_cols: list[str] = [
    "CarNumber",
    "Make_n_model",
    "Fines",
]

Drop duplicates:

In [9]:
df.drop_duplicates(
    subset=attention_cols,
    keep="last",
    inplace=True,
)

* check the number of observations again:

In [10]:
df.count()

CarNumber       725
Make_n_model    725
Refund          713
Fines           665
History          65
dtype: int64

### work with missing values:

* check how many values are missing from each column:

In [11]:
df.isnull().sum()

CarNumber         0
Make_n_model      0
Refund           12
Fines            60
History         660
dtype: int64

* drop all the columns with over `500` missing values using the argument `thresh=`:

Count a `thresh` for function `dropna()`:

In [12]:
dropna_thresh: int = df.shape[0] - 500

Check `dropna_thresh`:

In [13]:
dropna_thresh

225

Drop columns where `500` and more omissions:

In [14]:
df.dropna(
    axis=1,
    thresh=dropna_thresh,
    inplace=True,
)

* check how many missing values are in each column:

In [15]:
df.isnull().sum()

CarNumber        0
Make_n_model     0
Refund          12
Fines           60
dtype: int64

* replace all the missing values in the `Refund` column with the previous value in that column for that cell, use the argument method:

In [16]:
df["Refund"] = df["Refund"].fillna(method="ffill", )

* check how many values are missing from each column:

In [17]:
df.isnull().sum()

CarNumber        0
Make_n_model     0
Refund           0
Fines           60
dtype: int64

* replace all the missing values in the `Fines` column with the mean value of this column:

Calculate `Fines` mean value:

In [18]:
fines_mean_val: float = df["Fines"].mean()

Fill `NaN` cells by mean value:

In [19]:
df["Fines"].fillna(value=fines_mean_val, inplace=True, )

* check how many values are missing from each column:

In [20]:
df.isnull().sum()

CarNumber       0
Make_n_model    0
Refund          0
Fines           0
dtype: int64

### split and parse the `Make` and `Model`:

Create a function for splitting field `Make_n_model`:

In [21]:
def split_string(sample_str: str) -> pd.Series:
    """
    ...
    """

    str_parts: list = sample_str.split(' ', 1, )

    if len(str_parts, ) == 2:
        return pd.Series(str_parts, )
    else:
        return pd.Series([str_parts[0], None, ], ) 

* use the method `apply()` both for splitting and for extracting the values to the new columns `Make` and `Model`:

In [22]:
df[["Make", "Model"]] = df["Make_n_model"].apply(split_string, )

Check have we a `NaN` values after splitting:

In [23]:
df.isnull().sum()

CarNumber       0
Make_n_model    0
Refund          0
Fines           0
Make            0
Model           9
dtype: int64

We have string without a car model.

* drop the column `Make_n_model`:

In [24]:
df.drop(columns=["Make_n_model", ], inplace=True, )

Check updated *Pandas* dataframe:

In [25]:
df.head()

Unnamed: 0_level_0,CarNumber,Refund,Fines,Make,Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Y163O8161RUS,2.0,3200.0,Ford,Focus
1,E432XX77RUS,1.0,6500.0,Toyota,Camry
2,7184TT36RUS,1.0,2100.0,Ford,Focus
3,X582HE161RUS,2.0,2000.0,Ford,Focus
5,92918M178RUS,1.0,5700.0,Ford,Focus


* save the dataframe in the *JSON* file `auto.json`:

Create a dictionary for `to_json()` method calling:

In [26]:
to_json_params: dict = {
    "file": "auto.json",
    "file_path": "../../datasets/",
    "orient": "records",
}

In [27]:
df.to_json(to_json_params["file_path"] + to_json_params["file"], orient=to_json_params["orient"], )