<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# Data cleaning transformations

This notebook provides methods to aid you in data cleaning.

**Table of contents**
    
<ul class="toc-item"><li><span><a href="#Data-cleaning-transformations" data-toc-modified-id="Data-cleaning-transformations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data cleaning transformations</a></span></li><li><span><a href="#Quick-Dataset-Overview" data-toc-modified-id="Quick-Dataset-Overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Quick Dataset Overview</a></span></li><li><span><a href="#Missing-Value-Imputation" data-toc-modified-id="Missing-Value-Imputation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Missing Value Imputation</a></span><ul class="toc-item"><li><span><a href="#Imputation-for-numerical-variables" data-toc-modified-id="Imputation-for-numerical-variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Imputation for numerical variables</a></span><ul class="toc-item"><li><span><a href="#Imputation-based-on-known-formula/relationship" data-toc-modified-id="Imputation-based-on-known-formula/relationship-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Imputation based on known formula/relationship</a></span></li><li><span><a href="#Imputation-with-mean/median" data-toc-modified-id="Imputation-with-mean/median-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Imputation with mean/median</a></span></li><li><span><a href="#Imputation-based-on-grouping" data-toc-modified-id="Imputation-based-on-grouping-3.1.3"><span class="toc-item-num">3.1.3&nbsp;&nbsp;</span>Imputation based on grouping</a></span></li><li><span><a href="#Imputation-by-regression" data-toc-modified-id="Imputation-by-regression-3.1.4"><span class="toc-item-num">3.1.4&nbsp;&nbsp;</span>Imputation by regression</a></span></li></ul></li><li><span><a href="#Imputation-for-categorical-variables" data-toc-modified-id="Imputation-for-categorical-variables-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Imputation for categorical variables</a></span><ul class="toc-item"><li><span><a href="#Imputation-with-the-mode" data-toc-modified-id="Imputation-with-the-mode-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Imputation with the mode</a></span></li><li><span><a href="#Imputation-with-a-specific-value" data-toc-modified-id="Imputation-with-a-specific-value-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Imputation with a specific value</a></span></li><li><span><a href="#Imputation-by-backfill-or-forward-fill" data-toc-modified-id="Imputation-by-backfill-or-forward-fill-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Imputation by backfill or forward fill</a></span></li><li><span><a href="#Imputation-by-grouping" data-toc-modified-id="Imputation-by-grouping-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Imputation by grouping</a></span></li></ul></li></ul></li><li><span><a href="#Standardizing-capitalization" data-toc-modified-id="Standardizing-capitalization-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Standardizing capitalization</a></span></li></ul>

This notebook primarily uses capabilities from numpy and pandas. 

**We begin by importing key libraries**

In [1]:
# Import key libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

**Optional import of OW color scheme**

In [2]:
# Load in OW color scheme and plot style
plt.style.use('../../utilities/resources/ow_style.mplstyle')

# Add path of the folder 'utilities' to the path from which we can import modules  
import sys
sys.path.append('../../utilities')
from resources.ow_colormap import ow_colormap 

**Load in data from CSV**

We read in the data from a CSV containing data about used car auction sales

In [3]:
dataset = pd.read_csv("sample_input/transformations_used_cars.csv", low_memory=False)

# Quick Dataset Overview

We use the following pandas methods to obtain basic information about the contents of the data:
* <b>.info()</b>: Column names, number of non-nulls, and column data type
* <b>.head()</b>: See top rows of each data field

In [4]:
dataset.head(2)

Unnamed: 0,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,Color,Transmission,WheelTypeID,WheelType,VehOdo,Nationality,Size,MMRAcquisitionAuctionAveragePrice,VehBCost,WarrantyCost
0,0,6/17/2009,MANHEIM,2001,8.0,NISSAN,ALTIMA 2.4L I4 EFI,GXE,4D SEDAN GXE,WHITE,AUTO,2.0,Covers,80702.0,TOP LINE ASIAN,MEDIUM,2942.0,4160.0,1023
1,0,10/5/2010,OTHER,2008,Five,FORD,TAURUS,SEL,4D SEDAN SEL,SILVER,AUTO,1.0,Alloy,88245.0,AMERICAN,MEDIUM,9817.0,7850.0,1633


In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   IsBadBuy                           5000 non-null   int64  
 1   PurchDate                          5000 non-null   object 
 2   Auction                            5000 non-null   object 
 3   VehYear                            5000 non-null   int64  
 4   VehicleAge                         4783 non-null   object 
 5   Make                               5000 non-null   object 
 6   Model                              5000 non-null   object 
 7   Trim                               4846 non-null   object 
 8   SubModel                           5000 non-null   object 
 9   Color                              5000 non-null   object 
 10  Transmission                       5000 non-null   object 
 11  WheelTypeID                        3603 non-null   float

<a id="dataset_overview"></a>
# Missing Value Imputation

First check how many missing values are present:

In [6]:
dataset.isnull().sum().sort_values(ascending=False)

WheelTypeID                          1397
WheelType                            1397
VehOdo                                330
VehicleAge                            217
Trim                                  154
MMRAcquisitionAuctionAveragePrice      21
Size                                    1
Nationality                             1
IsBadBuy                                0
VehBCost                                0
Color                                   0
Transmission                            0
PurchDate                               0
SubModel                                0
Model                                   0
Make                                    0
VehYear                                 0
Auction                                 0
WarrantyCost                            0
dtype: int64

There are a number of different methods for imputation, which also depend on the datatype of the variable being imputed.

## Imputation for numerical variables

### Imputation based on known formula/relationship



**Impute missing vehicle age using known formula, and saving to the same column**

In [7]:
dataset['VehicleAge'] = pd.to_numeric(dataset['VehicleAge'], errors='coerce')
dataset['PurchaseYear'] = pd.to_datetime(dataset['PurchDate']).apply(lambda x: x.year)
dataset['ImputedAge'] = dataset['VehicleAge'].fillna(dataset['PurchaseYear'] - dataset['VehYear'])

### Imputation with mean/median

In [8]:
dataset.loc[dataset['MMRAcquisitionAuctionAveragePrice'] == 0, 'MMRAcquisitionAuctionAveragePrice'] = np.nan
median_value = dataset['MMRAcquisitionAuctionAveragePrice'].median()
dataset['ImputedAveragePrice'] = dataset['MMRAcquisitionAuctionAveragePrice'].fillna(median_value)
dataset['ImputedAveragePrice_flag'] = dataset['MMRAcquisitionAuctionAveragePrice'].isnull().astype(int)

### Imputation based on grouping

In [9]:
# Impute vehicle mileage based on average mileage by age
mileage_by_age = dataset.groupby('ImputedAge')['VehOdo'].transform(lambda x:x.mean())

The `fillna` method can take a series and use only relevant values from that series for imputation

In [10]:
dataset['ImputedVehOdo'] = dataset['VehOdo'].fillna(mileage_by_age) 
dataset['ImputedVehOdo_flag'] = 1*dataset['VehOdo'].isnull()
dataset[['VehOdo', 'ImputedVehOdo', 'ImputedVehOdo_flag']].head()

Unnamed: 0,VehOdo,ImputedVehOdo,ImputedVehOdo_flag
0,80702.0,80702.0,0
1,88245.0,88245.0,0
2,83441.0,83441.0,0
3,,71622.271889,1
4,76989.0,76989.0,0


### Imputation by regression

In [11]:
linreg = LinearRegression()
model_dataset = dataset[['MMRAcquisitionAuctionAveragePrice', 'VehBCost', 'ImputedAge']].dropna()
X_train = model_dataset[['VehBCost', 'ImputedAge']]
y_train = model_dataset['MMRAcquisitionAuctionAveragePrice']

linreg_trained = linreg.fit(y=y_train, X=X_train)
print("Insample R^2 is: " + str(linreg_trained.score(y=y_train, X=X_train)))
predicted_price = pd.Series(linreg.predict(X = dataset[['VehBCost', 'ImputedAge']]), index=dataset.index)

Insample R^2 is: 0.7882271328156077


In [12]:
dataset['ImputedAveragePrice_regression'] = dataset['MMRAcquisitionAuctionAveragePrice'].fillna(predicted_price)

In [13]:
dataset[['MMRAcquisitionAuctionAveragePrice', 'ImputedAveragePrice', 
         'ImputedAveragePrice_regression', 'ImputedAveragePrice_flag']].tail()

Unnamed: 0,MMRAcquisitionAuctionAveragePrice,ImputedAveragePrice,ImputedAveragePrice_regression,ImputedAveragePrice_flag
4995,,6209.0,8787.220535,1
4996,8043.0,8043.0,8043.0,0
4997,,6209.0,7201.755069,1
4998,7080.0,7080.0,7080.0,0
4999,4246.0,4246.0,4246.0,0


## Imputation for categorical variables

### Imputation with the mode

A common approach to dealing with missing categorical values is to replace with the mode:

In [14]:
# First get the mode for the variable you want to impute
wheeltype_mode = dataset['WheelType'].mode()[0]

print("Mode: ", wheeltype_mode)

dataset = dataset.assign(Imputed_WheelType = dataset['WheelType'].fillna(wheeltype_mode))

dataset[['WheelType','Imputed_WheelType']].head()

Mode:  Alloy


Unnamed: 0,WheelType,Imputed_WheelType
0,Covers,Covers
1,Alloy,Alloy
2,Alloy,Alloy
3,Covers,Covers
4,,Alloy


### Imputation with a specific value

Another approach is to replace the missing values with a specified value:

In [15]:
imputed_value = 'WheelType Missing'

dataset = dataset.assign(Imputed_WheelType = dataset['WheelType'].fillna(value=imputed_value))
dataset[['WheelType','Imputed_WheelType']].head()

Unnamed: 0,WheelType,Imputed_WheelType
0,Covers,Covers
1,Alloy,Alloy
2,Alloy,Alloy
3,Covers,Covers
4,,WheelType Missing


### Imputation by backfill or forward fill

Alternatively, you can backfill or forward fill the missing values:

In [16]:
dataset = dataset.assign(Imputed_WheelType = dataset['WheelType'].fillna(method='ffill'))
dataset[['WheelType','Imputed_WheelType']].head()

Unnamed: 0,WheelType,Imputed_WheelType
0,Covers,Covers
1,Alloy,Alloy
2,Alloy,Alloy
3,Covers,Covers
4,,Covers


### Imputation by grouping

# Standardizing capitalization 

In Pandas you can use the <code>str.upper</code> or <code>str.lower</code> methods to convert

In [17]:
dataset['WheelType_Upper'] = dataset['Imputed_WheelType'].str.upper()
dataset['WheelType_Lower'] = dataset['Imputed_WheelType'].str.lower()
dataset[['Imputed_WheelType', 'WheelType_Lower', 'WheelType_Upper']]

Unnamed: 0,Imputed_WheelType,WheelType_Lower,WheelType_Upper
0,Covers,covers,COVERS
1,Alloy,alloy,ALLOY
2,Alloy,alloy,ALLOY
3,Covers,covers,COVERS
4,Covers,covers,COVERS
...,...,...,...
4995,Alloy,alloy,ALLOY
4996,Alloy,alloy,ALLOY
4997,Alloy,alloy,ALLOY
4998,Covers,covers,COVERS


[Table of contents](#Data-cleaning-transformations)