# Data Wrangling

## 2.1 Contents
    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load Historical Sales Data
    2.6 Data Exploration
        2.6.1 Pandas Profiling Report
        2.6.2 Handling Null Values
            2.6.2.1 Intro to Features: Comments, SalesPersonID/2, Trade1/2, BuyerBirthDate, APR, MonthlyPayment  
            2.6.2.2 ContractTerm
            2.6.2.3 Profit Features
            2.6.2.4 Buyer Features
            2.6.2.5 VehicleSalePrice
            2.6.2.6 InventoryType
        2.6.3 Removing Uncessary Features
            2.6.3.1 DealNumber - decided to keep as unique identifier
            2.6.3.2 Removal of 9 incomplete or redundant features
        2.6.4 Categorical Features
            2.6.4.1 VIN Duplicates
            2.6.4.1 VehicleMake
        2.6.5 Potential Target Categorical Features
            2.6.5.1 Lexus
            2.6.5.2 Toyota
    2.7 Saving File
    2.8 Summary

## 2.2 Introduction
*Hypothesis:*
How can the historical sales data from 2004 - 2017 be analysed and deployed into a machine learning model forecasting consumer demand and vehicle production?

*Criteria for Success:*
Success for this project would be the training and deployment of a machine learning model that will be able to forecast which Lexus, Toyota, and non-Toyota models are necessary to have in the dealership inventory 12 to 24 months starting April 2017. This forecast will improve dealer order and inventory management, optimize plant production scheduling, and increase understanding of consumer demand in the market.

## 2.3 Imports

In [1]:
from pandas_profiling import ProfileReport
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

## 2.4 Objectives
- Do I have the data I need to tackle the desired question?
- Have I identified the required target value?
- Do I have potentially useful features?
- Do I have any fundamental issues with the data?

## 2.5 Load Historical Sales Data

In [2]:
#csv file in subdirectory 'raw'
sales_hist = pd.read_csv('../data/raw/HistoricalSalesData.csv')

In [3]:
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   DealStatus                  8208 non-null   object 
 4   Comments                    22 non-null     object 
 5   InventoryType               8199 non-null   object 
 6   StockNumber                 8208 non-null   object 
 7   VIN                         8208 non-null   object 
 8   VehicleMake                 8208 non-null   object 
 9   VehicleModel                8208 non-null   object 
 10  VehicleModelYear            8208 non-null   int64  
 11  VehicleSalePrice            8207 non-null   float64
 12  TotalGrossProfit            8208 non-null   float64
 13  BackEndGrossProfit          8202 

There are 34 features 23 of which are catagorical and 11 that are numerical, with 9 of those in float format.

In [4]:
sales_hist.head(20)

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
0,10029,1/12/11 0:00,1/12/11 0:00,F,,U,K175A,2T2HK31U49C118454,Lexus,RX 350,...,,K175B,YV1CM59H331013308,2003.0,Volvo,XC90,,,,
1,10035,1/10/11 0:00,1/10/11 0:00,F,,U,K190A,JTHCE96S580017706,Lexus,GS 350,...,,K190B,1FTWW31P95EB23344,2005.0,Ford,F-350,,,,
2,10036,1/11/11 0:00,1/11/11 0:00,F,,N,K205,JTHDL5EF0B5003231,Lexus,LS 460,...,,K205A,JTHBL46F385052674,2008.0,Lexus,LS 460,,,,
3,10037,1/14/11 0:00,1/14/11 0:00,F,,N,K210,JTJBK1BA2B2013626,Lexus,RX 350,...,,K210A,1HGCD5666SA119678,1995.0,Honda,Accord,,,,
4,10057,1/14/11 0:00,1/14/11 0:00,F,,U,L1112,JTJHK31U082048420,Lexus,RX 350,...,,L1112A,1FMDU34X1VUC98892,1997.0,Ford,Explorer,,,,
5,10059,1/14/11 0:00,1/14/11 0:00,F,,U,L1090A,1NXBR32E55Z545986,Toyota,Corolla,...,,,,,,,,,,
6,10060,1/14/11 0:00,1/14/11 0:00,F,,N,K130,2T2BK1BA7BC087463,Lexus,RX 350,...,,K130A,2T2HK31U08C049602,2008.0,Lexus,RX 350,,,,
7,10067,1/17/11 0:00,1/17/11 0:00,F,,U,K205A,JTHBL46F385052674,Lexus,LS 460,...,,K205B,JTHCE96S570011550,2007.0,Lexus,GS 350,,,,
8,10068,1/17/11 0:00,1/17/11 0:00,F,,U,L1116,JTJBT20X780158790,Lexus,GX 470,...,,L1116A,WA1EY74L57D059909,2007.0,Audi,Q7,,,,
9,10072,1/18/11 0:00,1/18/11 0:00,F,,U,J635C,2T2HA31U36C108786,Lexus,RX 330,...,,J635D,3N1CB51D94L836617,2004.0,Nissan,Sentra,,,,


There are 34 features, therefore feature reduction will be necessary in identifying the useful features and the target feature(s). Also, there appears to be 8208 entries for the features without nulls recorded. However, features such as Comments, SalesPerson2ID, Trade2_VIN, Trade2_Year, and Trade2_Model seem to have significantly smaller entries. These features will need to be explored later during the data cleaning portion of this notebook.

Additionally, the dataset appears to have three different data types. Further exploration will be necessary to understand why ContractTerm, Trade1_VIN, and Trade2_VIN are float64 data types and not int64.

## 2.6 Data Exploration

#### 2.6.1 Pandas Profiling Report

In [5]:
profile = ProfileReport(sales_hist, title="Pandas Profiling Report")
profile.to_notebook_iframe()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=48.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




Will need to confirm if the DealNumber feature is unique for each entry. If so, this may be a good way to identify each vehicle obeservation. Cleaning up the formatting for the ContractDate feature will need to be handled. If there are no other entries for DealStatus but the letter 'F', this column should be removed. There are many null entries for Comments as mentioned before, may remove this column if those 84 entries recordered show no significance to the objective of this project. InventoryType does have null values as seen prior, but the significance of if a vehicle is new(N) or used(U) is important. Therefore, maybe another feature, potentially VehicleModelYear in the dataset can assist in filling in the null values. StockNumber could be another way to identify each vehicle observation, however, from my experience in the industry, typically stock numbers are recycled once the vehicle it is assigned to is sold. If this is the case, this column will be removed. TheVINfeature has no null observations, but will have duplicates since it is a priority for a dealership to receive and resell vehicles sold or in a lease program prior with that dealership to ensure high marketshare. The VINs for all entries in this column have been decoded to provide a breakout of each vehicle's detailed makeup. This dataset will be merged with this dataset within this notebook.

Will need to decide if the VINs of the traded vehicles should be decoded as well. If so, it appears there are 3235 Trade1_VIN entries and 84 additional Trade2_VIN entries to decode.

#### 2.6.2 Handling Null Values

In [6]:
#examine # of missing values by column and sort them high to low
missing = pd.concat([sales_hist.isnull().sum(), 100 * sales_hist.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by = 'count', ascending = False)

Unnamed: 0,count,%
Comments,8186,99.731969
Trade2_Model,8124,98.976608
Trade2_Make,8124,98.976608
Trade2_Year,8124,98.976608
Trade2_VIN,8124,98.976608
SalesPerson2ID,7760,94.54191
BuyerBirthDate,5943,72.404971
Trade1_StockNumber,5595,68.165205
Trade1_VIN,4973,60.587232
Trade1_Model,4971,60.562865


Since Comments has the highest number of null values lets examine what is available with those 22 entries and decide if this feature should be removed.

##### 2.6.2.1 Comments

In [7]:
#exploring the Comments feature entries that are non-null
sales_hist[sales_hist.Comments.notnull()]

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
4287,20113,10/20/15 0:00,10/20/15 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1568,1FBSS3BL1EDA99461,Ford,E-350 Super Duty,...,,,,,,,,,,
4406,20557,1/31/16 0:00,1/31/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S257,JTJJM7FX4G5132520,Lexus,GX 460,...,,,,,,,,,,
4448,20672,1/31/16 0:00,1/31/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S255,JTJBM7FX7G5132886,Lexus,GX 460,...,,,,,,,,,,
4451,20677,1/23/16 0:00,1/23/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,R871,2T2BK1BAXFC337073,Lexus,RX 350,...,,,5UXFA53503LV87734,2003.0,BMW,X5,,,,
4453,20679,1/25/16 0:00,1/25/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S146,JTJBARBZ0G2051273,Lexus,NX 200t,...,,,,,,,,,,
4454,20681,1/29/16 0:00,1/29/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S246,2T2BZMCAXGC013733,Lexus,RX 350,...,,,JF2GPBKC7EH297629,2014.0,Subaru,XV Crosstrek Hybrid,,,,
4457,20692,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1561,JTHBK1GG3E2144678,Lexus,ES 350,...,,,3GTU2WEC7FG116897,2015.0,GMC,Sierra 1500,,,,
4458,20695,1/28/16 0:00,1/28/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1514,2T2BK1BA3EC231160,Lexus,RX 350,...,,,JNKCV51E63M328394,2003.0,Infiniti,G35,,,,
4459,20712,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S253,2T2BZMCA7GC012345,Lexus,RX 350,...,,,2T2HK31U99C101620,2009.0,Lexus,RX 350,,,,
4462,20724,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S249,JTJBZMCA8G2003167,Lexus,RX 350,...,,,,,,,,,,


The 22 entries that are non-null for *Comments* appear to have no importance for this project and therefore this feature will be removed from the dataset in the next subsection.

##### 2.6.2.2 Trade-in Vehicles

Next, were all the 'Trade...' features but these are most likely key features to understanding the broader marketplace's preferences since these vehicles were traded-ins. So, these null values will be converted to "Not Applicable" entries.

In [8]:
#locate which columns correlate to these Trade features
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   DealStatus                  8208 non-null   object 
 4   Comments                    22 non-null     object 
 5   InventoryType               8199 non-null   object 
 6   StockNumber                 8208 non-null   object 
 7   VIN                         8208 non-null   object 
 8   VehicleMake                 8208 non-null   object 
 9   VehicleModel                8208 non-null   object 
 10  VehicleModelYear            8208 non-null   int64  
 11  VehicleSalePrice            8207 non-null   float64
 12  TotalGrossProfit            8208 non-null   float64
 13  BackEndGrossProfit          8202 

In [9]:
#convert null values for Trade1_... and . Trade2... to "Not Applicable"
sales_hist.update(sales_hist[['Trade1_StockNumber','Trade1_VIN','Trade1_Year','Trade1_Make','Trade1_Model','Trade2_VIN','Trade2_Year','Trade2_Make','Trade2_Model']].fillna("Not Applicable"))
#confirm feature updated
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   DealStatus                  8208 non-null   object 
 4   Comments                    22 non-null     object 
 5   InventoryType               8199 non-null   object 
 6   StockNumber                 8208 non-null   object 
 7   VIN                         8208 non-null   object 
 8   VehicleMake                 8208 non-null   object 
 9   VehicleModel                8208 non-null   object 
 10  VehicleModelYear            8208 non-null   int64  
 11  VehicleSalePrice            8207 non-null   float64
 12  TotalGrossProfit            8208 non-null   float64
 13  BackEndGrossProfit          8202 

##### 2.6.2.3 SalesPersonIDs

*SalesPerson2ID* and *SalesPersonID* both refer to the sales team and hold no significance to the purpose of the project. Therefore, these two columns will be removed in the next subsection as well.

##### 2.6.2.4 BuyerBirthDate

Now, the *BuyerBirthDate* is missing over 70% of its entries. Therefore, though it could of given potential insight into the demograhic of buyers, it will be removed in the next section since there is no way to accurately fill in the missing values. 

##### 2.6.2.5 Customer Payment Features

APR, ContractTerm, and MonthlyPayment which all refer to the customer's financing. A customer's financing type could prove important to understanding how pricing of vehicles affects customers and or if customers have a finance preference. Out of these three features, ContractTerm provides the most relevant information. Additionally, this feature could be used to verify if a duplicate entry in the VIN column is a leased vehicle returning. Therefore, the APR and MonthlyPayment features will be removed in the next subsection and the ContractTerm feature will be examined closer to see how to handle its null values.

ContractTerm:

In [10]:
#examine the ContractTerm feature
sales_hist.ContractTerm.value_counts()

1.0     2874
60.0    1453
36.0     986
48.0     758
72.0     130
24.0      89
39.0      47
66.0      38
42.0      26
75.0      19
63.0      17
51.0      16
27.0      12
84.0       9
54.0       8
77.0       8
33.0       6
30.0       3
45.0       3
69.0       2
Name: ContractTerm, dtype: int64

There are 1704 null values and 2874 unusual '1.0' entries for this feature. However, after speaking to the source of this dataset I was informed that any blanks or 1.0 entries were how cash buyers were recorded. Therefore, all null values will be converted to 1.0 for consistency. Additionally, it was confirmed that all entries for this feature were recorded in the time series of months. 

In [11]:
#replace all null values in ContractTerm with 1.0
sales_hist['ContractTerm'] = sales_hist['ContractTerm'].fillna(value = 1.0)
#confirm feature updated
sales_hist['ContractTerm'].value_counts()

1.0     4578
60.0    1453
36.0     986
48.0     758
72.0     130
24.0      89
39.0      47
66.0      38
42.0      26
75.0      19
63.0      17
51.0      16
27.0      12
84.0       9
54.0       8
77.0       8
33.0       6
30.0       3
45.0       3
69.0       2
Name: ContractTerm, dtype: int64

##### 2.6.2.6 Dealership Profit Features

In [12]:
#examine the FrontEndGrossProfit feature - know there are 700 null values
sales_hist.FrontEndGrossProfit.value_counts()

 0.00       761
 2821.00     11
 685.00       9
 3080.00      7
-315.00       7
           ... 
 2291.27      1
 1847.81      1
 2800.38      1
 3161.11      1
 1708.31      1
Name: FrontEndGrossProfit, Length: 5883, dtype: int64

In [13]:
#examine the BackEndGrossProfit feature - know there are 6 null values
sales_hist.BackEndGrossProfit.value_counts()

 0.00       2513
-200.00      371
 200.00      320
 250.00      225
 150.00      221
            ... 
 6228.00       1
 2683.00       1
-203.40        1
-305.85        1
-997.36        1
Name: BackEndGrossProfit, Length: 3671, dtype: int64

In [14]:
#examine TotalGrossProfit for total number of '0.00' values, since the feature has no null values
sales_hist.TotalGrossProfit.value_counts()

0.00       695
500.00      53
1000.00     13
300.00       9
2541.00      6
          ... 
3229.60      1
4216.40      1
3006.18      1
3114.25      1
2502.00      1
Name: TotalGrossProfit, Length: 6972, dtype: int64

In [15]:
#confirm which rows for BackEndGrossProfit have a null value
sales_hist.loc[sales_hist['BackEndGrossProfit'].isnull()].head(10)

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
60,10220,2/28/11 0:00,2/28/11 0:00,F,,N,J567,JTHBL5EF3A5099541,Lexus,LS 460,...,,J567A,Not Applicable,2007,LEXU,GS350,Not Applicable,Not Applicable,Not Applicable,Not Applicable
380,11171,8/31/11 0:00,8/31/11 0:00,F,,N,K450,2T2BK1BA8BC118820,Lexus,RX 350,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
470,11472,10/31/11 0:00,10/31/11 0:00,F,,N,K491,JTHCF5C23B5052149,Lexus,IS 250,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
528,11714,12/10/11 0:00,12/10/11 0:00,F,,N,M186,JTJBM7FX3C5039700,Lexus,GX 460,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1007,13638,9/24/12 0:00,9/24/12 0:00,F,,U,N278A,JTHBJ46G382228753,Lexus,ES 350,...,,N278B,Not Applicable,2008,NISS,ALTIMA,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1009,13645,9/24/12 0:00,9/24/12 0:00,F,,U,L1279,1D7HU18R67U593078,DODG,RAM150,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable


In [16]:
#examine all Profit information around the 6 null values within feature 'BackEndGrossProfit' for possible insight
sales_hist.iloc[[60,380,470,528,1007,1009],[5,8,12,13,14,16]]

Unnamed: 0,InventoryType,VehicleMake,TotalGrossProfit,BackEndGrossProfit,FrontEndGrossProfit,ContractTerm
60,N,Lexus,1396.0,,,1.0
380,N,Lexus,603.99,,,48.0
470,N,Lexus,2262.0,,,1.0
528,N,Lexus,3647.0,,,1.0
1007,U,Lexus,4539.5,,,72.0
1009,U,DODG,1889.21,,,48.0


These 6 vehicles are questionable since it is unlikely a dealership had multiple vehicles in their inventory where no front or back end profit was made.

Therefore, since the vehicles in row 60, 470, and 528 are new (N), Lexus, and cash deals (1.0), it is fair to assume the profit was made on the front end of the sell, since there is no financing. As a result, the FrontEndGrossProfit for these three vehicles will reflect the same value as in their respective TotalGrossProfit column and their BackEndGrossProfit observations will be converted to '0.00' values.

For the vehicle in row 380, simiarily it is a new, Lexus, but financed. Also, the profit is low, less than $1000. Therefore, most likely the TotalGrossProfit identified for this vehicle occurred on the backend. As a result, the the FrontEndGrossProfit for this vehicle will be converted to a '0.00' value and the BackEndGrossProfit converted to reflect the same value as in its respective TotalGrossProfit column.

For the vehicles in row 1007 and 1009 these are used (U) and both financed. Although, there VehicleMake is different the high value within their "TotalGrossProfit", leads me to assume that the profit for these vehicles should be split in half between backend and frontend for these two units.

In [17]:
#update the FrontEndGrossProfit for rows 60,470, and 528
sales_hist.iloc[60,14]= 1396.00
sales_hist.iloc[470,14]= 2262.00
sales_hist.iloc[528,14]= 3647.00
#update the BackEndGrossProfit for rows 380
sales_hist.iloc[380,13]= 603.99
#update the BackEndGrossProfit and FrontEndGrossProfit for rows 60,380,470,528 to a '0.00' value
sales_hist.iloc[[60,470,528],13]= 0.00
sales_hist.iloc[380,14]= 0.00
#split value of row 1007 and 1009 TotalGrossProfit
sales_hist.iloc[1007,[13,14]]= 4539.50/2
sales_hist.iloc[1009,[13,14]]= 1889.21/2
#confirm change to entries
sales_hist.iloc[[60,380,470,528,1007,1009],[12,13,14]]

Unnamed: 0,TotalGrossProfit,BackEndGrossProfit,FrontEndGrossProfit
60,1396.0,0.0,1396.0
380,603.99,603.99,0.0
470,2262.0,0.0,2262.0
528,3647.0,0.0,3647.0
1007,4539.5,2269.75,2269.75
1009,1889.21,944.605,944.605


In [18]:
#exploring FrontEndGrossProfit null values in comparison to the TotalGrossProfit observations
columns_of_interest = ["FrontEndGrossProfit","TotalGrossProfit"]
rows_of_interest = sales_hist["FrontEndGrossProfit"].isnull()
sales_hist.loc[rows_of_interest,columns_of_interest].sort_values(by='TotalGrossProfit')

Unnamed: 0,FrontEndGrossProfit,TotalGrossProfit
3473,,0.0
6754,,0.0
6755,,0.0
6757,,0.0
6758,,0.0
...,...,...
6055,,0.0
6058,,0.0
6063,,0.0
6043,,0.0


In [19]:
#There appears to be no TotalGrossProfit values available for the null values in column FrontEndGrossProfit.
#Therefore the remaining 694 null values seen above will be converted to 0.0
sales_hist['FrontEndGrossProfit'] = sales_hist['FrontEndGrossProfit'].fillna(value = 0.0)
#confirm feature updated from 762 observations of'0.0' to 1456 observations
sales_hist['FrontEndGrossProfit'].value_counts()

0.00       1456
2821.00      11
685.00        9
3080.00       7
185.00        7
           ... 
4289.00       1
2291.27       1
1847.81       1
2800.38       1
2500.27       1
Name: FrontEndGrossProfit, Length: 5886, dtype: int64

##### 2.6.2.7 Buyer Location Features

In [20]:
#examine the 'StockNumber',BuyerHomeAddressPostalCode','BuyerHomeAddressState', 'BuyerHomeAddressCity' features null values
buyer_info = sales_hist.iloc[:,[6,20,21,22]]
buyer_info

Unnamed: 0,StockNumber,BuyerHomeAddressCity,BuyerHomeAddressState,BuyerHomeAddressPostalCode
0,K175A,Ocean View,DE,199704516
1,K190A,South Bend,IN,466149383
2,K205,Bremen,IN,465061850
3,K210,Granger,IN,465308309
4,L1112,Elkhart,IN,465146138
...,...,...,...,...
8203,CCJ462A,Osceola,IN,465618879
8204,CCK193A,South Bend,IN,466151140
8205,CCK127,Granger,IN,465307865
8206,CCL626,Granger,IN,465307078


In [21]:
#examining the 16 null values within the 'BuyerHomeAddressPostalCode' feature
buyer_info.loc[buyer_info['BuyerHomeAddressPostalCode'].isnull()].head(20)

Unnamed: 0,StockNumber,BuyerHomeAddressCity,BuyerHomeAddressState,BuyerHomeAddressPostalCode
1496,N759A,Mishawaka,IN,
3021,CCG286,,,
3723,CCH405A,,,
4085,CCI251AA,,,
5209,CCH129,,,
5408,CCH435,,,
6129,CCE345,,,
6298,CCE416,,,
6397,CCL483,,,
6559,CCF157,,,


In [22]:
buyer_info.loc[buyer_info['BuyerHomeAddressState'].isnull()].head(20)

Unnamed: 0,StockNumber,BuyerHomeAddressCity,BuyerHomeAddressState,BuyerHomeAddressPostalCode
530,K396A,Granger,,46430.0
2512,R426A,New Carlisle,,465529621.0
3021,CCG286,,,
3723,CCH405A,,,
4085,CCI251AA,,,
5209,CCH129,,,
5408,CCH435,,,
6129,CCE345,,,
6298,CCE416,,,
6397,CCL483,,,


In [23]:
buyer_info.loc[buyer_info['BuyerHomeAddressCity'].isnull()].head(20)

Unnamed: 0,StockNumber,BuyerHomeAddressCity,BuyerHomeAddressState,BuyerHomeAddressPostalCode
3021,CCG286,,,
3723,CCH405A,,,
4085,CCI251AA,,,
5209,CCH129,,,
5408,CCH435,,,
6129,CCE345,,,
6298,CCE416,,,
6397,CCL483,,,
6559,CCF157,,,
7242,CCF523,,,


There are 16 null BuyerHomeAddressPostalCode observations, 16 null BuyerHomeAddressState observations, and 13 null BuyerHomeAddressCity observations.

For BuyerHomeAddressState rows 530, 2512 are also contain null values, and for *BuyerHomeAddressCity

For these null values external research on the dataset containing the phone numbers for each sell will be examined and applied here for accuracy.

.... external research completed using a refeverse phone lookup (www.https://www.allareacodes.com/reverse-phone-lookup/) and then located the postal code (www.https://tools.usps.com/zip-code-lookup.htm?bycitystate)12 of the 16 had a home phone number available.

Row 1496 was a dealership phone number. Possibly this was a buy for a salesperson or his or her spouse. Will need to examine how often this happens in the dataset

...external research completed using original dataset containing customer private information and observation occurs only this one time where the address and number provided are the dealership's contact information.

Row 3021 - City: Dunlap, State: IN, Zip: 46517 / Row 3723 - City: South Bend, State: IN, Zip: unknown / Row 4085 - City: Taylorville, State: IL, Zip: 62568 / Row 5209 - City: Elkhart, State: IN, Zip: unknown / Row 5408 - City: South Bend, State: IN, Zip: 46617 / Row 6129 - City: Lafayette, State: IN, Zip: unknown / Row 6298 - City: South Bend, State: IN, Zip: unknown / Row 6397 - City: South Bend, State: IN, Zip: unknown / Row 6559 - City: Plymouth, State: IN, Zip: 46563 / Row 6873 - No home or business phone number available / Row 7242 - City: Hammond, State: IN, Zip: unknown / Row 7246 - City: South Bend, State: IN, Zip: unknown / Row 8078 - No home or business phone number available / Row 8114 - No home or business phone number available / Row 8162 - No home or business phone number available

Now, we need to update the dataset with the external data we have found and decide how to handle the remaining null values.

In [24]:
#replace all null values with '11111' value for the BuyerHomeAddressPostalCode
#Adv will convert StockNumber N759A as well since, it isn't customer info.
sales_hist.iloc[[1496,3723,5209,6129,6298,6397,6873,7242,7246,8078,8114,8162],22]= 11111
#confirm entry changed
sales_hist.iloc[[1496,3723,5209,6129,6298,6397,6873,7242,7246,8078,8114,8162],22]

1496    11111
3723    11111
5209    11111
6129    11111
6298    11111
6397    11111
6873    11111
7242    11111
7246    11111
8078    11111
8114    11111
8162    11111
Name: BuyerHomeAddressPostalCode, dtype: object

In [25]:
#replace all null values with 'unknown' value for the BuyerHomeAddressCity and BuyerHomeAddressState
sales_hist.iloc[[1496,6873,8078,8114,8162],[20,21]]= 'Unknown'
sales_hist.iloc[[530,2512],21]='Unknown'
#confirm entry changed
sales_hist.iloc[[530,1496,2512,6873,8078,8114,8162],[20,21]]

Unnamed: 0,BuyerHomeAddressCity,BuyerHomeAddressState
530,Granger,Unknown
1496,Unknown,Unknown
2512,New Carlisle,Unknown
6873,Unknown,Unknown
8078,Unknown,Unknown
8114,Unknown,Unknown
8162,Unknown,Unknown


In [26]:
#update data with external information = City = 'South Bend'
sales_hist.iloc[[3723,5408,6298,6397,7246],20]= 'South Bend'
#update data with external information = City = 'Dunlap'
sales_hist.iloc[3021,20]= 'Dunlap'
#update data with external information = City = 'Taylorville'
sales_hist.iloc[4085,20]= 'Taylorville'
#update data with external information = City = 'Elkhart'
sales_hist.iloc[5209,20]= 'Elkhart'
#update data with external information = City = 'Lafayette'
sales_hist.iloc[6129,20]= 'Lafayette'
#update data with external information = City = 'Plymouth'
sales_hist.iloc[6559,20]= 'Plymouth'
#update data with external information = City = 'Hammond'
sales_hist.iloc[7242,20]= 'Hammond'
#confirm entries changed
sales_hist.iloc[[3021,3723,4085,5209,5408,6129,6298,6397,6559,7246,7242],20]

3021         Dunlap
3723     South Bend
4085    Taylorville
5209        Elkhart
5408     South Bend
6129      Lafayette
6298     South Bend
6397     South Bend
6559       Plymouth
7246     South Bend
7242        Hammond
Name: BuyerHomeAddressCity, dtype: object

In [27]:
#update data with external information = State = 'IN'
sales_hist.iloc[[530,2512,3021,3723,4085,5209,5408,6129,6298,6397,6559,7242,7246],21]='IN'
#confirm entries changed
sales_hist.iloc[[530,2512,3021,3723,4085,5209,5408,6129,6298,6397,6559,7242,7246],21]

530     IN
2512    IN
3021    IN
3723    IN
4085    IN
5209    IN
5408    IN
6129    IN
6298    IN
6397    IN
6559    IN
7242    IN
7246    IN
Name: BuyerHomeAddressState, dtype: object

In [28]:
buyer_info.loc[buyer_info['BuyerHomeAddressCity'].isnull()].head()

Unnamed: 0,StockNumber,BuyerHomeAddressCity,BuyerHomeAddressState,BuyerHomeAddressPostalCode
3021,CCG286,,,
3723,CCH405A,,,
4085,CCI251AA,,,
5209,CCH129,,,
5408,CCH435,,,


In [29]:
#convert postal codes(with external information)
sales_hist.iloc[3021,22]='46517'
sales_hist.iloc[4085,22]='62568'
sales_hist.iloc[5408,22]='46617'
sales_hist.iloc[6559,22]='46563'
#confirm entries changed
sales_hist.iloc[[3021,4085,5408,6559],22]

3021    46517
4085    62568
5408    46617
6559    46563
Name: BuyerHomeAddressPostalCode, dtype: object

##### 2.6.2.8 VehicleSalePrice

In [30]:
#examine the 1 null value in 'VehicleSalePrice' - know there is 1 null value
sales_hist.loc[sales_hist['VehicleSalePrice'].isnull()].head()
#examine vehicle information about null value in row 6203
sales_hist.iloc[6203,[1,8,9,10,11,12,13,14]]

ContractDate           3/18/05 0:00
VehicleMake                   Lexus
VehicleModel                 RX 330
VehicleModelYear               2005
VehicleSalePrice                NaN
TotalGrossProfit                  0
BackEndGrossProfit                0
FrontEndGrossProfit               0
Name: 6203, dtype: object

In [31]:
#Find average for similar vehicles sold in the same year - 2005.
## first clean up ContractDate column
sales_hist['ContractDate'] = pd.to_datetime(sales_hist['ContractDate'], errors = 'coerce').dt.floor('d')

## filter for features of interest
feats_of_ints = sales_hist.loc[(sales_hist['ContractDate'].dt.year == 2005) & (sales_hist['VehicleMake']=='Lexus') & (sales_hist['VehicleModel']=='RX 330') & (sales_hist['VehicleModelYear']== 2005)]

## find avg VehicleSalePrice
feats_of_ints['VehicleSalePrice'].mean()

40013.201525423734

In [32]:
#convert that 1 vehicle without a saleprice to the average
sales_hist['VehicleSalePrice'] = sales_hist['VehicleSalePrice'].fillna(value = 40013.20)
#confirm feature updated
sales_hist.loc[sales_hist['VehicleSalePrice'].isnull()]

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model


##### 2.6.2.9 InventoryType

In [33]:
#examine the 'InventoryType' feature - know there are 9 null values
sales_hist.loc[sales_hist['InventoryType'].isnull()].head(10)

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
243,10695,2011-06-10,6/10/11 0:00,F,,,K114,JTJBM7FX4B5018854,Lexus,GX 460,...,,K114A,1NXBU40E59Z112648,2009,Toyota,Corolla,JN8AZ08W43W223831,2003,Nissan,Murano
1469,15281,2013-07-13,7/13/13 0:00,F,,,N309,JTJBC1BA0D2443588,Lexus,RX 450h,...,,N309A,JM3TB38A380128581,2008,Mazda,CX-9,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1731,16178,2013-12-03,12/3/13 0:00,F,,,N640,2T2BK1BA4DC197633,Lexus,RX 350,...,,N640A,JTEBU17R970123695,2007,Toyota,4Runner,Not Applicable,Not Applicable,Not Applicable,Not Applicable
2466,18632,2015-01-30,1/30/15 0:00,F,,,R451,JTJBARBZ4F2005394,Lexus,NX 200t,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
2623,19201,2015-05-13,5/13/15 0:00,F,,,P539,JTHCE1D2XE5005196,Lexus,IS 350,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
2635,19231,2015-05-27,5/27/15 0:00,F,,,R628,JTJBM7FX1F5109831,Lexus,GX 460,...,2995.0,Not Applicable,2T2BK1BA9DC161601,2013,Lexus,RX 350,Not Applicable,Not Applicable,Not Applicable,Not Applicable
4321,20209,2015-11-05,11/5/15 0:00,F,,,R841,JTHCE1BL8FA009400,Lexus,GS 350,...,,Not Applicable,JTHCE96S270004071,2007,Lexus,GS 350,Not Applicable,Not Applicable,Not Applicable,Not Applicable
4967,22416,2016-11-16,11/16/16 0:00,F,,,S373,2T2BGMCA1GC004165,Lexus,RX 450h,...,6908.0,Not Applicable,JTJBC1BA9C2433849,2012,Lexus,RX 450h,Not Applicable,Not Applicable,Not Applicable,Not Applicable
5150,23056,2017-03-25,3/25/17 0:00,F,,,S553,2T2BZMCAXGC045386,Lexus,RX 350,...,,Not Applicable,1C4RJFBG6EC237986,2014,Jeep,Grand Cherokee,5TBDV58157S457710,2007,Toyota,Tundra


In [34]:
#know rows 243, 1469, 1731, 2466, 2623, 2635, 4321, 4967, 5150 all have null values for 'InvetoryType' feature
#explore the 'DeliveryDate'(column:2) and VehicleModelYear' (column:10) of these features to determine their 'InventoryType'
sales_hist.iloc[[243, 1469, 1731, 2466, 2623, 2635, 4321, 4967, 5150],[2,10]]

Unnamed: 0,DeliveryDate,VehicleModelYear
243,6/10/11 0:00,2011
1469,7/13/13 0:00,2013
1731,12/3/13 0:00,2013
2466,1/30/15 0:00,2015
2623,5/13/15 0:00,2014
2635,5/27/15 0:00,2015
4321,11/5/15 0:00,2015
4967,11/16/16 0:00,2016
5150,3/25/17 0:00,2016


Since the 'DeliveryDate' for these 9 vehicles are within a year of the 'VehicleModelYear' it is fair to assume these vehicles are all 'InventoryType' 'N' = new.

In [35]:
#replace all null values in 'InventoryType' with 'N'
sales_hist['InventoryType'] = sales_hist['InventoryType'].fillna(value = 'N')
#confirm feature updated
sales_hist['InventoryType'].value_counts()

N    4628
U    3580
Name: InventoryType, dtype: int64

#### 2.6.3 Removing Uncessary Features

Although, not all datasets need a feature that serves as an unique identifier, this dataset appears to have multiple. *BuyerID*, *DealNumber*, *StockNumber*, and *Trade1_StockNumber* can  all be considered an unique identifier. It is already known that *Trade1_StockNumber* is missing 68% of its entries and serves no other value, therefore this feature should be removed. Additionally, we do not need multiple key features, so two of the remaining three can be removed as well. Let's examine which of these features has the most unique entries.

In [36]:
#exploring unique entries of BuyerID
sales_hist.BuyerID.nunique()

5656

In [37]:
#exploring unique entries of StockNumber
sales_hist.StockNumber.nunique()

8183

*BuyerID* has 5656 unique entries out of 8208. *DealNumber* only has two duplicate numbers. *StockNumber* has 8183 unique entries out of 8208. Therefore, the best feature to use as a unique key identifier is the *DealNumber*. Now, one of the two duplicate entries needs to be changed to another unique entry and the *BuyerID* and *StockNumber* features can be removed.

##### 2.6.3.1 DealNumber

In [38]:
#exploring unique entries of DealNumber
sales_hist.DealNumber.nunique()

8207

In [39]:
#identifying the duplicate entries in DealNumber
sales_hist.DealNumber.value_counts()

20113      2
6141       1
7321       1
3386       1
2004128    1
          ..
19060      1
21111      1
8825       1
12923      1
16384      1
Name: DealNumber, Length: 8207, dtype: int64

In [40]:
#find loc of these duplicate entries
sales_hist[sales_hist['DealNumber']==20113]

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
4287,20113,2015-10-20,10/20/15 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1568,1FBSS3BL1EDA99461,Ford,E-350 Super Duty,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable
4288,20113,2015-10-20,10/20/15 0:00,F,,U,L1567,1FBSS3BLXEDA85669,Ford,E-350 Super Duty,...,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable


In [41]:
sales_hist.DealNumber.sort_values(ascending = False)

4272    2007906
4271    2007905
4270    2007902
4269    2007896
4268    2007890
         ...   
2660        193
1357        149
995         136
602         120
5248          3
Name: DealNumber, Length: 8208, dtype: int64

DealNumber 20113 is the duplicated entry and we see that 2007904 is skipped. Therefore, the duplicate 20113 entrie will be converted to this unused number.

In [42]:
#replace one entry with 2007904
sales_hist.iloc[4287,0]=2007904
#confirm entry changed
sales_hist.iloc[4287,0]

2007904

In [43]:
#examining DealStatus
sales_hist.DealStatus.value_counts()

F    8208
Name: DealStatus, dtype: int64

The entries are all the same, so this feature can be removed as well. 

##### 2.6.3.2 Removal of 2.6.2 Features

In [44]:
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   DealNumber                  8208 non-null   int64         
 1   ContractDate                8208 non-null   datetime64[ns]
 2   DeliveryDate                8208 non-null   object        
 3   DealStatus                  8208 non-null   object        
 4   Comments                    22 non-null     object        
 5   InventoryType               8208 non-null   object        
 6   StockNumber                 8208 non-null   object        
 7   VIN                         8208 non-null   object        
 8   VehicleMake                 8208 non-null   object        
 9   VehicleModel                8208 non-null   object        
 10  VehicleModelYear            8208 non-null   int64         
 11  VehicleSalePrice            8208 non-null   float64     

In [45]:
#multiple features removed from dataset
sales_hist.drop(['DeliveryDate','DealStatus','Comments','StockNumber','APR','MonthlyPayment','BuyerID','BuyerBirthDate','Trade1_StockNumber','SalesPersonID','SalesPerson2ID'], axis =1, inplace=True)
#confirm features removed
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   DealNumber                  8208 non-null   int64         
 1   ContractDate                8208 non-null   datetime64[ns]
 2   InventoryType               8208 non-null   object        
 3   VIN                         8208 non-null   object        
 4   VehicleMake                 8208 non-null   object        
 5   VehicleModel                8208 non-null   object        
 6   VehicleModelYear            8208 non-null   int64         
 7   VehicleSalePrice            8208 non-null   float64       
 8   TotalGrossProfit            8208 non-null   float64       
 9   BackEndGrossProfit          8208 non-null   float64       
 10  FrontEndGrossProfit         8208 non-null   float64       
 11  ContractTerm                8208 non-null   float64     

Total number of features reduced from 34 to 23.

#### 2.6.4 Categorical Features

There are 15 features that are type object or "categorical" in nature. Let's explore these.

##### 2.6.4.1 VIN Duplicates

In [46]:
#examining VIN
sales_hist.VIN.value_counts()

2T2HA31U35C060138    4
JTHBJ46G272128612    3
2T2HK31U47C015306    3
JTHCE96S170010086    3
JTJBK1BA8A2412314    3
                    ..
2T2BK1BA2FC315097    1
JTHBJ46G472154791    1
2T2BK1BAXAC074835    1
JTHBW1GG1G2109628    1
JTHBA30G265147130    1
Name: VIN, Length: 7044, dtype: int64

In [47]:
#examining exact count of duplicates in VIN
sh = sales_hist.VIN.value_counts()
sh.value_counts()

1    5979
2     967
3      97
4       1
Name: VIN, dtype: int64

Therefore, we know there are no null values, but only 5979 of the entries are unique and the remaining are duplicates. 967 of them are repeated once, 97 twice, and one three times. We can see from the first output that the one VIN occuring four times is 2T2HA31U35C060138. It is suspected this is because these duplicated VINs belong to vehicles apart of lease programs.

##### 2.6.4.2 VehicleMake

In [48]:
#examining VehicleMake
sales_hist.VehicleMake.value_counts()

Lexus            6825
Toyota            209
Honda              96
Mercedes-Benz      87
Ford               84
Chevrolet          78
BMW                76
Cadillac           72
Jeep               71
Nissan             63
Chrysler           60
Dodge              46
Buick              42
LINCOLN            42
GMC                35
Volkswagen         28
Audi               28
Pontiac            27
Infiniti           27
Acura              24
Volvo              24
Mazda              22
Hyundai            21
Jaguar             18
Oldsmobile         18
Subaru             15
Mercury            10
Saturn             10
Land Rover          8
Porsche             7
Kia                 6
Saab                5
MINI                5
HUMMER              3
Mitsubishi          3
Isuzu               2
Plymouth            2
DODG                1
LEXUS               1
TOYOTA              1
BUICK               1
Maserati            1
PORSCHE             1
VOLKSWAGEN          1
FERRARI             1
CADILLAC  

Lexus has the highest number of observations with Toyota at a far second with Honda, Mercedes-Benz, and Ford surprisingly competing for the consecutive places. Initial thoughts lead to an assumption that specific vehicle features leads to this occurance. Categorizing the VehicleMake feature into broader categories of lucury , economy, and potentially one more may prove insightful as well. However, initially it is clear Lexus, Toyota and other will be necessary to explore for this business problem.

#### 2.6.5 Examining Potential Target Categorical Features

##### 2.6.5.1 Lexus

In [49]:
#grouping all VehicleMakes by VehicleModel
veh_info = sales_hist.groupby('VehicleMake')['VehicleModel'].value_counts()

In [50]:
#examining Lexus by VehicleModel exclusively
veh_info.Lexus

VehicleModel
RX 350     2033
ES 350     1016
RX 330      551
IS 250      388
ES 330      323
GX 470      302
LS 460      250
GX 460      209
RX 400h     204
LS 430      196
GS 350      185
RX 450h     151
RX 300      140
GS 300      139
ES 300      100
NX 200t      95
CT 200h      85
LX 470       68
SC 430       65
ES 300h      51
LX 570       51
IS 350       41
HS 250h      31
IS 250C      29
IS 300       26
RC 350       20
LS 400       16
GS 430       10
NX 300h      10
IS 350C       9
LS 600h       6
IS-F          5
GS 450h       4
RC-F          4
GS 400        3
RC 300        3
SC 300        2
GS 460        1
GS F          1
RC F          1
SC 400        1
Name: VehicleModel, dtype: int64

Lexus' top three selling cars are RX350, ES350 and RX330.

##### 2.6.5.2 Toyota

In [51]:
#examining Toyota by VehicleModel exclusively
veh_info.Toyota

VehicleModel
Avalon               36
Highlander           34
Camry                30
4Runner              20
RAV4                 11
Land Cruiser         10
Sienna               10
Camry Solara          8
Corolla               8
Venza                 7
Highlander Hybrid     6
Sequoia               6
Camry Hybrid          4
Tacoma                4
Matrix                3
Prius                 3
Tundra                3
Celica                2
Echo                  1
FJ Cruiser            1
Prius v               1
Yaris                 1
Name: VehicleModel, dtype: int64

Toyota's top selling cars are the Avalon, Highlander, and Camry.

## 2.7 Saving File

In [52]:
sales_hist.shape

(8208, 23)

The original dataset has been reduced from 34 features to 25.

In [53]:
#save sales_hist dataset as a transformed dataset named "Sales_Hist_Clean" in CSV format
sales_hist.to_csv('../data/interim/Sales_Hist_Clean.csv', index = False)

## 2.8 Summary

The sales_hist dataset began with 34 columns and 8208 observations. After completing all the steps within the Data Wrangling portion of this project the shape of the sales_hist dataset is now 23 columns and 8208 rows. This transformed dataset is now saved and titled 'Sales_Hist_Clean.csv'. No null values exist in the dataset and the VehicleMake feature has helped identify Lexus and Toyota as potential target features for this project.