# Data Wrangling

## 2.1 Contents
    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load Historical Sales Data
    2.6 Data Exploration
        2.6.1 Handling Null Values
            2.6.1.1 Intro to Features: Comments, SalesPersonID/2, Trade1/2, BuyerBirthDate, APR, MonthlyPayment  
            2.6.1.2 ContractTerm
            2.6.1.3 Profit Features
            2.6.1.4 Buyer Features
            2.6.1.5 VehicleSalePrice
            2.6.1.6 InventoryType
        2.6.2 Removing Uncessary Features
            2.6.2.1 DealNumber - decided to keep as unique identifier
            2.6.2.2 Removal of 8 incomplete features
        2.6.3 Categorical Features
            2.6.3.1 Time Series Fromatting
            2.6.3.2 VIN Duplicates
            2.6.3.3 VehicleMake
        2.6.4 Potential Target Categorical Features
            2.6.4.1 Lexus
            2.6.4.2 Toyota
    2.7 Load Decoded VINs Data
    2.8 Data Exploration
        2.8.1 Exploring and Cleaning Entry Types
        2.8.2 Handling Null Values
    Saving File
    2.8 Summary

## 2.2 Introduction
*Hypothesis:*
How can the historical sales data from 2004 - 2017 be analysed and deployed into a machine learning model forecasting consumer demand and vehicle production?

*Criteria for Success:*
Success for this project would be the training and deployment of a machine learning model that will be able to forecast which Lexus, Toyota, and non-Toyota models are necessary to have in the dealership inventory 12 to 24 months starting April 2017. This forecast will improve dealer order and inventory management, optimize plant production scheduling, and increase understanding of consumer demand in the market.

## 2.3 Imports

In [1]:
from pandas_profiling import ProfileReport
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

## 2.4 Objectives
- Do I have the data I need to tackle the desired question?
- Have I identified the required target value?
- Do I have potentially useful features?
- Do I have any fundamental issues with the data?

## 2.5 Load Historical Sales Data

In [None]:
#csv file in subdirectory 'raw'
sales_hist = pd.read_csv('../data/raw/HistoricalSalesData.csv')

In [None]:
sales_hist.info()

There are 34 features, therefore feature reduction will be necessary in identifying the useful features and the target feature(s). Also, there appears to be 8208 entries for the features without nulls recorded. However, features such as Comments, SalesPerson2ID, Trade2_VIN, Trade2_Year, and Trade2_Model seem to have significantly smaller entries. These features will need to be explored later during the data cleaning portion of this notebook.

Additionally, the dataset appears to have three different data types. Further exploration will be necessary to understand why ContractTerm, Trade1_VIN, and Trade2_VIN are float64 data types and not int64.

In [None]:
sales_hist.head(20)

Will need to confirm if the *DealNumber* feature is unique for each entry. If so, this may be a good way to identify each vehicle obeservation. Cleaning up the formatting for the *ContractDate* feature will need to be handled. If there are no other entries for *DealStatus* but the letter 'F', this column should be removed. There are many null entries for *Comments* as mentioned before, may remove this column if those 84 entries recordered show no significance to the objective of this project. *InventoryType* does have null values as seen prior, but the significance of if a vehicle is new(N) or used(U) is important. Therefore, maybe another feature, potentially *VehicleModelYear* in the dataset can assist in filling in the null values. StockNumber could be another way to identify each vehicle observation, however, from my experience in the industry, typically stock numbers are recycled once the vehicle it is assigned to is sold. If this is the case, this column will be removed. The*VIN*feature has no null observations, but will have duplicates since it is a priority for a dealership to receive and resell vehicles sold or in a lease program prior with that dealership to ensure high marketshare. The VINs for all entries in this column have been decoded to provide a breakout of each vehicle's detailed makeup. This dataset will be merged with this dataset within this notebook.

Will need to decide if the VINs of the traded vehicles should be decoded as well. If so, it appears there are 3235 *Trade1_VIN* entries and 84 additional *Trade2_VIN* entries to decode.

## 2.6 Data Exploration

#### 2.6.1 Handling Null Values

In [None]:
#examine # of missing values by column and sort them high to low
missing = pd.concat([sales_hist.isnull().sum(), 100 * sales_hist.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by = 'count', ascending = False)

Since *Comments* has the highest number of null values lets examine what is available with those 22 entries and decide if this feature should be removed. 

##### 2.6.1.1 Introduction to features with null values

In [None]:
#exploring the Comments feature entries that are non-null
sales_hist[sales_hist.Comments.notnull()]

The 22 entries that are non-null for *Comments* appear to have no importance for this project and therefore this feature will be removed from the dataset in the next subsection.

Next, were all the 'Trade...' features but these are most likely key features to understanding the broader marketplace's preferences since these vehicles were traded-in. So, they will be left alone at this time.

*SalesPerson2ID* and *SalesPersonID* both refer to the sales team and hold no significance to the purpose of the project. Therefore, these two columns will be removed in the next subsection as well.

Now, the *BuyerBirthDate* is missing over 70% of its entries. However, this feature could be modified to display the customer's age, providing potential insight into the type of customer that buys within each vehicle segiment. For now, it will be left alone.

Next, were all the 'Trade1_' features but these are most likely key features to understanding the broader marketplace's preferences since these vehicles were traded-in. So, they will be left alone at this time.

Then, *APR*, *ContractTerm*, and *MonthlyPayment* which all refer to the customer's financing. A customer's financing type could prove important to understanding how pricing of vehicles affects customers and or if customers have a finance preference. Out of these three features, *ContractTerm* provides the most relevant information. Additionally, this feature could be used to verify if a duplicate entry in the VIN column is a leased vehicle returning. 
Therefore, the *APR* and *MonthlyPayment* features will be removed in the next subsection and the *ContractTerm* feature will be examined closer to see how to handle its null values.

##### 2.6.1.2 ContractTerm Null Values

In [None]:
#examine the ContractTerm feature
sales_hist.ContractTerm.value_counts()

There are 1704 null values and 2874 unusual '1.0' entries for this feature. However, after speaking to the source of this dataset I was informed that any blanks or 1.0 entries were how cash buyers were recorded. Therefore, all null values will be converted to 1.0 for consistency. Additionally, it was confirmed that all entries for this feature were recorded in the time series of months. 

In [None]:
#replace all null values in ContractTerm with 1.0
sales_hist['ContractTerm'] = sales_hist['ContractTerm'].fillna(value = 1.0)
#confirm feature updated
sales_hist['ContractTerm'].value_counts()

##### 2.6.1.3 Profit Features Null Values

In [None]:
#examine the FrontEndGrossProfit feature - know there are 700 null values
sales_hist.FrontEndGrossProfit.value_counts()

In [None]:
#examine the BackEndGrossProfit feature - know there are 6 null values
sales_hist.BackEndGrossProfit.value_counts()

In [None]:
#examine TotalGrossProfit for total number of '0.00' values, since the feature has no null values
sales_hist.TotalGrossProfit.value_counts()

In [None]:
#confirm which rows for BackEndGrossProfit have a null value
sales_hist.loc[sales_hist['BackEndGrossProfit'].isnull()].head(10)

In [None]:
#examine all Profit information around the 6 null values within feature 'BackEndGrossProfit' for possible insight
sales_hist.iloc[[60,380,470,528,1007,1009],[5,8,12,13,14,16]]

These 6 vehicles are questionable since it is unlikely a dealership had multiple vehicles in their inventory where no front or back end profit was made.

Therefore, since the vehicles in row 60, 470, and 528 are new (N), Lexus, and cash deals (1.0), it is fair to assume the profit was made on the front end of the sell, since there is no financing. As a result, the *FrontEndGrossProfit* for these three vehicles will reflect the same value as in their respective *TotalGrossProfit* column and their *BackEndGrossProfit* observations will be converted to '0.00' values.

For the vehicle in row 380, simiarily it is a new, Lexus, but financed. Also, the profit is low, less than $1000. Therefore, most likely the *TotalGrossProfit* identified for this vehicle occurred on the backend. As a result, the the *FrontEndGrossProfit* for this vehicle will be converted to a '0.00' value and the *BackEndGrossProfit* converted to reflect the same value as in its respective *TotalGrossProfit* column.

For the vehicles in row 1007 and 1009 these are used (U) and both financed. Although, there VehicleMake is different the high value within their "TotalGrossProfit", leads me to assume that the profit for these vehicles should be split in half between backend and frontend for these two units.

In [None]:
#update the FrontEndGrossProfit for rows 60,470, and 528
sales_hist.iloc[60,14]= 1396.00
sales_hist.iloc[470,14]= 2262.00
sales_hist.iloc[528,14]= 3647.00
#update the BackEndGrossProfit for rows 380
sales_hist.iloc[380,13]= 603.99
#update the BackEndGrossProfit and FrontEndGrossProfit for rows 60,380,470,528 to a '0.00' value
sales_hist.iloc[[60,470,528],13]= 0.00
sales_hist.iloc[380,14]= 0.00
#split value of row 1007 and 1009 TotalGrossProfit
sales_hist.iloc[1007,[13,14]]= 4539.50/2
sales_hist.iloc[1009,[13,14]]= 1889.21/2
#confirm change to entries
sales_hist.iloc[[60,380,470,528,1007,1009],[12,13,14]]

In [None]:
#exploring FrontEndGrossProfit null values in comparison to the TotalGrossProfit observations
columns_of_interest = ["FrontEndGrossProfit","TotalGrossProfit"]
rows_of_interest = sales_hist["FrontEndGrossProfit"].isnull()
sales_hist.loc[rows_of_interest,columns_of_interest].sort_values(by='TotalGrossProfit')

In [None]:
#There appears to be no TotalGrossProfit values available for the null values in column FrontEndGrossProfit.
#Therefore the remaining 694 null values seen above will be converted to 0.0
sales_hist['FrontEndGrossProfit'] = sales_hist['FrontEndGrossProfit'].fillna(value = 0.0)
#confirm feature updated from 762 observations of'0.0' to 1456 observations
sales_hist['FrontEndGrossProfit'].value_counts()

##### 2.6.1.4 Buyer Location Features

In [None]:
#examine the 'StockNumber',BuyerHomeAddressPostalCode','BuyerHomeAddressState', 'BuyerHomeAddressCity' features null values
buyer_info = sales_hist.iloc[:,[6,20,21,22]]
buyer_info

In [None]:
#examining the 16 null values within the 'BuyerHomeAddressPostalCode' feature
buyer_info.loc[buyer_info['BuyerHomeAddressPostalCode'].isnull()].head(20)

In [None]:
buyer_info.loc[buyer_info['BuyerHomeAddressState'].isnull()].head(20)

In [None]:
buyer_info.loc[buyer_info['BuyerHomeAddressCity'].isnull()].head(20)

There are 16 null *BuyerHomeAddressPostalCode* observations, 16 null *BuyerHomeAddressState* observations, and 13 null *BuyerHomeAddressCity* observations.

For *BuyerHomeAddressState* rows 530, 2512 are also contain null values, and for *BuyerHomeAddressCity

For these null values external research on the dataset containing the phone numbers for each sell will be examined and applied here for accuracy. 

.... external research completed using a refeverse phone lookup (www.https://www.allareacodes.com/reverse-phone-lookup/) and then located the postal code (www.https://tools.usps.com/zip-code-lookup.htm?bycitystate)12 of the 16 had a home phone number available. 

StockNumber N759A  - dealership phone number. Possibly this was a buy for a salesperson or his or her spouse.
*Will need to examine how often this happens in the dataset*

...external research completed using original dataset containing customer private information and observation occurs only this one time where the address and number provided are the dealership's contact information. 

StockNumber CCG286 - City: Dunlap, State: IN, Zip: 46517
StockNumber CCH405A	- City: South Bend, State: IN, Zip: unknown
StockNumber CCI251AA - City: Taylorville, State: IL, Zip: 62568
StockNumber CCH129 - City: Elkhart, State: IN, Zip: unknown
StockNumber CCH435 - City: South Bend, State: IN, Zip: 46617
StockNumber CCE345 - City: 	Lafayette, State: IN, Zip: unknown
StockNumber CCE416 - City: South Bend, State: IN, Zip: unknown
StockNumber CCL483 - City: South Bend, State: IN, Zip: unknown
StockNumber CCF157 - City: Plymouth, State: IN, Zip: 46563
StockNumber CCF304 - No home or business phone number available
StockNumber CCF523 - City: Hammond, State: IN, Zip: unknown
StockNumber CCL537 - City: South Bend, State: IN, Zip: unknown
StockNumber CCJ537A - No home or business phone number available
StockNumber CCJ369A - No home or business phone number available
StockNumber CCJ654 - No home or business phone number available

Now, we need to update the dataset with the external data we have found and decide how to handle the remaining null values.

In [None]:
#replace all null values with '11111' value for the BuyerHomeAddressPostalCode
#Adv will convert StockNumber N759A as well since, it isn't customer info.
sales_hist.iloc[[1496,3723,5209,6129,6298,6397,6873,7242,7246,8078,8114,8162],22]= 11111
#confirm entry changed
sales_hist.iloc[[1496,3723,5209,6129,6298,6397,6873,7242,7246,8078,8114,8162],22]

In [None]:
#replace all null values with 'unknown' value for the BuyerHomeAddressCity and BuyerHomeAddressState
sales_hist.iloc[[1496,6873,8078,8114,8162],[20,21]]= 'unknown'
sales_hist.iloc[[530,2512],21]='unknown'
#confirm entry changed
sales_hist.iloc[[530,1496,2512,6873,8078,8114,8162],[20,21]]

In [None]:
#update data with external information = City = 'South Bend'
sales_hist.iloc[[3723,5408,6298,6397,7246],20]= 'South Bend'
#update data with external information = City = 'Dunlap'
sales_hist.iloc[3021,20]= 'Dunlap'
#update data with external information = City = 'Taylorville'
sales_hist.iloc[4085,20]= 'Taylorville'
#update data with external information = City = 'Elkhart'
sales_hist.iloc[5209,20]= 'Elkhart'
#update data with external information = City = 'Lafayette'
sales_hist.iloc[6129,20]= 'Lafayette'
#update data with external information = City = 'Plymouth'
sales_hist.iloc[6559,20]= 'Plymouth'
#update data with external information = City = 'Hammond'
sales_hist.iloc[7242,20]= 'Hammond'
#confirm entries changed
sales_hist.iloc[[3021,3723,4085,5209,5408,6129,6298,6397,6559,7246,7242],20]

In [None]:
#update data with external information = State = 'IN'
sales_hist.iloc[[530,2512,3021,3723,4085,5209,6129,6298,6397,6559,7242,7246],21]='IN'
#confirm entries changed
sales_hist.iloc[[530,2512,3021,3723,4085,5209,6129,6298,6397,6559,7242,7246],21]

In [None]:
#convert postal codes(with external information)
sales_hist.iloc[3021,22]='46517'
sales_hist.iloc[4085,22]='62568'
sales_hist.iloc[5408,22]='46617'
sales_hist.iloc[6559,22]='46563'
#confirm entries changed
sales_hist.iloc[[3021,4085,5408,6559],22]

##### 2.6.1.5 VehicleSalePrice

In [None]:
#examine the 1 null value in 'VehicleSalePrice' - know there is 1 null value
sales_hist.loc[sales_hist['VehicleSalePrice'].isnull()].head()
#examine vehicle information about null value in row 6203
sales_hist.iloc[6203,[1,8,9,10,11,12,13,14]]

In [None]:
#Find average for similar vehicles sold in the same year - 2005.
## first clean up ContractDate column
sales_hist['ContractDate'] = pd.to_datetime(sales_hist['ContractDate'], errors = 'coerce').dt.floor('d')

## filter for features of interest
feats_of_ints = sales_hist.loc[(sales_hist['ContractDate'].dt.year == 2005) & (sales_hist['VehicleMake']=='Lexus') & (sales_hist['VehicleModel']=='RX 330') & (sales_hist['VehicleModelYear']== 2005)]

## find avg VehicleSalePrice
feats_of_ints['VehicleSalePrice'].mean()

In [None]:
#convert that 1 vehicle without a saleprice to the average
sales_hist['VehicleSalePrice'] = sales_hist['VehicleSalePrice'].fillna(value = 40013.20)
#confirm feature updated
sales_hist.loc[sales_hist['VehicleSalePrice'].isnull()]

##### 2.6.1.6 InventoryType

In [None]:
#examine the 'InventoryType' feature - know there are 9 null values
sales_hist.loc[sales_hist['InventoryType'].isnull()].head(10)

In [None]:
#know rows 243, 1469, 1731, 2466, 2623, 2635, 4321, 4967, 5150 all have null values for 'InvetoryType' feature
#explore the 'DeliveryDate'(column:2) and VehicleModelYear' (column:10) of these features to determine their 'InventoryType'
sales_hist.iloc[[243, 1469, 1731, 2466, 2623, 2635, 4321, 4967, 5150],[2,10]]

Since the 'DeliveryDate' for these 9 vehicles are within a year of the 'VehicleModelYear' it is fair to assume these vehicles are all 'InventoryType' 'N' = new.

In [None]:
#replace all null values in 'InventoryType' with 'N'
sales_hist['InventoryType'] = sales_hist['InventoryType'].fillna(value = 'N')
#confirm feature updated
sales_hist['InventoryType'].value_counts()

#### 2.6.2 Removing Uncessary Features

Although, not all datasets need a feature that serves as an unique identifier, this dataset appears to have multiple. *BuyerID*, *DealNumber*, *StockNumber*, and *Trade1_StockNumber* can  all be considered an unique identifier. It is already known that *Trade1_StockNumber* is missing 68% of its entries and serves no other value, therefore this feature should be removed. Additionally, we do not need multiple key features, so two of the remaining three can be removed as well. Let's examine which of these features has the most unique entries.

In [None]:
#exploring unique entries of BuyerID
sales_hist.BuyerID.nunique()

In [None]:
#exploring unique entries of StockNumber
sales_hist.StockNumber.nunique()

*BuyerID* has 5656 unique entries out of 8208. *DealNumber* only has two duplicate numbers. *StockNumber* has 8183 unique entries out of 8208. Therefore, the best feature to use as a unique key identifier is the *DealNumber*. Now, one of the two duplicate entries needs to be changed to another unique entry and the *BuyerID* and *StockNumber* features can be removed.

##### 2.6.2.1 DealNumber

In [None]:
#exploring unique entries of DealNumber
sales_hist.DealNumber.nunique()

In [None]:
#identifying the duplicate entries in DealNumber
sales_hist.DealNumber.value_counts()

In [None]:
#find loc of these duplicate entries
sales_hist[sales_hist['DealNumber']==20113]

In [None]:
sales_hist.DealNumber.sort_values(ascending = False)

DealNumber 20113 is the duplicated entry and we see that 2007904 is skipped. Therefore, the duplicate 20113 entrie will be converted to this unused number.

In [None]:
#replace one entry with 2007904
sales_hist.iloc[4287,0]=2007904
#confirm entry changed
sales_hist.iloc[4287,0]

In [None]:
#examining DealStatus
sales_hist.DealStatus.value_counts()

The entries are all the same, so this feature can be removed as well. 

##### 2.6.2.2 Removal of several features

In [None]:
#multiple features removed from dataset
sales_hist.drop(['SalesPersonID', 'SalesPerson2ID','BuyerID','StockNumber','DealStatus','Comments','APR','MonthlyPayment'], axis =1, inplace=True)
#confirm features removed
sales_hist.info()

#### 2.6.3 Categorical Features

In [None]:
#select columns with dtypes 'object'
sales_hist.select_dtypes(object)

There are 17 columns with dtype object. Let's explore each.

##### 2.6.3.1 Time Series Fromatting

In [None]:
#examine and clean time series format for Contract and DeliveryDate
sales_hist['DeliveryDate'] = pd.to_datetime(sales_hist['DeliveryDate'], errors = 'coerce').dt.floor('d')
#confirm changes to formatting
sales_hist.iloc[:,1:3]

##### 2.6.3.2 VIN Duplicates

In [None]:
#examining VIN
sales_hist.VIN.value_counts()

In [None]:
#examining exact count of duplicates in VIN
sh = sales_hist.VIN.value_counts()
sh.value_counts()

Therefore, we know there are no null values, but only 5979 of the entries are unique and the remaining are duplicates. 967 of them are repeated once, 97 twice, and one three times. We can see from the first output that the one VIN occuring four times is 2T2HA31U35C060138. It is suspected this is because these duplicated VINs belong to vehicles apart of lease programs.

##### 2.6.3.3 VehicleMake

In [None]:
#examining VehicleMake
sales_hist.VehicleMake.value_counts()

Lexus has the highest number of observations with Toyota at a far second with Honda, Mercedes-Benz, and Ford surprisingly competing for the consecutive places. Initial thoughts lead to an assumption that specific vehicle features leads to this occurance. Categorizing the VehicleMake feature into broader categories of lucury , economy, and potentially one more may prove insightful as well. However, initially it is clear Lexus, Toyota and other will be necessary to explore for this business problem.

#### 2.6.4 Examining Potential Target Categorical Features

##### 2.6.4.1 Lexus

In [None]:
#grouping all VehicleMakes by VehicleModel
veh_info = sales_hist.groupby('VehicleMake')['VehicleModel'].value_counts()

In [None]:
#examining Lexus by VehicleModel exclusively
veh_info.Lexus

Lexus' top three selling cars are RX350, ES350 and RX330.

##### 2.6.4.2 Toyota

In [None]:
#examining Toyota by VehicleModel exclusively
veh_info.Toyota

Toyota's top selling cars are the Avalon, Highlander, and Camry.

## 2.7 Load Decoded VINs Data

In [2]:
#csv file in subdirectory 'raw'
dec_VINs = pd.read_csv('../data/raw/Decoded VINs.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
dec_VINs

Unnamed: 0,makeid,modelid,vin,batteryinfo,batterytype,bedtype,bodycabtype,bodyclass,enginecylinders,destinationmarket,...,saeautomationlevel_to,rearcrosstrafficalert,gcwr,gcwr_to,ncsanote,ncsamappingexception,ncsamapexcapprovedon,ncsamapexcapprovedby,gvwr_to,errortext
0,475.0,2147.0,2HNYD2H62AH501801,,,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,6.0,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
1,475.0,1871.0,5J8TB18518A010556,,,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,4.0,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
2,475.0,1873.0,19UUA56663A000835,,,Not Applicable,Not Applicable,Sedan/Saloon,6.0,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
3,475.0,1873.0,19UUA66284A040323,,,Not Applicable,Not Applicable,Sedan/Saloon,6.0,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
4,475.0,1872.0,JH4KB16586C000927,,,Not Applicable,Not Applicable,Sedan/Saloon,6.0,Continental US (excluding Hawaii & Alaska),...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8205,485.0,3132.0,YV1CZ91H141084304,,,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
8206,485.0,3132.0,YV1CZ91H031001914,,,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
8207,485.0,3132.0,YV4CM982471399718,,,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,6.0,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...
8208,485.0,3132.0,YV1CM59H331013308,,,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,...,,,,,,,,,,0 - VIN decoded clean. Check Digit (9th positi...


There are 145 features, therefore feature reduction will be necessary in identifying the useful features and the target feature(s). Also, there appears to be 8209 observations. Also, there appears to be significant variance in entry type throughout the dataset. This will be handled first.

#### 2.8.1 Exploring and Cleaning Entry Types

In [None]:
#examine # of missing values by column and sort them high to low - 60 at a time
missing = pd.concat([dec_VINs.isnull().sum(), 100 * dec_VINs.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by = 'count', ascending = False).head(60)

For this first 60, I will remove these from the dec_VINs dataset since they have little to no observations available.

In [None]:
missing.sort_values(by = 'count').head(60)

In [3]:
#remove features with more than 80% missing values
dec_VINs = dec_VINs[dec_VINs.columns[dec_VINs.isnull().mean() < 0.8]]
dec_VINs

Unnamed: 0,makeid,modelid,vin,bedtype,bodycabtype,bodyclass,enginecylinders,displacementcc,displacementci,displacementl,...,errorcode,enginemanufacturer,busfloorconfigtype,bustype,custommotorcycletype,motorcyclesuspensiontype,motorcyclechassistype,manufacturerid,tpms,errortext
0,475.0,2147.0,2HNYD2H62AH501801,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,6.0,3670.702336,224.000000,3.670702,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,990.0,Direct,0 - VIN decoded clean. Check Digit (9th positi...
1,475.0,1871.0,5J8TB18518A010556,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,4.0,2294.188960,140.000000,2.294189,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,988.0,Direct,0 - VIN decoded clean. Check Digit (9th positi...
2,475.0,1873.0,19UUA56663A000835,Not Applicable,Not Applicable,Sedan/Saloon,6.0,3211.864544,196.000000,3.211865,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,988.0,,0 - VIN decoded clean. Check Digit (9th positi...
3,475.0,1873.0,19UUA66284A040323,Not Applicable,Not Applicable,Sedan/Saloon,6.0,3211.864544,196.000000,3.211865,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,988.0,,0 - VIN decoded clean. Check Digit (9th positi...
4,475.0,1872.0,JH4KB16586C000927,Not Applicable,Not Applicable,Sedan/Saloon,6.0,3471.000000,212.000000,3.474058,...,0,Honda,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,987.0,,0 - VIN decoded clean. Check Digit (9th positi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8205,485.0,3132.0,YV1CZ91H141084304,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,,,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,1006.0,,0 - VIN decoded clean. Check Digit (9th positi...
8206,485.0,3132.0,YV1CZ91H031001914,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,,,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,1006.0,,0 - VIN decoded clean. Check Digit (9th positi...
8207,485.0,3132.0,YV4CM982471399718,,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,6.0,3192.000000,195.275981,3.200000,...,0,Volvo,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,1006.0,,0 - VIN decoded clean. Check Digit (9th positi...
8208,485.0,3132.0,YV1CM59H331013308,Not Applicable,Not Applicable,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,,,,,...,0,,Not Applicable,Not Applicable,Not Applicable,Not Applicable,Not Applicable,1006.0,,0 - VIN decoded clean. Check Digit (9th positi...


Now there are only 49 columns left. Now, need to examine the remaining features to see if there topic is relevant to the project.

In [5]:
dec_VINs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8210 entries, 0 to 8209
Data columns (total 49 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   makeid                    8207 non-null   float64
 1   modelid                   8204 non-null   float64
 2   vin                       8210 non-null   object 
 3   bedtype                   3766 non-null   object 
 4   bodycabtype               3796 non-null   object 
 5   bodyclass                 8204 non-null   object 
 6   enginecylinders           7999 non-null   float64
 7   displacementcc            8159 non-null   float64
 8   displacementci            8159 non-null   float64
 9   displacementl             8159 non-null   float64
 10  doors                     8115 non-null   float64
 11  drivetype                 7240 non-null   object 
 12  enginemodel               7316 non-null   object 
 13  enginekw                  6475 non-null   float64
 14  fueltype

The first three features are identifiers, so not all three are needed. Since the *vin* we know has duplicates, as seen in the sales_hist dataset, an additional identifier is necessary to confirm the accuracy when the two datasets are joined together. Since, *make* is available in column 16, the *makeid* and *modelid* will be removed.

With that being said the features listed after *make* are redundant features of the sales_hist dataset. Therefore, columns 17 - 19, then features 20,30-33,38,39,40,46,and 48 are irrelevant as well. 

In [4]:
#remove remaining unnecessary features - 15 total
dec_VINs = dec_VINs.drop(dec_VINs.columns[[0,1,17,18,19,20,30,31,32,33,38,39,40,46,48]], axis=1)
dec_VINs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8210 entries, 0 to 8209
Data columns (total 34 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   vin                       8210 non-null   object 
 1   bedtype                   3766 non-null   object 
 2   bodycabtype               3796 non-null   object 
 3   bodyclass                 8204 non-null   object 
 4   enginecylinders           7999 non-null   float64
 5   displacementcc            8159 non-null   float64
 6   displacementci            8159 non-null   float64
 7   displacementl             8159 non-null   float64
 8   doors                     8115 non-null   float64
 9   drivetype                 7240 non-null   object 
 10  enginemodel               7316 non-null   object 
 11  enginekw                  6475 non-null   float64
 12  fueltypeprimary           7994 non-null   object 
 13  gvwr                      4534 non-null   object 
 14  make    

Now, there are 34 features remaining in the dec_VINs dataset. Now, it's time to examine them closely to see the kind of observation that is being provided.

In [21]:
profile = ProfileReport(dec_VINs, title="Pandas Profiling Report")
profile.to_notebook_iframe()

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=29.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




From the HTML report above we now know 15 features (enginemodel,enginekw,fueltypeprimary,series,valvetraindesign,engineconfiguration,trailertype,trailerbodytype,coolingtype,busfloorconfigtype,bustype,custommotorcycletype,motorcyclesuspensiontype,motorcyclechassistype,and tpms)contain only the value "Not Applicable", only one categorical observation, or after reviewing the report are irrelevant). This observation compounded with their 'number of null values, makes these features useless at this time and therefore will be removed. Additionally, their are duplicate features in 3 different measurement formats - *displacementcc*, *displacementci*, and *displacementl*. I will only keep the *displacementl* feature. Also, there are duplicate features regarding the type of bed/body of a pickup truck - *bodycabtype* vs. *bedtytpe*. Since there are more observation categories for *bodycabtype*, this will remain and *bedtytpe* will be removed. Same applies to *vehicletype* vs. *bodyclass*, as a result, *bodyclass* will be kept since it is more informative. This means a total of 19 features will be removed.

In [5]:
#removing the 19 features with insufficient observations or purpose
dec_VINs = dec_VINs.drop(dec_VINs.columns[[1,5,6,10,11,12,15,17,19,20,25,26,27,28,29,30,31,32,33]],axis=1)
dec_VINs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8210 entries, 0 to 8209
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   vin               8210 non-null   object 
 1   bodycabtype       3796 non-null   object 
 2   bodyclass         8204 non-null   object 
 3   enginecylinders   7999 non-null   float64
 4   displacementl     8159 non-null   float64
 5   doors             8115 non-null   float64
 6   drivetype         7240 non-null   object 
 7   gvwr              4534 non-null   object 
 8   make              8208 non-null   object 
 9   trim              5623 non-null   object 
 10  airbagloccurtain  5065 non-null   object 
 11  airbaglocfront    5894 non-null   object 
 12  airbaglocknee     4440 non-null   object 
 13  enginehp          6477 non-null   float64
 14  airbaglocside     5489 non-null   object 
dtypes: float64(4), object(11)
memory usage: 962.2+ KB


Now, there are 15 features left. The null values for these features will now be handled.

#### 2.8.2 Handling Null Values

Since there are no null values for the feature *vin* we will move on to the next feature *bodycabtype*

###### 2.8.2.1 bodycabtype

In [10]:
dec_VINs['bodycabtype'].value_counts()

Not Applicable                            3766
Extra/Super/ Quad/Double/King/Extended      18
Crew/ Super Crew/ Crew Max                   9
Regular                                      3
Name: bodycabtype, dtype: int64

In [11]:
#examine bodycabtype feature observations that are given a category
dec_VINs.loc[dec_VINs['bodycabtype'].isin(['Regular','Crew/ Super Crew/ Crew Max','Extra/Super/ Quad/Double/King/Extended'])].sort_values(by="bodycabtype").head(50)

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
246,2GCEK13T151341866,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.3,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,,,,,,
526,2GTEK13T651246364,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.3,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",GMC,1500 (1/2 ton),,,,,
491,1FTWW31P95EB23344,Crew/ Super Crew/ Crew Max,Pickup,8.0,6.0,,4WD/4-Wheel Drive/4x4,"Class 3: 10,001 - 14,000 lb (4,536 - 6,350 kg)",FORD,,,1st Row (Driver & Passenger),,325.0,
486,1FTRW08L63KD45644,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
485,1FTRW08L23KD45656,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
483,1FTRW08L53KD35154,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
488,1FTEX14N3SKA72573,Crew/ Super Crew/ Crew Max,Pickup,8.0,5.0,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,,,,,195.0,
387,1D7HW58P67S182078,Crew/ Super Crew/ Crew Max,Pickup,8.0,4.7,,,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,Laramie Club Cab / Quad Cab,,,,,
315,1GCHK23101F208259,Crew/ Super Crew/ Crew Max,Pickup,8.0,6.6,4.0,4WD/4-Wheel Drive/4x4,"Class 2H: 9,001 - 10,000 lb (4,082 - 4,536 kg)",CHEVROLET,3/4 Ton,,,,,
502,1FTZR45E83PA34496,Extra/Super/ Quad/Double/King/Extended,Pickup,6.0,4.0,4.0,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,1st Row (Driver & Passenger),,207.0,


The three categories labeled within the *bodycabtype* are all *bodyclass* 'Pickup'. This would imply that all the vehicles labeled "Not Applicable" in this column are not *bodyclass* 'Pickup' and therefore, other body types. Let's examine the category "Not Applicable" to confirm this assumption.

In [12]:
#confirming 'Not Applicable' belongs to only non 'Pickup' vehicles
dec_VINs.loc[(dec_VINs['bodycabtype']=='Not Applicable') & (dec_VINs['bodyclass']=='Pickup')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


In [13]:
#view if any of the null values within bodycabtype are bodyclass 'PickUp'
dec_VINs.loc[(dec_VINs['bodycabtype'].isnull()) & (dec_VINs['bodyclass']=='Pickup')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
185,3GYEK63N22G345632,,Pickup,8.0,6.0,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CADILLAC,EXT,,,,,
249,1GCEC14W92Z330560,,Pickup,6.0,4.3,2.0,4x2,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,1/2 Ton,,,,,
254,3GCEK13348G273932,,Pickup,8.0,5.3,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,LS,,,,,
382,1D7HU18R67U593078,,Pickup,6.0,3.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
389,1D7HU18P67J593078,,Pickup,8.0,4.7,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
395,1D7HG48N73S202337,,Pickup,8.0,4.7,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,SLT,,,,,
413,1B7HF16Y7XS284634,,Pickup,8.0,5.2,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
414,3D7HU18NX2G138730,,Pickup,8.0,4.7,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
424,1B7GG22N4YS508812,,Pickup,8.0,4.7,2.0,,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,Base,,1st Row (Driver & Passenger),,,
427,1B7GG23X9VS130949,,Pickup,6.0,3.9,,,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,Sport/SLT,,,,,


Therefore, these 11 null values can be converted to "pickup_cab_size_unknown" and the remaining null values will be converted to 'Not Applicable' for the *bodycabtype* feature.

In [6]:
#replace the 11 null values in 'bodycabtype'
dec_VINs.iloc[[185,249,254,382,389,395,413,414,424,427,8143],1]="pickup_cab_size_unknown"
#confirm entries changed
dec_VINs.iloc[[185,249,254,382,389,395,413,414,424,427,8143],1]

185     pickup_cab_size_unknown
249     pickup_cab_size_unknown
254     pickup_cab_size_unknown
382     pickup_cab_size_unknown
389     pickup_cab_size_unknown
395     pickup_cab_size_unknown
413     pickup_cab_size_unknown
414     pickup_cab_size_unknown
424     pickup_cab_size_unknown
427     pickup_cab_size_unknown
8143    pickup_cab_size_unknown
Name: bodycabtype, dtype: object

In [9]:
#replace remaining null values with 'Not Applicable'
dec_VINs['bodycabtype'] = dec_VINs['bodycabtype'].fillna(value = 'Not Applicable')
#confirm feature updated
dec_VINs['bodycabtype'].value_counts()

Not Applicable                            8180
Extra/Super/ Quad/Double/King/Extended      18
Crew/ Super Crew/ Crew Max                   9
Regular                                      3
Name: bodycabtype, dtype: int64

###### 2.8.2.2 bodyclass

In [7]:
dec_VINs['bodyclass'].value_counts()

Wagon                                                      3730
Sedan/Saloon                                               3360
Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)     646
Convertible/Cabriolet                                       174
Hatchback/Liftback/Notchback                                111
Coupe                                                        88
Pickup                                                       41
Minivan                                                      30
Van                                                          17
Roadster                                                      3
Sport Utility Truck (SUT)                                     2
Hardtop or Coupe                                              1
Cargo Van                                                     1
Name: bodyclass, dtype: int64

'Hardtop or Coupe' appears to be a repetitive observation of 'Coupe' and the same for 'Cargo Van' for 'Van'. As a result, these two instances will be converted and merged with their respective observations before moving forward to this feature's null values.

In [11]:
#locate these two rows
dec_VINs.loc[(dec_VINs['bodyclass']=='Hardtop or Coupe') | (dec_VINs['bodyclass']=='Cargo Van')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
154,4H57H4H192363,,Hardtop or Coupe,8.0,,2.0,,,BUICK,,,,,150.0,
430,1FDEE1460VHC10713,,Cargo Van,8.0,4.6,,4x2,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,,,,,210.0,


In [12]:
#convert two redundant observations to correct name
dec_VINs.iloc[154,2]='Coupe'
dec_VINs.iloc[430,2]='Van'
#confirm entries changed
dec_VINs.iloc[[154,430],2]

154    Coupe
430      Van
Name: bodyclass, dtype: object

In [13]:
dec_VINs.loc[(dec_VINs['bodyclass'].isnull())]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
240,1G6KFE796YU289403,Not Applicable,,8.0,4.6,,,,CADILLAC,,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)
429,20971,,,8.0,,2.0,,,FERRARI,,,,,255.0,
7671,ZAMBC38A240011677,Not Applicable,,,,,,,MASERATI,,,,,,
7911,WPOAA299XYS621390,,,,,,,,,,,,,,
8117,JTM2F33V79D006162,,,4.0,2.5,,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",TOYOTA,Standard,,,,179.0,
8173,3NWRF31Y77M422458,,,,,,,,,,,,,,


Using an external site [faxvin](faxvin.com), I was able to get these 6 VINs decoded to gain the information missing.
Now, I will update as much as possible for each of these 6 VINs' missing values.

In [None]:
#examining VIN - 1G6KFE796YU289403
#this VIN does not pull up on any VIN decoder site. Will need to explore further why not.
dec_VINs.iloc[240,]

In [30]:
#replace bodyclass with 'unknown' for now
dec_VINs.iloc[240,2]='Unknown'
#confirm change to feature
dec_VINs.iloc[240,]

vin                            1G6KFE796YU289403
bodycabtype                       Not Applicable
bodyclass                                Unknown
enginecylinders                                8
displacementl                                4.6
doors                                        NaN
drivetype                                    NaN
gvwr                                         NaN
make                                    CADILLAC
trim                                         NaN
airbagloccurtain                             NaN
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                                NaN
enginehp                                     NaN
airbaglocside       1st Row (Driver & Passenger)
Name: 240, dtype: object

In [None]:
#examining VIN - 20971
#this VIN does not pull up on any VIN decoder site. Will need to explore further why not.
dec_VINs.iloc[429,]

In [31]:
#replace bodyclass with 'unknown' for now
dec_VINs.iloc[429,2]='Unknown'
#confirm change to feature
dec_VINs.iloc[429,]

vin                   20971
bodycabtype             NaN
bodyclass           Unknown
enginecylinders           8
displacementl           NaN
doors                     2
drivetype               NaN
gvwr                    NaN
make                FERRARI
trim                    NaN
airbagloccurtain        NaN
airbaglocfront          NaN
airbaglocknee           NaN
enginehp                255
airbaglocside           NaN
Name: 429, dtype: object

In [22]:
#examining VIN - ZAMBC38A240011677
dec_VINs.iloc[7671,]

vin                 ZAMBC38A240011677
bodycabtype            Not Applicable
bodyclass                       Coupe
enginecylinders                     8
displacementl                     4.2
doors                               2
drivetype                         RWD
gvwr                             DOHC
make                              287
trim                         Gasoline
airbagloccurtain                  NaN
airbaglocfront               Maserati
airbaglocknee                     NaN
enginehp                  Cambiocorsa
airbaglocside                     NaN
Name: 7671, dtype: object

In [32]:
#updated vehicle info from site info
#VIN - ZAMBC38A240011677
dec_VINs.iloc[7671,2]='Coupe'
dec_VINs.iloc[7671,3]= 8
dec_VINs.iloc[7671,4]= 4.2
dec_VINs.iloc[7671,5]= 2
dec_VINs.iloc[7671,6]='RWD'
#had to google the car to find its exact gvwr
dec_VINs.iloc[7671,7]= 4537
dec_VINs.iloc[7671,8]='Maserati'
dec_VINs.iloc[7671,9]='Cambiocorsa'
dec_VINs.iloc[7671,10]='Unknown'
dec_VINs.iloc[7671,11]='1st Row (Driver & Passenger)'
dec_VINs.iloc[7671,12]= 'Unknown'
#had to google the car to find its exact hp
dec_VINs.iloc[7671,13]= 390.0 
dec_VINs.iloc[7671,14]='1st Row (Driver & Passenger)'
#confirm changes to feature
dec_VINs.iloc[7671,]

vin                            ZAMBC38A240011677
bodycabtype                       Not Applicable
bodyclass                                  Coupe
enginecylinders                                8
displacementl                                4.2
doors                                          2
drivetype                                    RWD
gvwr                                        4537
make                                    Maserati
trim                                 Cambiocorsa
airbagloccurtain                         Unknown
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                            Unknown
enginehp                                     390
airbaglocside       1st Row (Driver & Passenger)
Name: 7671, dtype: object

In [None]:
#examining VIN - WPOAA299XYS621390
dec_VINs.iloc[7911,]

In [33]:
#updated vehicle info
#VIN - WPOAA299XYS621390 - need to change 'O' to '0'
dec_VINs.iloc[7911,0]='WP0AA299XYS621390'
dec_VINs.iloc[7911,2]='Coupe'
dec_VINs.iloc[7911,3]= 6
dec_VINs.iloc[7911,4]= 3.4
dec_VINs.iloc[7911,5]= 2
dec_VINs.iloc[7911,6]='AWD'
#had to google the car to find its exact gvwr
dec_VINs.iloc[7911,7]= 3032
dec_VINs.iloc[7911,8]='Porsche'
dec_VINs.iloc[7911,9]='Carrera 4 Coupe'
dec_VINs.iloc[7911,10]='Unknown'
dec_VINs.iloc[7911,11]='1st Row (Driver & Passenger)'
dec_VINs.iloc[7911,12]='Unknown'
#had to google the car to find its exact hp
dec_VINs.iloc[7911,13]= 300
dec_VINs.iloc[7911,14]= '1st Row (Driver & Passenger)'
#confirm changes to feature
dec_VINs.iloc[7911,]

vin                            WP0AA299XYS621390
bodycabtype                                  NaN
bodyclass                                  Coupe
enginecylinders                                6
displacementl                                3.4
doors                                          2
drivetype                                    AWD
gvwr                                        3032
make                                     Porsche
trim                             Carrera 4 Coupe
airbagloccurtain                         Unknown
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                            Unknown
enginehp                                     300
airbaglocside       1st Row (Driver & Passenger)
Name: 7911, dtype: object

In [None]:
#examining VIN - JTM2F33V79D006162
#this VIN does not pull up on any VIN decoder site. Will need to explore further why not.
dec_VINs.iloc[8117,]

In [34]:
#replace bodyclass with 'unknown' for now
dec_VINs.iloc[8117,2]='Unknown'
#confirm change to feature
dec_VINs.iloc[8117,]

vin                                             JTM2F33V79D006162
bodycabtype                                                   NaN
bodyclass                                                 Unknown
enginecylinders                                                 4
displacementl                                                 2.5
doors                                                         NaN
drivetype                                                     NaN
gvwr                Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)
make                                                       TOYOTA
trim                                                     Standard
airbagloccurtain                                              NaN
airbaglocfront                                                NaN
airbaglocknee                                                 NaN
enginehp                                                      179
airbaglocside                                                 NaN
Name: 8117

In [27]:
#examining VIN - 3NWRF31Y77M422458
#this VIN does not pull up on any VIN decoder site. Will need to explore further why not.
dec_VINs.iloc[8173,]

vin                 3NWRF31Y77M422458
bodycabtype            Not Applicable
bodyclass                         NaN
enginecylinders                   NaN
displacementl                     NaN
doors                             NaN
drivetype                         NaN
gvwr                              NaN
make                              NaN
trim                              NaN
airbagloccurtain                  NaN
airbaglocfront                    NaN
airbaglocknee                     NaN
enginehp                          NaN
airbaglocside                     NaN
Name: 8173, dtype: object

In [35]:
#replace bodyclass with 'unknown' for now
dec_VINs.iloc[8173,2]='Unknown'
#confirm change to feature
dec_VINs.iloc[8173,]

vin                 3NWRF31Y77M422458
bodycabtype                       NaN
bodyclass                     Unknown
enginecylinders                     6
displacementl                     NaN
doors                             NaN
drivetype                         NaN
gvwr                              NaN
make                              NaN
trim                              NaN
airbagloccurtain                  NaN
airbaglocfront                    NaN
airbaglocknee                     NaN
enginehp                          NaN
airbaglocside                     NaN
Name: 8173, dtype: object

In [20]:
#confirm all null values in bodyclass feature handled
dec_VINs.loc[dec_VINs['bodyclass'].isnull()]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


##### 2.8.2.3  enginecylinders

In [30]:
dec_VINs['enginecylinders'].value_counts()

6.0     6101
8.0     1397
4.0      493
5.0        9
10.0       1
Name: enginecylinders, dtype: int64

This feature has 209 null values to examine, since 2 were handled in the prior section.

Let's examine if there any correlations between *enginecylinders* with the highest observation of '6.0' and the highest observations for *bodyclass*, which is 'Wagon' and 'Sedan/Saloon', as seen in the pandas profile report.

In [98]:
dec_VINs.loc[(dec_VINs['enginecylinders']==6.0) & ((dec_VINs['bodyclass']=='Wagon')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
30,WAUDH74FX6N108331,Not Applicable,Wagon,6.0,3.123,4.0,,,AUDI,,1st & 2nd Rows,1st Row (Driver & Passenger),,255,1st Row (Driver & Passenger)
44,WAUYP64B61N037052,Not Applicable,Wagon,6.0,2.671,4.0,,,AUDI,,1st Row (Driver & Passenger),1st Row (Driver & Passenger),,250,1st Row (Driver & Passenger)
405,2D4FV48T85H512810,Not Applicable,Wagon,6.0,2.700,,RWD/ Rear Wheel Drive,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",DODGE,Base/SXT,,,,,
426,2D4FV48V55H130087,Not Applicable,Wagon,6.0,3.500,,RWD/ Rear Wheel Drive,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",DODGE,Base/SXT,,,,,
492,2FMZA576X4BA60315,Not Applicable,Wagon,6.0,3.900,4.0,4x2,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,1st Row (Driver & Passenger),,193,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8140,5TDBA22C05S044948,Not Applicable,Wagon,6.0,3.300,5.0,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",TOYOTA,XLE,,,,215,
8141,4T3ZF13C8XU117672,Not Applicable,Wagon,6.0,3.000,5.0,4x2,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",TOYOTA,LE/XLE,,,,194,
8149,4T3BK3BB5BU055335,Not Applicable,Wagon,6.0,3.500,5.0,4x2,"Class 1: 6,000 lb or less (2,722 kg or less)",TOYOTA,,All Rows,1st Row (Driver & Passenger),Driver Seat Only,268,1st Row (Driver & Passenger)
8151,4T3BK3BB4AU043806,Not Applicable,Wagon,6.0,3.500,5.0,4x2,"Class 1: 6,000 lb or less (2,722 kg or less)",TOYOTA,,All Rows,1st Row (Driver & Passenger),Driver Seat Only,268,1st Row (Driver & Passenger)


In [92]:
dec_VINs.loc[(dec_VINs['enginecylinders']==6.0) & (dec_VINs['bodyclass']=='Sedan/Saloon')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
2,19UUA56663A000835,Not Applicable,Sedan/Saloon,6.0,3.211865,4.0,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",ACURA,3.2,,1st Row (Driver & Passenger),,225,
3,19UUA66284A040323,Not Applicable,Sedan/Saloon,6.0,3.211865,4.0,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",ACURA,BASE,,1st Row (Driver & Passenger),,270,
4,JH4KB16586C000927,Not Applicable,Sedan/Saloon,6.0,3.474058,4.0,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",ACURA,,1st & 2nd Rows,1st Row (Driver & Passenger),,300,1st Row (Driver & Passenger)
5,JH4KA965XWC010621,Not Applicable,Sedan/Saloon,6.0,3.474058,4.0,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",ACURA,3.5,,1st Row (Driver & Passenger),,210,1st Row (Driver & Passenger)
7,19UUA9F50CA003440,Not Applicable,Sedan/Saloon,6.0,3.474058,4.0,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",ACURA,,1st & 2nd Rows,1st Row (Driver & Passenger),,280,1st Row (Driver & Passenger)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8168,3VWVH69MX3M170047,Not Applicable,Sedan/Saloon,6.0,2.792000,4.0,,,VOLKSWAGEN,,1st Row (Driver & Passenger),1st Row (Driver & Passenger),,200,Driver Seat Only
8178,WVWRH63B83P383066,Not Applicable,Sedan/Saloon,6.0,2.771000,4.0,,,VOLKSWAGEN,,1st Row (Driver & Passenger),1st Row (Driver & Passenger),,190,Driver Seat Only
8182,WVWRH63B12P239115,Not Applicable,Sedan/Saloon,6.0,2.771000,4.0,,,VOLKSWAGEN,,1st Row (Driver & Passenger),1st Row (Driver & Passenger),,190,Driver Seat Only
8184,WVWPD63B3XE453626,Not Applicable,Sedan/Saloon,6.0,2.771000,4.0,FWD/Front Wheel Drive,,VOLKSWAGEN,,,1st Row (Driver & Passenger),,190,1st Row (Driver & Passenger)


98% of the '6.0' observations within *enginecylinders* associate with 82% of 'Wagon' and 76% of 'Sedan/Saloon' observations within *bodyclass*. Now lets determine what portion of the null values within *enginecylinders* belong to these two bodyclasses.

In [103]:
dec_VINs.loc[(dec_VINs['enginecylinders'].isnull()) & ((dec_VINs['bodyclass']=='Wagon') | (dec_VINs['bodyclass']=='Sedan/Saloon'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
78,WBADD6326VBW00536,Not Applicable,Sedan/Saloon,,,4.0,,,BMW,,,1st Row (Driver & Passenger),,,
83,WBACD4325VAV53697,Not Applicable,Sedan/Saloon,,,4.0,,,BMW,,,1st Row (Driver & Passenger),,,
143,2G4GV5GV9D9198082,Not Applicable,Sedan/Saloon,,2.0,4.0,,,BUICK,,All Rows,1st Row (Driver & Passenger),,,1st & 2nd Rows
144,2G4GT5GVXD9192684,Not Applicable,Sedan/Saloon,,2.0,4.0,,,BUICK,,All Rows,1st Row (Driver & Passenger),,,1st & 2nd Rows
513,2FMDA5147SBA75739,Not Applicable,Wagon,,3.8,4.0,4x2,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,1st Row (Driver & Passenger),,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8199,YV1TS94D8Y1121772,Not Applicable,Sedan/Saloon,,,4.0,,,VOLVO,,,1st Row (Driver & Passenger),,,
8200,YV1SZ58D921049465,Not Applicable,Wagon,,,5.0,AWD/All Wheel Drive,,VOLVO,,,1st Row (Driver & Passenger),,,
8201,YV1SZ58D711015040,Not Applicable,Wagon,,,5.0,AWD/All Wheel Drive,,VOLVO,,,1st Row (Driver & Passenger),,,
8202,YV1SZ58D511037554,Not Applicable,Wagon,,,5.0,AWD/All Wheel Drive,,VOLVO,,,1st Row (Driver & Passenger),,,


91% of the null values for the feature *enginecyinders* belong to one of these two observations. Therefore, the 211 null values for the feature *enginecyinders* will be handled by converting them all to the value '6.0'.

In [24]:
#converting the null values within the feature *enginecyinders* to '6.0'
dec_VINs['enginecylinders'] = dec_VINs['enginecylinders'].fillna(value = 6.0)
#confirm feature updated
dec_VINs['enginecylinders'].value_counts()

6.0     6310
8.0     1397
4.0      493
5.0        9
10.0       1
Name: enginecylinders, dtype: int64

###### 2.8.2.4 doors

In [105]:
dec_VINs['doors'].value_counts()

4.0    4056
5.0    3787
2.0     268
3.0       6
Name: doors, dtype: int64

In [25]:
dec_VINs.loc[dec_VINs['doors'].isnull()]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
86,4USBT53433LT23693,Not Applicable,Convertible/Cabriolet,6.0,3.0,,,"Class 1B: 3,001 - 4,000 lb (1,360 - 1,814 kg)",BMW,Roadster 3.0si,,,,225.0,
240,1G6KFE796YU289403,Not Applicable,unknown,8.0,4.6,,,,CADILLAC,,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)
252,3GNFK16Z56G127483,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,4.0,2.4,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",CHEVROLET,,,,,,
256,2GCEK19TX11370722,Extra/Super/ Quad/Double/King/Extended,Pickup,8.0,5.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,1/2 Ton,,,,,
257,1GCEK19T34E293176,Extra/Super/ Quad/Double/King/Extended,Pickup,8.0,5.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,1/2 Ton,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8117,JTM2F33V79D006162,,unknown,4.0,2.5,,,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",TOYOTA,Standard,,,,179.0,
8142,5TEWN72N73Z195029,Extra/Super/ Quad/Double/King/Extended,Pickup,6.0,3.4,,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",TOYOTA,DELUXE,,,,190.0,
8143,3TMLU4EN4CM094849,pickup_cab_size_unknown,Pickup,6.0,4.0,,4WD/4-Wheel Drive/4x4,"Class 1: 6,000 lb or less (2,722 kg or less)",TOYOTA,,All Rows,1st Row (Driver & Passenger),,236.0,1st Row (Driver & Passenger)
8145,4TAPM62N0WZ156375,Regular,Pickup,4.0,2.7,,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",TOYOTA,DELUXE,,,,150.0,


There are 93 null values for feature *doors*. Let's examine if there any correlations between *doors* with the highest observation of '4.0' and the highest observations for bodyclass, which is 'Wagon' and 'Sedan/Saloon', as seen in the pandas profile report.

In [36]:
#how many doors does each body class have?
bodyclass_doors_count = dec_VINs.groupby('bodyclass')['doors'].value_counts()
bodyclass_doors_count

bodyclass                                                doors
Convertible/Cabriolet                                    2.0       154
Coupe                                                    2.0        90
                                                         4.0         1
Hatchback/Liftback/Notchback                             5.0        88
                                                         2.0         9
                                                         3.0         6
                                                         4.0         6
Minivan                                                  4.0        24
                                                         5.0         6
Pickup                                                   4.0        13
                                                         2.0         2
Roadster                                                 2.0         3
Sedan/Saloon                                             4.0      3355
              

In [37]:
#comparison with bodyclass total for reference
dec_VINs['bodyclass'].value_counts()

Wagon                                                      3730
Sedan/Saloon                                               3360
Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)     646
Convertible/Cabriolet                                       174
Hatchback/Liftback/Notchback                                111
Coupe                                                        91
Pickup                                                       41
Minivan                                                      30
Van                                                          18
Unknown                                                       4
Roadster                                                      3
Sport Utility Truck (SUT)                                     2
Name: bodyclass, dtype: int64

Therefore, we can gather that since 'Convertible/Cabriolet' has a total of 174 observations with only an observation of '2.0' for *doors*, that the 20 related null values should be converted to '2.0' as well. Next, looking at 'Coupe' it has no null values, however, it does have 1 observation of '4.0' *doors* out of 91. This could mean this observation is a mistake or outlier that will need to be examined closer. 'Hatchback/Liftback/Notchback' has two null values for *doors*, however with the highest observation being '5.0' these two will be converted as such. 'Minivan' has no null values. 'Pickup' has 26 null values and two observations for doors. Therefore, looking at the cab type associated with these vehicles will help determine if the null values should be converted to '4.0' or '2.0'. 'Roadster' has no null values. 'Sedan/Saloon' are primarily observed with '4.0' doors. However, the existence of '5.0' and '2.0' need to be examined closer due to their low count to see if these are a mistake. 'Sport Utility Truck (SUT)' has no null values. 'Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)' has 9 null values, most of which belong to '4.0' at 52% and '5.0' at 45%. To keep a similar distribution, I will convert 4 null values to '4.0', another 4 to '5.0' and the last one to '2.0'. 'Unknown' has 3 null values, but since there is only one observation recorded of '2.0' for *doors* these null values will be converted to this value. 'Van' has 17 null values. These will be handled by converting them all to '4.0' since this is the only observation for *doors* recorded. 'Wagon' has 14 null values. Since, 91% of the observations for 'Wagon' are '5.0', these 14 null values will be converted as such.

Now, lets explore and handle those observations within *bodyclass* with potential mistakes or outliers: 'Coupe' and 'Sedan/Saloon'.

In [38]:
#examining 'Coupe' 
dec_VINs.loc[(dec_VINs['doors']== 4.0) & (dec_VINs['bodyclass']== 'Coupe')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
7725,WDDDJ72X99A144631,Not Applicable,Coupe,8.0,5.5,4.0,,,MERCEDES-BENZ,,,1st Row (Driver & Passenger),,,


Using the same VIN decoder site mentioned prior in the project, I was able to confirm that this was a 4 door coupe. Additionally, I was able to pull up some of the missing information on this vehicle as well, which I will update now.

In [39]:
#updated vehicle info from site info
#VIN - WDDDJ72X99A144631
dec_VINs.iloc[7725,6]='RWD'
#had to google the car to find its exact gvwr
dec_VINs.iloc[7725,7]= '4,020 lbs'
dec_VINs.iloc[7725,9]= 'CLS550'
dec_VINs.iloc[7725,10]= '1st Row (Driver & Passenger)'
dec_VINs.iloc[7725,12]= 'Unknown'
#had to google the car to find its exact hp
dec_VINs.iloc[7725,13]= 382.0
dec_VINs.iloc[7725,14]= '1st & 2nd Rows'
#confirm changes to feature
dec_VINs.iloc[7725,]

vin                            WDDDJ72X99A144631
bodycabtype                       Not Applicable
bodyclass                                  Coupe
enginecylinders                                8
displacementl                                5.5
doors                                          4
drivetype                                    RWD
gvwr                                   4,020 lbs
make                               MERCEDES-BENZ
trim                                      CLS550
airbagloccurtain    1st Row (Driver & Passenger)
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                            Unknown
enginehp                                     382
airbaglocside                     1st & 2nd Rows
Name: 7725, dtype: object

In [40]:
#examining 'Sedan/Saloon'
dec_VINs.loc[(dec_VINs['bodyclass']=='Sedan/Saloon') & ((dec_VINs['doors']== 5.0) | (dec_VINs['doors']== 2.0))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
8055,JTDAT123110163508,Not Applicable,Sedan/Saloon,6.0,1.5,2.0,4x2,,TOYOTA,,,1st Row (Driver & Passenger),,108.0,
8107,2T1KE40E99C012244,Not Applicable,Sedan/Saloon,4.0,2.4,5.0,4x2,,TOYOTA,,1st Row (Driver & Passenger),1st Row (Driver & Passenger),,158.0,1st Row (Driver & Passenger)
8112,JTDKN3DU1C1570726,Not Applicable,Sedan/Saloon,4.0,1.8,5.0,4x2,,TOYOTA,,All Rows,1st Row (Driver & Passenger),Driver Seat Only,98.0,1st Row (Driver & Passenger)


Using the VIN decoder site again I was able to examine each VIN and handle each accordingly. For row 8055, this is a 2 door sedan, so no error, just needs null information for other features updated. However, for row 8107 this is an error. The vehicle only has 4 doors, but its trim is "S 5-Speed AT", which is why I believe the confusion occurred. This will be updated along with any other null information available on the website. Next, row 8112, is also an error, since it is not only a 4 door, but also a Hatchback. Therefore, these changes alongside any other information that can be updated from the website.

In [41]:
#updated vehicle info from site info
#VIN - JTDAT123110163508
#had to google the car to find its exact gvwr
dec_VINs.iloc[8055,7]= '2,020 to 2,080 lbs'
dec_VINs.iloc[8055,9]= '2-Door'
dec_VINs.iloc[8055,10]= 'Unknown'
dec_VINs.iloc[8055,12]= 'Unknown'
dec_VINs.iloc[8055,14]= 'Unknown'
#confirm changes to feature
dec_VINs.iloc[8055,]

vin                            JTDAT123110163508
bodycabtype                       Not Applicable
bodyclass                           Sedan/Saloon
enginecylinders                                6
displacementl                                1.5
doors                                          2
drivetype                                    4x2
gvwr                          2,020 to 2,080 lbs
make                                      TOYOTA
trim                                      2-Door
airbagloccurtain                         Unknown
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                            Unknown
enginehp                                     108
airbaglocside                            Unknown
Name: 8055, dtype: object

In [42]:
#updated vehicle info from site info
#VIN - 2T1KE40E99C012244
#had to google the car to find its exact gvwr
dec_VINs.iloc[8107,7]= '3,140 lbs'
dec_VINs.iloc[8107,9]= 'S 5-Speed AT'
dec_VINs.iloc[8107,12]= 'Unknown'
#confirm changes to feature
dec_VINs.iloc[8107,]

vin                            2T1KE40E99C012244
bodycabtype                       Not Applicable
bodyclass                           Sedan/Saloon
enginecylinders                                4
displacementl                                2.4
doors                                          5
drivetype                                    4x2
gvwr                                   3,140 lbs
make                                      TOYOTA
trim                                S 5-Speed AT
airbagloccurtain    1st Row (Driver & Passenger)
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                            Unknown
enginehp                                     158
airbaglocside       1st Row (Driver & Passenger)
Name: 8107, dtype: object

In [44]:
#updated vehicle info from site info
#VIN - JTDKN3DU1C1570726
dec_VINs.iloc[8112,2]= 'Hatchback/Liftback/Notchback'
dec_VINs.iloc[8112,3]= 'Hybrid'
dec_VINs.iloc[8112,5]= 4.0
dec_VINs.iloc[8112,6]= 'FWD'
#had to google the car to find its exact gvwr
dec_VINs.iloc[8112,7]= '3,042 lbs'
dec_VINs.iloc[8112,9]= 'Prius II'
#had to google the car to find its exact hp
dec_VINs.iloc[8112,13]= 134.0
#confirm changes to feature
dec_VINs.iloc[8112,]

vin                            JTDKN3DU1C1570726
bodycabtype                       Not Applicable
bodyclass           Hatchback/Liftback/Notchback
enginecylinders                           Hybrid
displacementl                                1.8
doors                                          4
drivetype                                    FWD
gvwr                                   3,042 lbs
make                                      TOYOTA
trim                                    Prius II
airbagloccurtain                        All Rows
airbaglocfront      1st Row (Driver & Passenger)
airbaglocknee                   Driver Seat Only
enginehp                                     134
airbaglocside       1st Row (Driver & Passenger)
Name: 8112, dtype: object

Now, it is finally time to handle those null values for feature *doors*. I will begin with 'Pickup' since further research regarding its correlation to the feature *bodycabtype* might give insight into whether '4.0' or '2.0' should be the converted value of *doors* null values.

In [51]:
bodycabtype_count = dec_VINs.groupby('doors')['bodycabtype'].value_counts()
bodycabtype_count

doors  bodycabtype                           
2.0    Not Applicable                             255
       pickup_cab_size_unknown                      2
3.0    Not Applicable                               6
4.0    Not Applicable                            3380
       Extra/Super/ Quad/Double/King/Extended       6
       pickup_cab_size_unknown                      4
       Crew/ Super Crew/ Crew Max                   3
5.0    Not Applicable                              99
Name: bodycabtype, dtype: int64

Since '4.0' exclusively contains 'Extra/Super/ Quad/Double/King/Extended' and 'Crew/ Super Crew/ Crew Max' the null values for their *doors* feature will be converted into '4.0'. I know there is also a *bodycabtype* named 'Regular' that is not seen here. However, a regular truck has only 2 doors, so these null observations will be converted to '2.0'. The remaining null values will be handled by decoding their VINs.

In [53]:
dec_VINs.loc[(dec_VINs['doors'].isnull()) & (dec_VINs['bodyclass']== 'Pickup') & ((dec_VINs['bodycabtype']== 'Extra/Super/ Quad/Double/King/Extended') | (dec_VINs['bodycabtype']== 'Crew/ Super Crew/ Crew Max'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
256,2GCEK19TX11370722,Extra/Super/ Quad/Double/King/Extended,Pickup,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,1/2 Ton,,,,,
257,1GCEK19T34E293176,Extra/Super/ Quad/Double/King/Extended,Pickup,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",CHEVROLET,1/2 Ton,,,,,
387,1D7HW58P67S182078,Crew/ Super Crew/ Crew Max,Pickup,8,4.7,,,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,Laramie Club Cab / Quad Cab,,,,,
392,3B7HF13Z51M551671,Extra/Super/ Quad/Double/King/Extended,Pickup,8,5.9,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
480,1FTPX14516NA44400,Extra/Super/ Quad/Double/King/Extended,Pickup,8,5.4,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",FORD,,,1st Row (Driver & Passenger),,300.0,
483,1FTRW08L53KD35154,Crew/ Super Crew/ Crew Max,Pickup,8,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
484,1FTDX0866VKB07250,Extra/Super/ Quad/Double/King/Extended,Pickup,8,4.6,,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,,,210.0,
485,1FTRW08L23KD45656,Crew/ Super Crew/ Crew Max,Pickup,8,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
486,1FTRW08L63KD45644,Crew/ Super Crew/ Crew Max,Pickup,8,5.4,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,F-Series,,1st Row (Driver & Passenger),,260.0,
487,1FTZX17W5WNC14044,Extra/Super/ Quad/Double/King/Extended,Pickup,8,4.6,,4x2,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,1st Row (Driver & Passenger),,215.0,


18 null values for *doors* fits this criteria and will all be converted to '4.0' now.

In [54]:
dec_VINs.iloc[[256,257,387,392,480,483,484,485,486,487,488,489,490,491,527,643,644,8142],5]= 4.0
#confirm change
dec_VINs.iloc[[256,257,387,392,480,483,484,485,486,487,488,489,490,491,527,643,644,8142],5]

256     4.0
257     4.0
387     4.0
392     4.0
480     4.0
483     4.0
484     4.0
485     4.0
486     4.0
487     4.0
488     4.0
489     4.0
490     4.0
491     4.0
527     4.0
643     4.0
644     4.0
8142    4.0
Name: doors, dtype: float64

In [56]:
#locate where bodycabtype is Regular
dec_VINs.loc[(dec_VINs['doors'].isnull()) & (dec_VINs['bodyclass']== 'Pickup') & (dec_VINs['bodycabtype']== 'Regular')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
481,2FTRF18224CA56943,Regular,Pickup,6,4.2,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",FORD,Classic - Styleside,,1st Row (Driver & Passenger),,202.0,
482,1FTZF1721WNB69488,Regular,Pickup,6,4.2,,4x2,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",FORD,,,1st Row (Driver & Passenger),,200.0,
8145,4TAPM62N0WZ156375,Regular,Pickup,4,2.7,,4WD/4-Wheel Drive/4x4,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",TOYOTA,DELUXE,,,,150.0,


In [57]:
#convert these three null values for doors to 2.0
dec_VINs.iloc[[481,482,8145],5]= 2.0
#confirm change
dec_VINs.iloc[[481,482,8145],5]

481     2.0
482     2.0
8145    2.0
Name: doors, dtype: float64

In [58]:
#confirm how many null values left corresponding with 'Pickup'
dec_VINs.loc[(dec_VINs['doors'].isnull()) & (dec_VINs['bodyclass']== 'Pickup')]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
382,1D7HU18R67U593078,pickup_cab_size_unknown,Pickup,6,3.3,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
389,1D7HU18P67J593078,pickup_cab_size_unknown,Pickup,8,4.7,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
413,1B7HF16Y7XS284634,pickup_cab_size_unknown,Pickup,8,5.2,,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,,,,,
427,1B7GG23X9VS130949,pickup_cab_size_unknown,Pickup,6,3.9,,,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,Sport/SLT,,,,,
8143,3TMLU4EN4CM094849,pickup_cab_size_unknown,Pickup,6,4.0,,4WD/4-Wheel Drive/4x4,"Class 1: 6,000 lb or less (2,722 kg or less)",TOYOTA,,All Rows,1st Row (Driver & Passenger),,236.0,1st Row (Driver & Passenger)


In [72]:
#use VIN decoder to handle remaining 5 null values & update where possible
#1D7HU18R67U593078 - not a valid VIN, however since the VIN and the details that are available are similar 
#to the VIN below it, I will make their informatio the same.
#1D7HU18P67J593078 & (copy for 1D7HU18R67U593078 what's null)
dec_VINs.iloc[[382,389],1]= 'Crew/ Super Crew/ Crew Max'
dec_VINs.iloc[[382,389],5]= 4.0
dec_VINs.iloc[389,9]= '1500 Laramie Quad Cab 4WD'
dec_VINs.iloc[[382,389],10]= 'Unknown'
dec_VINs.iloc[[382,389],11]= 'Unknown'
dec_VINs.iloc[[382,389],12]= 'Unknown'
#had to research exact hp
dec_VINs.iloc[382,13]= 0.0
dec_VINs.iloc[389,13]= 345.0
dec_VINs.iloc[[382,389],14]= 'Unknown'
#confirm changes
dec_VINs.iloc[[382,389],]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
382,1D7HU18R67U593078,Crew/ Super Crew/ Crew Max,Pickup,6,3.3,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500,Unknown,Unknown,Unknown,0.0,Unknown
389,1D7HU18P67J593078,Crew/ Super Crew/ Crew Max,Pickup,8,4.7,4.0,4WD/4-Wheel Drive/4x4,"Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",DODGE,1500 Laramie Quad Cab 4WD,Unknown,Unknown,Unknown,345.0,Unknown


In [61]:
#1B7HF16Y7XS284634
dec_VINs.iloc[413,1]= 'Regular'
dec_VINs.iloc[413,5]= 2.0
dec_VINs.iloc[413,9]= '1500 Reg. Cab Long Bed 4WD'
dec_VINs.iloc[413,10]= 'Unknown'
dec_VINs.iloc[413,11]= '1st Row (Driver & Passenger)'
dec_VINs.iloc[413,12]= 'Unknown'
#had to research exact hp
dec_VINs.iloc[413,13]= 345.0
dec_VINs.iloc[413,14]= 'Unknown'
#confirm changes
dec_VINs.iloc[413,]

vin                                             1B7HF16Y7XS284634
bodycabtype                                               Regular
bodyclass                                                  Pickup
enginecylinders                                                 8
displacementl                                                 5.2
doors                                                           2
drivetype                                   4WD/4-Wheel Drive/4x4
gvwr                Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)
make                                                        DODGE
trim                                   1500 Reg. Cab Long Bed 4WD
airbagloccurtain                                          Unknown
airbaglocfront                       1st Row (Driver & Passenger)
airbaglocknee                                             Unknown
enginehp                                                      345
airbaglocside                                             Unknown
Name: 413,

In [62]:
#1B7GG23X9VS130949: extended cab but only has two doors
dec_VINs.iloc[427,1]= 'Extra/Super/ Quad/Double/King/Extended'
dec_VINs.iloc[427,5]= 2.0
dec_VINs.iloc[427,9]= 'Club Cab 4WD'
dec_VINs.iloc[427,10]= 'Unknown'
dec_VINs.iloc[427,11]= '1st Row (Driver & Passenger)'
dec_VINs.iloc[427,12]= 'Unknown'
#had to research exact hp
dec_VINs.iloc[427,13]= 175.0
dec_VINs.iloc[427,14]= 'Unknown'
#confirm changes
dec_VINs.iloc[427,]

vin                                             1B7GG23X9VS130949
bodycabtype                Extra/Super/ Quad/Double/King/Extended
bodyclass                                                  Pickup
enginecylinders                                                 6
displacementl                                                 3.9
doors                                                           2
drivetype                                                     NaN
gvwr                Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)
make                                                        DODGE
trim                                                 Club Cab 4WD
airbagloccurtain                                          Unknown
airbaglocfront                       1st Row (Driver & Passenger)
airbaglocknee                                             Unknown
enginehp                                                      175
airbaglocside                                             Unknown
Name: 427,

In [63]:
#3TMLU4EN4CM094849
dec_VINs.iloc[8143,1]= 'Crew/ Super Crew/ Crew Max'
dec_VINs.iloc[8143,5]= 4.0
dec_VINs.iloc[8143,9]= 'Double Cab V6 4WD'
dec_VINs.iloc[8143,12]= 'Unknown'
#confirm changes
dec_VINs.iloc[8143,]

vin                                            3TMLU4EN4CM094849
bodycabtype                           Crew/ Super Crew/ Crew Max
bodyclass                                                 Pickup
enginecylinders                                                6
displacementl                                                  4
doors                                                          4
drivetype                                  4WD/4-Wheel Drive/4x4
gvwr                Class 1: 6,000 lb or less (2,722 kg or less)
make                                                      TOYOTA
trim                                           Double Cab V6 4WD
airbagloccurtain                                        All Rows
airbaglocfront                      1st Row (Driver & Passenger)
airbaglocknee                                            Unknown
enginehp                                                     236
airbaglocside                       1st Row (Driver & Passenger)
Name: 8143, dtype: object

Now, lets handle the remaining null values for *doors* as discussed.

In [64]:
#locate the null values to convert to 2.0
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Convertible/Cabriolet') | (dec_VINs['bodyclass']== 'Unknown'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
86,4USBT53433LT23693,Not Applicable,Convertible/Cabriolet,6,3.0,,,"Class 1B: 3,001 - 4,000 lb (1,360 - 1,814 kg)",BMW,Roadster 3.0si,,,,225.0,
240,1G6KFE796YU289403,Not Applicable,Unknown,8,4.6,,,,CADILLAC,,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)
7584,JTHFN48Y550062982,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7587,JTHFN48Y550065512,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7606,JTHFN48Y450070796,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7607,JTHFN48Y550070113,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7608,JTHFN48Y050069824,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7610,JTHFN48Y250069369,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7611,JTHFN48Y850069151,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)
7612,JTHFN48Y650069326,Not Applicable,Convertible/Cabriolet,8,4.3,,4x2,,LEXUS,430,,1st Row (Driver & Passenger),,300.0,1st Row (Driver & Passenger)


In [65]:
#convert the null values for *doors* to 2.0
dec_VINs.iloc[[86,240,7584,7587,7606,7607,7608,7610,7611,7612,7613,7614,7615,7616,7617,7619,7620,7621,7622,7623,7624,8117,8173],5]= 2.0
#confirm changes
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Convertible/Cabriolet') | (dec_VINs['bodyclass']== 'Unknown'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


In [66]:
#locate the null values to convert to 4.0
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Sedan/Saloon') | (dec_VINs['bodyclass']== 'Van'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
383,2B4GP44302R706364,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,Sport,,,,,
394,2B4GP44G7YR745612,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SE/Sport,,1st Row (Driver & Passenger),,,
397,2D4GP44L37R338596,,Van,6,3.8,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SXT,,,,,
398,1D4GP24R15B192140,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,"Base, C/V",,,,,
406,2B4GP44G7YR818476,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SE/Sport,,1st Row (Driver & Passenger),,,
410,2B4GP44G1XR365177,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SE/Sport,,1st Row (Driver & Passenger),,,
411,2B4GP44R7WR768869,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SE/Sport,,1st Row (Driver & Passenger),,,
412,2B4GP44R5WR783290,,Van,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SE/Sport,,1st Row (Driver & Passenger),,,
416,2B8GP54L72R793846,,Van,6,3.8,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,ES,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)
421,2B8GP54L21R275707,,Van,6,3.8,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,ES,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)


In [67]:
#convert the null values for *doors* to 4.0
dec_VINs.iloc[[383,394,397,398,406,410,411,412,416,421,422,423,425,428,430,648,7850,7880,8111],5]=4.0
#confirm changes
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Sedan/Saloon') | (dec_VINs['bodyclass']== 'Van'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


In [68]:
#locate the null values to convert to 5.0
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Hatchback/Liftback/Notchback') | (dec_VINs['bodyclass']== 'Wagon'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
294,1GNFK16Z83R170923,,Wagon,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",CHEVROLET,1/2 Ton,,,,,
327,2A4GM68446R677557,,Hatchback/Liftback/Notchback,6,3.5,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",CHRYSLER,Touring,,,,,
390,3D4GH57V09T173817,,Hatchback/Liftback/Notchback,6,3.5,,,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",DODGE,SXT,,1st Row (Driver & Passenger),,,1st Row (Driver & Passenger)
405,2D4FV48T85H512810,,Wagon,6,2.7,,RWD/ Rear Wheel Drive,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",DODGE,Base/SXT,,,,,
426,2D4FV48V55H130087,,Wagon,6,3.5,,RWD/ Rear Wheel Drive,"Class 1C: 4,001 - 5,000 lb (1,814 - 2,268 kg)",DODGE,Base/SXT,,,,,
431,1FBSS3BL1EDA99461,Not Applicable,Wagon,8,5.4,,4x2,"Class 2H: 9,001 - 10,000 lb (4,082 - 4,536 kg)",FORD,,,1st Row (Driver & Passenger),,255.0,
432,1FBSS3BLXEDA85669,Not Applicable,Wagon,8,5.4,,4x2,"Class 2H: 9,001 - 10,000 lb (4,082 - 4,536 kg)",FORD,,,1st Row (Driver & Passenger),,255.0,
548,1GKFK16T0YJ173667,,Wagon,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,1500 (1/2 ton),,,,,
655,5NMSH13EX7H071781,,Wagon,6,3.3,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",HYUNDAI,,,,,,
666,KM8NU13C27U021324,,Wagon,6,3.8,,FWD/Front Wheel Drive,"Class 1D: 5,001 - 6,000 lb (2,268 - 2,722 kg)",HYUNDAI,,,,,,


In [69]:
#convert the null values for *doors* to 5.0
dec_VINs.iloc[[294,327,390,405,426,431,432,548,655,666,7853,7942,7943,7944,8102,8113],5]=5.0
#confirm changes
dec_VINs.loc[(dec_VINs['doors'].isnull()) & ((dec_VINs['bodyclass']=='Hatchback/Liftback/Notchback') | (dec_VINs['bodyclass']== 'Wagon'))]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


In [73]:
#locate null values with bodyclass 'Sport Utility Vehicle (SUV)/Multi-Purpose Vehicle (MPV)'
dec_VINs.loc[(dec_VINs['doors'].isnull())]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside
252,3GNFK16Z56G127483,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,4,2.4,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",CHEVROLET,,,,,,
266,1GNFK16Z92J337321,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",CHEVROLET,1/2 Ton,,,,,
279,1GNFK16Z44J159384,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",CHEVROLET,1/2 Ton,,,,,
542,1GKFK66848J104196,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,6.2,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,,,,,,
543,1GKFK16Z23J160173,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,1500 (1/2 ton),,,,,
544,1GKFK16T21J255275,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,5.3,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,1500 (1/2 ton),,,,,
545,1GKFK66U34J115388,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,6.0,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,Luxury,,,,,
546,1GKFK66U82J105047,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,6.0,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,1500 (1/2 ton) Luxury,,,,,
547,1GKFK668X7J293578,,Sport Utility Vehicle (SUV)/Multi-Purpose Vehi...,8,6.2,,4WD/4-Wheel Drive/4x4,"Class 2F: 7,001 - 8,000 lb (3,175 - 3,629 kg)",GMC,,,,,,


In [74]:
#convert these null values above
dec_VINs.iloc[252,5]= 2.0
dec_VINs.iloc[[266,279,542,543],5]=4.0
dec_VINs.iloc[[544,545,546,547],5]=5.0
#confirm changes
dec_VINs.loc[(dec_VINs['doors'].isnull())]

Unnamed: 0,vin,bodycabtype,bodyclass,enginecylinders,displacementl,doors,drivetype,gvwr,make,trim,airbagloccurtain,airbaglocfront,airbaglocknee,enginehp,airbaglocside


###### 2.8.2.5 drivetype

In [75]:
dec_VINs['drivetype'].value_counts()

4WD/4-Wheel Drive/4x4    4245
4x2                      2392
AWD/All Wheel Drive       522
FWD/Front Wheel Drive      57
RWD/ Rear Wheel Drive      23
RWD                         2
FWD                         1
AWD                         1
Name: drivetype, dtype: int64

###### 2.8.2.6 gvwr

##### 2.8.2.7 make

##### 2.8.2.8 trim

##### 2.8.2.9 airbagloccurtain

##### 2.8.2.10 airbaglocfront

##### 2.8.2.11 airbaglocknee

##### 2.8.2.12 enginehp

##### 2.8.2.13 aibaglocside

## 2.7 Organize Layout of Dataset

## 2.8 Merge Datasets

## 2.9 Save File

In [None]:
sales_hist.shape

In [None]:
#save sales_hist dataset as a dataset named "sales_hist_clean" in CSV format
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv', index = False)

## 2.10 Summary

The sales_hist dataset began with 34 columns and 8208 observations. After completing all the steps within the Data Wrangling portion of this project the shape of the sales_hist dataset is now ### columns and #### rows. No null values exist in the dataset and the VehicleMake feature has helped identify *Lexus* and *Toyota* as potential target features for this project.