# Data Wrangling

## 2.1 Contents
    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load Historical Sales Data
    2.6 Data Exploration
        2.6.1 Handling Null Values
        2.6.2 Removing Uncessary Features
        2.6.3 Categorical Features

## 2.2 Introduction
This project is .....

## 2.3 Imports

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

## 2.4 Objectives
- Do I have the data I need to tackle the desired question?
- Have I identified the required target value?
- Do I have potentially useful features?
- Do I have any fundamental issues with the data?

## 2.5 Load Historical Sales Data

In [3]:
#csv file in subdirectory 'raw'
sales_hist = pd.read_csv('../data/raw/HistoricalSalesData.csv')

In [4]:
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   DealStatus                  8208 non-null   object 
 4   Comments                    22 non-null     object 
 5   InventoryType               8199 non-null   object 
 6   StockNumber                 8208 non-null   object 
 7   VIN                         8208 non-null   object 
 8   VehicleMake                 8208 non-null   object 
 9   VehicleModel                8208 non-null   object 
 10  VehicleModelYear            8208 non-null   int64  
 11  VehicleSalePrice            8207 non-null   float64
 12  TotalGrossProfit            8208 non-null   float64
 13  BackEndGrossProfit          8202 

There are 34 features, therefore feature reduction will be necessary in identifying the useful features and the target feature(s). Also, there appears to be 8208 entries for the features without nulls recorded. However, features such as Comments, SalesPerson2ID, Trade2_VIN, Trade2_Year, and Trade2_Model seem to have significantly smaller entries. These features will need to be explored later during the data cleaning portion of this notebook.

Additionally, the dataset appears to have three different data types. Further exploration will be necessary to understand why ContractTerm, Trade1_VIN, and Trade2_VIN are float64 data types and not int64.

In [5]:
sales_hist.head(20)

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
0,10029,1/12/11 0:00,1/12/11 0:00,F,,U,K175A,2T2HK31U49C118454,Lexus,RX 350,...,,K175B,YV1CM59H331013308,2003.0,Volvo,XC90,,,,
1,10035,1/10/11 0:00,1/10/11 0:00,F,,U,K190A,JTHCE96S580017706,Lexus,GS 350,...,,K190B,1FTWW31P95EB23344,2005.0,Ford,F-350,,,,
2,10036,1/11/11 0:00,1/11/11 0:00,F,,N,K205,JTHDL5EF0B5003231,Lexus,LS 460,...,,K205A,JTHBL46F385052674,2008.0,Lexus,LS 460,,,,
3,10037,1/14/11 0:00,1/14/11 0:00,F,,N,K210,JTJBK1BA2B2013626,Lexus,RX 350,...,,K210A,1HGCD5666SA119678,1995.0,Honda,Accord,,,,
4,10057,1/14/11 0:00,1/14/11 0:00,F,,U,L1112,JTJHK31U082048420,Lexus,RX 350,...,,L1112A,1FMDU34X1VUC98892,1997.0,Ford,Explorer,,,,
5,10059,1/14/11 0:00,1/14/11 0:00,F,,U,L1090A,1NXBR32E55Z545986,Toyota,Corolla,...,,,,,,,,,,
6,10060,1/14/11 0:00,1/14/11 0:00,F,,N,K130,2T2BK1BA7BC087463,Lexus,RX 350,...,,K130A,2T2HK31U08C049602,2008.0,Lexus,RX 350,,,,
7,10067,1/17/11 0:00,1/17/11 0:00,F,,U,K205A,JTHBL46F385052674,Lexus,LS 460,...,,K205B,JTHCE96S570011550,2007.0,Lexus,GS 350,,,,
8,10068,1/17/11 0:00,1/17/11 0:00,F,,U,L1116,JTJBT20X780158790,Lexus,GX 470,...,,L1116A,WA1EY74L57D059909,2007.0,Audi,Q7,,,,
9,10072,1/18/11 0:00,1/18/11 0:00,F,,U,J635C,2T2HA31U36C108786,Lexus,RX 330,...,,J635D,3N1CB51D94L836617,2004.0,Nissan,Sentra,,,,


Will need to confirm if the *DealNumber* feature is unique for each entry. If so, this may be a good way to identify each vehicle obeservation. Cleaning up the formatting for the *ContractDate* feature will need to be handled. If there are no other entries for *DealStatus* but the letter 'F', this column should be removed. There are many null entries for *Comments* as mentioned before, may remove this column if those 84 entries recordered show no significance to the objective of this project. *InventoryType* does have null values as seen prior, but the significance of if a vehicle is new(N) or used(U) is important. Therefore, maybe another feature, potentially *VehicleModelYear* in the dataset can assist in filling in the null values. StockNumber could be another way to identify each vehicle observation, however, from my experience in the industry, typically stock numbers are recycled once the vehicle it is assigned to is sold. If this is the case, this column will be removed. The*VIN*feature has no null observations, but will have duplicates since it is a priority for a dealership to receive and resell vehicles sold or in a lease program prior with that dealership to ensure high marketshare. The VINs for all entries in this column have been decoded to provide a breakout of each vehicle's detailed makeup. This dataset will be merged with this dataset within this notebook.

Will need to decide if the VINs of the traded vehicles should be decoded as well. If so, it appears there are 3235 *Trade1_VIN* entries and 84 additional *Trade2_VIN* entries to decode.

## 2.6 Data Exploration

#### 2.6.1 Handling Null Values

In [6]:
#examine # of missing values by column and sort them high to low
missing = pd.concat([sales_hist.isnull().sum(), 100 * sales_hist.isnull().mean()], axis=1)
missing.columns = ['count','%']
missing.sort_values(by = 'count', ascending = False)

Unnamed: 0,count,%
Comments,8186,99.731969
Trade2_Model,8124,98.976608
Trade2_Make,8124,98.976608
Trade2_Year,8124,98.976608
Trade2_VIN,8124,98.976608
SalesPerson2ID,7760,94.54191
BuyerBirthDate,5943,72.404971
Trade1_StockNumber,5595,68.165205
Trade1_VIN,4973,60.587232
Trade1_Model,4971,60.562865


Since *Comments* has the highest number of null values lets examine what is available with those 22 entries and decide if this feature should be removed. 

In [7]:
#exploring the Comments feature entries that are non-null
sales_hist[sales_hist.Comments.notnull()]

Unnamed: 0,DealNumber,ContractDate,DeliveryDate,DealStatus,Comments,InventoryType,StockNumber,VIN,VehicleMake,VehicleModel,...,SalesPerson2ID,Trade1_StockNumber,Trade1_VIN,Trade1_Year,Trade1_Make,Trade1_Model,Trade2_VIN,Trade2_Year,Trade2_Make,Trade2_Model
4287,20113,10/20/15 0:00,10/20/15 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1568,1FBSS3BL1EDA99461,Ford,E-350 Super Duty,...,,,,,,,,,,
4406,20557,1/31/16 0:00,1/31/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S257,JTJJM7FX4G5132520,Lexus,GX 460,...,,,,,,,,,,
4448,20672,1/31/16 0:00,1/31/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S255,JTJBM7FX7G5132886,Lexus,GX 460,...,,,,,,,,,,
4451,20677,1/23/16 0:00,1/23/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,R871,2T2BK1BAXFC337073,Lexus,RX 350,...,,,5UXFA53503LV87734,2003.0,BMW,X5,,,,
4453,20679,1/25/16 0:00,1/25/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S146,JTJBARBZ0G2051273,Lexus,NX 200t,...,,,,,,,,,,
4454,20681,1/29/16 0:00,1/29/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S246,2T2BZMCAXGC013733,Lexus,RX 350,...,,,JF2GPBKC7EH297629,2014.0,Subaru,XV Crosstrek Hybrid,,,,
4457,20692,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1561,JTHBK1GG3E2144678,Lexus,ES 350,...,,,3GTU2WEC7FG116897,2015.0,GMC,Sierra 1500,,,,
4458,20695,1/28/16 0:00,1/28/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,U,L1514,2T2BK1BA3EC231160,Lexus,RX 350,...,,,JNKCV51E63M328394,2003.0,Infiniti,G35,,,,
4459,20712,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S253,2T2BZMCA7GC012345,Lexus,RX 350,...,,,2T2HK31U99C101620,2009.0,Lexus,RX 350,,,,
4462,20724,1/30/16 0:00,1/30/16 0:00,F,<div id='DMSmatchingComment'>Excluded from mat...,N,S249,JTJBZMCA8G2003167,Lexus,RX 350,...,,,,,,,,,,


The 22 entries that are non-null for *Comments* appear to have no importance for this project and therefore this feature will be removed from the dataset in the next subsection.

Next, were all the 'Trade2_' features but these are most likely key features to understanding the broader marketplace's preferences since these vehicles were traded-in. So, they will be left alone at this time.

*SalesPerson2ID* and *SalesPersonID* both refer to the sales team and hold no significance to the purpose of the project. Therefore, these two columns will be removed in the next subsection as well.

Now, the *BuyerBirthDate* is missing over 70% of its entries. However, this feature could be modified to display the customer's age, providing potential insight into the type of customer that buys within each vehicle segiment. For now, it will be left alone.

Next, were all the 'Trade1_' features but these are most likely key features to understanding the broader marketplace's preferences since these vehicles were traded-in. So, they will be left alone at this time.

Then, *APR*, *ContractTerm*, and *MonthlyPayment* which all refer to the customer's financing. A customer's financing type could prove important to understanding how pricing of vehicles affects customers and or if customers have a finance preference. Out of these three features, *ContractTerm* provides the most relevant information. Additionally, this feature could be used to verify if a duplicate entry in the VIN column is a leased vehicle returning. 
Therefore, the *APR* and *MonthlyPayment* features will be removed in the next subsection and the *ContractTerm* feature will be examined closer to see how to handle its null values.

In [8]:
#examine the ContractTerm feature
sales_hist.ContractTerm.value_counts()

1.0     2874
60.0    1453
36.0     986
48.0     758
72.0     130
24.0      89
39.0      47
66.0      38
42.0      26
75.0      19
63.0      17
51.0      16
27.0      12
84.0       9
54.0       8
77.0       8
33.0       6
30.0       3
45.0       3
69.0       2
Name: ContractTerm, dtype: int64

In [9]:
#reference the count of null values in feature
sales_hist.ContractTerm.isnull().sum()

1704

There are 1704 null values and 2874 unusual '1.0' entries for this feature. However, after speaking to the source of this dataset I was informed that any blanks or 1.0 entries were how cash buyers were recorded. Therefore, all null values will be converted to 1.0 for consistency. Additionally, it was confirmed that all entries for this feature were recorded in the time series of months. 

In [10]:
#replace all null values in ContractTerm with 1.0
sales_hist['ContractTerm'] = sales_hist['ContractTerm'].fillna(value = 1.0)
#confirm feature updated
sales_hist['ContractTerm'].value_counts()

1.0     4578
60.0    1453
36.0     986
48.0     758
72.0     130
24.0      89
39.0      47
66.0      38
42.0      26
75.0      19
63.0      17
51.0      16
27.0      12
84.0       9
54.0       8
77.0       8
33.0       6
30.0       3
45.0       3
69.0       2
Name: ContractTerm, dtype: int64

#### 2.6.2 Removing Uncessary Features

Although, not all datasets need a feature that serves as an unique identifier, this dataset appears to have multiple. *BuyerID*, *DealNumber*, *StockNumber*, and *Trade1_StockNumber* can  all be considered an unique identifier. It is already known that *Trade1_StockNumber* is missing 68% of its entries and serves no other value, therefore this feature should be removed. Additionally, we do not need multiple key features, so two of the remaining three can be removed as well. Let's examine which of these features has the most unique entries.

In [11]:
#exploring unique entries of BuyerID
sales_hist.BuyerID.nunique()

5656

In [12]:
#exploring unique entries of DealNumber
sales_hist.DealNumber.nunique()

8207

In [13]:
#exploring unique entries of StockNumber
sales_hist.StockNumber.nunique()

8183

*BuyerID* has 5656 unique entries out of 8208. *DealNumber* only has two duplicate numbers. *StockNumber* has 8183 unique entries out of 8208. Therefore, the best feature to use as a unique key identifier is the *DealNumber*. Now, one of the two duplicate entries needs to be changed to another unique entry and the *BuyerID* and *StockNumber* features can be removed.

In [14]:
#identifying the duplicate entries in DealNumber
sales_hist.DealNumber.value_counts()

20113      2
6141       1
7321       1
3386       1
2004128    1
          ..
19060      1
21111      1
8825       1
12923      1
16384      1
Name: DealNumber, Length: 8207, dtype: int64

In [15]:
sales_hist.DealNumber.sort_values(ascending = False)

4272    2007906
4271    2007905
4270    2007902
4269    2007896
4268    2007890
         ...   
2660        193
1357        149
995         136
602         120
5248          3
Name: DealNumber, Length: 8208, dtype: int64

DealNumber 20113 is the duplicated entry and we see that 2007904 is skipped. Therefore, the duplicate 20113 entrie will be converted to this unused number.

#not sure how to change only one of the duplicate values.

In [16]:
#examining DealStatus
sales_hist.DealStatus.value_counts()

F    8208
Name: DealStatus, dtype: int64

The entries are all the same, so this feature can be removed as well. 

In [17]:
#DealStatus feature removed from dataset
del sales_hist['DealStatus']
#confirm feature removed
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 33 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   Comments                    22 non-null     object 
 4   InventoryType               8199 non-null   object 
 5   StockNumber                 8208 non-null   object 
 6   VIN                         8208 non-null   object 
 7   VehicleMake                 8208 non-null   object 
 8   VehicleModel                8208 non-null   object 
 9   VehicleModelYear            8208 non-null   int64  
 10  VehicleSalePrice            8207 non-null   float64
 11  TotalGrossProfit            8208 non-null   float64
 12  BackEndGrossProfit          8202 non-null   float64
 13  FrontEndGrossProfit         7508 

In [18]:
#Comments feature removed from dataset
del sales_hist['Comments']
#confirm feature removed
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 32 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   InventoryType               8199 non-null   object 
 4   StockNumber                 8208 non-null   object 
 5   VIN                         8208 non-null   object 
 6   VehicleMake                 8208 non-null   object 
 7   VehicleModel                8208 non-null   object 
 8   VehicleModelYear            8208 non-null   int64  
 9   VehicleSalePrice            8207 non-null   float64
 10  TotalGrossProfit            8208 non-null   float64
 11  BackEndGrossProfit          8202 non-null   float64
 12  FrontEndGrossProfit         7508 non-null   float64
 13  APR                         3509 

In [19]:
#SalesPersonID and SalesPerson2ID feature removed from dataset
sales_hist.drop(['SalesPersonID', 'SalesPerson2ID'], axis =1, inplace=True)
#confirm features removed
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   InventoryType               8199 non-null   object 
 4   StockNumber                 8208 non-null   object 
 5   VIN                         8208 non-null   object 
 6   VehicleMake                 8208 non-null   object 
 7   VehicleModel                8208 non-null   object 
 8   VehicleModelYear            8208 non-null   int64  
 9   VehicleSalePrice            8207 non-null   float64
 10  TotalGrossProfit            8208 non-null   float64
 11  BackEndGrossProfit          8202 non-null   float64
 12  FrontEndGrossProfit         7508 non-null   float64
 13  APR                         3509 

In [20]:
#APR and MonthlyPayment feature removed from dataset
sales_hist.drop(['APR', 'MonthlyPayment'], axis =1, inplace=True)
#confirm features removed
sales_hist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DealNumber                  8208 non-null   int64  
 1   ContractDate                8208 non-null   object 
 2   DeliveryDate                8208 non-null   object 
 3   InventoryType               8199 non-null   object 
 4   StockNumber                 8208 non-null   object 
 5   VIN                         8208 non-null   object 
 6   VehicleMake                 8208 non-null   object 
 7   VehicleModel                8208 non-null   object 
 8   VehicleModelYear            8208 non-null   int64  
 9   VehicleSalePrice            8207 non-null   float64
 10  TotalGrossProfit            8208 non-null   float64
 11  BackEndGrossProfit          8202 non-null   float64
 12  FrontEndGrossProfit         7508 non-null   float64
 13  ContractTerm                8208 

#### 2.6.3 Categorical Features

In [21]:
#examining InventoryType
sales_hist.InventoryType.value_counts()

N    4619
U    3580
Name: InventoryType, dtype: int64

In [64]:
#examining VIN
sales_hist.VIN.value_counts()

2T2HA31U35C060138    4
JTHBK1EG4A2356230    3
JTJBK1BA3A2400796    3
2T2BK1BA1BC094862    3
JTHBJ46G872024450    3
                    ..
2T2BK1BA5EC240216    1
JTJBT20X360098584    1
JTJHT00W554002046    1
JTHBK1GG1D2038860    1
2T2BK1BA2FC330568    1
Name: VIN, Length: 7044, dtype: int64

In [74]:
#examining exact count of duplicates in VIN
sh = sales_hist.VIN.value_counts()
sh.value_counts()

1    5979
2     967
3      97
4       1
Name: VIN, dtype: int64

Therefore, we know there are no null values, but only 5979 of the entries are unique and the remaining are duplicates. 967 of them are repeated once, 97 twice, and one three times. We can see from the first output that the one VIN occuring four times is 2T2HA31U35C060138. It is suspected this is because these duplicated VINs belong to vehicles apart of lease programs.

In [83]:
#examining VehicleMake
sales_hist.VehicleMake.value_counts()

Lexus            6825
Toyota            209
Honda              96
Mercedes-Benz      87
Ford               84
Chevrolet          78
BMW                76
Cadillac           72
Jeep               71
Nissan             63
Chrysler           60
Dodge              46
LINCOLN            42
Buick              42
GMC                35
Volkswagen         28
Audi               28
Pontiac            27
Infiniti           27
Volvo              24
Acura              24
Mazda              22
Hyundai            21
Oldsmobile         18
Jaguar             18
Subaru             15
Mercury            10
Saturn             10
Land Rover          8
Porsche             7
Kia                 6
Saab                5
MINI                5
HUMMER              3
Mitsubishi          3
Plymouth            2
Isuzu               2
LEXUS               1
DODG                1
Maserati            1
CADILLAC            1
FERRARI             1
VOLKSWAGEN          1
TOYOTA              1
PORSCHE             1
BUICK     

Lexus has the highest number of observations with Toyota at a far second with Honda, Mercedes-Benz, and Ford surprisingly competing for the consecutive places. Initial thoughts lead to an assumption that specific vehicle features leads to this occurance. Categorizing the VehicleMake feature into broader categories of lucury , economy, and potentially one more may prove insightful as well. However, initially it is clear Lexus, Toyota and other will be necessary to explore for this business problem.