## Extraction of datasets and data synthesis required for Car Price Analysis

#### The following datasets were scraped from the web. These datasets may have important pointers for price and price category prediction. This notebook describes how these datasets are augmented to produce a combined data set.
    

<div class="alert alert-block alert-info">



| Data Set Type | Data Source | Fields of Importance | Keys for join | Data Frame Name
| --- | --- | --- | --- | ---
| Actual Car Data listing Extraction | www.truecar.com | All | All | df_cardata
| Car Categories based on Make and Model | https://www.back4app.com | Category | Year, Make, Model  | df_category
| Car Reliability Index by Make | https://www.usatoday.com | Brand (Make) | BrandReliabilityRank | df_reliability
| Cost of living index for each city | https://meric.mo.gov | Cost of Living Index, Transportation Index | City | df_cost
| % of market sales in US by make | https://www.goodcarbadcar.net | PercentSales | Make | df_sales 
| "Days to turn" for used cars by make | https://www.edmunds.com | AvgDaysToTurn | Make | df_turn

</div>

In [1]:
import pandas as pd
import json
import requests
import numpy as np
from fuzzywuzzy import fuzz
requests.packages.urllib3.disable_warnings() 
pd.set_option('display.max_colwidth',None)
pd.set_option('display.float_format',lambda x: '%.2f' %x)
pd.set_option('display.max_rows',300)
pd.set_option('display.max_columns',None)


# Extraction of raw data

### Data Set 1 :  Main car data listing.

Please refer to file _WebScrapeCarData.py_ for the scraping code. Here, the csv file created after extraction is read for data augmentation

<span style="color:red"> #### Dataframe = df_cardata </span>

In [2]:
# Get main car details data

df_cardata = pd.read_csv("cardata.csv",delimiter = '|',index_col = False, encoding='cp1252')
for col in ['make','model','trim','pricecategory','city','state','colorexterior','colorinterior','usage']:
    df_cardata[col] = df_cardata[col].astype(str).apply(lambda x:x.upper())

df_cardata['discount'] = df_cardata['pricevariance'].astype(str).apply(lambda x:'Y' if 'off' in x.lower() else 'N')
df_cardata.drop(columns = ['pricevariance'],inplace = True)

df_cardata['model'] = df_cardata['model'].astype(str).apply(lambda x:x.replace('-',''))

df_cardata['owner'].fillna(0,inplace = True)
df_cardata['owner'] = df_cardata['owner'].astype(int)

df_cardata['accidenthist'] = df_cardata['accidenthist'].apply(lambda x:0 if x.lower().strip() == 'no' else x)
df_cardata['accidenthist'] = df_cardata['accidenthist'].astype(int)

df_cardata['pricecategory'] = df_cardata['pricecategory'].replace('NAN','NOT LISTED')

df_cardata.drop_duplicates(subset = ['vin'], inplace = True)

df_cardata.head()


FileNotFoundError: [Errno 2] No such file or directory: 'cardata.csv'

In [34]:
# Check for nulls
df_cardata.isnull().sum().sum()

0

In [35]:
len(df_cardata)

9337

### Data Set 2 :  Car Categories based on Make and Model

<span style="color:red"> #### Dataframe = df_category </span>

In [36]:
# get car category data

# url = 'https://parseapi.back4app.com/classes/Car_Model_List?limit=5000'
# headers = {
#     'X-Parse-Application-Id': 'hlhoNKjOvEhqzcVAJ1lxjicJLZNVv36GdbboZj3Z', # This is the fake app's application id
#     'X-Parse-Master-Key': 'SNMJJF0CZZhTPhLDIqGhTlUNV9r60M2Z5spyWfXW' # This is the fake app's readonly master key
# }

# data = json.loads(requests.get(url, headers=headers,verify=False).content.decode('utf-8')) # Here you have the data that you need
# dfcat = pd.DataFrame(data['results'])
# dfcat.drop(columns = ['objectId','createdAt','updatedAt'], inplace = True)
# for col in ['Make','Model','Category']:
#     dfcat[col] = dfcat[col].apply(lambda x:x.upper())

# dfcat['Category'] = dfcat['Category'].apply(lambda x:'SEDAN' if 'SEDAN' in x else x)
# dfcat['Category'] = dfcat['Category'].apply(lambda x:'COUPE' if 'COUPE' in x else x)
# dfcat['Category'] = dfcat['Category'].apply(lambda x:'HATCHBACK' if 'HATCHBACK' in x else x)
# dfcat['Category'] = dfcat['Category'].apply(lambda x:'WAGON' if 'WAGON' in x else x)
# dfcat['Model'] = dfcat['Model'].apply(lambda x:x.replace('-',''))
# df_category = dfcat[['Year','Make','Model','Category']]
# df_category.drop_duplicates(inplace = True)
# df_category.to_csv("car_category.csv",index = False)

df_category = pd.read_csv('car_category.csv', index_col = False)
df_category.head()

Unnamed: 0,Year,Make,Model,Category
0,2020,AUDI,Q3,SUV
1,2020,CHEVROLET,MALIBU,SEDAN
2,2020,CADILLAC,ESCALADE ESV,SUV
3,2020,CHEVROLET,CORVETTE,COUPE
4,2020,ACURA,RLX,SEDAN


In [37]:
len(df_category)

5032

### Data Set 3 :  Car Reliability Index

<span style="color:red"> #### Dataframe = df_reliability </span>

Read:

    Reliability rankings from consumer review reports in https://www.usatoday.com/story/money/cars/2019/11/14/consumer-reports-auto-reliability-study-2020-vehicles/2578463001/


In [38]:
# Get reliability rankings data

df_reliability = pd.read_csv('car_reliability_rankings.csv')
df_reliability = df_reliability[['Make','ReliabilityRank']]
df_reliability.head()

Unnamed: 0,Make,ReliabilityRank
0,GENESIS,1
1,LEXUS,2
2,BUICK,3
3,PORSCHE,4
4,TOYOTA,5


### Data Set 4 :  Cost of Living State in the US

<span style="color:red"> ####Dataframe = df_cost </span>

Read:

    Cost Of Living Index(CLI) and Local Purchasing Power Index (LPPI) are socio-economic indicators of a region 
    Data from https://meric.mo.gov/data/cost-living-data-series


In [39]:
# Get economic data

# url = 'https://meric.mo.gov/data/cost-living-data-series'
# response = requests.get(url,verify = False).content.decode('utf-8')
# df_cost = pd.read_html(response)[0] 
# df_cost['State'] = df_cost['State'].str.upper()
# df_cost.drop(df_cost.index[-1],axis = 0)
# states = pd.read_csv("US_States.csv")
# states['StateName'] = states['StateName'].str.upper()
# df_cost = pd.merge(df_cost,states,how = 'inner',left_on='State', right_on='StateName')
# df_cost = df_cost[['StateCode','Rank','Index','Transportation']]
# df_cost.columns = ['State',  'CostOfLivingRank','CostOfLivingIndex','TransportationIndex']
# df_cost.to_csv('statewise_economic_indicators.csv',index = False)

df_cost = pd.read_csv('statewise_economic_indicators.csv',index_col = False)
df_cost = df_cost[['State','CostOfLivingRank']]
df_cost.head()

Unnamed: 0,State,CostOfLivingRank
0,MS,1
1,KS,2
2,OK,3
3,NM,4
4,AR,5


### Data Set 5 : % of market sales in 2019/2020 by make

<span style="color:red"> ####Dataframe = df_sales </span>

    Obtained from 
        https://www.goodcarbadcar.net/2020-us-vehicle-sales-figures-by-model 
        https://www.goodcarbadcar.net/2019-us-vehicle-sales-figures-by-model


In [40]:
# Get Car Sales data

# url = 'https://www.goodcarbadcar.net/2020-us-vehicle-sales-figures-by-model/'
# response = requests.get(url,verify = False).content
# dfs = pd.read_html(response) 
# df2020 = dfs[0]
# df2020.dropna(inplace = True)
# df2020 = df2020.apply(pd.to_numeric, errors='ignore') 
# df2020['TotalSales'] = df2020.iloc[:,:7].sum(axis=1)

# url = 'https://www.goodcarbadcar.net/2019-us-vehicle-sales-figures-by-model/'
# response = requests.get(url,verify = False).content
# dfs = pd.read_html(response) 
# df2019 = dfs[1]
# df2019.dropna(inplace = True)
# df2019 = df2019.apply(pd.to_numeric, errors='ignore') 
# df2019['TotalSales'] = df2019.mean(axis=1)

# df_sales = pd.concat([df2020[['Model','TotalSales']], df2019[['Model','TotalSales']]], axis=0)

# df_sales ['Model'] = df_sales ['Model'].str.upper()
# df_sales ['Model'] = df_sales ['Model'].apply(lambda x:x.replace('ALFA ROMEO','ALFAROMEO').replace('LAND ROVER','LANDROVER').replace('ASTON MARTIN','ASTONMARTIN'))
# df_sales ['Make'] = df_sales ['Model'].apply(lambda x:x.split()[0])
# df_sales ['Make'] = df_sales ['Make'].apply(lambda x:x.replace('ALFAROMEO','ALFA ROMEO').replace('LANDROVER','LAND ROVER').replace('ASTONMARTIN','ASTON MARTIN'))
# df_sales = df_sales.groupby('Make').sum().reset_index()


# df_sales['PercentSales'] = (df_sales['TotalSales'] / df_sales['TotalSales'].sum()) * 100
# df_sales.to_csv('car_sales.csv',index = False)

df_sales = pd.read_csv('car_sales.csv',index_col = False)
df_sales = df_sales.drop(columns = ['TotalSales'])
df_sales.head()

Unnamed: 0,Make,PercentSales
0,ACURA,0.88
1,ALFA ROMEO,0.11
2,AUDI,1.2
3,BMW,1.78
4,BUICK,1.1


### Data Set 6 : Used car - days to turn (2016 data)


<span style="color:red"> ####Dataframe = df_turn </span>

Read: 
        
    Days to Turn by Make is the average number of days vehicles were in dealer inventory before being sold during the months indicated.
    
     Obtained from : https://www.edmunds.com/industry-center/data/days-to-turn-by-make.html

In [41]:
df_turn = pd.read_csv("used_car_time_to_turn.csv")
df_turn.head()
df_turn['AvgDaysToTurn'] = df_turn.mean(axis=1)
df_turn['Make'] = df_turn['Make'].str.upper()
df_turn = df_turn[['Make','AvgDaysToTurn']]
df_turn.head()

Unnamed: 0,Make,AvgDaysToTurn
0,ACURA,65.54
1,AUDI,55.38
2,BMW,74.0
3,BUICK,82.92
4,CADILLAC,80.23


<div class="alert alert-block alert-success">
    
# Data accumulation into single dataframe

    df_cardata =  Main Car data 
    df_category = Category by make and model
    df_reliability = Brand reliability
    df_cost = Cost of living of an area  
    df_sales = Percent Sales by Make
    df_turn = Days to turn by Make
    
    
The objective is to append potentially important fields (for EDA) from the data sets to the main car data set. 
</div>

### <span style="color:Blue"> A : Car Data with Category Data </span>

In [42]:
df2 = df_cardata.copy()

In [43]:
# merge cardata with category to get car bodytype

df_cardata = df_cardata.merge(df_category,how = 'left', left_on = ['year','make','model'], right_on = ['Year','Make','Model'])
df_cardata = df_cardata.rename({'Category' : 'bodytype'},axis = 1)
df_cardata = df_cardata.drop(columns = ['Year', 'Make', 'Model'])
df_cardata.sample(10)

Unnamed: 0,vin,year,make,model,trim,pricecategory,price,mileage,city,state,colorexterior,colorinterior,accidenthist,owner,usage,discount,bodytype
876,3N1CN7AP8KL813646,2019,NISSAN,VERSA,SV SEDAN CVT,EXCELLENT PRICE,10350,38329,JACKSONVILLE,NC,SILVER,BLACK,0,1,PERSONAL,Y,SEDAN
6972,3GCUKSEC0EG129854,2014,CHEVROLET,SILVERADO 1500,LTZ WITH 2LZ CREW CAB SHORT BOX 4WD,GREAT PRICE,29667,45563,LUBBOCK,TX,WHITE,BROWN,0,1,PERSONAL,N,
8258,1FAHP2F89KG101285,2019,FORD,TAURUS,LIMITED FWD,EXCELLENT PRICE,13700,29193,JACKSONVILLE,FL,BLUE,GRAY,1,1,FLEET,Y,SEDAN
3323,1HGCR2F39HA007052,2017,HONDA,ACCORD,LX SEDAN CVT,GREAT PRICE,15296,29627,FREDERICKSBURG,VA,WHITE,BEIGE,0,2,PERSONAL,Y,SEDAN
5560,2C4RDGCG5KR694559,2019,DODGE,GRAND CARAVAN,SXT,GREAT PRICE,14984,37671,ATHENS,GA,BLACK,BLACK,0,1,PERSONAL,Y,
3449,1J8FF28WX8D547416,2008,JEEP,PATRIOT,SPORT 4WD,EXCELLENT PRICE,1888,217435,HAGERSTOWN,MD,BLACK,GRAY,1,3,PERSONAL,Y,SUV
2288,3FADP4EJ3KM118720,2019,FORD,FIESTA,SE HATCHBACK,GREAT PRICE,11998,27826,MOBILE,AL,GRAY,BLACK,0,1,PERSONAL,Y,SEDAN
6998,KM8J23A45KU881836,2019,HYUNDAI,TUCSON,SE FWD,GREAT PRICE,15366,37387,AMES,IA,BLUE,BLACK,0,1,PERSONAL,Y,SUV
6664,1D7RV1CT9BS643075,2011,RAM,1500,"SLT CREW CAB 5'7"" BOX 4WD",EXCELLENT PRICE,11900,132148,LAKE WORTH,TX,WHITE,GRAY,0,3,PERSONAL,N,
2038,KM8J3CA40HU591753,2017,HYUNDAI,TUCSON,SE AWD,EXCELLENT PRICE,15995,16558,ENGLEWOOD,CO,BLACK,BLACK,0,1,PERSONAL,Y,SUV


In [44]:
# check if all body types are populated
len(df_cardata[df_cardata['bodytype'].isnull()]) 

1573

1573 listings still cannot be associated to a category by direct join to category dataset 
<br>
**Need fuzzy matches for that**

In [45]:
# function to get fuzzy matches for body type
def getcat(make,model):

    matches = df_category[df_category['Make'] == make]['Model'].apply(lambda x:fuzz.ratio(x,model)) 
    return df_category.loc[matches.idxmax(),'Category']
#getcat('FORD','RANGER')

In [46]:
mask = df_cardata['bodytype'].isnull()
df_cardata.loc[mask,'bodytype'] = df_cardata.loc[mask,['make','model']].apply(lambda x:getcat(x['make'],x['model']), axis = 1)

In [47]:
# check nulls in body  type
df_cardata['bodytype'].isnull().sum()

0

In [48]:
len(df_cardata)

9337

### <span style="color:Blue"> B : Car Data with Brand Reliability Data </span>

In [49]:
# merge cardata with Brand reliability
df_cardata = df_cardata.merge(df_reliability,how = 'left',left_on='make', right_on='Make')
df_cardata.drop(columns = ['Make'],inplace = True)
df_cardata.head()

Unnamed: 0,vin,year,make,model,trim,pricecategory,price,mileage,city,state,colorexterior,colorinterior,accidenthist,owner,usage,discount,bodytype,ReliabilityRank
0,WBY1Z2C51FV286674,2015,BMW,I3,60 AH,FAIR PRICE,15991,21493,BELLEVUE,WA,SILVER,GRAY,1,2,PERSONAL,N,HATCHBACK,8.0
1,2GNAXHEV4J6220616,2018,CHEVROLET,EQUINOX,LS WITH 1LS FWD,FAIR PRICE,14899,37071,NORCO,CA,BLACK,GRAY,0,1,PERSONAL,Y,SUV,9.0
2,4S3GTAD6XK3741106,2019,SUBARU,IMPREZA,2.0I PREMIUM 5-DOOR CVT,FAIR PRICE,19220,15914,STAFFORD,TX,RED,BEIGE,0,1,PERSONAL,N,SEDAN,23.0
3,2C4RDGCG8KR551301,2019,DODGE,GRAND CARAVAN,SXT,EXCELLENT PRICE,12993,42070,OCALA,FL,GRAY,BLACK,0,1,PERSONAL,Y,VAN/MINIVAN,25.0
4,5YFEPRAEXLP047434,2020,TOYOTA,COROLLA,LE CVT,EXCELLENT PRICE,13800,18725,BOERNE,TX,WHITE,BLACK,0,1,PERSONAL,N,SEDAN,5.0


In [50]:
# check if all "BrandReliabilityScore" is populated
len(df_cardata[df_cardata['ReliabilityRank'].isnull()])

73

### <span style="color:Blue"> C : Car Data with Socio Economic indicators for the area (state) where car is listed </span>

In [51]:
df_cardata = df_cardata.merge(df_cost,how = 'left',left_on=['state'], right_on=['State'])
df_cardata.drop(columns = ['State'],inplace = True)
df_cardata.head()

Unnamed: 0,vin,year,make,model,trim,pricecategory,price,mileage,city,state,colorexterior,colorinterior,accidenthist,owner,usage,discount,bodytype,ReliabilityRank,CostOfLivingRank
0,WBY1Z2C51FV286674,2015,BMW,I3,60 AH,FAIR PRICE,15991,21493,BELLEVUE,WA,SILVER,GRAY,1,2,PERSONAL,N,HATCHBACK,8.0,39
1,2GNAXHEV4J6220616,2018,CHEVROLET,EQUINOX,LS WITH 1LS FWD,FAIR PRICE,14899,37071,NORCO,CA,BLACK,GRAY,0,1,PERSONAL,Y,SUV,9.0,49
2,4S3GTAD6XK3741106,2019,SUBARU,IMPREZA,2.0I PREMIUM 5-DOOR CVT,FAIR PRICE,19220,15914,STAFFORD,TX,RED,BEIGE,0,1,PERSONAL,N,SEDAN,23.0,14
3,2C4RDGCG8KR551301,2019,DODGE,GRAND CARAVAN,SXT,EXCELLENT PRICE,12993,42070,OCALA,FL,GRAY,BLACK,0,1,PERSONAL,Y,VAN/MINIVAN,25.0,27
4,5YFEPRAEXLP047434,2020,TOYOTA,COROLLA,LE CVT,EXCELLENT PRICE,13800,18725,BOERNE,TX,WHITE,BLACK,0,1,PERSONAL,N,SEDAN,5.0,14


In [53]:
# check if there is any missing data in the new appended field "CostOfLivingRank"
len(df_cardata[df_cardata['CostOfLivingRank'].isnull()])

0


### <span style="color:Blue"> D : Car Data with Percent Sales by Make in the US </span>

In [54]:
df_cardata = df_cardata.merge(df_sales,how = 'left',left_on=['make'], right_on=['Make'])
df_cardata.drop(columns = ['Make'],inplace = True)
df_cardata.head()

Unnamed: 0,vin,year,make,model,trim,pricecategory,price,mileage,city,state,colorexterior,colorinterior,accidenthist,owner,usage,discount,bodytype,ReliabilityRank,CostOfLivingRank,PercentSales
0,WBY1Z2C51FV286674,2015,BMW,I3,60 AH,FAIR PRICE,15991,21493,BELLEVUE,WA,SILVER,GRAY,1,2,PERSONAL,N,HATCHBACK,8.0,39,1.78
1,2GNAXHEV4J6220616,2018,CHEVROLET,EQUINOX,LS WITH 1LS FWD,FAIR PRICE,14899,37071,NORCO,CA,BLACK,GRAY,0,1,PERSONAL,Y,SUV,9.0,49,11.69
2,4S3GTAD6XK3741106,2019,SUBARU,IMPREZA,2.0I PREMIUM 5-DOOR CVT,FAIR PRICE,19220,15914,STAFFORD,TX,RED,BEIGE,0,1,PERSONAL,N,SEDAN,23.0,14,4.13
3,2C4RDGCG8KR551301,2019,DODGE,GRAND CARAVAN,SXT,EXCELLENT PRICE,12993,42070,OCALA,FL,GRAY,BLACK,0,1,PERSONAL,Y,VAN/MINIVAN,25.0,27,2.13
4,5YFEPRAEXLP047434,2020,TOYOTA,COROLLA,LE CVT,EXCELLENT PRICE,13800,18725,BOERNE,TX,WHITE,BLACK,0,1,PERSONAL,N,SEDAN,5.0,14,12.19


In [55]:
# check if there is any missing data in the new appended field "PercentSales"
len(df_cardata[df_cardata['PercentSales'].isnull()])

68


### <span style="color:Blue"> E : Car Data with typical "Days to Turn" value for used cars (2016 data) </span>

In [56]:
df_cardata = df_cardata.merge(df_turn,how = 'left',left_on=['make'], right_on=['Make'])
df_cardata.drop(columns = ['Make'],inplace = True)
df_cardata.head()

Unnamed: 0,vin,year,make,model,trim,pricecategory,price,mileage,city,state,colorexterior,colorinterior,accidenthist,owner,usage,discount,bodytype,ReliabilityRank,CostOfLivingRank,PercentSales,AvgDaysToTurn
0,WBY1Z2C51FV286674,2015,BMW,I3,60 AH,FAIR PRICE,15991,21493,BELLEVUE,WA,SILVER,GRAY,1,2,PERSONAL,N,HATCHBACK,8.0,39,1.78,74.0
1,2GNAXHEV4J6220616,2018,CHEVROLET,EQUINOX,LS WITH 1LS FWD,FAIR PRICE,14899,37071,NORCO,CA,BLACK,GRAY,0,1,PERSONAL,Y,SUV,9.0,49,11.69,78.23
2,4S3GTAD6XK3741106,2019,SUBARU,IMPREZA,2.0I PREMIUM 5-DOOR CVT,FAIR PRICE,19220,15914,STAFFORD,TX,RED,BEIGE,0,1,PERSONAL,N,SEDAN,23.0,14,4.13,23.23
3,2C4RDGCG8KR551301,2019,DODGE,GRAND CARAVAN,SXT,EXCELLENT PRICE,12993,42070,OCALA,FL,GRAY,BLACK,0,1,PERSONAL,Y,VAN/MINIVAN,25.0,27,2.13,98.54
4,5YFEPRAEXLP047434,2020,TOYOTA,COROLLA,LE CVT,EXCELLENT PRICE,13800,18725,BOERNE,TX,WHITE,BLACK,0,1,PERSONAL,N,SEDAN,5.0,14,12.19,43.23


In [57]:
#check if there is any missing data in the new appended field "AvgDaysToTurn"
len(df_cardata[df_cardata['AvgDaysToTurn'].isnull()])

87

<div class="alert alert-block alert-success">
    
# COMPLETED Data Set Creation
    
</div>

### <span style="color:Blue"> Bonus data addition - Used Car Ratings </span>

Data Source : https://cars.usnews.com/


In [58]:
df_ratings = pd.read_csv('car_ratings.csv', index_col = False)
df_ratings.drop_duplicates(subset = ['MakeModel'], inplace = True)
df_ratings['AvgMPG'] = (df_ratings['MpgCity']  + df_ratings['MpgHwy']) / 2
df_ratings.loc[df_ratings['CarClass'].str.contains(r'LUXURY|SPORTS|HYBRID'),'LuxurySportsOrHybrid'] = 'Y'
df_ratings['LuxurySportsOrHybrid']  = df_ratings['LuxurySportsOrHybrid'].fillna('N')
df_ratings = df_ratings[['MakeModel','ReviewScore','AvgMPG','LuxurySportsOrHybrid']]
df_ratings.sample(10)

Unnamed: 0,MakeModel,ReviewScore,AvgMPG,LuxurySportsOrHybrid
238,DODGE DURANGO,8.0,22.5,N
190,JEEP WRANGLER,7.1,21.0,N
240,SUBARU ASCENT,8.0,24.0,N
146,JAGUAR F-TYPE,7.9,24.0,Y
232,HYUNDAI SANTA FE,8.5,25.5,N
225,JAGUAR E-PACE,7.7,24.5,Y
297,LEXUS NX HYBRID,7.9,31.5,Y
20,SUBARU IMPREZA,7.7,33.0,N
24,TOYOTA PRIUS,7.6,52.0,N
252,DODGE JOURNEY,5.5,22.0,N


In [59]:
# Function to do fuzzy matching of make and model combination
def getclass(makemodel):

    try:
        matches = df_ratings['MakeModel'].apply(lambda x:fuzz.ratio(x,makemodel))
        if matches.max() > 80:
            return matches.idxmax()
        else:
            return -1
    except:
        return -1

In [60]:
df_cardata['makemodel'] = df_cardata['make'].astype(str) + ' ' + df_cardata['model'].astype(str)
df_cardata['matchindex'] = df_cardata['makemodel'].apply(getclass)
df_cardata = df_cardata.merge(df_ratings, how = 'left', left_on = 'matchindex', right_index = True)
df_cardata = df_cardata.drop(columns = ['makemodel','matchindex','MakeModel'])

In [61]:
len(df_cardata)

9337

In [62]:
df_cardata.isnull().sum()

vin                       0
year                      0
make                      0
model                     0
trim                      0
pricecategory             0
price                     0
mileage                   0
city                      0
state                     0
colorexterior             0
colorinterior             0
accidenthist              0
owner                     0
usage                     0
discount                  0
bodytype                  0
ReliabilityRank          73
CostOfLivingRank          0
PercentSales             68
AvgDaysToTurn            87
ReviewScore             668
AvgMPG                  674
LuxurySportsOrHybrid    668
dtype: int64

In [63]:
df_cardata.to_csv('cardata_final.csv',index = False)