# **Used Car Price Predictor**

**Intro:**

 This summer my college roommate and I decided it would be fun to buy a cheap, reliable car on craigslist to road trip from Madison, WI to Seattle WA. We knew very little about the market for used cars, but after some research on Google and a few test drives we settled on a 1999 Toyota Camry for $2700. 

 Our trip was a great success, but after we bought the car I realized this would have been a prime opportunity to use my data analytics and machine learning skills to ensure we were getting a good deal on the car. The goal of this project is to create a model that can accurately predict the price of a car given only the details one could find in a craigslist ad.

**Data Source:**

 The data used in this project comes from Kaggle user Austin Reese, who aggregated every Craigslist 'cars & trucks' listing in the United States from 2021. It can be found here: https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data  

**Objectives:**

1) Clean Data

2) Create Model to predict price of cars based on available data

3) Use model to find cars with large descrepencies between actual and predicted price


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
# Connect to drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load data
cars = pd.read_csv("/content/drive/MyDrive/Kaggle/vehicles.csv")

## Data Cleaning

In [None]:
cars.head(4)

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,


In [None]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

In [None]:
# goodbye, useless columns!
colsToDrop = ['url','region_url','VIN','image_url','county','lat','long','posting_date',
              'paint_color','fuel','cylinders','size','type','id']
cars = cars.drop(columns = colsToDrop)

# goodbye, entries without enough data!
cars = cars.dropna(subset=['model', 'odometer','price','year','manufacturer','title_status','transmission'])
cars.isna().sum()

region               0
price                0
year                 0
manufacturer         0
model                0
condition       158828
odometer             0
title_status         0
transmission         0
drive           115377
description          2
state                0
dtype: int64

In [None]:
# fill na values with 'unknown'
cars['condition'] = cars['condition'].fillna('unknown')
cars['drive'] = cars['drive'].fillna('unknown')

# all text lowercase
cars['model'] = cars['model'].str.lower()
cars['description'] = cars['description'].str.lower()
cars['model'].value_counts()

f-150                    7802
silverado 1500           4987
1500                     4148
camry                    3052
silverado                2946
                         ... 
terrain slt - 2 awd         1
vandura/rally               1
capitiva ltz                1
liberty sport limited       1
gand wagoneer               1
Name: model, Length: 22427, dtype: int64

In [None]:
# searchfor = ['down payment', 'downpayment','0 down','9 down']
# cars['contains_down'] = cars['desc'].str.contains('|'.join(searchfor))
# cars = cars[~cars['contains_down']]

In [None]:
# create region row
cars['region'] = cars['state']

cars['region']=cars['region'].replace(['ca','or','wa','hi','ak','nv','id','mt','wy','ut','co','az','nm'],'west')
cars['region']=cars['region'].replace(['hi','ak'],'wayout')
cars['region']=cars['region'].replace(['nd','sd','ne','ks','mn','ia','mo','wi','il','mi','in','oh'],'midwest')
cars['region']=cars['region'].replace(['pa','ny','vt','me','nh','ma','ct','ri','nj'],'northeast')
cars['region']=cars['region'].replace(['tx','ok','ar','la','ky','tn','ms','al','de','md','dc','wv','va','nc','sc','ga','fl'],'south')

In [None]:
# no use in looking for cars that are unrealistically priced for a used daily driver vehicle
cars.drop(cars[cars['price'] > 35000].index, inplace = True)
cars.drop(cars[cars['price'] < 950].index, inplace = True)
# or cars being parted out
cars.drop(cars[cars['title_status'] == 'parts only'].index, inplace = True)
cars.head(5)

Unnamed: 0,region,price,year,manufacturer,model,condition,odometer,title_status,transmission,drive,description,state
27,south,33590,2014.0,gmc,sierra 1500 crew cab slt,good,57923.0,clean,other,unknown,carvana is the safer way to buy a car during t...,al
28,south,22590,2010.0,chevrolet,silverado 1500,good,71229.0,clean,other,unknown,carvana is the safer way to buy a car during t...,al
30,south,30990,2017.0,toyota,tundra double cab sr,good,41124.0,clean,other,unknown,carvana is the safer way to buy a car during t...,al
31,south,15000,2013.0,ford,f-150 xlt,excellent,128000.0,clean,automatic,rwd,2013 f-150 xlt v6 4 door. good condition. leve...,al
32,south,27990,2012.0,gmc,sierra 2500 hd extended cab,good,68696.0,clean,other,4wd,carvana is the safer way to buy a car during t...,al


In [None]:
# sort by model
# going with camrys because that is the car I recently bought
camrys = cars[cars['model'].str.contains('camry')]
camrys

Unnamed: 0,region,price,year,manufacturer,model,condition,odometer,title_status,transmission,drive,description,state
141,south,18590,2017.0,toyota,camry le sedan 4d,good,30223.0,clean,other,fwd,carvana is the safer way to buy a car during t...,al
209,south,3500,2003.0,toyota,camry,good,237000.0,clean,automatic,unknown,nice vehicle with sunroof and 4 cylinder engin...,al
273,south,2675,1996.0,toyota,camry,unknown,164719.0,clean,automatic,fwd,1996 *toyota* *camry* 4dr sedan dx automatic -...,al
310,south,4499,2005.0,toyota,camry,unknown,170687.0,clean,automatic,fwd,2005 *toyota* *camry* 4dr sedan le automatic -...,al
379,south,4000,2002.0,toyota,camry,like new,160000.0,clean,automatic,fwd,must see! super clean 2002 toyota camry le! 16...,al
...,...,...,...,...,...,...,...,...,...,...,...,...
426241,midwest,14995,2015.0,toyota,camry,excellent,69246.0,clean,automatic,fwd,2015 toyota camry xse leather interior navigat...,wi
426379,west,17590,2015.0,toyota,camry se sedan 4d,good,25889.0,clean,other,fwd,carvana is the safer way to buy a car during t...,wy
426700,west,3977,2000.0,toyota,camry,unknown,162829.0,clean,automatic,unknown,2000 toyota camry xle v6 ☎ 406-283-3311 call o...,wy
426720,west,19200,2013.0,toyota,camry hybrid xle,good,90000.0,clean,automatic,fwd,"for sale: 2013 camry xle hybrid, 5 passenger ...",wy


## Model Fitting & Testing

In [None]:
#splitting to feautures and label 
x=camrys.drop(columns=['price','model','state','description']) 
y=camrys[['price']] 

x.head(3)

Unnamed: 0,region,year,manufacturer,condition,odometer,title_status,transmission,drive
141,south,2017.0,toyota,good,30223.0,clean,other,fwd
209,south,2003.0,toyota,good,237000.0,clean,automatic,unknown
273,south,1996.0,toyota,unknown,164719.0,clean,automatic,fwd


In [None]:
#categorical data encoding
x=pd.get_dummies(x)
x.shape

(4423, 27)

In [None]:
#splitting the data into train and test
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25)

#Creating and fitting RandomForestRegressor
random_forest = RandomForestRegressor(n_estimators = 200, max_features = 'sqrt', n_jobs = 20)
random_forest.fit(x_train, y_train.values.ravel())
print(random_forest.score(x_train, y_train), random_forest.score(x_test, y_test))

0.990101861313236 0.9399223135883428


In [None]:
random_forest_values = cross_val_score(random_forest, x, y.values.ravel(), cv=4)

print(f"RandomForestRegressor model has {random_forest_values} accuracy scores")

RandomForestRegressor model has [0.91530441 0.84335566 0.89444042 0.89830429] accuracy scores


In [None]:
#Model stats
print('RandomForestRegressor Accuracy Evaluation')
print(f'r2 score: {r2_score(y_test, random_forest.predict(x_test))}')
print(f'Mean absolute error: {mean_absolute_error(y_test, random_forest.predict(x_test))}')
print(f'Mean squared error: {mean_squared_error(y_test, random_forest.predict(x_test))}')

RandomForestRegressor Accuracy Evaluation
r2 score: 0.9399223135883428
Mean absolute error: 986.2278068551274
Mean squared error: 2842512.081720776


In [None]:
x_test.columns

Index(['year', 'odometer', 'region_midwest', 'region_northeast',
       'region_south', 'region_west', 'manufacturer_nissan',
       'manufacturer_toyota', 'condition_excellent', 'condition_fair',
       'condition_good', 'condition_like new', 'condition_new',
       'condition_salvage', 'condition_unknown', 'title_status_clean',
       'title_status_lien', 'title_status_missing', 'title_status_rebuilt',
       'title_status_salvage', 'transmission_automatic', 'transmission_manual',
       'transmission_other', 'drive_4wd', 'drive_fwd', 'drive_rwd',
       'drive_unknown'],
      dtype='object')

In [None]:
# insert data for car I bought
data = [[1999,220000,1,0,
         0,0,0,
         1,0,0,
         1,0,0,
         0,0,1,
         0,0,0,
         0,1,0,
         0,0,1,0,
         0]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=x_test.columns)
  
# print dataframe
df

Unnamed: 0,year,odometer,region_midwest,region_northeast,region_south,region_west,manufacturer_nissan,manufacturer_toyota,condition_excellent,condition_fair,...,title_status_missing,title_status_rebuilt,title_status_salvage,transmission_automatic,transmission_manual,transmission_other,drive_4wd,drive_fwd,drive_rwd,drive_unknown
0,1999,220000,1,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0


In [None]:
# I bought for $2700, lets see what this model predicts
random_forest.predict(df)

array([2706.74666667])

## Using Model to find good deals

In [None]:
# type model of car you wish to test in in parentheses 
# fir this example we will look for Ford Rangers
model = cars[cars['model'].str.contains('ranger')]

x=model.drop(columns=['price','model','state','region','description']) #splitting to feautures and label
y=model[['price']] 

x=pd.get_dummies(x)
x.shape

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25)

random_forest = RandomForestRegressor(n_estimators = 250, max_features = 'sqrt', n_jobs = 25)
random_forest.fit(x_train, y_train.values.ravel())
print(random_forest.score(x_train, y_train), random_forest.score(x_test, y_test))

print('RandomForestRegressor Accuracy Evaluation')
print(f'r2 score: {r2_score(y_test, random_forest.predict(x_test))}')
print(f'Mean absolute error: {mean_absolute_error(y_test, random_forest.predict(x_test))}')
print(f'Mean squared error: {mean_squared_error(y_test, random_forest.predict(x_test))}')

0.9969429451989178 0.96493899824979
RandomForestRegressor Accuracy Evaluation
r2 score: 0.96493899824979
Mean absolute error: 909.7387858448944
Mean squared error: 4732270.268490126


In [None]:
x_test.columns

Index(['year', 'odometer', 'manufacturer_ford', 'manufacturer_jeep',
       'condition_excellent', 'condition_fair', 'condition_good',
       'condition_like new', 'condition_new', 'condition_salvage',
       'condition_unknown', 'title_status_clean', 'title_status_lien',
       'title_status_rebuilt', 'title_status_salvage',
       'transmission_automatic', 'transmission_manual', 'transmission_other',
       'drive_4wd', 'drive_fwd', 'drive_rwd', 'drive_unknown'],
      dtype='object')

In [None]:
# run this cell to test a car of your choice
a = x_test.columns
b = x_test.columns.tolist()
print('data =[[')
for i in b:
  print('0, #' + i)
print(']]')
print('df = pd.DataFrame(data, columns=a)')
print('random_forest.predict(df)')

data =[[
0, #year
0, #odometer
0, #manufacturer_ford
0, #manufacturer_jeep
0, #condition_excellent
0, #condition_fair
0, #condition_good
0, #condition_like new
0, #condition_new
0, #condition_salvage
0, #condition_unknown
0, #title_status_clean
0, #title_status_lien
0, #title_status_rebuilt
0, #title_status_salvage
0, #transmission_automatic
0, #transmission_manual
0, #transmission_other
0, #drive_4wd
0, #drive_fwd
0, #drive_rwd
0, #drive_unknown
]]
df = pd.DataFrame(data, columns=a)
random_forest.predict(df)


In [None]:
# copy previous cell output into here and input car details to test any car you'd like to, using binary notation for non numeric specifics

data =[[
2005, #year
85000, #odometer
1, #manufacturer_ford
0, #manufacturer_jeep
0, #condition_excellent
1, #condition_fair
0, #condition_good
0, #condition_like new
0, #condition_new
0, #condition_salvage
0, #condition_unknown
1, #title_status_clean
0, #title_status_lien
0, #title_status_rebuilt
0, #title_status_salvage
1, #transmission_automatic
0, #transmission_manual
0, #transmission_other
0, #drive_4wd
1, #drive_fwd
0, #drive_rwd
0, #drive_unknown
]]
df = pd.DataFrame(data, columns=a)
random_forest.predict(df)

array([5062.004])

In [None]:
# w = random_forest.predict(x)

x['estimate'] = w.tolist()
x['actual'] = model['price']
x['ratio'] = x['actual']/x['estimate']
x['savings'] = x['estimate'] - x['actual']
x['year'] = model['year']
x['model'] = model['model']
x['make'] = model['manufacturer']
x['desc'] = model['description']
x['state'] = model['state']
x['miles'] = model['odometer']
x['region'] = model['region']
# x.sort_values(by=['swag'])


pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', 500)

# searchfor = ['down payment', 'downpayment','0 down','9 down']
# x['contains_down'] = x['desc'].str.contains('|'.join(searchfor))


# showme = x[~x['contains_down']]
x[['estimate','actual','ratio','savings','year','make','model','miles','state','desc']].sort_values(by=['savings'], ascending=False)[0:10]


Unnamed: 0,estimate,actual,ratio,savings,year,make,model,miles,state,desc
234450,29700.443914,955,0.032154,28745.443914,2018.0,ford,ranger,3000.0,nc,"this top came off of a 2000 ford ranger. great condition. this top has interior light and every thing works, locks have 2 keys, door stays open. clamps included and mat comes with top. the price for this top is fair please text or call $900 obo 980-241-023 ford dodge chevrolet camper camp cover"
333540,8344.814667,1200,0.143802,7144.814667,1999.0,ford,ranger,85000.0,pa,clean truck bad motor/tmu tax & tags extra/sold as is/tow out/trades welcome call mark 609-347-8888
255092,8209.448,1500,0.182716,6709.448,2003.0,ford,ranger,162016.0,nj,"2003 ford ranger xlt fx4 with super cab and cap. 2 owners, being sold by second owner. runs great, but taken off the road because where rear suspension shackles mount to frame is rusted out. 4.0 liter engine, burns no oil and runs great. new plugs and wires two years ago. dunlop tires only have around 10k miles on them. bilstein shocks installed on front end about 5 years ago. brakes in good shape. includes 6' leer bed cap."
312651,13069.433333,7000,0.535601,6069.433333,2011.0,ford,ranger,92682.0,or,"sngl cab, auto, heat, a/c, 2.3l, 4cyl, pwr windows, tow hitch 92,682 miles **please call jeff @ 707-246-1416 for more details **please also visit our website: www.americmachinery.com atlas copco, bobcat, bomag, case, cat, caterpillar, deere, dynapac, gehl, genie, hamm, hitachi, hyster, ihi, ingersoll rand, jcb, jlg, john deere, kobelco, komatsu, kubota, laymor, lee boy, linkbelt, mustang, morooka, new holland, skyjack, takeuchi, terex, volvo, wacker neuson, yanmar, air compressor, bac..."
252737,10393.692,4500,0.432955,5893.692,2003.0,ford,ranger,60000.0,nj,"2003 ford ranger 4x4. in good condition. new jasper engine with only 60k miles on it. body has over 200k. clean inside and out, frame in good condition. 4.0l v6. no issues runs great. clean nj title. email me through craigslist."
107691,9770.604,3999,0.409289,5771.604,2001.0,jeep,wranger,63914.0,fl,"please call text andy at auto depot of navarre 850-375-0491 website: www.autodepotofnavarre.com address: 1809 natures way, gulf breeze, fl 32563 hours: mon - sat 9-5 buy-sell-trade-consign no in house finance. we accept cash trades credit cards and outside finance from your bank or credit union."
321137,9329.090846,4100,0.439485,5229.090846,1993.0,ford,ranger super cab,57180.0,or,"desirable, very low mileage, very clean 1993 ford ranger family vehicle, would be a great 2nd family vehicle recently serviced for minor repairs light grey, clear title. 57,180 miles, runs excellently, dependable $4,100.00 obo contact cathy show contact info"
289923,7149.152,1999,0.279614,5150.152,2001.0,ford,ranger,159982.0,oh,"2001 ford ranger -- $1,999 ☎ call: (513) 712-0936 ext 10184 📱 text 10184 to (513) 712-0936 vehicle information: 2001 ford rangerprice: $1,999 year: 2001* make: *ford** model: *ranger** series: ** body style: truck* stock number: r4954 vin: 1ftzr15e11pa29376 mileage: 159,000 engine: 4.0l v6 cylinder transmission: exterior color: woodland green* interior color: woodland green* for more details, pictures and informa..."
33386,6087.702667,1550,0.254612,4537.702667,1998.0,ford,ranger,1.0,ca,runs like new! but crunch in corner ok to drive as-is or bodywork on corner only (frame is ok) or easy to swap on a better bed shell is good call better than texting valley area code is 8one8 then dial 818-3586
62088,6598.590667,2500,0.378869,4098.590667,1993.0,ford,ranger 4x4,400.0,ca,here i have my 1993 ford ranger automatic 4 cylinder. runs drives/great. full tune up done. doesn’t burn oil/ no oil leaks. smog registerd. clean title. brand new tires got recipes. recontructed motor. dash shows 400x miles serious people only (2500$)*hablo español*
