# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

In [None]:
# The car dealership would like to know the important features in a car which would increase the price of a car. It would help
# him to maintain the inventory based on those features and hence get a better return on his investment

In [None]:
# In order to know what features are potentially important, we would need to conduct a data analyses of teh available data and 
# build a logical model which would predict the prices of the used car and show the important features which maximise it

In [None]:
# The assumption is that sufficient data is available to analyze which is reflective of the current reality and modelling them
# would give way to predict prices with reasonable accuracy.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [None]:
# We collect the data set which has used car prices and their correponding features. The next step is to explore that data and
# look at the quality of it and its key features and components

In [1]:
import pandas as pd
car =pd.read_csv('C:\\Certification\\Module 11\\vehicles.csv')

In [2]:
#looking at the sample of the data
car.tail()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
426875,7301591192,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,1N4AA6AV6KC367801,fwd,,sedan,,wy
426876,7301591187,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,7JR102FKXLG042696,fwd,,sedan,red,wy
426877,7301591147,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,1GYFZFR46LF088296,,,hatchback,white,wy
426878,7301591140,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,58ABK1GG4JU103853,fwd,,sedan,silver,wy
426879,7301591129,wyoming,30590,2019.0,bmw,4 series 430i gran coupe,good,,gas,22716.0,clean,other,WBA4J1C58KBM14708,rwd,,coupe,,wy


In [3]:
# Looking at the columns, it was found that ID and VIN are unique identifiers which miht not be needed
car.columns

Index(['id', 'region', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'state'],
      dtype='object')

In [4]:
# Looking at the info, we saw theer are many object data types whcih need to be converted for operations. There are 426,880 
# entries and 18 columns
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [5]:
# There are quite a high number of nulls in both categorical and numeric columns
car.isnull().sum()

id                   0
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64

In [6]:
#checking if any of the values in the columns is equal to zero. Price in ~33k rows has zero value and odometer in ~2k rows
car.eq(0).sum()

id                  0
region              0
price           32895
year                0
manufacturer        0
model               0
condition           0
cylinders           0
fuel                0
odometer         1965
title_status        0
transmission        0
VIN                 0
drive               0
size                0
type                0
paint_color         0
state               0
dtype: int64

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [7]:
#dropping the ID, VIN and Year columns as the identification numbers are unique and wont have any impact on the model
# and we are not doing any time series analyses and hence year wont be needed
car = car.drop(columns = ["id", "VIN","year"])

In [8]:
#Converting the data types using pandas into indentifiable ones for conduction operations
car1 = car.convert_dtypes()

In [9]:
# Now all the columns are either string or Int64 which will help in further data operations
car1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   region        426880 non-null  string
 1   price         426880 non-null  Int64 
 2   manufacturer  409234 non-null  string
 3   model         421603 non-null  string
 4   condition     252776 non-null  string
 5   cylinders     249202 non-null  string
 6   fuel          423867 non-null  string
 7   odometer      422480 non-null  Int64 
 8   title_status  418638 non-null  string
 9   transmission  424324 non-null  string
 10  drive         296313 non-null  string
 11  size          120519 non-null  string
 12  type          334022 non-null  string
 13  paint_color   296677 non-null  string
 14  state         426880 non-null  string
dtypes: Int64(2), string(13)
memory usage: 49.7 MB


In [10]:
#looking at the rows where price is equal to zero 
import warnings
warnings.filterwarnings("ignore")
car1.query("price==0")

Unnamed: 0,region,price,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
10,el paso,0,,,,,,,,,,,,,tx
11,el paso,0,,,,,,,,,,,,,tx
12,el paso,0,,,,,,,,,,,,,tx
13,el paso,0,,,,,,,,,,,,,tx
14,el paso,0,,,,,,,,,,,,,tx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426764,wyoming,0,,peterbilt 579,,,diesel,1,clean,automatic,,,,,wy
426812,wyoming,0,toyota,scion tc,excellent,4 cylinders,gas,195000,clean,automatic,fwd,,,silver,wy
426832,wyoming,0,toyota,prius,excellent,4 cylinders,hybrid,239000,clean,automatic,fwd,,,blue,wy
426836,wyoming,0,ram,2500,excellent,6 cylinders,diesel,20492,clean,automatic,4wd,full-size,truck,white,wy


In [11]:
#Since Price is the dependant variable in the model and it being equal to zero does not make sense and hence these 
#rows are dropped from the data
car2 = car1.query("price!=0")
print(car2.eq(0).sum())
print(car2.info())

region             0
price              0
manufacturer       0
model              0
condition          0
cylinders          0
fuel               0
odometer        1114
title_status       0
transmission       0
drive              0
size               0
type               0
paint_color        0
state              0
dtype: int32
<class 'pandas.core.frame.DataFrame'>
Index: 393985 entries, 0 to 426879
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   region        393985 non-null  string
 1   price         393985 non-null  Int64 
 2   manufacturer  377800 non-null  string
 3   model         389284 non-null  string
 4   condition     242596 non-null  string
 5   cylinders     233575 non-null  string
 6   fuel          391391 non-null  string
 7   odometer      391695 non-null  Int64 
 8   title_status  386251 non-null  string
 9   transmission  392162 non-null  string
 10  drive         273731 non-null  string
 11  si

In [12]:
# We need to treat for odometer reading being zero as that is impossible as all of them are used cars
car2.query("odometer==0")

Unnamed: 0,region,price,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
694,birmingham,3980,kia,rio,,4 cylinders,gas,0,,manual,fwd,,sedan,white,al
1720,birmingham,10477,chevrolet,colorado,,,gas,0,clean,automatic,,,,black,al
1804,birmingham,11980,infiniti,g37 sedan,,6 cylinders,gas,0,,automatic,rwd,,sedan,silver,al
3959,mobile,4500,gmc,sierra,good,6 cylinders,gas,0,clean,automatic,rwd,mid-size,truck,blue,al
4267,mobile,4250,ford,f150,good,8 cylinders,gas,0,clean,automatic,rwd,full-size,pickup,white,al
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
417977,green bay,16900,ford,e-350,,,gas,0,clean,other,,,other,white,wi
418235,janesville,4900,,IC CE PB105,,,diesel,0,clean,other,,,bus,yellow,wi
421403,madison,999,ford,transit,,,gas,0,clean,automatic,,,,,wi
423311,milwaukee,4900,,IC CE PB105,,,diesel,0,clean,other,,,bus,yellow,wi


In [13]:
#Converted all zero values in odometer to missing values
import numpy as np
car2["odometer"] = car2['odometer'].replace(0, np.nan)
print(car2.eq(0).sum())
print(car2.info())

region          0
price           0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
drive           0
size            0
type            0
paint_color     0
state           0
dtype: int32
<class 'pandas.core.frame.DataFrame'>
Index: 393985 entries, 0 to 426879
Data columns (total 15 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   region        393985 non-null  string
 1   price         393985 non-null  Int64 
 2   manufacturer  377800 non-null  string
 3   model         389284 non-null  string
 4   condition     242596 non-null  string
 5   cylinders     233575 non-null  string
 6   fuel          391391 non-null  string
 7   odometer      390581 non-null  Int64 
 8   title_status  386251 non-null  string
 9   transmission  392162 non-null  string
 10  drive         273731 non-null  string
 11  size          111052 non-null  string
 12  type

In [14]:
#We have a large number of null values and hence we have to impute values, else we will lose a lot of data
car2.isnull().sum()

region               0
price                0
manufacturer     16185
model             4701
condition       151389
cylinders       160410
fuel              2594
odometer          3404
title_status      7734
transmission      1823
drive           120254
size            282933
type             85932
paint_color     117149
state                0
dtype: int64

In [15]:
#filling the odometer column with median
car22 = car2
car22["odometer"].fillna(car22["odometer"].median(skipna=True), inplace=True)
#filling the categorical columns with the most frequent value
car22['manufacturer'].fillna(car22['manufacturer'].value_counts().idxmax(), inplace=True)
car22['model'].fillna(car22['model'].value_counts().idxmax(), inplace=True)
car22['condition'].fillna(car22['condition'].value_counts().idxmax(), inplace=True)
car22['cylinders'].fillna(car22['cylinders'].value_counts().idxmax(), inplace=True)
car22['fuel'].fillna(car22['fuel'].value_counts().idxmax(), inplace=True)
car22['title_status'].fillna(car22['title_status'].value_counts().idxmax(), inplace=True)
car22['transmission'].fillna(car22['transmission'].value_counts().idxmax(), inplace=True)
car22['drive'].fillna(car22['drive'].value_counts().idxmax(), inplace=True)
car22['size'].fillna(car22['size'].value_counts().idxmax(), inplace=True)
car22['type'].fillna(car22['type'].value_counts().idxmax(), inplace=True)
car22['paint_color'].fillna(car22['paint_color'].value_counts().idxmax(), inplace=True)
car22.isnull().sum()


region          0
price           0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
title_status    0
transmission    0
drive           0
size            0
type            0
paint_color     0
state           0
dtype: int64

In [17]:
#We have to encode the category data for modelling and hence installed category encoders
import category_encoders as ce

In [18]:
# Build the James Stein encoder with the target categorical variables
encoder = ce.JamesSteinEncoder(cols = ['region', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'title_status', 'transmission',
       'drive', 'size', 'type', 'paint_color', 'state'])


In [19]:
# Encode the frame with Price as the target variable for encoding and view it
car3 = encoder.fit_transform(car22, car22["price"])
car3

Unnamed: 0,region,price,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
0,25667.544221,6000,42777.874157,144975.347408,78456.92406,83627.43637,79800.545699,87236,83102.449244,81768.192373,116567.082876,80897.337983,100926.505732,92467.667071,20548.117622
1,20214.348639,11900,42777.874157,144975.347408,78456.92406,83627.43637,79800.545699,87236,83102.449244,81768.192373,116567.082876,80897.337983,100926.505732,92467.667071,22364.850187
2,25285.936943,21000,42777.874157,144975.347408,78456.92406,83627.43637,79800.545699,87236,83102.449244,81768.192373,116567.082876,80897.337983,100926.505732,92467.667071,18634.050306
3,15433.738804,1500,42777.874157,144975.347408,78456.92406,83627.43637,79800.545699,87236,83102.449244,81768.192373,116567.082876,80897.337983,100926.505732,92467.667071,15935.329963
4,22796.328637,4900,42777.874157,144975.347408,78456.92406,83627.43637,79800.545699,87236,83102.449244,81768.192373,116567.082876,80897.337983,100926.505732,92467.667071,40100.848447
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426875,22466.900378,23590,21011.660925,23979.685588,78456.92406,83627.43637,79800.545699,32226,83102.449244,30584.425015,16528.495588,80897.337983,100926.505732,92467.667071,22466.897372
426876,22466.900378,30590,173405.11281,29800.909878,78456.92406,83627.43637,79800.545699,12029,83102.449244,30584.425015,16528.495588,80897.337983,100926.505732,23980.130789,22466.897372
426877,22466.900378,34990,20507.293039,34566.34504,78456.92406,83627.43637,123372.645311,4174,83102.449244,30584.425015,116567.082876,80897.337983,14999.977858,92467.667071,22466.897372
426878,22466.900378,28990,20345.027834,24360.338961,78456.92406,83627.43637,79800.545699,30112,83102.449244,30584.425015,16528.495588,80897.337983,100926.505732,94053.835781,22466.897372


In [None]:
#Now the data is prepared for the modelling. 

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [20]:
from sklearn.metrics import mean_squared_error as ms
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector as sf
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import r2_score as score
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, TransformedTargetRegressor
import matplotlib.pyplot as plt

In [21]:
#Before starting to model, we should check for Multicollinearity using Variance Inflation Factor
#Creating the dependant variable and feature variables
XT=car3.drop(columns = "price")
YT= car3["price"]

In [22]:
def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict = {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        XT, YT = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(XT, YT).score(XT, YT)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict})

    return df_vif

In [23]:
#it is clear none of the features have VIF > 5 and hence we can ignore any Multicollinearity
sklearn_vif(XT.columns, XT).sort_values("VIF", ascending = False)

Unnamed: 0,VIF
state,1.171575
region,1.169347
cylinders,1.136907
drive,1.122056
transmission,1.054716
fuel,1.051279
paint_color,1.021886
odometer,1.019912
manufacturer,1.01948
type,1.017673


In [24]:
#Scaling the data for modelling separately as column names get lost in a Pipeline with StandardScaler
ss= StandardScaler()
rescaled_car3 = pd.DataFrame(ss.fit_transform(car3),columns = ss.get_feature_names_out())
rescaled_car3

Unnamed: 0,region,price,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,state
0,-0.078660,-0.005952,-0.226583,0.292592,0.022561,0.258344,-0.107261,-0.056873,0.187206,0.007801,0.784080,0.015274,0.867765,0.629015,-0.608247
1,-0.235628,-0.005487,-0.226583,0.292592,0.022561,0.258344,-0.107261,-0.056873,0.187206,0.007801,0.784080,0.015274,0.867765,0.629015,-0.567389
2,-0.089644,-0.004769,-0.226583,0.292592,0.022561,0.258344,-0.107261,-0.056873,0.187206,0.007801,0.784080,0.015274,0.867765,0.629015,-0.651294
3,-0.373235,-0.006307,-0.226583,0.292592,0.022561,0.258344,-0.107261,-0.056873,0.187206,0.007801,0.784080,0.015274,0.867765,0.629015,-0.711987
4,-0.161306,-0.006039,-0.226583,0.292592,0.022561,0.258344,-0.107261,-0.056873,0.187206,0.007801,0.784080,0.015274,0.867765,0.629015,-0.168513
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
393980,-0.170789,-0.004565,-0.743295,-0.006775,0.022561,0.258344,-0.107261,-0.324886,0.187206,-1.366358,-1.451299,0.015274,0.867765,0.629015,-0.565094
393981,-0.170789,-0.004013,2.874400,0.007627,0.022561,0.258344,-0.107261,-0.423287,0.187206,-1.366358,-1.451299,0.015274,0.867765,-1.307269,-0.565094
393982,-0.170789,-0.003666,-0.755268,0.019418,0.022561,0.258344,3.050049,-0.461557,0.187206,-1.366358,0.784080,0.015274,-1.467126,0.629015,-0.565094
393983,-0.170789,-0.004139,-0.759120,-0.005834,0.022561,0.258344,-0.107261,-0.335185,0.187206,-1.366358,-1.451299,0.015274,0.867765,0.673859,-0.565094


In [25]:
rescaled_XT=rescaled_car3.drop(columns = "price")
rescaled_YT= rescaled_car3["price"]

In [26]:
#Created the train and test sets for cross validation
X_train, X_test, y_train, y_test = train_test_split(rescaled_XT, rescaled_YT, test_size = 0.2,random_state = 123 )

In [27]:
#Building a starting model with Polynomial degree = 2, Sequential feature = 5, Ridge Regression with alpha = 100
ridge_poly_pipe = Pipeline([("Poly_features",PolynomialFeatures(degree = 2, include_bias = False)),("seq_sel",sf(Lasso(),n_features_to_select = 5 )),
                                            ("ridge",Ridge(alpha = 100))])
ridge_poly_pipe.fit(X_train,y_train)

In [72]:
# This is Hold out cross validation
#Getting the MSE for train and test data. It has a low training and test MSE which means good overall fit.
print(ms(ridge_poly_pipe.predict(X_train),y_train))
print(ms(ridge_poly_pipe.predict(X_test),y_test))


0.9723707810743195
1.1020715676931958


In [None]:
#GridSearchCv was not used due to 2 reasons
#1. The MSEs were quite low and optimization may not yield signifcantly better results
#2. It is taking VERY LONG TIME to process this data and I did not get any results after running for hours

In [29]:
#Looked at the top features from the model and it can be concluded that model, region, condition and manufacturer
#are the most important features
coeff_df = pd.DataFrame([ridge_poly_pipe.named_steps["ridge"].coef_], 
                      columns = ridge_poly_pipe.named_steps["seq_sel"].get_feature_names_out())
mmm=pd.melt(coeff_df,var_name="features",value_name="Coeff")
mmm["abs_coeff"]= abs(mmm["Coeff"])
mmm.sort_values("abs_coeff",ascending=False).head(5)

Unnamed: 0,features,Coeff,abs_coeff
2,x2,0.183195,0.183195
0,x0,0.019265,0.019265
3,x3,0.00903,0.00903
1,x1,0.006812,0.006812
4,x41,-0.000482,0.000482


In [30]:
nn = pd.DataFrame(ridge_poly_pipe.named_steps["Poly_features"].get_feature_names_out())
nn.iloc[[0,1,2,3,41]]

Unnamed: 0,0
0,region
1,manufacturer
2,model
3,condition
41,model^2


In [31]:
#Interpreting the coefiicients: The top features in descending order of importance are Model, Region, condition, manufacturer
#All of them are positively related to price except for the 5th one model^2 which is negatively related


In [None]:
#So higher the model, regiom,condition, manufacturer score, higher the price
#Since these are all categorical variables, the corresponding top ones in each feature are:


In [76]:
#making the index comparable for retrieving the names of the categorical features
car22a= car22.reset_index()
car3a = rescaled_car3.reset_index()

In [77]:
#for model, the top 10 ones are: 
m=pd.DataFrame(car3a["model"].unique()).sort_values(0,ascending= False).head(10)
m1= m.reset_index()
print(m1)
model_names = car22a["model"].iloc[m1["index"]]
print(model_names)

   index           0
0  11113  333.972462
1  10798  305.390268
2  11674    4.057553
3   7612    2.408091
4   7018    2.386421
5   8061    2.383349
6    986    1.833276
7  10816    1.306826
8   1934    1.241577
9   8400    1.213051
11113    mazda3 i touring hatchback
10798           leaf s hatchback 4d
11674            gladiator overland
7612                 silverado 1500
7018                          tahoe
8061                        sorento
986         xf 20d premium sedan 4d
10816       pt cruiser limited edit
1934     romeo stelvio ti sport suv
8400               traverse premier
Name: model, dtype: string


In [78]:
#for region, the top 10 ones are: 
r=pd.DataFrame(car3a["region"].unique()).sort_values(0,ascending= False).head(10)
r1= r.reset_index()
print(r1)
region_names = car22a["region"].iloc[r1["index"]]
print(region_names)

   index          0
0    297  14.228794
1     13   7.715483
2    195   6.651240
3     40   5.874803
4    178   5.507055
5     12   4.974667
6    131   4.296980
7     99   4.260571
8     77   3.472119
9     46   3.422832
297    birmingham
13     bellingham
195    birmingham
40         auburn
178    birmingham
12     bellingham
131        auburn
99         auburn
77         auburn
46         auburn
Name: region, dtype: string


In [79]:
#for condition, the top 10 ones are: 
c=pd.DataFrame(car3a["condition"].unique()).sort_values(0,ascending= False).head(10)
c1= c.reset_index()
print(c1)
condition_names = car22a["condition"].iloc[c1["index"]]
print(condition_names)

   index         0
0      2  7.320138
1      0  0.022561
2      1 -0.360970
3      3 -0.930759
4      4 -1.208327
5      5 -1.776611
2    good
0    good
1    good
3    good
4    good
5    good
Name: condition, dtype: string


In [80]:
#for condition, the top 10 ones are: 
f=pd.DataFrame(car3a["manufacturer"].unique()).sort_values(0,ascending= False).head(10)
f1= f.reset_index()
print(f1)
manufacturer_names = car22a["manufacturer"].iloc[f1["index"]]
print(manufacturer_names)

   index         0
0     15  2.874400
1     23  2.280152
2     30  1.749200
3      4  1.656737
4      3  1.489671
5      2  1.168030
6     39  0.209245
7      0 -0.226583
8     13 -0.298245
9     34 -0.309016
15         ford
23    chevrolet
30    chevrolet
4          ford
3          ford
2          ford
39         ford
0          ford
13         ford
34    chevrolet
Name: manufacturer, dtype: string


In [None]:
# The car dealership should focus on the top 10 models above in the 3 cities (Bellingham, Birmingham and Auburn) 
#in good condition manufactured by Ford and Chevrolet

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [71]:
# Getting the MSE for test data. It has a low test MSE which means good overall fit.


1.1020715676931958


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

In [None]:
#It is out of our scope of this course 