# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

Using historical used car sales, determine what factors make a car more or less expensive. As a used car dealership, we want to identify used car 'characteristics' (factors) that can increase revenue. The dealership wants to understand its present inventory to appeal to more customers as well as to determine what cars to have available for its customers to increase sales.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

The problem statement 'identifying key drivers for used car prices' (or 'what consumers value in a used car'), requires historical data from past used car sales that can be used to extrapolate these features. An adequate amount of data is required (eg. data from 100 used car sales would be insufficient) as well as data from the sale of new cars.

Furthermore, the data set needs to contain robust information about used cars that were sold to better determine the reason they appealed to a customer. Some of these might include: make, model, year, safety rating, number of miles on the odometer, engine (v6, v8), fuel type (gasoline, diesel, electric, hybrid), 'creature comforts' (AC, heated/cooled seats), number of doors, airbags (front, back, side), color. Since the client is 'a used car dealership', we can assume that its geographi context is also important; the data set would require, at minimum, city and state.

The data set provided (vehicles.csv) contains 426,880 entries. At first inspection we can see:
1. 'model' information is textual (eg. 'sierra 1500 crew cab slt' & 'silverado 1500') & it would be challenging 'normalizing' these
2. the 'transmission' column contains 'other' which has no meaning
3. there is quite a bit of missing data

In [108]:
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from warnings import filterwarnings 
filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from random import shuffle, seed
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

from scipy.linalg import svd

In [57]:
df = pd.read_csv('data/vehicles.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [32]:
df.isna().sum().sort_values(ascending=True)

id                   0
price                0
state                0
region               0
year              1205
transmission      2556
fuel              3013
odometer          4400
model             5277
title_status      8242
manufacturer     17646
type             92858
paint_color     130203
drive           130567
VIN             161042
condition       174104
cylinders       177678
size            306361
dtype: int64

Since we only have a limited amount of columns, the feature shown above that has the lowest amount of data is 'size'. Were we to remove all null entries, this would reduce our data set to 120,519 entries. Even though, I believe, this is an important feature, ignorign/removing it would allow us to retain a larger data set.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [58]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34868 entries, 126 to 426836
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            34868 non-null  int64  
 1   region        34868 non-null  object 
 2   price         34868 non-null  int64  
 3   year          34868 non-null  float64
 4   manufacturer  34868 non-null  object 
 5   model         34868 non-null  object 
 6   condition     34868 non-null  object 
 7   cylinders     34868 non-null  object 
 8   fuel          34868 non-null  object 
 9   odometer      34868 non-null  float64
 10  title_status  34868 non-null  object 
 11  transmission  34868 non-null  object 
 12  VIN           34868 non-null  object 
 13  drive         34868 non-null  object 
 14  size          34868 non-null  object 
 15  type          34868 non-null  object 
 16  paint_color   34868 non-null  object 
 17  state         34868 non-null  object 
dtypes: float64(2), int64(2)

In [53]:
df.shape

(34868, 18)

In [54]:
df.head()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
126,7305672709,auburn,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,68472.0,clean,automatic,1GCWGAFP8J1309579,rwd,full-size,van,white,al
127,7305672266,auburn,0,2019.0,chevrolet,express cargo van,like new,6 cylinders,gas,69125.0,clean,automatic,1GCWGAFP4K1214373,rwd,full-size,van,white,al
128,7305672252,auburn,0,2018.0,chevrolet,express cargo van,like new,6 cylinders,gas,66555.0,clean,automatic,1GCWGAFPXJ1337903,rwd,full-size,van,white,al
215,7316482063,birmingham,4000,2002.0,toyota,echo,excellent,4 cylinders,gas,155000.0,clean,automatic,JTDBT123520243495,fwd,compact,sedan,blue,al
219,7316429417,birmingham,2500,1995.0,bmw,525i,fair,6 cylinders,gas,110661.0,clean,automatic,WBAHD6322SGK86772,rwd,mid-size,sedan,white,al


In [80]:
# remove rows where
# price = 0
# df.drop(df.loc[df['price']==0].index, inplace=True)
# # fuel is 'other'
# df.drop(df.loc[df['fuel']=='other'].index, inplace=True)
# # transmission is 'other'
# df.drop(df.loc[df['transmission']=='other'].index, inplace=True)
# # type is 'other'

# # remove ' cylinders' from cylinders column
# df['cylinders'] = df['cylinders'].str.replace(' cylinders','')

# drop columns: id, model, VIN
# df2 = df.drop(columns=['id','model','VIN'])
df2.shape
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31677 entries, 215 to 426833
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   region        31677 non-null  object 
 1   price         31677 non-null  int64  
 2   year          31677 non-null  float64
 3   manufacturer  31677 non-null  object 
 4   condition     31677 non-null  object 
 5   cylinders     31677 non-null  object 
 6   fuel          31677 non-null  object 
 7   odometer      31677 non-null  float64
 8   title_status  31677 non-null  object 
 9   transmission  31677 non-null  object 
 10  drive         31677 non-null  object 
 11  size          31677 non-null  object 
 12  type          31677 non-null  object 
 13  paint_color   31677 non-null  object 
 14  state         31677 non-null  object 
dtypes: float64(2), int64(1), object(12)
memory usage: 3.9+ MB


Convert categorical variables into dummy variables & convert float64 to int

In [86]:
# drop_prefixes = ['region','manufacturer','condition','cylinders','fuel','title_status','transmission','drive','size','type','paint_color','state']
# df2 = pd.get_dummies(df2, prefix=drop_prefixes).astype(int)
# df2['odometer'] = df2['odometer'].astype('int')
# df2['year'] = df2['year'].astype('int')
df2.shape

(31677, 540)

As mentioned above, the column 'size' contains the most null data. The column 'vin' does not add value to our dataset. The same is true for the 'id' column. 
Encoding will have to be performed on most of the columns since they are textual.
Also, adding in 'region' and 'state' will generate 12080 (vs 5225) columns, and since the question asked is what car features impact sale price, and not 'and what region', this is a reasonable implication. In short, columns that contain significant NaN data and those that contain predominantly 'other' are dropped to reduce the number of columns.

In [87]:
df2.dtypes

price                 int64
year                  int64
odometer              int64
region_SF bay area    int64
region_abilene        int64
                      ...  
state_vt              int64
state_wa              int64
state_wi              int64
state_wv              int64
state_wy              int64
Length: 540, dtype: object

In [88]:
df2.describe()

Unnamed: 0,price,year,odometer,region_SF bay area,region_abilene,region_akron / canton,region_albany,region_albuquerque,region_altoona-johnstown,region_amarillo,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
count,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,...,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0,31677.0
mean,17209.790605,2010.934116,109790.0,0.006093,0.000284,0.003915,0.01026,0.004925,0.000347,0.000915,...,0.004704,0.027117,0.040345,0.004704,0.030748,0.02093,0.009628,0.034946,0.002746,0.002746
std,13154.104236,7.143911,102959.8,0.077819,0.016854,0.062444,0.100771,0.070004,0.018632,0.030244,...,0.068423,0.162428,0.19677,0.068423,0.172637,0.143152,0.097653,0.183647,0.052336,0.052336
min,1.0,1905.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7880.0,2008.0,68655.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,13500.0,2012.0,106715.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,23700.0,2015.0,143429.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,155000.0,2022.0,9999999.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [89]:
ss = StandardScaler() # regularization scores will be applied more equitably
df2_rescaled = pd.DataFrame(ss.fit_transform(df2), columns=ss.get_feature_names_out())
df2_rescaled

Unnamed: 0,price,year,odometer,region_SF bay area,region_abilene,region_akron / canton,region_albany,region_albuquerque,region_altoona-johnstown,region_amarillo,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
0,-1.004249,-1.250611,0.439110,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,-0.052479
1,-1.118284,-2.230482,0.008460,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,-0.052479
2,-0.624134,-0.410722,-0.515646,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,-0.052479
3,-0.627935,0.009223,0.526524,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,-0.052479
4,-1.004249,-5.450058,-0.210668,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,-0.052479
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31672,0.477821,0.569149,0.359418,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,19.055274
31673,-0.320421,0.709130,-0.472648,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,19.055274
31674,1.200345,0.429167,0.435633,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,19.055274
31675,-0.092048,0.989093,-0.712183,-0.078295,-0.016858,-0.062689,-0.101814,-0.07035,-0.018638,-0.030271,...,-0.068746,-0.166953,-0.205039,-0.068746,-0.17811,-0.14621,-0.0986,-0.190294,-0.052479,19.055274


Sequential Feature Selection

In [125]:
# np.linspace(1,540).astype(int)
np.arange(1,540)

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

In [126]:
# df2_rescaled_shuffled = shuffle(df2_rescaled)
# train, validation, test = np.split(df2_rescaled_shuffled,[19006, 9503])


# all_indices = list(range(0, len(df2_rescaled)))
# seed(42)
# shuffle(all_indices)

# training_idx, test_idx = np.split(all_indices, [25341])

X = df2_rescaled.drop(columns=['price'])
y = df2_rescaled['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

all_indicies = list(range(0, len(df2_rescaled)))
seed(42)
shuffle(all_indicies)
train_idx, test_idx = np.split(all_indicies, [350])

# features_to_try = {'selector__n_features_to_select':np.linspace(1,len(df2_rescaled.columns)-1).astype(int)}
features_to_try = {'selector__n_features_to_select':np.arange(1,5)}
seq_features_selector_pipe = Pipeline([
    ('selector', 
     SequentialFeatureSelector(estimator=LinearRegression(),
                                    n_features_to_select=4,
                                    cv = [[train_idx, test_idx]],
                                    scoring = 'neg_mean_squared_error')
    )
])

features_finder = GridSearchCV(estimator=seq_features_selector_pipe,
                              param_grid = features_to_try,
                              scoring = 'neg_mean_squared_error')

features_finder.fit(X, y)
features_finder.best_params_



ValueError: 
All the 20 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
8 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 268, in fit
    new_feature_idx, new_score = self._get_best_new_feature_score(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 299, in _get_best_new_feature_score
    scores[feature_idx] = cross_val_score(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 515, in cross_val_score
    cv_results = cross_validate(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    results = parallel(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 864, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 263, in __call__
    return [func(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 263, in <listcomp>
    return [func(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 678, in _fit_and_score
    X_train, y_train = _safe_split(estimator, X, y, train)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/metaestimators.py", line 233, in _safe_split
    X_subset = _safe_indexing(X, indices)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 356, in _safe_indexing
    return _array_indexing(X, indices, indices_dtype, axis=axis)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 185, in _array_indexing
    return array[key] if axis == 0 else array[:, key]
IndexError: index 25499 is out of bounds for axis 0 with size 25341

--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 268, in fit
    new_feature_idx, new_score = self._get_best_new_feature_score(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/feature_selection/_sequential.py", line 299, in _get_best_new_feature_score
    scores[feature_idx] = cross_val_score(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 515, in cross_val_score
    cv_results = cross_validate(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    results = parallel(
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 864, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 263, in __call__
    return [func(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/joblib/parallel.py", line 263, in <listcomp>
    return [func(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 678, in _fit_and_score
    X_train, y_train = _safe_split(estimator, X, y, train)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/metaestimators.py", line 233, in _safe_split
    X_subset = _safe_indexing(X, indices)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 356, in _safe_indexing
    return _array_indexing(X, indices, indices_dtype, axis=axis)
  File "/Users/maspen/anaconda3/lib/python3.10/site-packages/sklearn/utils/__init__.py", line 185, in _array_indexing
    return array[key] if axis == 0 else array[:, key]
IndexError: index 25499 is out of bounds for axis 0 with size 25342


In [19]:
# Normalize data
mu = df.mean()
sigma = df.std()

df_norm = (df-mu)/sigma
df_norm

Unnamed: 0,price,year,odometer,manufacturer_acura,manufacturer_alfa-romeo,manufacturer_aston-martin,manufacturer_audi,manufacturer_bmw,manufacturer_buick,manufacturer_cadillac,...,fuel_electric,fuel_gas,fuel_hybrid,fuel_other,title_status_clean,title_status_lien,title_status_missing,title_status_parts only,title_status_rebuilt,title_status_salvage
126,-1.176626,0.952641,-0.389626,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
127,-1.176626,1.092261,-0.383090,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
128,-1.176626,0.952641,-0.408814,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
215,-0.879450,-1.281268,0.476444,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
219,-0.990891,-2.258603,0.032649,-0.086169,-0.020745,-0.009276,-0.114342,6.044153,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
426788,-0.211176,0.673403,-0.463143,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
426792,1.274999,0.394164,0.472860,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,-2.967811,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
426793,0.012002,0.952641,-0.709989,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325
426833,-0.671427,-1.979365,-0.378836,-0.086169,-0.020745,-0.009276,-0.114342,-0.165444,-0.106078,-0.122919,...,-0.047653,0.336939,-0.11108,-0.030308,0.187153,-0.062111,-0.020042,-0.007574,-0.152255,-0.08325


In [22]:
# perform SVD on normalized data

U, s, Vt = svd(df_norm, full_matrices=False)

Sigma = np.diag(s)
V = Vt.T

In [24]:
# verify that every entry in df_norm is close to the product of the matrices

np.allclose(df_norm, U @ Sigma @ V.T )

True

In [None]:
# The importance of principal components


In [25]:
r = 4
Ur = U[:,:r]
Sigmar = Sigma[:r,:r]
Xrr = pd.DataFrame(Ur @ Sigmar)
Xrr

Unnamed: 0,0,1,2,3
0,0.259105,0.279504,-0.921861,1.158898
1,0.247062,0.252207,-0.888746,1.142999
2,0.258693,0.276753,-0.919431,1.158157
3,1.506000,0.439709,0.766447,-0.296714
4,1.882864,1.127807,-1.571845,1.070129
...,...,...,...,...
34863,0.380104,-0.299036,-0.245546,0.796387
34864,-5.387637,0.369836,0.902033,-1.152830
34865,0.550263,-0.287631,1.029626,-0.250558
34866,1.339457,0.876828,-2.249269,0.913276


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.