# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

### Background:
Used car sales dealership wants to know what drives the price of the car so that they can appropriately price the car to gain buy or sell the used cars.
### Business Objectives: 
Identify what are the features or attributes that drives the price of the car. Set the price for the car for sale based on findings.
### Business Success Criteria: 
Set the car price appropriately either for buying or selling to gain maximum profitability
### Access Situation: 
We need to identify the dataset that contains several features of any given car such as make, model, odometer reading, condition, color, location etc.
### Risks: 
There may be a possibility of the data is not accurate or insufficient or not useful
### Costs/Benefits: 
The cost to run analysis to give the price range for any given car so that the delarship can sell appropriately.
### Data Mining/Success goals: 
Identify data, collect the data elements, and samples for various sources. Here the data set is given.
### Project Plan: 
Prepare a project plan to collect data, resources, data analysis,  modelling, deployment and monitoring.
### Tools: 
Linear Regression, Python, Jupyter notebook, and several algorithms, validations etc.


### Data Understanding

### Collect Initial Data:
Run several campaigns to collect the data. Here we got the dataset, so we are not doing anything.
### Describe Data:
The data has 426880 data points with 18 features with several data types.
### Explore Data: 
Explore the data with techniques to identify what features are important using several plots, detecting outliers, treat missing values,data type changes, data cleansing, and null checks etc.
### Verify Data Quality: 
Inspect the data for quality, consistency, range, and length etc.

In [None]:
from warnings import filterwarnings 
filterwarnings('ignore')
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy import stats
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error
from scipy.stats import zscore

In [None]:
data = pd.read_csv('data/vehicles.csv')
data.info()

In [None]:
data.describe()

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
data.head()

In [None]:
data.isna().sum().sort_values()

In [None]:
data.eq(0).sum().sort_values()

In [None]:
unique_values = data.nunique().sort_values()
unique_values

In [None]:
pd.Series({col:data[col].unique() for col in data})

In [None]:
fig, axes = plt.subplots(ncols=2)
for i, yvar in enumerate(['odometer', 'year']):
    axes[i].scatter(data['price'],data[yvar])
    

In [None]:
g = sns.pairplot(data,x_vars=['year','odometer'], y_vars=['price'],diag_kind="kde")

In [None]:
sns.displot(data, x="year")

In [None]:
sns.displot(data, x="fuel")

In [None]:
sns.displot(data, x="paint_color")

In [None]:
sns.displot(data, x="title_status")

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
#remove dups as VIN will be unique, keeping odometer highest as that will be latest data point
data = data.sort_values('odometer', ascending=True).drop_duplicates('VIN', keep='last').sort_index()
#dropping id column as it is useless in identifying price
data = data.drop(['id'], axis=1)

data.info()

In [None]:
#Encode some of the features which makes sense..
onehotencoded_columns = ['manufacturer']
# Assign dataframe using pd.get_dummies to encode categorical values into separate columns
data = pd.get_dummies(data, columns=onehotencoded_columns, dtype='int', drop_first=False)

data.update(data[['region', 'condition', 'state','model','cylinders','fuel','title_status','transmission','drive','size','type','paint_color']].apply(lambda s: s.map(data['price'].groupby(s).mean())))

finaldata = data.drop(['VIN'], axis=1)



In [None]:
#find and fill null values
finaldata.isna().sum()
finaldata.fillna(method="ffill",inplace=True)
finaldata.fillna(method="bfill",inplace=True)

finaldata

In [None]:
#remove outliers with Z score
finaldata = finaldata[(np.abs(zscore(finaldata)) <= 3).all(axis=1)]

In [None]:
corr = finaldata.corr().abs()

values = (corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool)).stack().sort_values(ascending=False))
for index, value in enumerate(values.items()):
    if index == 25:
        break
    print(index, value)
    


In [None]:
# dfCorr = finaldata.corr()
# filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
# plt.figure(figsize=(15,10))
# sns.heatmap(finaldata, annot=True, cmap="coolwarm")
# plt.show()

# Split data

X = finaldata.drop('price', axis=1)
y = finaldata['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#RFE selection


model = LinearRegression()
# Create an RFE selector 
recursive_feature_elimination = RFE(model, n_features_to_select=5)

# Fit the selector to the data
recursive_feature_elimination.fit(X_train, y_train)

# selected features
selected_features = X_train.columns[recursive_feature_elimination.support_]

print(selected_features)

In [None]:
#PCA model
# Scale numerical features
pca_columns = finaldata.columns
scaler = StandardScaler()
scaled_numerical = scaler.fit_transform(X_train)

# Apply PCA on scaled numerical features
pca = PCA()
transformed_numerical = pca.fit_transform(scaled_numerical)
# Bar graph for explained variance ratios
explained_variance_ratio = pca.explained_variance_ratio_
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
plt.xlabel('PCA Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio by PCA Components')
plt.show()

In [None]:
plt.scatter(transformed_numerical[:, 0], transformed_numerical[:, 1])
plt.xlabel("PCA Component 1")
plt.ylabel("PCA 2")
plt.title("PCA Analysis")
plt.show()


In [None]:
loadings = pca.components_
num_pc = pca.n_features_
pc_list = ["PC"+str(i) for i in list(range(1, num_pc+1))]
loadings_df = pd.DataFrame.from_dict(dict(zip(pc_list, loadings)))
finaldata.columns.values
loadings_df['variable'] = X_train.columns.values
loadings_df = loadings_df.set_index('variable')
loadings_df
loadings_df.corr


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [None]:


# Create a column transformer to apply one-hot to selected variables
col_transformer = make_column_transformer(
        (OneHotEncoder(drop='if_binary', handle_unknown='ignore'), X_train.columns.tolist()),
        remainder='passthrough'
)

# Fit and transform the column transformer with training data
X_train_transformed = col_transformer.fit_transform(X_train)

# Create a new column transformer for test data
col_transformer_test = make_column_transformer(
        (OneHotEncoder(drop='if_binary', handle_unknown='ignore'), X_train.columns.tolist()),
        remainder='passthrough'
)

# Fit the column transformer with training data (it already transformed X_train)
col_transformer_test.fit(X_train)

# Transform X_test using the col_transformer_test
X_test_transformed = col_transformer_test.transform(X_test)

# Create a Pipeline for data processing with linear regression
pipe = Pipeline([
        ('linreg', LinearRegression())
])

# Train the pipeline on the transformed training data
pipe.fit(X_train_transformed, y_train)

# Predict y from the transformed test dataset
y_test_pred = pipe.predict(X_test_transformed)

# Predict y from our test dataset
y_train_pred = pipe.predict(X_train_transformed)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_test_pred)
print("Y Test Mean Squared Error:", mse)
# Calculate mean squared error
mse = mean_squared_error(y_train, y_train_pred)
print("Y Train Mean Squared Error:", mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_test_pred)
print(f"R-squared (Coefficient of Determination): {r2}")

In [None]:
# Calculate residuals
residuals = y_test - y_test_pred

# Create a DataFrame with residuals
residuals_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_test_pred, 'Residuals': residuals})

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(residuals_df['Predicted'], residuals_df['Residuals'], alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals Plot')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()


In [None]:


# Create a Pipeline for data processing with linear regression
pipe = Pipeline([
    ('poly', PolynomialFeatures(degree=1)),
    ('linreg', LinearRegression())
])

# Train the pipeline on the transformed training data
pipe.fit(X_train_transformed, y_train)

# Predict y from our test dataset
y_train_pred = pipe.predict(X_train_transformed)

# Predict y from the transformed test dataset
y_test_pred = pipe.predict(X_test_transformed)


# Calculate mean squared error
mse = mean_squared_error(y_test, y_test_pred)
print("Y Test Mean Squared Error:", mse)
# Calculate mean squared error
mse = mean_squared_error(y_train, y_train_pred)
print("Y Train Mean Squared Error:", mse)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_test_pred)
print(f"R-squared (Coefficient of Determination): {r2}")

In [None]:

# Create a LinearRegression model
linreg = LinearRegression()

# Train the model on the training data
linreg.fit(X_train, y_train)

# Predict y from the test dataset
y_test_pred = linreg.predict(X_test)

# Predict y from the training dataset
y_train_pred = linreg.predict(X_train)

# Calculate mean squared error for test data
mse_test = mean_squared_error(y_test, y_test_pred)
print("Y Test Mean Squared Error:", mse_test)

# Calculate mean squared error for train data
mse_train = mean_squared_error(y_train, y_train_pred)
print("Y Train Mean Squared Error:", mse_train)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_test_pred)
print(f"R-squared (Coefficient of Determination): {r2}")

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [None]:
# Split the data into training and testing sets


x_cat_columns = X_train.select_dtypes(include=['object']).columns.tolist()

# Create a column transformer to apply one-hot to selected variables
col_transformer = make_column_transformer(
    (OneHotEncoder(drop='if_binary', handle_unknown='ignore'), x_cat_columns),
    remainder='passthrough'
)

# Fit and transform the column transformer with training data
X_train_transformed = col_transformer.fit_transform(X_train)

# Transform X_test using the col_transformer
X_test_transformed = col_transformer.transform(X_test)

# Define the parameter grid for grid search
param_grid = {
    'linreg__fit_intercept': [True, False]
}

# Create a Pipeline for data processing with linear regression
pipe = Pipeline([
    ('linreg', LinearRegression())
])

# Initialize GridSearchCV
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Train models using grid search
grid_search.fit(X_train_transformed, y_train)

# Get the best estimator from grid search
best_pipe = grid_search.best_estimator_

# Predict y from the transformed test dataset using the best model
y_test_pred = best_pipe.predict(X_test_transformed)

# Predict y from the training dataset using the best model
y_train_pred = best_pipe.predict(X_train_transformed)

# Calculate mean squared error for test data
mse_test = mean_squared_error(y_test, y_test_pred)
print("Y Test Mean Squared Error:", mse_test)

# Calculate mean squared error for train data
mse_train = mean_squared_error(y_train, y_train_pred)
print("Y Train Mean Squared Error:", mse_train)

# Calculate R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_test_pred)
print(f"R-squared (Coefficient of Determination): {r2}")


In [None]:

# Add polynomial features
degree = 2  # Change the degree as needed
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Create a LinearRegression model
linreg = LinearRegression()

# Hyperparameter tuning using GridSearch
param_grid = {
    'fit_intercept': [True, False],
}

grid_search = GridSearchCV(estimator=linreg, param_grid=param_grid, cv=5)
grid_search.fit(X_train_poly, y_train)

best_linreg = grid_search.best_estimator_

# Train the best model on the training data
best_linreg.fit(X_train_poly, y_train)

# Predict y from the test dataset using the best model
y_test_pred_best = best_linreg.predict(X_test_poly)

# Predict y from the training dataset using the best model
y_train_pred_best = best_linreg.predict(X_train_poly)

# Calculate mean squared error for test data using the best model
mse_test_best = mean_squared_error(y_test, y_test_pred_best)
print("Best Model Y Test Mean Squared Error:", mse_test_best)

# Calculate mean squared error for train data using the best model
mse_train_best = mean_squared_error(y_train, y_train_pred_best)
print("Best Model Y Train Mean Squared Error:", mse_train_best)

# Calculate R-squared (Coefficient of Determination) for test data using the best model
r2_best = r2_score(y_test, y_test_pred_best)
print(f"Best Model R-squared (Coefficient of Determination): {r2_best}")

In [None]:
# Define the number of folds
n_folds = 5

# Create regression models
linear_reg_model = LinearRegression()

# Define the mean squared error as the scoring metric
scorer = make_scorer(mean_squared_error)

# Perform K-Fold Cross-Validation for Linear Regression
linear_reg_scores = cross_val_score(linear_reg_model, X, y, cv=n_folds, scoring=scorer)
linear_reg_mse = linear_reg_scores.mean()



# Print the mean squared error for each model
print("Linear Regression Cross validation X1-Mean Squared Error:", linear_reg_mse)

In [None]:
# Define the number of folds
n_folds = 5

# Create a Random Forest Regressor model
lg = LinearRegression()

# Fit the model on your training data
lg.fit(X, y)

# Perform cross-validation to get mean squared error scores
mse_scores = cross_val_score(lg, X, y, cv=n_folds, scoring='neg_mean_squared_error')

# Convert negative scores to positive (mean squared error)
mse_scores = -mse_scores

# Calculate the average mean squared error
average_mse = mse_scores.mean()

# Print the mean squared error for each fold
for i, mse in enumerate(mse_scores):
    print(f"Fold {i+1} Mean Squared Error: {mse}")

# Print the average mean squared error
print("Average Mean Squared Error:", average_mse)

# Perform cross-validation to get predicted target values
predicted_y = cross_val_predict(lg, X, y, cv=n_folds)

# Calculate R-squared
r_squared = lg.score(X, y)

# Print R-squared
print("R-squared (Coefficient of Determination):", r_squared)

# Calculate Mean Squared Error between predicted and actual target values
mse = mean_squared_error(y, predicted_y)
print("Overall Mean Squared Error:", mse)


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.