<a href="https://colab.research.google.com/github/michellechen202212/ucb/blob/main/prompt_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from Kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications, we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

# Load the dataset
file_path = 'sample_data/vehicles_cleaned_outliers_removed.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values
data_cleaned = data.dropna()

# Define features (X) and target variable (y)
X = data_cleaned.drop(columns=['price', 'model', 'manufacturer'])
y = data_cleaned['price']

# Select categorical and numerical features
categorical_features = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'paint_color']
numerical_features = ['year', 'odometer']

# Preprocess categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Define the pipeline with Linear Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
pipeline.fit(X_train, y_train)

# Predict and calculate R^2
y_pred = pipeline.predict(X_test)
initial_r2 = r2_score(y_test, y_pred)
print("Initial R^2 with Linear Regression:", initial_r2)

# Define the pipeline with Ridge Regression for regularization
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge(alpha=1.0))  # Regularization parameter alpha set to 1.0
])

# Train the Ridge regression model
ridge_pipeline.fit(X_train, y_train)

# Predict and calculate R^2 with Ridge Regression
y_ridge_pred = ridge_pipeline.predict(X_test)
ridge_r2 = r2_score(y_test, y_ridge_pred)
print("R^2 with Ridge Regression:", ridge_r2)

Initial R^2 with Linear Regression: 0.6338518845873631
R^2 with Ridge Regression: 0.6338437540412545


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, make_scorer

# Load the dataset
file_path = 'sample_data/vehicles_cleaned_outliers_removed.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values
data_cleaned = data.dropna()

# Define features (X) and target variable (y)
X = data_cleaned.drop(columns=['price', 'model', 'manufacturer'])
y = data_cleaned['price']

# Select categorical and numerical features
categorical_features = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'paint_color']
numerical_features = ['year', 'odometer']

# Preprocess categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Define the pipeline with Linear Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Define the pipeline with Ridge Regression for regularization
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge(alpha=1.0))  # Regularization parameter alpha set to 1.0
])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
pipeline.fit(X_train, y_train)

# Predict and calculate R^2
y_pred = pipeline.predict(X_test)
initial_r2 = r2_score(y_test, y_pred)
print("Initial R^2 with Linear Regression:", initial_r2)

# Train the Ridge regression model
ridge_pipeline.fit(X_train, y_train)

# Predict and calculate R^2 with Ridge Regression
y_ridge_pred = ridge_pipeline.predict(X_test)
ridge_r2 = r2_score(y_test, y_ridge_pred)
print("R^2 with Ridge Regression:", ridge_r2)

# Perform cross-validation
# Define a custom scorer for R^2
r2_scorer = make_scorer(r2_score)

# Set up K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Reduce dataset size by taking a random sample (20% of data)
sampled_data = data_cleaned.sample(frac=0.2, random_state=42)
X_sampled = sampled_data.drop(columns=['price', 'model', 'manufacturer'])
y_sampled = sampled_data['price']

# Perform cross-validation with the Ridge regression pipeline on the smaller dataset
cv_ridge_scores_sampled = cross_val_score(ridge_pipeline, X_sampled, y_sampled, cv=kf, scoring=r2_scorer)
cv_linear_scores_sampled = cross_val_score(pipeline, X_sampled, y_sampled, cv=kf, scoring=r2_scorer)

# Calculate the mean R^2 score for each model on the sampled dataset
ridge_cv_mean_r2_sampled = cv_ridge_scores_sampled.mean()
linear_cv_mean_r2_sampled = cv_linear_scores_sampled.mean()

# Display cross-validation results
print("Cross-Validation Results:")
print(f"Mean R^2 (Linear Regression): {linear_cv_mean_r2_sampled:.4f}")
print(f"Mean R^2 (Ridge Regression): {ridge_cv_mean_r2_sampled:.4f}")


Initial R^2 with Linear Regression: 0.6338518845873631
R^2 with Ridge Regression: 0.6338437540412545
Cross-Validation Results:
Mean R^2 (Linear Regression): 0.6337
Mean R^2 (Ridge Regression): 0.6337


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np

# Load the dataset
file_path = 'sample_data/vehicles_cleaned_outliers_removed.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values
data_cleaned = data.dropna()

# Define features (X) and target variable (y)
X = data_cleaned.drop(columns=['price', 'model', 'manufacturer'])
y = data_cleaned['price']

# Select categorical and numerical features
categorical_features = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'paint_color']
numerical_features = ['year', 'odometer']

# Preprocess categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Define the pipeline with Linear Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Define the pipeline with Ridge Regression for regularization
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge(alpha=1.0))  # Regularization parameter alpha set to 1.0
])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
pipeline.fit(X_train, y_train)

# Predict and calculate R^2 for Linear Regression
y_pred = pipeline.predict(X_test)
initial_r2 = r2_score(y_test, y_pred)
print("Initial R^2 with Linear Regression:", initial_r2)

# Train the Ridge regression model
ridge_pipeline.fit(X_train, y_train)

# Predict and calculate R^2 with Ridge Regression
y_ridge_pred = ridge_pipeline.predict(X_test)
ridge_r2 = r2_score(y_test, y_ridge_pred)
print("R^2 with Ridge Regression:", ridge_r2)

# Calculate MSE and RMSE for Ridge Regression
ridge_mse = mean_squared_error(y_test, y_ridge_pred)
ridge_rmse = np.sqrt(ridge_mse)
print(f"MSE (Ridge Regression): {ridge_mse:.2f}")
print(f"RMSE (Ridge Regression): {ridge_rmse:.2f}")

# Perform cross-validation
# Define a custom scorer for R^2
r2_scorer = make_scorer(r2_score)

# Set up K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Reduce dataset size by taking a random sample (20% of data)
sampled_data = data_cleaned.sample(frac=0.2, random_state=42)
X_sampled = sampled_data.drop(columns=['price', 'model', 'manufacturer'])
y_sampled = sampled_data['price']

# Perform cross-validation with the Ridge regression pipeline on the smaller dataset
cv_ridge_scores_sampled = cross_val_score(ridge_pipeline, X_sampled, y_sampled, cv=kf, scoring=r2_scorer)
cv_linear_scores_sampled = cross_val_score(pipeline, X_sampled, y_sampled, cv=kf, scoring=r2_scorer)

# Calculate the mean R^2 score for each model on the sampled dataset
ridge_cv_mean_r2_sampled = cv_ridge_scores_sampled.mean()
linear_cv_mean_r2_sampled = cv_linear_scores_sampled.mean()

# Display cross-validation results
print("Cross-Validation Results:")
print(f"Mean R^2 (Linear Regression): {linear_cv_mean_r2_sampled:.4f}")
print(f"Mean R^2 (Ridge Regression): {ridge_cv_mean_r2_sampled:.4f}")


Initial R^2 with Linear Regression: 0.6338518845873631
R^2 with Ridge Regression: 0.6338437540412545
MSE (Ridge Regression): 57821228.75
RMSE (Ridge Regression): 7604.03
Cross-Validation Results:
Mean R^2 (Linear Regression): 0.6337
Mean R^2 (Ridge Regression): 0.6337


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np

# Load the dataset
file_path = 'sample_data/vehicles_cleaned_outliers_removed.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values
data_cleaned = data.dropna().drop_duplicates()


# Define features (X) and target variable (y)
X = data_cleaned.drop(columns=['price', 'model', 'manufacturer'])
y = data_cleaned['price']

# Select categorical and numerical features
categorical_features = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'paint_color']
numerical_features = ['year', 'odometer']

# Preprocess categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Define the pipeline with Ridge Regression
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge())
])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a grid of hyperparameters to search
param_grid = {
    'regressor__alpha': [0.1, 1.0, 10.0, 100.0]  # Different values for the regularization strength
}

# Set up Grid Search
grid_search = GridSearchCV(
    estimator=ridge_pipeline,
    param_grid=param_grid,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring='neg_mean_squared_error',  # Use negative MSE for optimization
    verbose=2
)

# Perform Grid Search
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict using the best model
y_pred_best = best_model.predict(X_test)

# Calculate MSE and RMSE
best_mse = mean_squared_error(y_test, y_pred_best)
best_rmse = np.sqrt(best_mse)

# Output results
print("Best Parameters:", best_params)
print(f"Best MSE: {best_mse:.2f}")
print(f"Best RMSE: {best_rmse:.2f}")


Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END ...............................regressor__alpha=0.1; total time=   1.1s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=1.0; total time=   1.7s
[CV] END ...............................regressor__alpha=1.0; total time=   1.8s
[CV] END ...............................regressor__alpha=1.0; total time=   1.3s
[CV] END ...............................regressor__alpha=1.0; total time=   1.0s
[CV] END ...............................regressor__alpha=1.0; total time=   1.0s
[CV] END ..............................regressor__alpha=10.0; total time=   1.0s
[CV] END ..............................regressor_

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, make_scorer
import numpy as np

# Load the dataset
file_path = 'sample_data/vehicles_cleaned_outliers_removed.csv'
data = pd.read_csv(file_path)

# Drop rows with missing values
data_cleaned = data.dropna()

# Define features (X) and target variable (y)
X = data_cleaned.drop(columns=['price', 'model', 'manufacturer'])
y = data_cleaned['price']

# Select categorical and numerical features
categorical_features = ['condition', 'cylinders', 'fuel', 'title_status', 'transmission', 'drive', 'paint_color']
numerical_features = ['year', 'odometer']

# Define RMSE as a scoring metric
def rmse_score(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_scorer = make_scorer(rmse_score, greater_is_better=False)

# Preprocess categorical and numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Ridge Regression Pipeline
ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Ridge())
])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ridge Regression: Hyperparameter tuning
ridge_param_grid = {
    'regressor__alpha': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
}

ridge_grid_search = GridSearchCV(
    estimator=ridge_pipeline,
    param_grid=ridge_param_grid,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring=rmse_scorer,
    verbose=2
)

ridge_grid_search.fit(X_train, y_train)

# Best Ridge Regression model
best_ridge_params = ridge_grid_search.best_params_
best_ridge_model = ridge_grid_search.best_estimator_

# Test set predictions for Ridge Regression
y_pred_ridge = best_ridge_model.predict(X_test)

# RMSE for Ridge Regression
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print("Ridge Regression:")
print("Best Parameters:", best_ridge_params)
print(f"Test RMSE: {ridge_rmse:.2f}")

# Gradient Boosting Pipeline
gbr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', GradientBoostingRegressor(random_state=42))
])

# Gradient Boosting: Hyperparameter tuning
gbr_param_grid = {
    'regressor__learning_rate': [0.01, 0.1, 0.2],
    'regressor__n_estimators': [100, 200, 300],
    'regressor__max_depth': [3]
}

gbr_grid_search = GridSearchCV(
    estimator=gbr_pipeline,
    param_grid=gbr_param_grid,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring=rmse_scorer,
    verbose=2
)

gbr_grid_search.fit(X_train, y_train)

# Best Gradient Boosting model
best_gbr_params = gbr_grid_search.best_params_
best_gbr_model = gbr_grid_search.best_estimator_

# Test set predictions for Gradient Boosting
y_pred_gbr = best_gbr_model.predict(X_test)

# RMSE for Gradient Boosting
gbr_rmse = np.sqrt(mean_squared_error(y_test, y_pred_gbr))
print("Gradient Boosting:")
print("Best Parameters:", best_gbr_params)
print(f"Test RMSE: {gbr_rmse:.2f}")



Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ..............................regressor__alpha=0.01; total time=   1.1s
[CV] END ..............................regressor__alpha=0.01; total time=   1.2s
[CV] END ..............................regressor__alpha=0.01; total time=   1.7s
[CV] END ..............................regressor__alpha=0.01; total time=   1.7s
[CV] END ..............................regressor__alpha=0.01; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=0.1; total time=   1.0s
[CV] END ...............................regressor__alpha=1.0; total time=   1.0s
[CV] END ...............................regressor

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

### Data Preparation

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`.

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.