Data preparation

For the rest of the homework, you'll need to use only these columns:

Make,
Model,
Year,
Engine HP,
Engine Cylinders,
Transmission Type,
Vehicle Style,
highway MPG,
city mpg

Data preparation

Select only the features from above and transform their names using next line:
data.columns = data.columns.str.replace(' ', '_').str.lower()
Fill in the missing values of the selected features with 0.
Rename MSRP variable to price.

In [29]:
import pandas as pd

df = pd.read_csv("data.csv")
selected_columns = df[['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders', 'Transmission Type', 'Vehicle Style', 'highway MPG', 'city mpg', 'MSRP']]

selected_columns.columns = selected_columns.columns.str.replace(' ', '_').str.lower()

selected_columns = selected_columns.fillna(0)

selected_columns = selected_columns.rename(columns={'msrp': 'price'})
selected_columns.head(5)

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


Question 1
What is the most frequent observation (mode) for the column transmission_type?

AUTOMATIC
MANUAL
AUTOMATED_MANUAL
DIRECT_DRIVE


In [30]:
most_frequent_transmission = selected_columns['transmission_type'].mode().iloc[0]
print(most_frequent_transmission)

AUTOMATIC


Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

engine_hp and year
engine_hp and engine_cylinders
highway_mpg and engine_cylinders
highway_mpg and city_mpg

In [31]:
# Select only the numerical columns
numerical_columns = selected_columns.select_dtypes("number")

correlation_matrix = numerical_columns.corr()

# Zero out the diagonal for ease of reading.
for i in range(correlation_matrix.shape[0]):
    correlation_matrix.iloc[i, i] = 0

# Find the two features with the highest correlation
max_corr_value = correlation_matrix.abs().max().max()
result = correlation_matrix.where(correlation_matrix == max_corr_value).stack().index.tolist()[0]

print(result)


('highway_mpg', 'city_mpg')


Make price binary

Now we need to turn the price variable from numeric into a binary format.
Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.

Split the data

Split your data in train/val/test sets with 60%/20%/20% distribution.
Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
Make sure that the target value (price) is not in your dataframe.

In [32]:
from sklearn.model_selection import train_test_split

selected_columns['above_average'] = (selected_columns['price'] > selected_columns['price'].mean()).astype(int)

selected_columns = selected_columns.drop(columns=['price'])

categorical_vars = list(selected_columns.dtypes[selected_columns.dtypes == 'object'].index)

for c in categorical_vars:
    selected_columns[c] = selected_columns[c].str.lower().str.replace(' ', '_')


train_val, test = train_test_split(selected_columns, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.25, random_state=42)  # Ensures train is 60% and val is 20%

print("Train shape:", train.shape)
print("Validation shape:", val.shape)
print("Test shape:", test.shape)

y_train = train.above_average.values
y_val = val.above_average.values

train.drop(['above_average'], axis=1, inplace=True)
val.drop(['above_average'], axis=1, inplace=True)


Train shape: (7148, 10)
Validation shape: (2383, 10)
Test shape: (2383, 10)


Question 3

Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
Round the scores to 2 decimals using round(score, 2).
Which of these variables has the lowest mutual information score?

make
model
transmission_type
vehicle_style


In [33]:
from sklearn.feature_selection import mutual_info_classif

mi_scores = {}
for var in categorical_vars:
    mi = mutual_info_classif(train[var].astype('category').cat.codes.values.reshape(-1, 1), y_train)
    mi_scores[var] = round(mi[0], 2)

lowest_mi_var = min(mi_scores, key=mi_scores.get)

print(mi_scores)
print("Variable with the lowest mutual information score:", lowest_mi_var)


{'make': 0.23, 'model': 0.41, 'transmission_type': 0.02, 'vehicle_style': 0.09}
Variable with the lowest mutual information score: transmission_type


Question 4

Now let's train a logistic regression.
Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
Fit the model on the training dataset.
To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
What accuracy did you get?

0.60
0.72
0.84
0.95

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer


train_dict = train.to_dict(orient='records')
val_dict = val.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

val_predictions = model.predict(X_val)
accuracy = accuracy_score(y_val, val_predictions)
rounded_accuracy = round(accuracy, 2)

print(rounded_accuracy)


0.95


Question 5

Let's find the least useful feature using the feature elimination technique.
Train a model with all these features (using the same parameters as in Q4).
Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
Which of following feature has the smallest difference?

year
engine_hp
transmission_type
city_mpg
Note: the difference doesn't have to be positive

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer

# All features (excluding the target 'above_average')
features = ['year','engine_hp','transmission_type','city_mpg']
train = train[['year','engine_hp','transmission_type','city_mpg']]
val = val[['year','engine_hp','transmission_type','city_mpg']]
differences = {}

train_dict = train.to_dict(orient='records')
val_dict = val.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

# Train a model using all features
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
original_accuracy = accuracy_score(y_val, model.predict(X_val))
print("originalaccuracy:", original_accuracy)

for feature in features:
    # Exclude the feature
    cols_to_use = [col for col in train.columns if col != feature]
    
    train_dict = train[cols_to_use].to_dict(orient='records')
    val_dict = val[cols_to_use].to_dict(orient='records')
    
    dv = DictVectorizer(sparse=False)
    X_train_sub = dv.fit_transform(train_dict)
    X_val_sub = dv.transform(val_dict)
    
    # Train a model without the feature
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train_sub, y_train)
    
    # Calculate accuracy difference
    accuracy_without_feature = accuracy_score(y_val, model.predict(X_val_sub))
    print("accuracy_without_feature", feature, accuracy_without_feature)
    differences[feature] = abs(original_accuracy - accuracy_without_feature)

# Find the feature with the smallest difference
least_impactful_feature = min(differences, key=differences.get)
print(differences)
print("\nFeature with the smallest difference:", least_impactful_feature)


originalaccuracy: 0.8850188837599664
accuracy_without_feature year 0.8854385228703315
accuracy_without_feature engine_hp 0.7444397817876626
accuracy_without_feature transmission_type 0.8820814099874108
accuracy_without_feature city_mpg 0.8766261015526647
{'year': 0.00041963911036513313, 'engine_hp': 0.14057910197230383, 'transmission_type': 0.002937473772555599, 'city_mpg': 0.008392782207301663}

Feature with the smallest difference: year


Question 6

For this question, we'll see how to use a linear regression model from Scikit-Learn.
We'll need to use the original column price. Apply the logarithmic transformation to this column.
Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
Round your RMSE scores to 3 decimal digits.
Which of these alphas leads to the best RMSE on the validation set?

0
0.01
0.1
1
10


In [53]:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

import pandas as pd

df = pd.read_csv("data.csv")
selected_columns = df[['Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders', 'Transmission Type', 'Vehicle Style', 'highway MPG', 'city mpg', 'MSRP']]

selected_columns.columns = selected_columns.columns.str.replace(' ', '_').str.lower()

selected_columns = selected_columns.fillna(0)

selected_columns = selected_columns.rename(columns={'msrp': 'price'})

selected_columns['price'] = np.log1p(selected_columns['price'])

categorical_vars = list(selected_columns.dtypes[selected_columns.dtypes == 'object'].index)

for c in categorical_vars:
    selected_columns[c] = selected_columns[c].str.lower().str.replace(' ', '_')

train_val, test = train_test_split(selected_columns, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.25, random_state=42)  # Ensures train is 60% and val is 20%

print("Train shape:", train.shape)
print("Validation shape:", val.shape)
print("Test shape:", test.shape)

# Prepare datasets
train_dict = train.drop(columns=['price']).to_dict(orient='records')
val_dict = val.drop(columns=['price']).to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

# Try various alpha values and record RMSE
alphas = [0, 0.01, 0.1, 1, 10]
rmse_scores = {}

for alpha in alphas:
    model = Ridge(alpha=alpha, solver='sag', random_state=42)
    model.fit(X_train, train['price'])
    
    val_predictions = model.predict(X_val)
    rmse = mean_squared_error(val['price'], val_predictions, squared=False)
    rmse_scores[alpha] = round(rmse, 3)

# Find the alpha leading to the best RMSE
best_alpha = min(rmse_scores, key=rmse_scores.get)
print(rmse_scores)
print("\nBest alpha:", best_alpha)

Train shape: (7148, 10)
Validation shape: (2383, 10)
Test shape: (2383, 10)




{0: 0.487, 0.01: 0.487, 0.1: 0.487, 1: 0.487, 10: 0.487}

Best alpha: 0




In [60]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction import DictVectorizer

df = pd.read_csv("data.csv")
selected_columns = df[['Year', 'Engine HP', 'Transmission Type', 'city mpg', 'MSRP']]

selected_columns.columns = selected_columns.columns.str.replace(' ', '_').str.lower()

selected_columns = selected_columns.fillna(0)

selected_columns = selected_columns.rename(columns={'msrp': 'price'})
selected_columns.head(5)

selected_columns['above_average'] = (selected_columns['price'] > selected_columns['price'].mean()).astype(int)

selected_columns = selected_columns.drop(columns=['price'])

categorical_vars = list(selected_columns.dtypes[selected_columns.dtypes == 'object'].index)

for c in categorical_vars:
    selected_columns[c] = selected_columns[c].str.lower().str.replace(' ', '_')

train_val, test = train_test_split(selected_columns, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.25, random_state=42)  # Ensures train is 60% and val is 20%

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)

model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, train['above_average'])

val_predictions = model.predict(X_val)
accuracy = accuracy_score(val['above_average'], val_predictions)

print(accuracy)

for feature in ["make", "model", "transmission_type", "vehicle_style"]:
    # Exclude the feature
    cols_to_use = [col for col in train.columns if col != feature]

    print(cols_to_use)
    
    
    train_dict = train[cols_to_use].to_dict(orient='records')
    val_dict = val[cols_to_use].to_dict(orient='records')

    print(train_dict)
    
    dv = DictVectorizer(sparse=False)
    X_train_sub = dv.fit_transform(train_dict)
    X_val_sub = dv.transform(val_dict)
    
    # Train a model without the feature
    model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model.fit(X_train_sub, train['above_average'])
    
    # Calculate accuracy difference
    accuracy_without_feature = accuracy_score(val['above_average'], model.predict(X_val_sub))
    print("accuracy_without_feature", feature, accuracy_without_feature)
    differences[feature] = abs(original_accuracy - accuracy_without_feature)

1.0
['year', 'engine_hp', 'transmission_type', 'city_mpg', 'above_average']
[{'year': 2011, 'engine_hp': 225.0, 'transmission_type': 'automatic', 'city_mpg': 15, 'above_average': 0}, {'year': 2009, 'engine_hp': 276.0, 'transmission_type': 'automatic', 'city_mpg': 17, 'above_average': 0}, {'year': 2012, 'engine_hp': 570.0, 'transmission_type': 'manual', 'city_mpg': 12, 'above_average': 1}, {'year': 2016, 'engine_hp': 200.0, 'transmission_type': 'automatic', 'city_mpg': 20, 'above_average': 0}, {'year': 2009, 'engine_hp': 158.0, 'transmission_type': 'automatic', 'city_mpg': 20, 'above_average': 0}, {'year': 2011, 'engine_hp': 160.0, 'transmission_type': 'automatic', 'city_mpg': 21, 'above_average': 0}, {'year': 2016, 'engine_hp': 240.0, 'transmission_type': 'automatic', 'city_mpg': 23, 'above_average': 1}, {'year': 2016, 'engine_hp': 420.0, 'transmission_type': 'automatic', 'city_mpg': 15, 'above_average': 1}, {'year': 2016, 'engine_hp': 305.0, 'transmission_type': 'automatic', 'city_mpg