### Jonathan Bunch

6 November 2021

Bellevue University

DSC550-T301

---

# Final Project Milestone Four

This week I wanted to focus on feature selection, as this was a weakness in my analysis last week.  I had essentially
chosen a random number of features based on those feature's correlation with the target feature.  This week I employed
the "SelectKBest" and "chi2" methods from sklearn to quantitatively determine the best features for my regression model.

In [16]:
# Import libraries.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score
from sklearn.feature_selection import SelectKBest, chi2

# Import my finished data set from the previous milestone.
df8 = pd.read_csv('week-8-data.csv')

In [17]:
# Assign the predictive and target features to variables for convenience.
y = df8.inflation_adjusted_gross
X = df8.drop(columns="inflation_adjusted_gross")

# Create a list of descending k values to try for the SelectKBest method.
k_vals = [b for b in range(2, 25)]
k_vals.reverse()

# Create and test models for number of best features.
for k in k_vals:
    X_k = SelectKBest(chi2, k=k).fit_transform(X, y)
    # Split the data into training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(X_k, y, test_size=0.2, random_state=4)
    # Create and fit a linear regression model.
    model = LinearRegression()
    model.fit(X_train, y_train)
    # Use the model to make some predictions.
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    evs = explained_variance_score(y_test, y_pred)
    # Print the results.
    print('Regression Model Results for ', k, ' Best Features:')
    print("R-squared score:            ", r2)
    print("Explained Variance Score:   ", evs)
    print("Mean Absolute Error:        ", mae)
    print("Mean Squared Error:         ", mse)
    print()

Regression Model Results for  24  Best Features:
R-squared score:             0.7762115773224278
Explained Variance Score:    0.7777247233392993
Mean Absolute Error:         287079309.9112983
Mean Squared Error:          1.6995626644758458e+17

Regression Model Results for  23  Best Features:
R-squared score:             0.7785333832297617
Explained Variance Score:    0.7794399941720883
Mean Absolute Error:         268014779.46989715
Mean Squared Error:          1.6819296940699123e+17

Regression Model Results for  22  Best Features:
R-squared score:             0.7786227476521608
Explained Variance Score:    0.783922581549676
Mean Absolute Error:         276602681.96570164
Mean Squared Error:          1.681251015369627e+17

Regression Model Results for  21  Best Features:
R-squared score:             0.7751079941605327
Explained Variance Score:    0.778147493290234
Mean Absolute Error:         272610231.92357
Mean Squared Error:          1.7079438341389594e+17

Regression Model Result

It looks like the best performing models are those that include the 19, 6, and 5 best features, according to the chi2
test.  Next, I will see which features are included in each of these top performing models, and create a new dataframe
for each selection of features.

In [18]:
# Create and fit the selector for the 5 best features.
selector_5 = SelectKBest(chi2, k=5)
selector_5.fit(X, y)
# Create a new dataframe with the k best columns.
cols = selector_5.get_support(indices=True)
X_best_5 = X.iloc[:, cols]

# Check the results of the model again using this selection of features.
X_5 = selector_5.fit_transform(X_best_5, y)
X_train, X_test, y_train, y_test = train_test_split(X_5, y, test_size=0.2, random_state=4)
# Create and fit a linear regression model.
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to make some predictions.
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print('Best 5 Features:')
print(X_best_5.columns.values)
print('5 Feature Model Results:')
print("R-squared score:            ", r2)
print("Explained Variance Score:   ", evs)
print("Mean Absolute Error:        ", mae)
print("Mean Squared Error:         ", mse)
print()

Best 5 Features:
['genre_Adventure' 'genre_Comedy' 'genre_Drama' 'mortality'
 'income_per_cap']
5 Feature Model Results:
R-squared score:             0.7827199593702088
Explained Variance Score:    0.8417171884466026
Mean Absolute Error:         278282460.236372
Mean Squared Error:          1.6501347137257293e+17



In [19]:
# Create and fit the selector for the 6 best features.
selector_6 = SelectKBest(chi2, k=6)
selector_6.fit(X, y)
# Create a new dataframe with the k best columns.
cols = selector_6.get_support(indices=True)
X_best_6 = X.iloc[:, cols]

# Check the results of the model again using this selection of features.
X_6 = selector_6.fit_transform(X_best_6, y)
X_train, X_test, y_train, y_test = train_test_split(X_6, y, test_size=0.2, random_state=4)
# Create and fit a linear regression model.
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to make some predictions.
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print('Best 6 Features:')
print(X_best_6.columns.values)
print('6 Feature Model Results:')
print("R-squared score:            ", r2)
print("Explained Variance Score:   ", evs)
print("Mean Absolute Error:        ", mae)
print("Mean Squared Error:         ", mse)
print()

Best 6 Features:
['month_1' 'genre_Adventure' 'genre_Comedy' 'genre_Drama' 'mortality'
 'income_per_cap']
6 Feature Model Results:
R-squared score:             0.7820687426993217
Explained Variance Score:    0.8352684537176487
Mean Absolute Error:         282287310.4297555
Mean Squared Error:          1.6550803830641226e+17



In [20]:
# Create and fit the selector for the 19 best features.
selector_19 = SelectKBest(chi2, k=19)
selector_19.fit(X, y)
# Create a new dataframe with the k best columns.
cols = selector_19.get_support(indices=True)
X_best_19 = X.iloc[:, cols]

# Check the results of the model again using this selection of features.
X_19 = selector_19.fit_transform(X_best_19, y)
X_train, X_test, y_train, y_test = train_test_split(X_19, y, test_size=0.2, random_state=4)
# Create and fit a linear regression model.
model = LinearRegression()
model.fit(X_train, y_train)
# Use the model to make some predictions.
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print('Best 19 Features:')
print(X_best_19.columns.values)
print('19 Feature Model Results:')
print("R-squared score:            ", r2)
print("Explained Variance Score:   ", evs)
print("Mean Absolute Error:        ", mae)
print("Mean Squared Error:         ", mse)
print()

Best 19 Features:
['month_1' 'month_2' 'month_3' 'month_4' 'month_5' 'month_9' 'month_10'
 'month_12' 'genre_Adventure' 'genre_Black Comedy' 'genre_Comedy'
 'genre_Concert/Performance' 'genre_Documentary' 'genre_Drama'
 'genre_Musical' 'genre_Romantic Comedy' 'genre_Western' 'mortality'
 'income_per_cap']
19 Feature Model Results:
R-squared score:             0.8046644001711996
Explained Variance Score:    0.8047170953685409
Mean Absolute Error:         284826513.62975913
Mean Squared Error:          1.483477512106773e+17



For the final project submission, I will continue to refine the feature selection as I also expand on model selection.