<h1><u>Capstone 2 - Coffee Shop - Modeling</u>

[Rubric](https://docs.google.com/document/d/1rbG66SRqRj73Y-KtI_0qlkMX2CSGrvWddvi1V-WYXcY/edit)

In previous notebooks I have already defined my problem, cleaned the data set, created dummy variables for categorical data, standardized the originally numeric data, and created the train/test split. The data set is from Kaggle and can be found [here](https://www.kaggle.com/datasets/patkle/coffeereviewcom-over-7000-ratings-and-reviews). The previously completed data cleaning notebook can be found [here](https://github.com/lindseyc735/Springboard/blob/main/Capstone%202/Capstone_2_data_wrangling.ipynb). Please see the below review of the project prior to considering the modeling.

<u>**Problem Statement:**</u>
<br>What features most affect the coffee rating?

<u>**Context:**</u>
<br>A start-up coffee company is creating their signature blend to sell alongside the more generic blends of coffee. The start-up needs to know what three features to primarily incorporate into their signature blend to maximize its popularity and distinguish their company from other coffee companies.

<u>**Criteria for Success:**</u>
<br>Determine the three coffee features that will create a popular, signature blend of coffee.

<u>**Scope of Solution Space:**</u>
<br>Rating
<br>Acidity
<br>Aftertaste
<br>Aroma
<br>Body
<br>Flavor
<br>Review description
<br>Country of origin
<br>Roast level
<br>Roaster
<br>Roaster location

In [1]:
import warnings
warnings.filterwarnings('ignore') # Removes deprecation warnings
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
from pandas_profiling import ProfileReport # Creates data description, visuals, and missing value statistics for the data frame
from IPython.display import display
import os
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Import the data and run a ProfileReport to find statistical descriptions, visuals, and missing value information
df = pd.read_csv('reordered_preprocessed_coffee4.csv')
df.head()

Unnamed: 0,aftertaste,aroma,body,flavor,coffee_origin_20% Kona; other blend components not disclosed,coffee_origin_40% Colombia; 40% Brazil; 20% Rwanda,"coffee_origin_50% Colombia, 35% Ethiopia, 15% Sumatra",coffee_origin_50% Colombia; 50% Ethiopia,coffee_origin_50% Yirgacheffe Ethiopia; 25% Papua New Guinea; 25% Brazil,coffee_origin_A blend of coffees from southern India,...,"roaster_location_Youngstown, Ohio","roaster_location_Yuanlin, Taiwan","roaster_location_Yun-Lin County, Taiwan","roaster_location_Zhongli, Taiwan","roaster_location_Zhubei City, Taiwan","roaster_location_Zhubei, Taiwan","roaster_location_Zhuwei, Taiwan",roaster_location_Zimbabwe,"roaster_location_Zurich, Switzerland",rating
0,0.040738,0.700223,-0.111574,0.554627,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.517301
1,0.040738,0.700223,-0.111574,0.554627,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.27465
2,0.040738,0.700223,1.057494,0.554627,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.759951
3,0.040738,0.700223,1.057494,0.554627,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.759951
4,0.040738,0.700223,-0.111574,0.554627,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.517301


In [2]:
# Import the train/test split data
X = df.iloc[:, :-1]  # Features (all columns except the last one)
y = df.iloc[:, -1]   # Target (last column)
X_train = pd.read_csv('X_train.csv')
X_test = pd.read_csv('X_test.csv')
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv')

In [3]:
df.shape

(7037, 4174)

In [4]:
X_train.shape, y_train.shape

((5629, 4173), (5629, 1))

In [5]:
X_test.shape, y_test.shape

((1408, 4173), (1408, 1))

# <u>Modeling</u>  
Goal: Build 3 to 5 different models and identify the best one.  

# Fit models with a training dataset  

# Model 1: Linear Regression Model 

In [6]:
# Create model specific variables for the train/test set components
XLR = X
yLR = y
X_trainLR = X_train
X_testLR = X_test
y_trainLR = y_train
y_testLR = y_test

In [7]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the model
lr = LinearRegression()

# Fit the model to the data
lr.fit(X_trainLR, y_trainLR)

# Make predictions
y_predLR = lr.predict(X_testLR)

In [8]:
# Evaluate the model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Calculate MAE
maeLR = mean_absolute_error(y_testLR, y_predLR)

# Print the MAE
print(f"Mean Absolute Error for the Linear Regression Model: {maeLR}")

# Calculate Mean Squared Error (MSE) for evaluation
mseLR = mean_squared_error(y_testLR, y_predLR)
print(f"Mean Squared Error for the Linear Regression Model: {mseLR}")

# Calculate the RMSE
rmseLR = np.sqrt(mseLR)
print(f"RMSE for the Linear Regression Model: ", rmseLR)

Mean Absolute Error for the Linear Regression Model: 446544337.98906857
Mean Squared Error for the Linear Regression Model: 9.467784905330545e+19
RMSE for the Linear Regression Model:  9730254315.962427


# Model 2: Random Forest Regressor Model

In [9]:
# Create model specific variables for the train/test set components
XRF = X
yRF = y
X_trainRF = X_train
X_testRF = X_test
y_trainRF = y_train
y_testRF = y_test

In [10]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model on the training data
rf.fit(X_trainRF, y_trainRF.values.ravel())  # Using .values.ravel() to convert y_train DataFrame to a 1D array

# Predict on the test data
y_predRF = rf.predict(X_testRF)

In [11]:
# Calculate MAE
maeRF= mean_absolute_error(y_testRF, y_predRF)

# Print the MAE
print(f"Mean Absolute Error for the Random Forest Model: {maeRF}")

# Calculate Mean Squared Error (MSE) for evaluation
mseRF = mean_squared_error(y_testRF, y_predRF)
print(f"Mean Squared Error for the Random Forest Regressor Model: {mseRF}")

# Calculate the RMSE
rmseRF = np.sqrt(mseRF)
print(f"RMSE for the Random Forest Regressor Model: ", rmseRF)

Mean Absolute Error for the Random Forest Model: 0.1794049411585514
Mean Squared Error for the Random Forest Regressor Model: 0.0832775065420507
RMSE for the Random Forest Regressor Model:  0.28857842355597324


# Model 3: Gradient Boosting Regressor Model

In [12]:
# Create model specific variables for the train/test set components
XGB = X
yGB = y
X_trainGB = X_train
X_testGB = X_test
y_trainGB = y_train
y_testGB = y_test

# Install XGBoost
#! pip install xgboost

In [13]:
import xgboost as xgb

# Initialize the XGBoost regressor
gb = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=10, seed=42)

# Fit the model on the training data
gb.fit(X_trainGB, y_trainGB)

# Make predictions on the test set
y_predGB = gb.predict(X_testGB)

In [14]:
# Calculate MAE
maeGB = mean_absolute_error(y_testGB, y_predGB)

# Print the MAE
print(f"Mean Absolute Error for the Extreme Gradient Boosting Model: {maeGB}")

# Calculate Mean Squared Error for evaluation
mseGB = mean_squared_error(y_testGB, y_predGB)
print(f"Mean Squared Error for the Extreme Gradient Boosting Model: {mseGB}")

# Calculate the RMSE
rmseGB = np.sqrt(mseGB)
print(f"RMSE for the Extreme Gradient Boosting Model: ", rmseGB)

Mean Absolute Error for the Extreme Gradient Boosting Model: 0.19203128125179927
Mean Squared Error for the Extreme Gradient Boosting Model: 0.08294503229995759
RMSE for the Extreme Gradient Boosting Model:  0.2880017921818501


# Model 4: Support Vector Regression (SVR) Model

In [15]:
# Create model specific variables for the train/test set components
XSVR = X
ySVR = y
X_trainSVR = X_train
X_testSVR = X_test
y_trainSVR = y_train
y_testSVR = y_test

In [16]:
from sklearn.svm import SVR

# Initialize the SVR model
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Fit the model on the training data
svr.fit(X_trainSVR, y_trainSVR)

# Make predictions on the test set
y_predSVR = svr.predict(X_testSVR)

In [17]:
# Calculate MAE
maeSVR= mean_absolute_error(y_testSVR, y_predSVR)

# Print the MAE
print(f"Mean Absolute Error for the Support Vector Regression Model: {maeSVR}")

# Calculate Mean Squared Error for evaluation
mseSVR = mean_squared_error(y_testSVR, y_predSVR)
print(f"Mean Squared Error for the Support Vector Regression Model: {mseSVR}")

# Calculate the RMSE
rmseSVR = np.sqrt(mseSVR)
print(f"RMSE for the Support Vecgtor Regression Model: ", rmseSVR)

Mean Absolute Error for the Support Vector Regression Model: 0.18286007431960155
Mean Squared Error for the Support Vector Regression Model: 0.08128681850700242
RMSE for the Support Vecgtor Regression Model:  0.28510843289352633


# Model 5: Elastic Net Model

In [18]:
# Create model specific variables for the train/test set components
XEN = X
yEN = y
X_trainEN = X_train
X_testEN = X_test
y_trainEN = y_train
y_testEN = y_test

In [19]:
from sklearn.linear_model import ElasticNet

# Initialize the Elastic Net model
en = ElasticNet(alpha=1.0, l1_ratio=0.5)  

# Fit the model on the training data
en.fit(X_trainEN, y_trainEN)

# Make predictions on the test set
y_predEN = en.predict(X_testEN)

In [20]:
# Calculate MAE
maeEN = mean_absolute_error(y_testEN, y_predEN)

# Print the MAE
print(f"Mean Absolute Error for the Elastic Net Model: {maeEN}")

# Calculate Mean Squared Error for evaluation
mseEN = mean_squared_error(y_testEN, y_predEN)
print(f"Mean Squared Error for the Elastic Net Model: {mseEN}")

# Calculate the RMSE
rmseEN = np.sqrt(mseEN)
print(f"RMSE for the Elastic Net Model: ", rmseEN)

Mean Absolute Error for the Elastic Net Model: 0.4807004765238523
Mean Squared Error for the Elastic Net Model: 0.4763833374938104
RMSE for the Elastic Net Model:  0.6902052864864268


# Review model outcomes — Iterate over additional models as needed  

In [22]:
# Create a dictionary containing your data
table = {
    'Model': ['Linear Regression', 'Random Forest', 'XGBoosting', 'Elastic Net'],
    'MAE': [maeLR, maeRF, maeGB, maeEN], 
    'MSE': [mseLR, mseRF, mseGB, mseEN],
    'RMSE': [rmseLR, rmseRF, rmseGB, rmseEN]
}

# Create a DataFrame from the dictionary
data_table = pd.DataFrame(table)

# Display the data table
print(data_table)

               Model           MAE           MSE          RMSE
0  Linear Regression  4.465443e+08  9.467785e+19  9.730254e+09
1      Random Forest  1.794049e-01  8.327751e-02  2.885784e-01
2         XGBoosting  1.920313e-01  8.294503e-02  2.880018e-01
3        Elastic Net  4.807005e-01  4.763833e-01  6.902053e-01


# Identify the final model that you think is the best model for this project  
Hint: the most powerful model isn’t always the best one to use. Other considerations
include computational complexity, scalability, and maintenance costs. 

Extreme Gradient Boosting displays the lowest MSE and RMSE, and second lowest MAE. I will select the Extreme Gradient Boosting as the best model.