## Predictive Modeling and Comparison of Results

This file will create various models to predict Weekly Sales. It will also measure the accuracy of each model and compare their results.

**Import** neccessary **libraries**

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam


**Load In** the **Machine Learning** dataset 

In [4]:
df = pd.read_csv('ml_df.csv')
df

Unnamed: 0,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Store_2,Store_3,Store_4,Store_5,...,Store_45,year_sin,year_cos,month_sin,month_cos,day_sin,day_cos,Season_Spring,Season_Summer,Season_Winter
0,1643690.90,0,42.31,2.572,211.096358,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.394356,0.918958,1,0,0
1,1641957.44,1,38.51,2.548,211.242170,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,0.433884,-0.900969,0,0,0
2,1611968.17,0,39.93,2.514,211.289143,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.900969,-0.433884,0,0,1
3,1409727.59,0,46.63,2.561,211.319643,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.433884,0.900969,0,0,1
4,1554806.68,0,46.50,2.625,211.350143,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.571268,0.820763,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0,64.88,3.997,192.013558,8.684,0,0,0,0,...,1,-2.449294e-16,1.00000,-1.000000e+00,-1.836970e-16,-0.406737,0.913545,0,0,0
6431,733455.07,0,64.89,3.985,192.170412,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,5.000000e-01,-8.660254e-01,0.897805,-0.440394,1,0,0
6432,734464.36,0,54.47,4.000,192.327265,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,-2.449294e-16,1.000000e+00,0.897805,-0.440394,0,0,0
6433,718125.53,0,56.47,3.969,192.330854,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,-8.660254e-01,5.000000e-01,-0.651372,-0.758758,0,0,0


Split the df into **Features** (X) and **Target Variable** (y)

In [5]:
X = df.iloc[:, 1:].values
X

array([[ 0.   , 42.31 ,  2.572, ...,  1.   ,  0.   ,  0.   ],
       [ 1.   , 38.51 ,  2.548, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 39.93 ,  2.514, ...,  0.   ,  0.   ,  1.   ],
       ...,
       [ 0.   , 54.47 ,  4.   , ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 56.47 ,  3.969, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 58.85 ,  3.882, ...,  0.   ,  0.   ,  0.   ]])

In [6]:
y = df.iloc[:, 0].values
y

array([1643690.9 , 1641957.44, 1611968.17, ...,  734464.36,  718125.53,
        760281.43])

### Multiple Linear Regression

Splitting the Dataset Into **Training** and **Test** Sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Training** the model

In [6]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

**Predicting** the **Test** Set Results

In [7]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

output = np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1)
formatted_output = [[f'{pred:.2f}' for pred in row] for row in output]

for row in formatted_output:
    print(row)


['2029926.52', '1870619.23']
['494411.69', '448391.99']
['1351701.85', '1272948.27']
['710760.78', '744969.42']
['395779.55', '325345.41']
['2159672.29', '2080529.06']
['526660.84', '528832.54']
['486167.02', '457504.35']
['989823.71', '921612.53']
['2117189.34', '2135982.79']
['2081049.11', '2052246.40']
['588890.20', '472450.81']
['522461.17', '478773.05']
['317858.04', '289667.55']
['631629.67', '639651.24']
['2057594.82', '2066219.30']
['853607.70', '917317.15']
['1335839.71', '1239466.97']
['1200276.95', '1219979.29']
['566619.49', '545840.05']
['796716.29', '720946.99']
['2064152.08', '1639585.61']
['1837491.36', '1811562.88']
['816188.71', '841224.74']
['499473.12', '502456.04']
['1800368.25', '1880691.64']
['1589528.38', '1582083.40']
['1221548.20', '1504545.94']
['393761.22', '437893.76']
['966968.03', '919503.40']
['1351436.49', '1230250.25']
['554563.89', '529515.66']
['1294715.54', '1303233.15']
['263926.04', '338400.82']
['1993573.52', '1853657.60']
['1407110.69', '1264117

Calculating Different **Accuracy** Measures

In [8]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 93456.23921900992
Mean Squared Error (MSE): 26504870719.634964
Root Mean Squared Error (RMSE): 162803.16557006797
The r-squared score (R^2): 0.9197706842183846
The Mean Percent Error (MPE): 9.236949826826544


### Multiple Linear Regression with **Lasso Regularization**

**Instantiating**, **Training** and **Predicting** results

In [9]:
regressor = Lasso(alpha=1)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

output = np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1)
formatted_output = [[f'{pred:.2f}' for pred in row] for row in output]

for row in formatted_output:
    print(row)

['2030677.33', '1870619.23']
['512459.06', '448391.99']
['1354140.33', '1272948.27']
['717866.48', '744969.42']
['368540.65', '325345.41']
['2145738.01', '2080529.06']
['532473.48', '528832.54']
['504110.05', '457504.35']
['994216.91', '921612.53']
['2108139.43', '2135982.79']
['2090727.78', '2052246.40']
['585046.82', '472450.81']
['532075.20', '478773.05']
['316171.55', '289667.55']
['627458.34', '639651.24']
['2054080.55', '2066219.30']
['875014.03', '917317.15']
['1336617.63', '1239466.97']
['1187017.82', '1219979.29']
['577011.03', '545840.05']
['798786.44', '720946.99']
['2046779.72', '1639585.61']
['1851557.23', '1811562.88']
['825573.20', '841224.74']
['510034.45', '502456.04']
['1793763.54', '1880691.64']
['1572302.62', '1582083.40']
['1227884.21', '1504545.94']
['398192.74', '437893.76']
['969484.33', '919503.40']
['1362908.80', '1230250.25']
['557205.96', '529515.66']
['1298127.88', '1303233.15']
['278337.59', '338400.82']
['1981927.20', '1853657.60']
['1398669.10', '1264117

Calculating Different **Accuracy** Measures

In [10]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 92321.77491376526
Mean Squared Error (MSE): 26671304343.38257
Root Mean Squared Error (RMSE): 163313.51549514377
The r-squared score (R^2): 0.9192668954658353
The Mean Percent Error (MPE): 8.923841116914613


### Multiple Linear Regression with **Ridge Regression**

**Instantiating**, **Training** and **Predicting** results

In [11]:
regressor = Ridge(alpha = 1)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)

output = np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1)
formatted_output = [[f'{pred:.2f}' for pred in row] for row in output]

for row in formatted_output:
    print(row)

['2075360.09', '1870619.23']
['542106.19', '448391.99']
['1370530.00', '1272948.27']
['710021.85', '744969.42']
['373985.90', '325345.41']
['2164796.12', '2080529.06']
['535482.63', '528832.54']
['522760.48', '457504.35']
['1020071.40', '921612.53']
['2062883.82', '2135982.79']
['2075273.50', '2052246.40']
['587988.69', '472450.81']
['543904.12', '478773.05']
['303460.57', '289667.55']
['628766.79', '639651.24']
['2026394.60', '2066219.30']
['900221.40', '917317.15']
['1307013.34', '1239466.97']
['1186847.96', '1219979.29']
['613522.81', '545840.05']
['803418.75', '720946.99']
['2057777.32', '1639585.61']
['1869594.91', '1811562.88']
['893077.31', '841224.74']
['531856.24', '502456.04']
['1771154.43', '1880691.64']
['1475993.08', '1582083.40']
['1282843.28', '1504545.94']
['403461.23', '437893.76']
['988113.97', '919503.40']
['1380832.14', '1230250.25']
['568623.65', '529515.66']
['1334176.37', '1303233.15']
['311445.68', '338400.82']
['1967720.26', '1853657.60']
['1367053.06', '126411

Calculating Different **Accuracy** Measures

In [12]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 93986.90813135507
Mean Squared Error (MSE): 28115563705.775635
Root Mean Squared Error (RMSE): 167676.9623585054
The r-squared score (R^2): 0.9148951729367324
The Mean Percent Error (MPE): 9.050746028078454


Next steps. I chose the default alpha value for Lasso and Ridge. To get the optimal alpha value, I need to conduct cross validation to find the best mix of underfitting and overfitting. Accuracy order by MPE: Lasso, ridge, normal

### Polynomial Regression **(Degree = 2)**

**Reshaping** the **Target Variable** Column

In [57]:
y = y.reshape(len(y), 1)
y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Splitting the Dataset Into **Training** and **Test** Sets

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Scaling** the Data

In [59]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

**Training** the Model

In [60]:
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)

LinearRegression()

Making **Predictions**

In [63]:
X_poly_test = poly_reg.transform(X_test)  # Apply the same transformation to test data
y_pred = regressor.predict(X_poly_test)

**Measuring Accuracy**

In [64]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 70210.19652680654
Mean Squared Error (MSE): 11797974088.0625
Root Mean Squared Error (RMSE): 108618.47949618196
The r-squared score (R^2): 0.9642879454607837
The Mean Percent Error (MPE): 8.505344225017039


### Polynomial Regression **(Degree = 3)**

In [19]:
X

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [20]:
y = y.reshape(len(y), 1)
y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [65]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [66]:
poly_reg = PolynomialFeatures(degree=3)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)

LinearRegression()

In [67]:
X_poly_test = poly_reg.transform(X_test)  # Apply the same transformation to test data
y_pred = regressor.predict(X_poly_test)

In [68]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 418418109.9586359
Mean Squared Error (MSE): 6.443587613610712e+18
Root Mean Squared Error (RMSE): 2538422268.5776124
The r-squared score (R^2): -19504513.128262054
The Mean Percent Error (MPE): 69125.77232994547


### Due to Memory Issues I was not able to run a Polynomial Regression with **Degree >= 4**. However, I dropped all the *Store* Columns and that allowed me to run higher degree Polynomial Regressions

## Polynomial Regression with new DF (Degree = 4)

**Create the new dataset**

In [71]:
df_new = df.drop(df.columns[df.columns.str.startswith('Store')], axis=1)

In [72]:
df_new

Unnamed: 0,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,year_sin,year_cos,month_sin,month_cos,day_sin,day_cos,Season_Spring,Season_Summer,Season_Winter
0,1643690.90,0,42.31,2.572,211.096358,8.106,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.394356,0.918958,1,0,0
1,1641957.44,1,38.51,2.548,211.242170,8.106,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,0.433884,-0.900969,0,0,0
2,1611968.17,0,39.93,2.514,211.289143,8.106,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.900969,-0.433884,0,0,1
3,1409727.59,0,46.63,2.561,211.319643,8.106,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.433884,0.900969,0,0,1
4,1554806.68,0,46.50,2.625,211.350143,8.106,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.571268,0.820763,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0,64.88,3.997,192.013558,8.684,-2.449294e-16,1.00000,-1.000000e+00,-1.836970e-16,-0.406737,0.913545,0,0,0
6431,733455.07,0,64.89,3.985,192.170412,8.667,-2.449294e-16,1.00000,5.000000e-01,-8.660254e-01,0.897805,-0.440394,1,0,0
6432,734464.36,0,54.47,4.000,192.327265,8.667,-2.449294e-16,1.00000,-2.449294e-16,1.000000e+00,0.897805,-0.440394,0,0,0
6433,718125.53,0,56.47,3.969,192.330854,8.667,-2.449294e-16,1.00000,-8.660254e-01,5.000000e-01,-0.651372,-0.758758,0,0,0


Create the **New X-Values**

In [73]:
X_new = df_new.iloc[:, 1:].values
X_new

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

Create the **New y-Values**

In [74]:
y_new = df_new.iloc[:, 0].values
y_new

array([1643690.9 , 1641957.44, 1611968.17, ...,  734464.36,  718125.53,
        760281.43])

In [75]:
y_new = y_new.reshape(len(y_new), 1)
y_new

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.2, random_state=1)


In [77]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test= sc_X.transform(X_test)

In [78]:
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)

LinearRegression()

In [83]:
y_pred = regressor.predict(poly_reg.transform(X_test))

In [84]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 518767.88031857036
Mean Squared Error (MSE): 420690238833.4288
Root Mean Squared Error (RMSE): 648606.3820480251
The r-squared score (R^2): -0.2734146253581575
The Mean Percent Error (MPE): 72.31253932994359


## Polynomial Regression with new DF (Degree = 5)

In [85]:
X_new = df_new.iloc[:, 1:].values
X_new

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [86]:
y_new = df_new.iloc[:, 0].values
y_new

array([1643690.9 , 1641957.44, 1611968.17, ...,  734464.36,  718125.53,
        760281.43])

In [87]:
y_new = y_new.reshape(len(y_new), 1)
y_new

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.2, random_state=1)
sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test= sc_X.transform(X_test)

poly_reg = PolynomialFeatures(degree=5)
X_poly = poly_reg.fit_transform(X_train)

regressor = LinearRegression()
regressor.fit(X_poly, y_train)

y_pred = regressor.predict(poly_reg.transform(X_test))


In [89]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 210574933.1796839
Mean Squared Error (MSE): 4.107206078619035e+18
Root Mean Squared Error (RMSE): 2026624306.2341464
The r-squared score (R^2): -12432368.014257735
The Mean Percent Error (MPE): 28005.153083665056


## Polynomial Regression with new DF (Degree = 6)

In [41]:
X_new

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [42]:
y_new

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

In [43]:
y_new = y_new.reshape(len(y_new), 1)
y_new

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.2, random_state=1)

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)  # Apply the same scaling transformation to test data

poly_reg = PolynomialFeatures(degree=6)
X_poly_train = poly_reg.fit_transform(X_train)

regressor = LinearRegression()
regressor.fit(X_poly_train, y_train)

X_poly_test = poly_reg.transform(X_test)  # Apply the same transformation to test data
y_pred = regressor.predict(X_poly_test)



In [55]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 722135706.7603116
Mean Squared Error (MSE): 1.9353911508419895e+19
Root Mean Squared Error (RMSE): 4399308071.551695
The r-squared score (R^2): -58583611.591182075
The Mean Percent Error (MPE): 98850.94908413697


## Support Vector Machine 

In [91]:
X

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [93]:
y = y.reshape(len(y), 1)
y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Splitting the data into **Training** and **Test** sets

In [154]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

**Scaling** the Data

In [155]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Conducting **PCA Analysis** 

In [159]:
# Create PCA object with desired number of components or explained variance ratio
pca = PCA(n_components=0.95)  # Retain 95% of the variance

# Apply PCA to the scaled training data
X_train = pca.fit_transform(X_train)

# Transform the test data using the same PCA object
X_test = pca.transform(X_test)


In [181]:
regressor = SVR(kernel='poly', C=182)
regressor.fit(X_train, y_train)

SVR(C=182, kernel='poly')

Getting the **Predicted** Sales Values

In [182]:
y_pred = regressor.predict(X_test)

# Evaluate the model performance using appropriate metrics or techniques


**Calculating Accuracies**

In [183]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 449381.2924838185
Mean Squared Error (MSE): 313525776091.80927
Root Mean Squared Error (RMSE): 559933.7247316054
The r-squared score (R^2): 0.008025822046951347
The Mean Percent Error (MPE): 58.400793968718176


**Cross Validating** the **C** Hyperparameter to find the optimal value. 

In [180]:
# Define a list of C values to try
c_values = np.arange(100,200,1)
#180

# Create an empty list to store the cross-validation scores
cv_scores = []

# Iterate over each C value
for c in c_values:
    # Initialize the SVR regressor with the current C value
    regressor = SVR(kernel='poly', C=c)
    
    # Perform cross-validation and compute the mean score
    scores = cross_val_score(regressor, X, y, cv=15)  # You can adjust the number of cross-validation folds (cv) as needed
    mean_score = scores.mean()
    
    # Append the mean score to the cv_scores list
    cv_scores.append(mean_score)

# Print the C values and their corresponding cross-validation scores
for c, score in sorted(zip(c_values, cv_scores), key=lambda x: x[1], reverse=True):
    print("C =", c, "  Mean Score =", score)



C = 182   Mean Score = -0.8752646445042088
C = 181   Mean Score = -0.8753125887377998
C = 183   Mean Score = -0.8753313812632117
C = 180   Mean Score = -0.8753361349919314
C = 184   Mean Score = -0.8753914656460637
C = 179   Mean Score = -0.875461222161352
C = 178   Mean Score = -0.8755365740582394
C = 185   Mean Score = -0.8756462392558976
C = 187   Mean Score = -0.8756606476779157
C = 186   Mean Score = -0.8756727789487141
C = 177   Mean Score = -0.8756798841292288
C = 148   Mean Score = -0.8756848044498454
C = 195   Mean Score = -0.8757456765657281
C = 196   Mean Score = -0.875749232195993
C = 176   Mean Score = -0.8757818730779101
C = 197   Mean Score = -0.8757891418077846
C = 149   Mean Score = -0.8757901576390673
C = 173   Mean Score = -0.8758030235465087
C = 164   Mean Score = -0.8758199120875886
C = 194   Mean Score = -0.875832398201464
C = 147   Mean Score = -0.8758367198850167
C = 175   Mean Score = -0.8758555033684814
C = 198   Mean Score = -0.8758589286874626
C = 188   Mean

## Decision Tree

In [185]:
X

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [187]:
y
# if the y column shape looks more horizontal than vertical, use the code below:
# y = y.reshape(len(y), 1)
# y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Splitting the dataset into a **Training** and **Test** Set

In [188]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Training** the Model 

In [190]:
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(random_state=1)

**Predicting** Results

In [191]:
y_pred = regressor.predict(X_test)

**Calculating** Accuracy Measures

In [192]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 81600.31427350428
Mean Squared Error (MSE): 29159237064.1139
Root Mean Squared Error (RMSE): 170760.75973160198
The r-squared score (R^2): 0.911736010218125
The Mean Percent Error (MPE): 7.0629711461135045


## Random Forest

In [193]:
X

array([[ 0.  , 42.31,  2.57, ...,  1.  ,  0.  ,  0.  ],
       [ 1.  , 38.51,  2.55, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 39.93,  2.51, ...,  0.  ,  0.  ,  1.  ],
       ...,
       [ 0.  , 54.47,  4.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 56.47,  3.97, ...,  0.  ,  0.  ,  0.  ],
       [ 0.  , 58.85,  3.88, ...,  0.  ,  0.  ,  0.  ]])

In [195]:
y
# if the y column shape looks more horizontal than vertical, use the code below:
# y = y.reshape(len(y), 1)
# y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Splitting the dataset into a **Training** and **Test** Set

In [196]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Training** the Model

In [199]:
regressor = RandomForestRegressor(n_estimators=10, random_state=1)
regressor.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, random_state=1)

Making **Predictions** based on our **model**

In [200]:
y_pred = regressor.predict(X_test)

Calculating **Accuracies** based on **y_pred**

In [201]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 66086.93394327896
Mean Squared Error (MSE): 19130331246.069675
Root Mean Squared Error (RMSE): 138312.44067714832
The r-squared score (R^2): 0.9420931570358188
The Mean Percent Error (MPE): 5.677558722964053


## K-Nearest Neighbors Regression

In [3]:
X

NameError: name 'X' is not defined

In [228]:
y
# if the y column shape looks more horizontal than vertical, use the code below:
# y = y.reshape(len(y), 1)
# y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

**Scaling** our Data

In [229]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Conducting **PCA Analysis** to deal with **Multicollinearity**

In [233]:
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

**Training** the Model

In [234]:
regressor = KNeighborsRegressor(n_neighbors=5) 
regressor.fit(X_train, y_train)

KNeighborsRegressor()

Making **Predictions** with the Model

In [235]:
y_pred = regressor.predict(X_test)

Calculating **Accuracies**

In [236]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 90256.29191142191
Mean Squared Error (MSE): 32989499586.639812
Root Mean Squared Error (RMSE): 181630.1175098442
The r-squared score (R^2): 0.9001419396528773
The Mean Percent Error (MPE): 7.965267150603703


## XG Boost

In [7]:
X

array([[ 0.   , 42.31 ,  2.572, ...,  1.   ,  0.   ,  0.   ],
       [ 1.   , 38.51 ,  2.548, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 39.93 ,  2.514, ...,  0.   ,  0.   ,  1.   ],
       ...,
       [ 0.   , 54.47 ,  4.   , ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 56.47 ,  3.969, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 58.85 ,  3.882, ...,  0.   ,  0.   ,  0.   ]])

In [10]:
# y
# if the y column shape looks more horizontal than vertical, use the code below:
y = y.reshape(len(y), 1)
y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Creating **Training** and **Test** Sets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Instantiating** and **Fitting** the Model

In [12]:
regressor = xgb.XGBRegressor()
regressor.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)

Making **Predictions**

In [13]:
y_pred = regressor.predict(X_test)

Calculating **Accuracies**

In [14]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 54668.28489996115
Mean Squared Error (MSE): 9601350997.846226
Root Mean Squared Error (RMSE): 97986.48375080222
The r-squared score (R^2): 0.9709370466551386
The Mean Percent Error (MPE): 5.932926537429619


## Neural Network

In [16]:
X

array([[ 0.   , 42.31 ,  2.572, ...,  1.   ,  0.   ,  0.   ],
       [ 1.   , 38.51 ,  2.548, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 39.93 ,  2.514, ...,  0.   ,  0.   ,  1.   ],
       ...,
       [ 0.   , 54.47 ,  4.   , ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 56.47 ,  3.969, ...,  0.   ,  0.   ,  0.   ],
       [ 0.   , 58.85 ,  3.882, ...,  0.   ,  0.   ,  0.   ]])

In [17]:
y
# if the y column shape looks more horizontal than vertical, use the code below:
# y = y.reshape(len(y), 1)
# y

array([[1643690.9 ],
       [1641957.44],
       [1611968.17],
       ...,
       [ 734464.36],
       [ 718125.53],
       [ 760281.43]])

Creating **Training** and **Test** Sets

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

**Scaling** X_train and X_test

In [19]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Define** the **Neural Network Architecture**

In [21]:
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))

**Compiling** the Model

In [23]:
model.compile(loss='mean_squared_error', optimizer=Adam())

**Train** the Model

In [24]:
model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x24bcbf4ce20>

**Predicting** results

In [25]:
y_pred = model.predict(X_test)



Calculating **Accuracies**

In [26]:
# Assuming y_test contains the actual values and y_pred contains the predicted values
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)  # squared=False returns RMSE
r2 = r2_score(y_test, y_pred)
mpe = mean_absolute_percentage_error(y_test, y_pred) * 100

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("The r-squared score (R^2):", r2)
print("The Mean Percent Error (MPE):", mpe)

Mean Absolute Error (MAE): 115812.98732416185
Mean Squared Error (MSE): 33199767923.24145
Root Mean Squared Error (RMSE): 182208.03473843148
The r-squared score (R^2): 0.8995054647590913
The Mean Percent Error (MPE): 15.694885441840118


## A Comparison of R2 Scores

| Model | R2 Score | 
| :----: | :---: | 
| XG Boost     | 0.9709370466551386   | 
| Polynomial Regression (Degree=2)       | 0.9642879454607837 | 
| Random Forest     |   0.9420931570358188    | 
|   Multiple Linear Regression   |   0.9197706842183846     | 
|   Multiple Linear Regression with Lasso   |   0.9192668954658353     | 
|   Decision Tree   |    0.911736010218125   | 
|   Multiple Linear Regression with Ridge   |   0.9148951729367324    | 
|   K-Nearest Neighbor   |   0.9001419396528773    | 
|   Neural Network   |   0.8995054647590913    | 
|   Support Vector Machine   |   0.008025822046951347    | 
|   Polynomial Regression (Degree=4)    |   -0.2734146253581575   | 
|   Polynomial Regression (Degree=5)  |    -12,432,368.014257735   | 
|   Polynomial Regression (Degree=3)   |    -19,504,513.128262054   | 
|   Polynomial Regression (Degree=6)   |    -58,583,611.591182075   | 


## A Comparison of Mean Percent Error Scores

| Model | MPE Score | 
| :----: | :---: | 
| Random Forest     | 5.677558722964053   | 
| XG Boost    | 5.932926537429619 | 
| Decision Tree    |   7.0629711461135045    | 
|  K-Nearest Neighbor   |   7.965267150603703     | 
|   Polynomial Regression (Degree=2)   |   8.505344225017039   | 
|   Multiple Linear Regression with Lasso  |    8.923841116914613  | 
|   Multiple Linear Regression with Ridge   |   9.050746028078454    | 
|   Multiple Linear Regression   |  9.236949826826544   | 
|   Neural Network   |   15.694885441840118  | 
|   Support Vector Machine   |  58.400793968718176    | 
|   Polynomial Regression (Degree=4)    |    72.31253932994359  | 
|   Polynomial Regression (Degree=5)  |    28,005.153083665056   | 
|   Polynomial Regression (Degree=3)   |    69,125.77232994547   | 
|   Polynomial Regression (Degree=6)   |    98,850.94908413697  | 
