## A6.4 Supervised Machine Learning - Regression
### The following script contains the following:

#### 1. Import data, libraries, additional requirements
#### 2. Data consistency checks
#### 3. Data prep for regression analysis
#### 4. Regression Analysis
--------------------------------------------------
### 1. Import libraries, additional requirements and data


In [1]:
# Import data libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Display visualizations in the notebook without the need to "call" them specifically
%matplotlib inline

# Turning off warning feature
import warnings
warnings.filterwarnings('ignore')

In [3]:
#  Create/save project folder path
path = r"C:\Users\keanu\OneDrive\Desktop\Career Foundry\Boat Sales Analysis"

# Read data (csv file)
df = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'boat_sales_final.csv'), encoding='ISO-8859-1', index_col=False)

---------------------------------------------------------------------------------------------------------------------------
### 2. Data consistency checks

In [4]:
df.columns

In [5]:
# Drop redundant column
df = df.drop(columns = ['Unnamed: 0'])

In [6]:
df.shape

In [7]:
df.head()

In [8]:
# Check for missing values
df.isnull().sum()

Notes: Nulls as expected, nulls kept instead of removing, as there can be a loss in valuable data points.

In [9]:
# Check for duplicates
df_dups = df[df.duplicated()] #no duplicates

In [10]:
# Supress scientific notation for easier analysis profiling
pd.set_option('display.float_format', '{:.2f}'.format)

df.describe()

In [11]:
# Impute mean value of year for missing variables in 'Year Built' column as 5.57% Nans are relatively low and do not want to lose valuable information by removing Nans.
mean_value = np.round(df['Year Built'].mean())

# Fill missing values with the rounded mean
df['Year Built'].fillna(mean_value, inplace=True)

In [12]:
df['Year Built'].value_counts(dropna=False)

---------------------------------------------------------------------------------------------------------------------------
### 3. Data prep for regression analysis

In [13]:
# Create a scatterplot to revisit how the chosen variables for your hypothesis plot against each other
df.plot(x = 'Year Built', y='# of views last 7 days',style='o') # The style option creates a scatterplot; without it, we only have lines
plt.title('Year Built vs # of Views Last 7 Days')  
plt.xlabel('Year Built')  
plt.ylabel('# of Views')

In [14]:
# Reshape the variables into NumPy arrays and put them into separate objects
X = df['Year Built'].values.reshape(-1,1)
y = df['# of views last 7 days'].values.reshape(-1,1)

In [15]:
X

In [16]:
y

In [17]:
# Split data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

---------------------------------------------------------------------------------------------------------------------------
### 4. Regression Analysis

In [18]:
# Create a regression object
regression = LinearRegression()  # This is the regression object, which will be fit onto the training set

# Import models that support missing values.
#from sklearn.ensemble import HistGradientBoostingRegressor

#regression = HistGradientBoostingRegressor() # Used instead, in order to preserve nan values/multiple regressions

In [19]:
# Fit the regression object onto the training set
regression.fit(X_train, y_train)

In [20]:
# Predict the values of y using X
y_predicted = regression.predict(X_test)

In [21]:
# Create a plot that shows the regression line from the model on the test set
plot_test = plt
plot_test.scatter(X_test, y_test, color='skyblue', s = 15)
plot_test.plot(X_test, y_predicted, color='orange', linewidth =3)
plot_test.title('Year Built vs # of Views Last 7 Days (Test set)')
plot_test.xlabel('Year Built')
plot_test.ylabel('# of Views Last 7 Days')

In [22]:
# Create objects that contain the model summary statistics
rmse = mean_squared_error(y_test, y_predicted) # This is the mean squared error
r2 = r2_score(y_test, y_predicted) # This is the R2 score

In [23]:
# Print the model summary statistics. This is where you evaluate the performance of the model
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)

In [24]:
y_predicted

In [25]:
# Create a dataframe comparing the actual and predicted values of y
data = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_predicted.flatten()})
data.head(30)

##### Compare how the regression fits the training set

In [26]:
# Predict
y_predicted_train = regression.predict(X_train) # This is predicting X_train

In [27]:
rmse = mean_squared_error(y_train, y_predicted_train)
r2 = r2_score(y_train, y_predicted_train)

In [28]:
print('Slope:' ,regression.coef_)
print('Mean squared error: ', rmse)
print('R2 score: ', r2)

In [29]:
# Visualizing the training set results
plot_test = plt
plot_test.scatter(X_train, y_train, color='seagreen', s = 15)
plot_test.plot(X_train, y_predicted_train, color='orange', linewidth =3)
plot_test.title('Year Built vs # of Views Last 7 Days (Train set)')
plot_test.xlabel('Year Built')
plot_test.ylabel('# of Views Last 7 Days')

### Model Summary:
- Slope: -1.0446
- Indicates a negative relationship between the independent and dependent variables.

#### Mean Squared Error (MSE):
- MSE1: 20086.42
- MSE2: 23898.96
- High MSE suggests significant prediction errors compared to actual values.

#### R-squared (R2) Score:
- R21: 0.0167
- R22: 0.0111
- Low R2 scores indicate limited explanatory power of the model.

#### Interpretation:
- The linear regression model struggles to predict the dependent variable accurately.
- Negative slope suggests a negative relationship.
- Low R2 scores and high MSE indicate room for improvement in model performance.

#### Recommendations:
- Consider model refinement, or exploring alternative models for improved accuracy.


### Limitations and Potential Data Bias:
1. The number of views data can be influenced by various factors in this dataset, such as price and year built, though not directly. This suggests that website views may be influenced by factors beyond this dataset, such as how the website displays its yacht and boat selling posts to consumers, time of day online, and season of year. This implies a broader range of influences on the number of views.

2. Potential biases in the dataset include human error, missing variables, imputation of missing variables, and the removal of extreme values. Dealing with NaN values in this dataset was challenging to avoid introducing additional bias to the analysis.

3. Extreme values were not removed.