<font color='white' size=6>**<span style='background:midnightblue'>Machine Learing Sales Predictions based on Marketing Budgets 
        by Nathaniel Cekay    </span>**</font>

# Table of Contents
- [Introduction](#Introduction)
- [Data Loading](#Data-Loading)
- [Machine Learning](#Machine-Learning)

#Introduction

In this project, I utilize machine learning to forecast total sales by leveraging marketing budget data. The primary focus of this project is precision in sales predictions.

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from joblib import dump
from joblib import load


In [2]:
np.random.seed(42)

#Data-Loading

In [3]:
budget_and_sales = pd.read_csv('/kaggle/input/advertising-sales-dataset/Advertising Budget and Sales.csv')

In [4]:
budget_and_sales.head()

Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [5]:
budget_and_sales.isna().sum()

Unnamed: 0                 0
TV Ad Budget ($)           0
Radio Ad Budget ($)        0
Newspaper Ad Budget ($)    0
Sales ($)                  0
dtype: int64

In [6]:
budget_and_sales.info()
budget_and_sales.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               200 non-null    int64  
 1   TV Ad Budget ($)         200 non-null    float64
 2   Radio Ad Budget ($)      200 non-null    float64
 3   Newspaper Ad Budget ($)  200 non-null    float64
 4   Sales ($)                200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


Unnamed: 0.1,Unnamed: 0,TV Ad Budget ($),Radio Ad Budget ($),Newspaper Ad Budget ($),Sales ($)
count,200.0,200.0,200.0,200.0,200.0
mean,100.5,147.0425,23.264,30.554,14.0225
std,57.879185,85.854236,14.846809,21.778621,5.217457
min,1.0,0.7,0.0,0.3,1.6
25%,50.75,74.375,9.975,12.75,10.375
50%,100.5,149.75,22.9,25.75,12.9
75%,150.25,218.825,36.525,45.1,17.4
max,200.0,296.4,49.6,114.0,27.0


#Machine-Learning

In [7]:
X = budget_and_sales.drop(columns=["Sales ($)"])
y = budget_and_sales["Sales ($)"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

train_predictions = model.predict(X_train_scaled)
test_predictions = model.predict(X_test_scaled)

train_rmse = mean_squared_error(y_train, train_predictions, squared=False)
test_rmse = mean_squared_error(y_test, test_predictions, squared=False)

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

Train RMSE: 1.6442982086509947
Test RMSE: 1.788576100865966


In [8]:
full_data = budget_and_sales

X_full = full_data.drop(columns=["Sales ($)"])

scaler = StandardScaler()
X_full_scaled = scaler.fit_transform(X_full)

In [9]:
full_predictions = model.predict(X_full_scaled)

full_data["Predicted_Sales"] = full_predictions
full_data.to_csv("full_dataset_with_predictions.csv", index=False)

In [10]:
mse_rf_fin = mean_squared_error(budget_and_sales[["Sales ($)"]], full_predictions)
rmse_rf_fin = np.sqrt(mse_rf_fin)

print("Root Mean Squared Error (RMSE) of LinearRegression model:", rmse_rf_fin)

Root Mean Squared Error (RMSE) of LinearRegression model: 1.6801319503228778


In [11]:
train_r2 = model.score(X_train_scaled, y_train)
test_r2 = model.score(X_test_scaled, y_test)

print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

Train R-squared: 0.8957553000540606
Test R-squared: 0.8986489151417081


<font color='white' size=5>**<span style='background:midnightblue'>89.58% of the variance in the target variable Sales is explained by the features in the training data. 
89.86% of the variance in Sales is explained by the features in the test data  
    </span>**</font>

In [12]:
dump(model, 'saved_model.joblib')

print("Model saved successfully!")

Model saved successfully!


In [13]:
loaded_model = load('saved_model.joblib')

print("Model loaded successfully!")

Model loaded successfully!


In [14]:
loaded_model_predictions = loaded_model.predict(X_full_scaled)

full_data["Predicted_Sales"] = loaded_model_predictions
full_data.to_csv("full_dataset_with_predictions.csv", index=False)

In [15]:
mse_final = mean_squared_error(budget_and_sales[["Sales ($)"]], loaded_model_predictions)
rmse_final = np.sqrt(mse_final)

print("Root Mean Squared Error (RMSE) of saved LinearRegression model:", rmse_final)

Root Mean Squared Error (RMSE) of saved LinearRegression model: 1.6801319503228778


In [16]:
loaded_model_train_r2 = loaded_model.score(X_train_scaled, y_train)
loaded_model_test_r2 = loaded_model.score(X_test_scaled, y_test)

print("Train R-squared:", loaded_model_train_r2)
print("Test R-squared:", loaded_model_test_r2)

Train R-squared: 0.8957553000540606
Test R-squared: 0.8986489151417081


<font color='white' size=5>**<span style='background:midnightblue'>The final model boasts an impressive RMSE score of 1.68, and an R-squared score of 0.89, showcasing its capability in making precise predictions. This indicates that, on average, the model's forecasts deviate by approximately 1.68 units from the actual values, and 89% if the variance in sales is explained by the features utilized in the test data.  Such a minimal deviation confirms the reliability and accuracy of our final model, affirming its efficacy for predicting sales in practical applications.  
    </span>**</font>