# Data Analysis on Sales by Advertising Platform

We are given a table with the money spent on an advertising platform and the sales generated. Marketing is looking to see the relationship between TV advertising and sales, and seee if we can build a model to predict the amount of sales for a given amount spent in TV advertising.

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

## Import file

In [None]:
file_path = r'../resources/data/advertising.csv'

In [None]:
df = pd.read_csv(file_path)
df.head()

## Inspect data source

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.describe()

## Data cleanup

In [None]:
# drop null values
df = df.dropna()

In [None]:
# check for outliers
labels = df.columns[0:3]
plt.boxplot(df[labels], labels=labels)
plt.show()

## Data Analysis

In [None]:
# Review target variable we are trying to predict
plt.boxplot(df['Sales'], labels=['Sales'])
plt.show()

In [None]:
plt.figure(figsize=(15,5))

# Create scatter plots
for idx, label in enumerate(labels):

    plt.subplot(1, 3, idx + 1)
    plt.scatter(df[label], df['Sales'])

    # Label plots
    plt.title(label)
    plt.xlabel(f'{label} Advertising Spent')
    plt.ylabel('Sales')

plt.show()

In [None]:
# Create dataframe for correlation heatmap
corr = df.corr()

# Zero out values where the labels are the same, the value would be 1.00 which means that x and y values are the same
for col in list(corr.columns):
    corr.loc[corr[col] == 1.00, col] = 0.00

corr

In [None]:
# Build heatmap
plt.xticks(np.arange(len(list(df.columns))), labels=list(df.columns))
plt.yticks(np.arange(len(list(df.columns))), labels=list(df.columns))

plt.imshow(corr, cmap='YlGn')
plt.show()

## Regression Model

Our goal is to create a regression line that follows the slope-intercept form $y = mx + b$. The variable $x$ represents amount spent in TV advertising; $y$ represents the predicted Sales for a given $x$ value. $m$ is the slope of the line and $b$ is the y-intercept of the line.

One useful value that can be calculated from this model is the coefficient of determination or $R^2$. This value represents the proportion of variation in the data set that is predictable. The closer $R^2$ is to 1.00, the better the observed data lines up with the model.

In [None]:
# Split data into training and testing models, we'll do a 80/20 split
from sklearn.model_selection import train_test_split

# Convert dataframe data to numpy 2D arrays
x = np.array(df['TV']).reshape(-1, 1)
y = np.array(df['Sales']).reshape(-1, 1)

# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=100)

In [None]:
# Build linear regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(x_train, y_train)

In [None]:
# Store values in variables, these will be used later
rsquared = model.score(x_test, y_test)
yint = model.intercept_.tolist()[0]
slope = model.coef_.tolist()[0][0]

print(f'R2: {rsquared}\nm : {slope}\nb : {yint}')

We have a high value for $R^2$ therefore we can conclude that this model accurately predicts results.

## Predicting Values

In [None]:
# Get user input and print out results
user_input = float(input('Enter TV advertising spend amount: '))

x_value = np.array(user_input).reshape(-1, 1)

y_value = model.predict(x_value).tolist()[0][0]

print(f'Predicted Sales for {user_input}: {round(y_value, 2)}')

## Plotting the Model

In [None]:
# Build plot with y=mx+b line and add scatter plot with observed data
plt.title('Plot of TV Advertising by Sales with Regression Line')
plt.xlabel('TV Advertising Spent')
plt.ylabel('Sales')
plt.plot(x_train, (slope * x_train) + yint, 'r', label=f'$y=${round(slope, 4)}x + {round(yint, 4)}\n$R^2 = {round(rsquared, 4)}$')
plt.scatter(x_train, y_train)
plt.legend()
plt.show()