# Predict sales revenue

Feature Descriptions
1. TV - Spend on TV Advertisements
2. Radio - Spend on radio Advertisements
3. Newspaper - Spend on newspaper Advertisements
4. Sales - Sales revenue generated

Sales is dependent / target variable

TV, Radio and Newspaper are independent variables

import numpy and pandas

In [2]:
import numpy as np
import pandas as pd

Load Advertising data set

In [3]:
data = pd.read_csv("Advertising.csv")

In [4]:
type(data)

pandas.core.frame.DataFrame

In [5]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   Radio       200 non-null    float64
 3   Newspaper   200 non-null    float64
 4   Sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


First column is index and we do not need it. So remove the first column 

In [12]:
data = data[["TV", "Radio", "Newspaper", "Sales"]]

In [13]:
#data = data.drop("Unnamed: 0", axis = 1)

In [14]:
data.head(2)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4


# Explore the data set

Is there a relationship between sales and spend various advertising channels?

Load matplotlib and seaborn libraries for visual analytics

In [15]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Visualize pairwise correlations

In [16]:
sns.pairplot( data , diag_kind='kde')

<seaborn.axisgrid.PairGrid at 0x1675ff470>

Any observations from the above graph?

Calculate correlations

In [17]:
data.TV.corr( data.Sales )

0.7822244248616061

In [18]:
data.corr()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,1.0,0.054809,0.056648,0.782224
Radio,0.054809,1.0,0.354104,0.576223
Newspaper,0.056648,0.354104,1.0,0.228299
Sales,0.782224,0.576223,0.228299,1.0


Visualize the correlations

In [20]:
sns.heatmap( data.corr(), annot=True)

<Axes: >

Observations: 
1. The diagonal of the above matirx shows the auto-correlation of the variables. It is always 1. We can observe that the correlation betweeb TV and Sales is highest i.e. 0.78 and then between sales and radio i.e. 0.576.
2. Correlations can vary from -1 to +1. Closer to +1 means strong positive correlation and close -1 means strong negative correlation. Closer to 0 means not very strongly correlated. variables with strong correlations are mostly probably candidates for model builing.

# Predict Sales revenue using TV advertisement expenditure

Sales = beta_0 + beta_1*TV

sklearn library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics.

Import linear regression api from sklearn

In [21]:
from sklearn.linear_model import LinearRegression

Initial imported linear regression model api

In [22]:
linreg = LinearRegression()

Prepare input data set

In [23]:
x = data[['TV']]

In [24]:
type(x)

pandas.core.frame.DataFrame

In [25]:
y = data[['Sales']]

In [26]:
linreg.fit(x, y)

Find the intercept beta_0 and slope beta_1

In [27]:
linreg.intercept_

array([7.03259355])

In [28]:
linreg.coef_

array([[0.04753664]])

Now the Sales prediction equation can be written as: 
 
Sales = beta_0 + beta_1 * TV

Sales = 7.03 + 0.0475 * TV

Sales = linreg.intercept_ + linreg.coef_ * TV

## Interpretation of slope or beta_1 or linreg.coef_

# Make predictions for the new TV advertisment expenditure data

Suppose TV = 150 . Predict Sales?

In [29]:
TV = 150
Sales = linreg.intercept_ + linreg.coef_*TV

In [30]:
Sales

array([[14.16308961]])

We can also use inbuilt function for prediction

In [31]:
newData = {'TV': [150, 200, 250]}
newData = pd.DataFrame(data=newData)  ## Observe that 'x' type is pandas dataframe

In [32]:
linreg.predict(newData)

array([[14.16308961],
       [16.53992164],
       [18.91675366]])

Calculate accuracy of the built model

In [33]:
model = linreg.fit(x, y)

In [34]:
SalesPredictions = model.predict(x)

In [35]:
SalesPredictions[0:5]

array([[17.97077451],
       [ 9.14797405],
       [ 7.85022376],
       [14.23439457],
       [15.62721814]])

In [36]:
y[0:5]

Unnamed: 0,Sales
0,22.1
1,10.4
2,9.3
3,18.5
4,12.9


In [37]:
model.coef_

array([[0.04753664]])

In [38]:
model.intercept_

array([7.03259355])

Calculate root mean sequare error: RMSE

In [39]:
from sklearn import metrics

mean_sequare_error = sum(i = 1 to 200 (y_i - SalesPredictions_i)^2 )/200

In [40]:
metrics.mean_squared_error(SalesPredictions, y)

10.512652915656757

In [41]:
rmse = np.sqrt(metrics.mean_squared_error(SalesPredictions, y))

In [42]:
rmse

3.2423221486546887

In [43]:
avgSales = np.mean(y)

In [44]:
avgSales

14.0225

In [45]:
## Error percentage
## RMSE percentage

In [46]:
rmse/avgSales*100

23.122283106826092

In [47]:
# R^2
model.score(x,y)

0.611875050850071

In [48]:
0.782224*0.782224  ### Corr(TV, Sales) * Corr(TV, Sales)

0.6118743861760001

# Model 2

Build Sales = beta_0 + beta_1 * Radio

In [49]:
x = data[["Radio"]]
y = data[["Sales"]]

In [50]:
from sklearn.model_selection import train_test_split

In [51]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

In [52]:
model2 = LinearRegression()
model2.fit(x_train, y_train)

In [53]:
x = data[["Radio"]]
y = data[["Sales"]]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
model2 = LinearRegression()
model2.fit(x_train, y_train)

In [54]:
predictedSales = model2.predict(x_train)
mse = metrics.mean_squared_error(predictedSales, y_train)
trainRmse = np.sqrt(mse)
predictedSales = model2.predict(x_test)
mse = metrics.mean_squared_error(predictedSales, y_test)
testRmse = np.sqrt(mse)
print(testRmse)

3.8215351050686674


In [55]:
testRmse/np.mean(y_train)*100

27.70949423604793

In [56]:
model2.coef_

array([[0.1874808]])

Analyze beta_0 and beta_1

Analyze R^2

In [57]:
model2.score(x_train, y_train)

0.2917784045936669

In [58]:
model2.score(x_test, y_test)

0.41293932917162346

Analyze RMSE

In [59]:
predictedSales = model2.predict(x_train)

In [60]:
mse = metrics.mean_squared_error(predictedSales, y_train)

In [61]:
rmse = np.sqrt(mse)
rmse 

4.4415264249912445

# Model 3

Build Sales = beta_0 + beta_1 * Newspaper

Analyze beta_0 and beta_1

Analyze R^2

Analyze RMSE

# Model 4: Multiple Linear Regression

Build Sales = beta_0 + beta_1 * TV + beta_2 * Radio

In [62]:
data.head(2)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4


In [63]:
x = data[["TV", "Radio"]]

In [64]:
x.head(2)

Unnamed: 0,TV,Radio
0,230.1,37.8
1,44.5,39.3


In [65]:
 y = data[["Sales"]]

In [66]:
y.head(2)

Unnamed: 0,Sales
0,22.1
1,10.4


Compare R^2 of Model 1 to 4

Any observations?

# Model 5 : Multiple Linear Regression

Build Sales = beta_0 + beta_1 TV + beta_2 Newspaper

Analyze beta_0, beta_1 and beta_2

Analyze R^2

Analyze RMSE

# Model 6

Build Sales = beta_0 + beta_1 Radio + beta_2 Newspaper

Analyze beta_0, beta_1 and beta_2

Analyze R^2

Analyze RMSE

# Model 7

Build Sales = beta_0 + beta_1 * TV + beta_2 Radio + beta_3 Newspaper

Analyze beta_0, beta_1, beta_2 and beta_3

Analyze R^2

Analyze RMSE

# Model 8 : Interaction terms

Build Sales = beta_0 + beta_1 * TV + beta_2 Radio + beta_3 * TV * Radio

In [67]:
x = data[["TV", "Radio"]]

In [68]:
x.head(2)

Unnamed: 0,TV,Radio
0,230.1,37.8
1,44.5,39.3


In [69]:
x['TvRadio'] = x.TV*x.Radio

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['TvRadio'] = x.TV*x.Radio


In [70]:
x.head(2)

Unnamed: 0,TV,Radio,TvRadio
0,230.1,37.8,8697.78
1,44.5,39.3,1748.85


In [71]:
y  = data[["Sales"]]

In [72]:
model8 = LinearRegression()

Analyze RMSE

Compare R^2 of Model 1 to 8

Any observations?

In [73]:
## Adjusted R^2
def AdjRsquare(modelToBeTested, indData, target):
    Rsquare = modelToBeTested.score(indData, target)
    NoData = len(target)
    p = indData.shape[1]
    tempRsquare = 1 - (1-Rsquare)*(NoData-1)/(NoData - p - 1)
    return tempRsquare

In [74]:
AdjRsquare(model2, x_test, y_test)

0.40281759346768586

In [75]:
model2.score(x_test, y_test)

0.41293932917162346

In [76]:
x_test.shape

(60, 1)

In [77]:
seed = 7

In [78]:
## Combine all the steps to test the model performance
def linRegcheckModelPerformance(x, y):
    model = LinearRegression()
    # Covert data into train and test
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, 
                                                        random_state = seed)
    # Build model with train data set
    model.fit(x_train, y_train)
    # Train accuracies
    trainR2 = model.score(x_train, y_train)
    predictedSales = model.predict(x_train)
    mse = metrics.mean_squared_error(predictedSales, y_train)
    trainRmse = np.sqrt(mse)
    trainRmsePct = trainRmse/np.mean(np.mean(np.array(y_train)))*100
    trainAdjR2 = AdjRsquare(model, x_train, y_train)
    trainAccuracies = [len(y_train), trainRmse, trainRmsePct, trainR2, trainAdjR2]
    # Test accuracies
    testR2 = model.score(x_test, y_test)
    predictedSales = model.predict(x_test)
    mse = metrics.mean_squared_error(predictedSales, y_test)
    testRmse = np.sqrt(mse)
    testRmsePct = testRmse/np.mean(np.mean(np.array(y_test)))*100
    testAdjR2 = AdjRsquare(model, x_test, y_test)
    testAccuracies = [len(y_test), testRmse, testRmsePct, testR2, testAdjR2]
    # Create dataframe for results
    resultsDf = pd.DataFrame(index = ["dataSize", "rmse", "rmsePct", "r2", "adjR2"])
    resultsDf['trainData'] = trainAccuracies
    resultsDf['testData'] = testAccuracies
    return ( round(resultsDf, 4))

In [79]:
# Model 4
x = data[["TV", "Radio"]]
y = data[["Sales"]]
linRegcheckModelPerformance(x, y)

Unnamed: 0,trainData,testData
dataSize,140.0,60.0
rmse,1.686,1.6471
rmsePct,11.6805,12.6103
r2,0.897,0.8895
adjR2,0.8955,0.8856


In [80]:
# Model 5
x = data[["Radio", "Newspaper"]]
y = data[["Sales"]]
linRegcheckModelPerformance(x, y)

Unnamed: 0,trainData,testData
dataSize,140.0,60.0
rmse,4.341,4.0477
rmsePct,30.0739,30.9892
r2,0.3175,0.3324
adjR2,0.3076,0.309


In [81]:
# Model 7
x = data[["TV", "Radio", "Newspaper"]]
y = data[["Sales"]]
linRegcheckModelPerformance(x, y)

Unnamed: 0,trainData,testData
dataSize,140.0,60.0
rmse,1.686,1.6471
rmsePct,11.6805,12.6101
r2,0.897,0.8895
adjR2,0.8948,0.8835


In [82]:
data["TvRadio"] = data["TV"]*data["Radio"]

In [83]:
data.head(2)

Unnamed: 0,TV,Radio,Newspaper,Sales,TvRadio
0,230.1,37.8,69.2,22.1,8697.78
1,44.5,39.3,45.1,10.4,1748.85


In [84]:
# Model 8
x = data[["TV", "Radio", "TvRadio"]]
y = data[["Sales"]]
linRegcheckModelPerformance(x, y)

Unnamed: 0,trainData,testData
dataSize,140.0,60.0
rmse,0.9248,0.9601
rmsePct,6.4068,7.3507
r2,0.969,0.9624
adjR2,0.9683,0.9604
