<a href="https://colab.research.google.com/github/mukulsn/Machine-Learning/blob/main/Learning_material/Linear_Regression_~_Advertising_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!cp /kaggle/input/advertising-dataset/'Advertising Dataset'/Images/* .
!ls

<h1>Agenda: Fitting Linear Regression & verifying how well it's assumptions are satisfied</h1>

<figure>
          <img src= 'Regression1.png' style="width:75%">
</figure>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from scipy.stats import kurtosis, skew
warnings.filterwarnings("ignore")
%matplotlib inline

<h3><b>Problem Statement:</b></h3>

<a href = 'https://en.wikipedia.org/wiki/Advertising'>Advertising</a> is a marketing communication that promotes the sell of a product, service or idea.
The dataset consists of money spent in TV, Radio and Newspaper ads and corresponding achieved sales in that period. Task is to analyse the dataset and create a predictive model to forecast the expected sales given the advertisement spent. <br>
Dataset Link: <a href = 'https://www.kaggle.com/ashydv/advertising-dataset'>https://www.kaggle.com/ashydv/advertising-dataset</a>

<figure>
  <img src="Adverstisement_.png" style="width:75%">
</figure>

<h2>Let's have a look at the Advertisement Dataset</h2>

<figure>
  <img src="look.jpg" style="width:50%">
    <center><figcaption><b>Image Source:</b> <a href='https://image.freepik.com/free-vector/computer-concept_1308-35665.jpg'>https://image.freepik.com/free-vector/computer-concept_1308-35665.jpg</a></figcaption></center>
</figure>



In [None]:
df = pd.read_csv("/kaggle/input/advertising-dataset/Advertising Dataset/Advertising Dataset.csv")

In [None]:
df.shape

In [None]:
df.head()

<font color='red'><i><b>Realisation: Definitely the Advertisement spent & Sales are in different scale</i></b></font>


<h3>EDA: Next we will do Exploratory Data Analysis (EDA) to understand our dataset better.</h3>

<figure>
  <img src="eda.jpg" style="width:75%">
    <center><figcaption><b>Image Source:</b> <a href='https://www.freepik.com/free-vector/data-inform-illustration-concept_6195525.htm'>https://www.freepik.com/free-vector/data-inform-illustration-concept_6195525.htm</a></figcaption></center>
</figure>


In [None]:
sns.distplot(df.Sales,color='blue', hist=True,rug=False)

In [None]:
skew(df.Sales),kurtosis(df.Sales)

<font color='blue'><i>
<li>
Absolute value of Skewness and Kurtosis when <= 0.05, denotes almost Normal distribution.
<li>                                                
Even in this case, Sales doesn't deviate a lot from normal distribution. Kurtosis is high and negative indicating the excess flatness in curve.
   
</i></font>

In [None]:
sns.scatterplot(x='TV', y='Sales', data=df)

<b>Regression Plot: </b><font color='red'><i>Draws a scatterplot of two variables, x and y, and then fit the regression model to visualise the linear relationship.</i></font>

In [None]:
# var = df.columns.values

var = df.loc[:, df.columns != 'Sales'].columns.values

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(3,1,figsize=(8,10))

i= 0
for feature in var:
    i += 1
    plt.subplot(3,1,i)
    sns.regplot(x=feature, y='Sales', data=df)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
    fig.tight_layout(pad=3.0)
plt.show();

In [None]:
np.std(df.TV),np.std(df.Radio),np.std(df.Newspaper)

<b>EDA Insights: </b><font color='green'><i>
<br>1. TV ads seems to have highest impact with the steepest slope followed by Radio and Newspaper
<br>2. Spending on TV ads seems to have highest variance followed by Newspaper and Radio. One reason could be the cost-associated is high for TV ads.

<h2><font color='brown'><i>Assumptions of Linear Regression: Linear Relationship statisfied
    </i></font></h2>

<h3> Check for Multicollinearity </h3>


<b>Correlation matrix: </b><font color='red'><i>Calculates Pearson’s Bivariate Correlation among all independent variables. Remove predictors with high correlation.


In [None]:
sns.heatmap(df.loc[:, df.columns != 'Sales'].corr(),annot=True)

<b>Variance Inflation Factor (VIF): </b><font color='red'><i>Each feature is regressed against all the other features.<br>Variance Inflation Factor is defined as VIF = 1/T. With VIF > 5 there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.


In [None]:
X = df.loc[:, df.columns != 'Sales']
vif_data = pd.DataFrame()
vif_data["feature"] = df.loc[:, df.columns != 'Sales'].columns

# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]

vif_data.head()

<font color='red'><i>
We don't see high correlation between independent variables.<br>
    VIF score is also < 5 for all the variables.</i></font>

<h2><font color='brown'><i>Assumptions of Linear Regression: No or Little Multicollinearity satisfied
    </i></font></h2>

In [None]:
X = df.loc[:, df.columns.isin(['TV','Radio','Newspaper'])]
model = sm.OLS(df.Sales,X)
results = model.fit()
print(results.summary())

<b>F-Statistics</b><br>
    
H0: <font color='red'>The null hypothesis states that the model with no independent variables fits the data as well as your model.</font><br>
Alternate Hypothesis: <font color='red'>The alternative hypothesis says that your model fits the data better than the intercept-only model.
    </i><font/>

<h3>Test for Autocorrelation:  Durbin-Watson test</h3><br>
<font color='red'><i>
The null hypothesis of the test is that there is no serial correlation. <br>
Value close to 2 indicates No Autocorrelation, ~0: Positive Autocorrelation, ~4: Negative Autocorrelation

Durbin-Watson Test returns a value of 1.949 (close to zero) indicating no-autocorrelation.
    </i></font>


<h2><font color='brown'><i>Assumptions of Linear Regression: No or Little Autocorrelation satisfied
    </i></font></h2>

<h3>Check for Homoscedasticity</h3><br>
<font color='red'><i>
Homoscedasticity(meaning “same variance”) assumes the residuals are randomly scattered along the regression line/plane/hyperplane and does not follow any pattern.<br>
    
In contrast, Heteroscedasticity is a systematic change in the spread of the residuals.
    </i></font>
<br><br>
<figure>
  <img src="Homoscedasticity.png" style="width:75%">
    <center><figcaption><b>Image Source:</b> <a href='https://miro.medium.com/max/1400/1*Jan9oVOzNqQyhA4bSg_zwA.png'>https://miro.medium.com/max/1400/1*Jan9oVOzNqQyhA4bSg_zwA.png</a></figcaption></center>
</figure>

In [None]:
residual = results.resid

sns.residplot(x = results.predict(), y = residual ,
              lowess=True, color="olivedrab")
plt.show()

Problems due to Heteroscedasticity:<br>
<font color='red'><i>
If there’s patterns in the residuals, model has a problem and is not able to explain the data patterns completely. Hence, the coefficients values are unreliable.<br>
Heteroscedasticity tends to produce p-values that are smaller than they should be reducing the statistical significance.
    </i></font>

<h2><font color='brown'><i>Assumptions of Linear Regression: Homoscedasticity Satisfied</i></font></h2>

<h3>Check for Multivariate Normality</h3>

<font color='red'><i>
Linear regression analysis requires all variables to be multivariate normal. (Follow Gaussian distribution ~ 68–95–99.7 rule)<br>
Alternatively, the residuals should be normally distributed. (Suggests other variables also don’t deviate much from normality as well)
    </i></font>

In [None]:
sns.distplot(residual,color='blue',label= 'Residuals', hist=True,rug=False)

In [None]:
skew(residual),kurtosis(residual)

<h2><font color='brown'><i>Assumptions of Linear Regression: Multivariate Normality satisfied</i></font></h2>

<h2>Feature Importance</h2>


<font color='red'><i>
Weight can be in kilogram or grams similarly height can be in cm or meters, it's necessary to bring variables in a scale where the deviation is similar.
    </i></font>

In [None]:
X = df.loc[:, df.columns.isin(['TV','Radio','Newspaper'])]
#Bring Features to comparable scale
scaler = StandardScaler(with_mean=False, with_std=True)
scaler.fit(X)
X_t = scaler.transform(X)
model = sm.OLS(df.Sales,X_t)
results = model.fit()
print(results.summary())

In [None]:
value = [results.params[0],results.params[1],results.params[2]]
features = ['TV','Radio','Newspaper']

In [None]:
tmp = pd.DataFrame({'Feature': features, 'Feature importance': value})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (12,6))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

In [None]:
# f, ax = plt.subplots(figsize = (8,5))

# ax = sns.barplot(x=value, y=features)
# plt.ylabel('Features', fontsize=12)
# plt.xlabel('Value', fontsize=12)
# plt.show()

<h2> Summary</h2>
<font color='Blue'><i><br>
1. TV ads comes out to be the most important feature.<br>
2. Linear Regression results are statistically significant as all the assumptions are satisfied.

<h2> References</h2>
<font color='Blue'><i><br>
1. <a href='https://towardsdatascience.com/assumptions-of-linear-regression-algorithm-ed9ea32224e1'>https://towardsdatascience.com/assumptions-of-linear-regression-algorithm-ed9ea32224e1</a><br>
2. <a href='https://towardsdatascience.com/linear-regression-and-its-assumptions-ef6e8db4904d'>https://towardsdatascience.com/linear-regression-and-its-assumptions-ef6e8db4904d</a>