# Multiple regression using Advertising dataset - student version


Dataset: advertising.csv

## Multiple Linear Regression

Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. It is an algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting a best linear relationship.

Equation

$$y=\beta_0+\beta_1*x_1+\beta_2*x_2+..+\beta_n*x_n+e$$

where
- $y$ = Dependent variable / Target variable
- $\beta_0$ = Intercept of the regression line
- $\beta_0, \beta_1, .. \beta_n$ = Slope of the regression line which tells whether the line is increasing or decreasing
- $ x_1, x_2, .. x_n$ = Independent variable / Predictor variable
- e = Error 


Predicting sales based on the money spent on TV, Radio, and Newspaper for marketing. In this case, there are three independent variables, i.e., money spent on TV, Radio, and Newspaper for marketing, and one dependent variable, i.e., sales, that is the value to be predicted.

In [13]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [14]:
# Reading the dataset
# TODO: read the dataset (ds) and find out the properties
df = pd.read_csv("advertising.csv")
print(df); print()

print(df.describe()); print()

print(f"Amount of NaNs in tv: {sum(np.isnan(df["TV"]))}")
print(f"Amount of NaNs in tv: {sum(np.isnan(df["Newspaper"]))}")
print(f"Amount of NaNs in tv: {sum(np.isnan(df["Radio"]))}")
print(f"Amount of NaNs in tv: {sum(np.isnan(df["Sales"]))}")


        TV  Radio  Newspaper  Sales
0    230.1   37.8       69.2   22.1
1     44.5   39.3       45.1   10.4
2     17.2   45.9       69.3   12.0
3    151.5   41.3       58.5   16.5
4    180.8   10.8       58.4   17.9
..     ...    ...        ...    ...
195   38.2    3.7       13.8    7.6
196   94.2    4.9        8.1   14.0
197  177.0    9.3        6.4   14.8
198  283.6   42.0       66.2   25.5
199  232.1    8.6        8.7   18.4

[200 rows x 4 columns]

               TV       Radio   Newspaper       Sales
count  200.000000  200.000000  200.000000  200.000000
mean   147.042500   23.264000   30.554000   15.130500
std     85.854236   14.846809   21.778621    5.283892
min      0.700000    0.000000    0.300000    1.600000
25%     74.375000    9.975000   12.750000   11.000000
50%    149.750000   22.900000   25.750000   16.000000
75%    218.825000   36.525000   45.100000   19.050000
max    296.400000   49.600000  114.000000   27.000000

Amount of NaNs in tv: 0
Amount of NaNs in tv: 0
Amount o

In [15]:
# Setting the value for X (=features) and y (=sales)
# TODO: select the features (X) and target (y)

X = df[["TV", "Radio", "Newspaper"]]
y = df[["Sales"]]

print(X); print()
print(y)


        TV  Radio  Newspaper
0    230.1   37.8       69.2
1     44.5   39.3       45.1
2     17.2   45.9       69.3
3    151.5   41.3       58.5
4    180.8   10.8       58.4
..     ...    ...        ...
195   38.2    3.7       13.8
196   94.2    4.9        8.1
197  177.0    9.3        6.4
198  283.6   42.0       66.2
199  232.1    8.6        8.7

[200 rows x 3 columns]

     Sales
0     22.1
1     10.4
2     12.0
3     16.5
4     17.9
..     ...
195    7.6
196   14.0
197   14.8
198   25.5
199   18.4

[200 rows x 1 columns]


In [None]:
# Splitting the dataset
# random_sate controls shuffling applied to the data before applying the split. 
# (pass an int for reproducible output across multiple function calls)
# TODO: Split the dataset into train (X_train, y_train) and test sets (X_test, y_test) use 70/30 ratio

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, test_size = 0.3, random_state = 0
)

print(f"len of X_train: {len(X_train)}")



In [None]:
# Note test set is in random order (index values)
X_test.iloc[:10]

In [None]:
# Fitting the Linear Regression model
# TODO: build LinearRegression model using train data


In [None]:
# Intercept and Coefficient
# TODO: print out the coefficients and intercept


\# TODO: write the conclusions  

From these results one can conclude the following:
- if no money is used in advertising, the sales is ...
- single money increase spent on TV advertising will increase the sales ...
- single money increase spent on Radio advertising will increase the sales ...
- single money increase spent on Newspaper advertising will increase the sales ...

In [None]:
# Prediction of test set
# TODO: use test set to predict the sales, print the predicted values

# print predicted values


In [None]:
# Actual value and the predicted value
# TODO: Calculate the difference (prediction error)


In [None]:
# visualize the results
# TODO: Use scatter plot ot visualize the test and predicted values


In [None]:
# TODO: visualize the differences using bar plot


## Evaluating the Model

### R Squared: 
R Square is the coefficient of determination. It tells us how many points fall on the regression line. The value of R Square is 90.11, which indicates that 90.11% of the data fit the regression model.

### Mean Absolute Error: 
Mean Absolute Error is the absolute difference between the actual or true values and the predicted values. The lower the value, the better is the model’s performance. A mean absolute error of 0 means that your model is a perfect predictor of the outputs. The mean absolute error obtained for this particular model is 1.227, which is quite good as it is close to 0.

The mean_absolute_error function computes mean absolute error, a risk metric corresponding to the expected value of the absolute error loss or -norm loss.

$$MAE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n-1} \lvert y_i-\hat{y} \rvert$$

### Mean Square Error: 
Mean Square Error is calculated by taking the average of the square of the difference between the original and predicted values of the data. The lower the value, the better is the model’s performance. The mean square error obtained for this particular model is 2.636, which is quite good.

$$MSE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n-1} (y_i-\hat{y})^2$$

### Root Mean Square Error: 
Root Mean Square Error is the standard deviation of the errors which occur when a prediction is made on a dataset. This is the same as Mean Squared Error, but the root of the value is considered while determining the accuracy of the model. The lower the value, the better is the model’s performance. The root mean square error obtained for this particular model is 1.623, which is quite good.

$$Root Mean Square Error (RMSE) = \sqrt{MSE(y,\hat{y})}$$

In [None]:
# Model Evaluation
# TODO: Calculate and print the indicated metrics (use sklearn.metrics package)
from sklearn import metrics

print('R squared: {:.2f}'.format(...))
print('R squared test: {:.2f}'.format(...))
print('Mean Absolute Error:', ...)
print('Mean Square Error:', ...)
print('Root Mean Square Error:', ...)