---
# **Multiple Linear Regression**
---



### **Introduction**
This notebook provides a comprehensive guide to understanding and implementing multiple linear regression. We'll cover the underlying concepts, step-by-step implementation, and practical examples to help you master this powerful statistical technique.

### **Objectives**
By the end of this notebook, you will be able to:
* Understand the concept of multiple linear regression.
* Implement multiple linear regression using Python and scikit-learn.
* Interpret the results of a multiple linear regression model.
* Evaluate the performance of a multiple linear regression model.
* Apply multiple linear regression to real-world problems.

## **Conceptual Overview**


### What is Multiple Linear Regression?
Like Simple Linear Regression, multiple regression is a supervised regression algorithm. In multiple regression, we have multiple independent variables that impact the dependent variable. We use labeled data to train the model and predict a numerical value.

### Key Concepts
*   **Supervised Learning:** Training a model using labeled data.
*   **Regression:** Predicting a numerical value.
*   **Independent Variables:** Variables that influence the dependent variable.
*   **Dependent Variable:** The variable we are trying to predict.
*   **Least Squares:** A method used to find the best-fitting plane to the data.
*   **Parsimony:** Striking a balance between the quality of fit and the number of variables.
*   **R^2 Score:** A metric that tells us how closely our prediction matched the data.


### Assumptions of Multiple Linear Regression
Multiple linear regression relies on several key assumptions:

1.  **Linearity:** The relationship between the independent and dependent variables is linear.
2.  **Independence:** The errors (residuals) are independent of each other.
3.  **Homoscedasticity:** The variance of the errors is constant across all levels of the independent variables.
4.  **Normality:** The errors are normally distributed.
5.  **No Multicollinearity:** The independent variables are not highly correlated with each other.

It's crucial to test these assumptions to ensure the validity of the regression results. Violations of these assumptions can lead to biased or inefficient estimates.


> #### **Check for Understanding:**
> Why is it important to test the assumptions of multiple linear regression?

---

---
# **Setup**
---
In this section, we will implement simple linear regression using Python. We will use the following libraries:

*   **NumPy:** For numerical computations.
*   **Pandas:** For data manipulation and analysis.
*   **Matplotlib and seaborn:** For data visualization.
*   **Scikit-learn:** For building and evaluating the linear regression model.

> Let's start by importing these libraries.

In [14]:
import numpy as np              # For numerical computations
import pandas as pd             # For data manipulation and analysis
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns           # For enhanced data visualization

from sklearn.model_selection import train_test_split        # For splitting the data into training and testing sets
from sklearn.linear_model import LinearRegression           # For building the linear regression model
from sklearn.metrics import r2_score                        # For evaluating the model

---
# **Reading and Preparing the Data for Modeling**
---

## 1. Load Dataset to workspace
We will use the attributes of 50 startups to predict their profit. 

In [4]:
# read dataset from csv
dataset = pd.read_csv('50_Startups.csv')
print(dataset.head())


   R&D Spend  Administration  Marketing Spend     Profit
0  165349.20       136897.80        471784.10  192261.83
1  162597.70       151377.59        443898.53  191792.06
2  153441.51       101145.55        407934.54  191050.39
3  144372.41       118671.85        383199.62  182901.99
4  142107.34        91391.77        366168.42  166187.94


> Now that we have loaded the data, let's separate the independent variable (X) and the dependent variable (y).

## 2. Separating the independent and dependent variables

*   **X (Independent Variable):** R&D Spend, Administrative Spend, Marketing Spend, and State of operation.
*   **y (Dependent Variable):** profit

In [5]:
# set independent variable using all rows, and all columns except for the last one.
X = dataset.iloc[:, :-1].values

# set the dependent variable using all rows, but ony the last column.
y = dataset.iloc[:, -1].values

> - Next, we need to split the data into training and testing sets. This will allow us to train our model on a portion of the data and then evaluate its performance on the remaining portion.
>- We will use the `train_test_split` function from Scikit-learn to split the data. We will use 25% of the data for testing and 75% for training.

## 3. Splitting the data into training and testing sets

In [6]:
# Splitting the data into training and testing sets
# test_size = 0.25 means that 25% of the data will be used for testing
# random_state = 0 means that the data will be split in a consistent way
# X_train and y_train will be the training sets
# X_test and y_test will be the validation sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

print("Shape of the training Features", X_train.shape)
print("Shape of the testing Features", X_test.shape)
print("Shape of the training Labels", y_train.shape)
print("Shape of the testing Labels", y_test.shape)

Shape of the training Features (34, 3)
Shape of the testing Features (12, 3)
Shape of the training Labels (34,)
Shape of the testing Labels (12,)


> - Now that we have split the data into training and testing sets, we can build and train our linear regression model.
> - We will use the `LinearRegression` class from Scikit-learn to build our model. Then, we will use the `fit` method to train the model on the training data.

---
# **Modeling**
---

## 1. Building the Model

In [7]:
lr = LinearRegression() # Creates a linear regression object
lr

## 2. Training the Model


In [8]:
lr.fit(X_train, y_train) # Trains the model on the training data

In [9]:
print("Intercept (a0):", lr.intercept_) # Intercept of the model
print("Slope (a1)    : ", lr.coef_ )# Coefficient of the independent variable

Intercept (a0): 59177.89865443355
Slope (a1)    :  [ 0.76968099 -0.04685646  0.0176137 ]


In [10]:
lr.score(X_train, y_train) # R^2 score of the model on the training data

0.9622339164880572

> - With our model trained, we can now make predictions on the test set.
> - We will use the `predict` method to make predictions on the test data.

---
# **Evaluation (Check Performance)**
---

With our model built, we can now use it for generating predictions. We will use our test set so we can see how well it did.

In [11]:
# Making predictions on the testing sets
y_pred = lr.predict(X_test) 

In [12]:
# Evaluating the model
score = r2_score(y_test, y_pred) # Calculates the R^2 score
# or Calculates the R^2 score by this way
# score = regressor.score(X_test, y_test)

print('R^2 score:', score) 

# Print out our score properly formatted as a percent.
print("R^2 score:", "{:.0%}".format(score))

R^2 score: 0.9444425023599581
R^2 score: 94%


> Because we have 4 independent variables, this would be impossible to visualize. So we will skip the step of visualization and instead we will move on directly to the R^2 score, which tells us how much of the variation in our dependent variable can be explained by our independent variable.

---
# **Inference**
---

We can now apply this trained model to novel examples to predict their profit.

In [13]:
# Prediction for a business with R&D of 160,000, Admin of 130,000 and Marketing of 300,000.
print(lr.predict([[160000, 130000, 300000]])[0])

181519.62842299516
