# Multiple Linear Regression
---

1.   **[Introduction to Multiple Linear Regression](#1.-Introduction-to-Multiple-Linear-Regression)**
1.   **[Foundations of Linear Regression](#2.-Foundations-of-Linear-Regression)**
1.   **[Model Assumptions](#3.-Model-Assumptions)**
1.   **[Variable Selection](#4.-Variable-Selection)**
1.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
1.   **[Model Construction](#6.-Model-Construction)**
1.   **[Model Evaluation](#7.-Model-Evaluation)**
1.   **[Model Results](#8.-Model-Results)**

---
<a name="1.-Introduction-to-Multiple-Linear-Regression"></a>
### 1. Introduction to Multiple Linear Regression

#### 1.1 Definitions

**Multiple Linear Regression |** Technique that estimates the linear relationship between one `continuous` dependent variable and `two or more independent` variable. 

**Dependant variable (y) |** The variable a given model estimates, also referred to as a response or outcome variable

**Independent variable (x) |** A variable that explains trends in the dependent variable, also referred to as an explanatory or predictor variable.

**Simple Linear Regression Formula |** $y = intercept + slope(x)$

**Slope |** The amount that `y` increases or decreases per one-unit increase of `x`

**Intercept |** The value of `y`, the dependent variable, when `x`, the independent variable, equals 0

**Regression Coefficients |** The estimated betas in a regression model. Represented as $\hat{\beta_i}$

**Ordinary Least Squares Estimation (OLS) |** Common way to calculate linear regression coefficients $\hat{(\beta)}_n$ 

**Loss Function |** A function that measures the distance between the observed values and the model's estimated values 



#### 1.2 Mathematical Multiple Linear Regression

**Multiple Linear Regression Equation |** $y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$

**Interaction Term |** A term that represents how the relationship between tow independent variable is associated with changes in the mean of the dependent variable
- Equation without interaction: $y = \beta_0 + \beta_1 X_1 + \beta_2 X_2$
- Equation with interaction: $y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_{\text{interaction}} * (variable1 * variable2)$

---
<a name="2.-Foundations-of-Linear-Regression"></a>
### 2. Foundations of Linear Regression

#### 2.1 Ordinary Least Squares Estimation


Ordinary least squares (OLS) is a method used in linear regression analysis to estimate the unknown parameters of the linear regression model. The goal of OLS estimation is to find the values of the regression coefficients that minimize the sum of the squared errors between the predicted values and the actual values of the dependent variable.

**Best Fit Line |** The line that fits the data best by minimizing some loss function or error

**Predicted values |** The estimated (y) values for each (x) calculated by a model

**Residual |** The difference between observed or actual values and the predicted values of the regression line 
- Residual = Observed - Predicted ---> $\epsilon_i = y_i - \hat{y_i}$

**Sum of Squared Residuals (SSR) |** The sum of the squared differences between each observed value and its associated predicted value 
- $SSR = \sum\limits_{i=1}^{n}(Observed - Predicted)^2$
- $SSR = \sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2$

**Ordinary Least Squares (OLS) |** A method that minimizes the sum of the squared residuals to estimate parameters in a linear regression model
- Used to calculate: $\hat{y}=\hat{\beta_0} + \hat{\beta_1(x)}$

---
<a name="3.-Model-Assumptions"></a>
### 3. Model Assumptions

Model assumptions are statements about the data that must be true in order to justify the use of a particular modeling technique



#### 3.1 Multiple Linear Regression Assumptions
- **Linearity**
- **Normality**
- **Independent Observations**
- **Homoscedasticity**
- **No Multicollinearity**


##### 3.1.1 Linearity

**Each predictor variable $(X_i)$ is linearly related to the outcome variable $(Y)$**

##### 3.1.2 (Multivariate) Normality

**The residuals of errors are normally distributed.**
- Can only be checked after the model is built because residuals must be known for calculation
- Checked using a quantile-quantile plot (Q-Q plot)
    - if points on the plot form a straight diagonal line then can assume normality

##### 3.1.3 Independent Observation 

**Each observation in the dataset is independent**

##### 3.1.4 Homoscedasticity

**The variation of the residuals (errors) is constant or similar across the model**
- Homoscedasticity means having the same scatter

##### 3.1.5 No Multicollinearity

**No two independent variables ($X_i$ and $X_j$) can be highly correlated with each other.**
- Variance Inflation Factors (VIF) quantifies how correlated each independent variable is with all of the other independent variables


#### 3.2 Assumption Violations



##### 3.2.1 Linearity
**Transform one or both of the variables**, such as taking the logarithm.
- For example, if measuring the relationship between years of education and income, take the logarithm of the income variable and check if that helps the linear relationship.

##### 3.2.2 Normality
**Transform one or both variables.** Most commonly, this would involve taking the logarithm of the outcome variable.
- When the outcome variable is right skewed, the normality of the residuals can be affected. Taking the logarithm of the outcome variable can sometimes help with this assumption.
- When transforming a variable, reconstruct the model and recheck the normality assumption. If the assumption is still not satisfied, continue troubleshooting the issue.

##### 3.2.3 Independent Observation 
**Take just a subset of the available data.**
- If, for example, data is a survey including responses from people in the same household, responses may be correlated. Correct for this by just keeping the data of one person in each household.
- Another example data on bike rental over a time period. If data collected every 15 minutes, the number of bikes rented out at 8:00 a.m. might correlate with the number of bikes rented out at 8:15 a.m. Perhaps the number of bikes rented out is independent if the data is taken once every 2 hours, instead of once every 15 minutes.

##### 3.2.4 Homoscedasticity
**Define a different outcome variable.**
- If interested in understanding how a city’s population correlates with the number of restaurants in a city, it's known that some cities are more populous than others. Therefore possible to redefine the outcome variable as the ratio of population to restaurants instead.

**Transform the Y variable.**
- As with the above assumptions, sometimes taking the logarithm or transforming the Y variable in another way can potentially fix inconsistencies with the homoscedasticity assumption.

##### 3.2.5 No Multicollinearity
**Drop Variables**
- Drop one or more variables that have high multicollinearity
- Strategic variable selection:
    - Forward Selection
    - Backward elimination 
- Advanced variable selection:
    - Ridge regression
    - Lasso regression
    - Elastic-net regression
    - Principal component analysis(PSA)

**Create new Variables**
- Use existing data to create new variables

---
<a name="4.-Variable-Selection"></a>
### 4. Variable Selection

#### 4.1 Definitions

**$R^2$ |** The proportion of variance of the dependent variable, Y, explained by the independent variables, X

**Overfitting |** When a model fits the observed rr training data too specifically, and is unable to generate suitable estimates for the general population

**Adjusted $R^2$ |** A variation of the $R^2$ regression evaluation metric that penalizes unnecessary explanatory variables

**$R^2$ vs. Adjusted $R^2$ |**
- Adjusted $R^2$ is used to compare models of varying complexity
    - Determine if you should add another variable or not
- $R^2$ is more easily interpretable
    - Determine how much variation in the dependent variable is explained by the model

**When to use adjusted R-squared**
- Adjusted R-squared is used to compare between multiple regression models with varying numbers of independent variables. To avoid selecting an overfitted model purely based on inflated R-squared, adjusted R-squared is used to select the optimal model. 

#### 4.2 Selection Methods

**Variable / Feature selection |** The process of determining which variable or features to include in a given model

**Forward Selection |** A stepwise variable selection process that begins with the null model, with 0 independent variables, considers all possible variables to add. It incorporates the independent variable that contributes the most explanatory power to the model. 

**Backward Elimination |** A stepwise variable selection process that begins with the full model, with all possible independent variables, and removes the independent variable that adds the least explanatory power to the model.  

**Extra-sum-of-squares F-test |** Quantifies the difference between the amount of variance that is left unexplained by a reduced model that is explained by the full model

**Bias-variance tradeoff |** Balance between two model qualities, bias and variance, to minimize overall error for unobserved data

**Regularization |** A set of regression techniques that shrinks regression coefficient estimates toward zero, adding in bias, to reduce variance

**Regularized regression |**
- Lasso Regression
- Ridge Regression
- Elastic-net Regression

---
<a name="3.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Import relevant Python libraries and modules

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the dataset into a DataFrame and save in a variable

data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Create a pairplot to visualize the relationship between the continuous variables in the data
sns.pairplot(data);

# Important to consider which variables have a linear relationship 

#### 5.3 Missing Data

##### 5.3.1 Check for missing data

In [None]:
# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
data.isna()

In [None]:
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
data.isna().any(axis=1)

In [None]:
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

##### 5.3.2 Drop missing data

In [None]:
# Step 1. Use .dropna(axis=0) to indicate that you want rows which contain missing values to be dropped
# Step 2. To update the DataFrame, reassign it to the result
data = data.dropna(axis=0)

In [None]:
# Check to make sure that the data does not contain any rows with missing values now

# Step 1. Start with .isna() to get booleans indicating whether each value in the data is missing
# Step 2. Use .any(axis=1) to get booleans indicating whether there are any missing values along the columns in each row
# Step 3. Use .sum() to get the number of rows that contain missing values
data.isna().any(axis=1).sum()

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction

In [None]:
# Select relevant columns
# Save resulting DataFrame in a separate variable to prepare for regression
ols_data = data[["Independent variable(Column_n)", "Dependant variable(Column_n)", "Dependant variable(Column_n)", "Dependant variable(Column_n)" ]] 

# Display first 10 rows of the new DataFrame
ols_data.head(10)

In [None]:
# Write the linear regression formula replacing Y and X with the corresponding column names eg: Sales and Ad_Spend
# Save it in a variable
ols_formula = "Dependant variable(Y) ~ Independent variable(X_1) + C(Independent categorical variable(X_2))" # C(Variable) used for categorical variables

# Implement OLS approach for linear regression
OLS = ols(formula= ols_formula, data= ols_data)

# Fit the model to the data
# Save the fitted model in a variable
model = OLS.fit()

# Save the results summary.
model_results = model.summary()

# Display the model results.
model_results

---
<a name="7.-Model-Evaluation"></a>
### 7. Model Evaluation

#### 7.1 Model Assumptions Check

##### 7.1.1 Linearity Check

In [None]:
# Create a scatterplot for each independent variable and the dependent variable.

# Create a 1x2 plot figure.
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a scatterplot between X_1 and Y.
sns.scatterplot(x = ols_data['X_1'], y = ols_data['Y'],ax=axes[0])

# Set the title of the first plot.
axes[0].set_title("X_1 and Y")

# Create a scatterplot between Social Media and Sales.
sns.scatterplot(x = ols_data['X_2'], y = ols_data['Y'],ax=axes[1])

# Set the title of the second plot.
axes[1].set_title("X_2 and Y")

# Set the xlabel of the second plot.
axes[1].set_xlabel("X_2")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance.
plt.tight_layout()

**Question to answer |** Is there a clear linear relationship in the scatterplot between Y and X variables? If yes then assumption is met.

##### 7.1.2 Independence Check

The independent observation assumption states that each observation in the dataset is independent. As each marketing promotion (i.e., row) is independent from one another, the independence assumption is not violated.
- Consider whether each row of data is independent from one another is so then the independence assumption is not violated.

##### 7.1.3 Normality Check

Create the following plots to check the **normality assumption**:

* **Plot 1**: Histogram of the residuals
* **Plot 2**: Q-Q plot of the residuals

In [None]:
# Calculate the residuals.
residuals = model.resid

# Create a 1x2 plot figure.
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a histogram with the residuals. 
sns.histplot(residuals, ax=axes[0])

# Set the x label of the residual plot.
axes[0].set_xlabel("Residual Value")

# Set the title of the residual plot.
axes[0].set_title("Histogram of Residuals")

# Create a Q-Q plot of the residuals.
sm.qqplot(residuals, line='s',ax = axes[1])

# Set the title of the Q-Q plot.
axes[1].set_title("Normal QQ Plot")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance.
plt.tight_layout()

# Show the plot.
plt.show()

**Question to answer |** Based on the visualizations above, is the distribution of the residuals normal?
- Is the histogram of the residuals approximately normally distributed?
- Are the residuals in the Q-Q plot forming a straight line?

##### 7.1.4 Homoscedasticity(constant variance) Check

In [None]:
# Create a scatterplot with the fitted values from the model and the residuals.
fig = sns.scatterplot(x = model.fittedvalues, y = model.resid)

# Set the x axis label.
fig.set_xlabel("Fitted Values")

# Set the y axis label.
fig.set_ylabel("Residuals")

# Set the title.
fig.set_title("Fitted Values v. Residuals")

# Add a line at y = 0 to visualize the variance of residuals above and below 0.
fig.axhline(0)

# Show the plot.
plt.show()

**Question to answer:**

1. Do the data points have a cloud-like resemblance and do not follow an explicit pattern?
    - If yes then normality assumption met

##### 7.1.5 Multicollinearity Check

**Two common ways to check for multicollinearity are to:**

* Create scatterplots to show the relationship between pairs of independent variables
* Use the variance inflation factor to detect multicollinearity

In [None]:
# Create a pairplot of the data.
sns.pairplot(data)

**Question to answer:**

1. Are the independent variables visibly linearly correlated?
    - If no then assumption met

In [None]:
# Calculate the variance inflation factor (optional).

# Import variance_inflation_factor from statsmodels.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a subset of the data with the continuous independent variables. 
X = ols_data[['X_1','X_2']]

# Calculate the variance inflation factor for each variable.
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Create a DataFrame with the VIF results for the column names in X.
df_vif = pd.DataFrame(vif, index=X.columns, columns = ['VIF'])

# Display the VIF results.
df_vif

**Question to answer:**

A VIF value of 1 indicates that there is no correlation between the predictor variables, while a value of greater than 1 indicates that there is some correlation. Generally, a VIF value of 5 or above is considered to indicate a high degree of multicollinearity.



---
<a name="8.-Model-Results"></a>
### 8. Model Results

In [None]:
# Display the model results summary.
model_results

##### 8.1 Drawing conclusions

1. Interpret the R-squared value
2. Interpret the coefficients:  look at the coefficient estimates and the uncertainty of these estimates
    - what are the coefficients
    eg.:
        * $\beta_{0} =  218.5261 $
        * $\beta_{TVLow}= -154.2971$
        * $\beta_{TVMedium} = -75.3120 $
        * $\beta_{Radio} =  2.9669$
3. Express the relationship that has been modelled as a linear equation:
    - eg:
        - $\text{Sales} = \beta_{0} + \beta_{1}*X_{1}+ \beta_{2}*X_{2}+ \beta_{3}*X_{3}$
        - $\text{Sales} = \beta_{0} + \beta_{TVLow}*X_{TVLow}+ \beta_{TVMedium}*X_{TVMedium}+ \beta_{Radio}*X_{Radio}$
        - $\text{Sales} = 218.5261 - 154.2971*X_{TVLow} - 75.3120*X_{TVMedium}+ 2.9669 *X_{Radio}$
4. What is your interpretation of the coefficient estimates? Are the coefficients statistically significant?
5. Beta coefficients allow an estimation of the magnitude and direction (positive or negative) of the effect of each independent variable on the dependent variable. 
    - The coefficient estimates can be converted to explainable insights, such as the connection between an increase in $X_1$ and $Y$
6. What are you interested in exploring further based on the current model?
7. Do you think your model could be improved? Why or why not? How?
8. What findings would be important to share with stakeholders and how should these be framed for most effective communication. 