# Regression Models

During the last course, we introduced the fundamental concepts of Machine Learning. In this course, we will deepen our knowledge and build linear regression models and make powerful predictions.

## What you will learn in the course 🧐🧐

- Understanding single and multiple regression models
- What are the assumptions to be known in order to apply regression models?
- What are dummy variables
- Building linear regression models with Python
- Understanding R2 and Adjusted R2
- Interpreting coefficients

## Simple linear regressions 🍼

### Definition

Simple linear regression models are based on the following linear equation:

$$
y = a(x)+b
$$

Depending on the individuals (or point cloud), your model will find the line that comes closest to all the individuals at once. This is what it looks like visually:

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/regressions.png)

### Dependent variables

In Machine Learning, we always distinguish between **dependent variables** and **independent variables.** Dependent variables are the things you are trying to predict. In the equation above, this corresponds to $y$. In other words, $y$ depends on $a$, $x$, and $b$ to have a value.

### Independent Variables

The independent variables, represented by $x$, are your predictors or factors that will determine the value of $y$. For example, if we are trying to predict someone's salary based on the number of years of experience. The independent variable $x$ is the number of years of experience.

### Coefficient

The coefficient $a$ represents the _slope_ or weight your independent variable will have in your equation.

### Constant

Finally, the constant $b$ represents where your line will start if $x=0$. In the case of predicting wages based on years of experience, even if you have 0 years of experience ( $x=0$ ), you will still have a starting minimum wage, not 0.

## Ordinary Least squares method 🟩

### Definition

You're probably wondering how we know that the line in our model is the one that is "closest" to each point in our dataset. Well, it's thanks to the **Ordinary least squares** method. We won't go too far in demonstrating the formula. What you have to understand is that the algorithm will look for the minimum possible distance between each point in your graph using this formula:

$$
Min(\sum_{i=0}^{n}(y_{i}-\hat{y}_{i})^2)
$$

In this equation, $y_{i}$ represents each individual (or point) in your dataset while $\hat{y}_i$ represents the prediction of your model.

After several iterations, your algorithm is able to find the minimum number in this formula and thus have the best possible line that describes your dataset.

## The assumptions behind linear regression 💡

When building an ML model, you should be aware of the assumptions you need to respect in order for your model to work well. Otherwise, you will have poor performance. Here are the assumptions of a simple linear regression model:

### Linearity

The first assumption is simple. You need your points to follow roughly a line. In other words, you need to make sure that your dependent variable follows a linear growth path as your independent variables increase.

### Homoscedasticity

Beyond the complexity of the word itself, it means that the variance of your points must be relatively the same. If you have a huge variance, it means that you have points very far apart from each other and therefore it will be difficult to have a line that is representative of your dataset.

### Normality of the variables

The points should have a normal (or at least approximately normal) distribution. However, you will rarely have a normal distribution of your points. The trick is to have a mean, median and mode that are not too far apart.

## Multiple linear regression 🧨

### Definition

### Plurality of independent variables

Most of the time, you will not have just one factor that will allow you to predict your dependent variable. For example, you can predict someone's salary with the number of years of experience, but you can also predict the type of degree, the sector in which the person works, the gender, the country, and so on.

This is the only difference between simple and multiple linear regression. You add independent variables into the equation.

$$
y = \sum_{i=0}^{n}a_{i}x_{i}
$$

### Hypotheses for multiple linear regression

#### Everything in simple linear regression + NO collinearity

As you can already imagine, multiple linear regressions will follow the same assumptions as simple linear regressions because, after all, you just add a little complexity. The only thing you need to add in the assumptions is the **non-colinearity** between independent variables.

For example, if you are trying to predict someone's salary based on their age and years of experience, you will run into a problem. The relationship between age and years of experience is quite possible, since logically, the older you are, the more years of experience you have.

If you have collinearity in your model, your model will be biased and unusable because we will not be able to know which variable really influences your dependent variable.

### Dummy Variables

#### Reminder: Categorical variables

To understand what dummy variables are, let's first recall what category variables are. They are simply qualitative data. For example, think about countries, shoe sizes, etc. You could technically create a category for each variable.

#### Encoding categorical variables

In regressions, you cannot have text data as variables, only numbers are accepted. This is why categorical variables are encoded and replaced by numbers: 0 or 1.

#### The trap of dummy variables 🪤

Once you have encoded your dummy variables, you will not add them all in your equation because you will have a collinearity problem between your last dummy variable and your first one because one will be the opposite of the other. So what you will do is add all the dummy variables and you will **remove 1 in your equation**.

## Evaluating a regression model

### R Squared

#### What's that?

R squared is an indicator that evaluates the performance of your linear regression model using the method of least squares. The indicator will try to compare your model with the prediction mean of your dataset.

$$
R^{2} = 1 - \frac{\sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y_{i}})^{2}}
$$

Let's take a simple example. You've built your regression model that predicts someone's salary based on their years of experience. If you take this dataset and calculate the average income of all individuals ($\bar{y}$), you can technically say that you are making a prediction based on the average wage of your sample. This "model" is not as accurate as you might think, but that's what we're going to compare our true linear regression model to and see how much better it is.

The closer $R^{2}$ is to 1, the better your model. However, there is a problem with this indicator. The more data you add, the closer you will get to 1, even if that data is not at all predictive of your dependent variable. This would mean that the more independent variables you add to your model, the better it would be, regardless of the quality of your independent variable. But this is not the case. That's why we use the adjusted R squared.

### Adjusted R Squared

#### Why is this indicator better?

With adjusted R squared, we add a penalty to our equation so that if the independent variable is not a good predictor, R squared will decrease.

$$
adj R^{2} = 1-(1-R^{2})\frac{n-1}{n-p-1}
$$

$p$ is the number of regressors (or independent variables).
$n$ is the size of your sample.

In the equation, each time you add a regressor, the multiplier $\frac{n-1}{n-p-1}$ increases. On the other hand, if $R^{2}$ stays at the same value then $adj R^{2}$ will decrease. Therefore, if you add a new regressor that doesn't make $R^{2}$ grow strong enough to counterbalance $\frac{n-1}{n-p-1}$, it means that the predictor is not good and doesn't improve your model.

### Coefficients

#### Definition

$$
y = ax+b
$$

In this equation, $a$ is the coefficient on your line. The stronger it is, the bigger $y$ will be. In a regression, this represents the impact of your independent variable on your dependent variable.

#### Beware of the coefficient trap!!!

Careful with the coefficients. You have to make sure they're all on the same scale. If you have a first coefficient of 1000 for a predictor $X_1$ and 10 000 for a predictor $X_2$ but the first is expressed in k€ and the second simply in € this would mean that the coefficient of $X_1$ has a greater impact on your dependent variable than the coefficient of $X_2$.

## Resources 📚📚

- Introduction to Simple Linear Regression - [http://bit.ly/2DpD0XQ](http://bit.ly/2DpD0XQ)
- Statistics How to Simple Linear Regression - [http://bit.ly/2Dlh0JE](http://bit.ly/2Dlh0JE)
- Multiple Linear Regression Yale - [http://bit.ly/1QKpZGo](http://bit.ly/1QKpZGo)
- What is Multiple Linear Regression - [http://bit.ly/2DpYdAJ](http://bit.ly/2DpYdAJ)
- How do I interpret R Squared - [http://bit.ly/2pP83Eb](http://bit.ly/2pP83Eb)
- Adjusted R Squared - [http://bit.ly/2qqz55b](http://bit.ly/2qqz55b)