# Introduction
In this lesson, you will:

Identify Regression Applications
Learn How Regression Works
Apply Regression to Problems Using Python

# Introduction to Machine Learning
**Machine Learning** is frequently split into **supervised** and **unsupervised** learning. Regression, which you will be learning about in this lesson (and its extensions in later lessons), is an example of supervised machine learning.

In supervised machine learning, you are interested in predicting a label for your data. Commonly, you might want to predict fraud, customers that will buy a product, or home values in an area.

In unsupervised machine learning, you are interested in clustering data together that isn't already labeled. This is covered in more detail in the [Machine Learning Engineer Nanodegree](https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009). However, we will not be going into the details of these algorithms in this course.

# Introduction to Linear Regression
In simple linear regression, we compare two quantitative variables to one another.

The **response** variable is what you want to predict, while the **explanatory** variable is the variable you use to predict the response. A common way to visualize the relationship between two variables in linear regression is using a scatterplot. You will see more on this in the concepts ahead.

# Scatter plots
Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the **correlation coefficient** commonly denoted by r.

Though there are a [few different](http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/) ways to measure correlation between two variables, the most common way is with Pearson's correlation coefficient. [Pearson's correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) provides the:

1. Strength
1. Direction
of a **linear relationship**. [Spearman's Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables.

> Supporting Materials
>* [Different ways to measure correlation](https://video.udacity-data.com/topher/2019/November/5dcb372a_statistics-solutions/statistics-solutions.pdf)
>* [Pearson's correlation coefficient](https://video.udacity-data.com/topher/2019/November/5dcb373d_pearson-correlation-/pearson-correlation-.pdf)
>* [Spearman's Correlation Coefficient](https://video.udacity-data.com/topher/2019/November/5dcb374d_spearmans-coefficient-/spearmans-coefficient-.pdf)

# Correlation Coefficients
Correlation coefficients provide a measure of the **strength** and **direction** of a **linear** relationship.

We can tell the direction based on whether the correlation is positive or negative.

A rule of thumb for judging the strength:

![eqns](./assets/l14_8.png)

## Calculation of the Correlation Coefficient
$$
r = \frac{\sum_{i=1}^{n}\left(x_i - \bar{x}\right)\left(y_i - \bar{y}\right)}{\sqrt{\sum\left(x_i - \bar{x}\right)}\sqrt{\sum\left(y_i - \bar{y}\right)}}
$$
It can also be calculated in Excel and other spreadsheet applications using **CORREL(col1, col2)**, where **col1** and **col2** are the two columns you are looking to compare to one another.

# What Defines A Line
A line is commonly identified by an **intercept** and a **slope**.

The **intercept** is defined as **the predicted value of the response when the x-variable is zero**.

The **slope** is defined as **the predicted change in the response for every one unit increase in the x-variabl**e.

We notate the line in linear regression in the following way:

$ \hat{y} = b_0 + b_1 x_1$

where

$\hat{y}$ is the predicted value of the response from the line.

$b_0$ is the intercept.

$b_1$ is the slope.

$x_1$ is the explanatory variable.

$y$ is an actual response value for a data point in our dataset (not a prediction from our line).

# Fitting A Regression Line
The main algorithm used to find the best fit line is called the **least-squares** algorithm, which finds the line that minimizes $\sum_{i=1}^{n}\left(y_i - \hat{y_i}\right)^2$ 

There are other ways we might choose a "best" line, but this algorithm tends to do a good job in many scenarios.

# The Regression Closed Form Solution
## How Do We Determine The Line of Best Fit?
You saw in the last video, that in regression we are interested in minimizing the following function:

$$
\sum_{i=1}^{n}\left(y_i - \hat{y_i}\right)^2
$$
It turns out that in order to minimize this function, we have set equations that provide the intercept and slope that should be used.

If you have a set of points like the values in the image here:
![table](./assets/l14_14.png)

In order to compute the slope and intercept, we need to compute the following:

$\bar{x} = \frac{1}{n}\sum{x_i}$

$\bar{y} = \frac{1}{n}\sum{y_i}$

$s_y = \sqrt{\frac{1}{n-1}\sum{\left(y_i - \bar{y}\right)}}$

$s_x = \sqrt{\frac{1}{n-1}\sum{\left(x_i - \bar{x}\right)}}$

$r = \frac{\sum_{i=1}^n{\left(x_i - \bar{x}\right)\left(y_i - \bar{y}\right)}}{\sqrt{\sum{\left(x_i - \bar{x}\right)}}\sqrt{\sum{\left(y_i - \bar{y}\right)}}}$

$b_1 = r \frac{s_y}{s_x}$

$b_0 = \bar{y} - b_1 \bar{x}$

## But Before You Get Carried Away...
Though you are now totally capable of carrying out these steps....

In the age of computers, it doesn't really make sense to do this all by hand. Instead, using computers can allow us to focus on interpreting and acting on the output. If you want to see a step by step of this in Excel, you can find that [here](https://www.youtube.com/watch?v=zPG4NjIkCjc). With the rest of this lesson, you will get some practice with this in Python.

# Fitting A Regression Line in Python
[Here](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model) is a post on the need of an intercept in nearly all cases of regression. Again, there are very few cases where you do not need to include an intercept in a linear model.

# How to Interpret the Results
We can perform hypothesis tests for the coefficients in our linear models using Python (and other software). These tests help us determine if there is a statistically significant linear relationship between a particular variable and the response. The hypothesis test for the intercept isn't useful in most cases.

However, the hypothesis test for each x-variable is a test of if that population slope is equal to zero vs. an alternative where the parameter differs from zero. Therefore, if the slope is different than zero (the alternative is true), we have evidence that the x-variable attached to that coefficient has a statistically significant linear relationship with the response. This in turn suggests that the x-variable should help us in predicting the response (or at least be better than not having it in the model).

# Does the Line Fit the Data Well
The **Rsquared** value is the square of the correlation coefficient.

A common definition for the Rsquared variable is that it is the amount of variability in the response variable that can be explained by the x-variable in our model. In general, the closer this value is to 1, the better our model fits the data.

Many feel that Rsquared isn't a great measure (which is possibly true), but I would argue that using cross-validation can assist us with validating any measure that helps us understand the fit of a model to our data. [Here](http://data.library.virginia.edu/is-r-squared-useless/), you can find one such argument explaining why one individual doesn't care for Rsquared.

# Recap + Next Steps
## Recap
In this lesson, you learned about simple linear regression. The topics in this lesson included:

1. Simple linear regression is about building a line that models the relationship between two quantitative variables.


2. Correlation coefficients are a measure that can inform you about the **strength** and **direction** of a linear relationship.


3. The most common way to visualize simple linear regression is using a scatterplot.


4. A line is defined by an intercept and slope, which you found using the **statsmodels** library in Python.


5. You learned the interpretations for the slope, intercept, and R-squared values.


## Up Next
In the next lesson, you will extend your knowledge from simple linear regression to multiple linear regression.