# Linear Regression

### 1. Import the needed libraries

__`Step 1`__ - The first thing you should do is always import the needed libraries. In this case, we are going to import:
- pandas as pd
- numpy as np
- LinearRegression from sklearn.linear_model
- train_test_split from sklearn.model_selection
- matplotlib.pyplot as plt

In [None]:
import pandas as pd 
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### 2. Import the dataset

Since we are dealing with linear regression, we are going to work with a dataset where the target is continuous. <br>
__`Step 2`__ - Import the dataset __Boston.csv__ using pandas and assign it to an object named __data__

In [None]:
data = pd.read_csv('Boston.csv')
data.head()

### 3. Explore the dataset

The next step is to explore our data: while this is not the focus for this class, we are just going to check if we don't have missing values and what is the type of data that we have.

__`Step 3`__ - Call the method __info()__ in your data. <br>
This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. <br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html?highlight=info#pandas.DataFrame.info

In [None]:
data.info()

By calling the __info()__ method, we can verify that we don't have missing values and all data is numerical, so there is no need to deal with missing data or create dummies. <br>
We are ready to apply linear regression in our dataset! But first, and since we want to evaluate the performance of our model, we need to split our dataset into training and validation. Since we only have 505 observations, we are not going to create a test dataset.

### 4. Data partition

__`Step 4`__ - By calling the method __train_test_split()__, split your dataset into train (70%) and validation (30%). Don't forget that you need to define first what are your independent variables and your target/ dependent variable. <br>

- Define as __X__ the independent variables and __y__ the dependent variable (last column - 'medv')
- Divide the __X__ into __X_train__ and __X_val__, the __y__ into __y_train__ and __y_val__, and define the following arguments: __test_size = 0.3__, __random_state = 15__ 

In [None]:
X = data.drop(columns=['medv'])
y = data['medv']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=15)

### 5. Apply linear regression

__`Step 5`__ - Create an instance of LinearRegression named as lin_model with the default parameters and fit to your train data.

In [None]:
lin_model = LinearRegression()

<div class="alert alert-block alert-success">
    <b><h3>Methods in LinearRegression()</h3></b><br>
</div>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'>sklearn.linear_model.LinearRegression().fit(X,y,...)</a>

__Definition:__ <br>
Fit linear model in the training data.

__Parameters:__ <br>
X : The regressors in my training dataset; <br>
y : The target in my training dataset; <br>
...
</div>

__`Step 6`__ - Fit your model to your data, and define __X = X_train__ and __y = y_train__

In [None]:
lin_model.fit(X_train, y_train)

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'>sklearn.linear_model.LinearRegression().predict(X)</a>

__Definition:__ <br>
Predict using the linear model.

__Parameters:__ <br>
X : Samples to predict; <br>
...

</div>

__`Step 7`__ - Predict the values for __X_val__ by applying the method __predict()__ to your model and check your result

In [None]:
predictions = lin_model.predict(X_val)
predictions

Those are the predicted values to your validation dataset by applying the model created previously based on train data.

<div class="alert alert-block alert-success">
    <b><h3>Attributes in LinearRegression()</h3></b><br>
</div>

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'>sklearn.linear_model.LinearRegression().coef_</a>

__Definition:__ <br>
Coefficient of the features in the decision function.

</div>

__`Step 8`__ - To check the coefficients calculated by applying the linear regression, call the attribute __coef___ associated to your model

In [None]:
lin_model.coef_

The result is an array that shows all the coefficients. In order to better understand what is the variable associated to each coefficient, let's convert the result to a DataFrame and define as headers the variables.

__`Step 9`__ - Create a dataframe that will contain the values of the coefficients

In [None]:
df = pd.DataFrame(lin_model.coef_)
df

__`Step 10`__ - By using the method __set_index()__, define the index of Dataframe equal to the name of the variables

In [None]:
df = df.set_index(X_train.columns)
df

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html'>sklearn.linear_model.LinearRegression().intercept_</a>

__Definition:__ <br>
Independent term in the linear model.

</div>

__`Step 11`__ - To obtain the intercept of the linear regression, call the attribute __intercept___ associated to your model

In [None]:
# The intercept (often labeled the constant) is the expected mean value of Y when all X=0.
lin_model.intercept_

<div class="alert alert-block alert-warning">
<h1><center>Calculate the p-values</center></h1>
</div>

https://regressors.readthedocs.io/en/latest/_modules/regressors/stats.html

The library sklearn doesn't have any option that allows to calculate automatically the standard error associated to each coefficient, the t-value and the p-value. <br>

One alternative is to use the library regressors.

__`Step 12`__ - Install the library regressors

In [None]:
import sys
!{sys.executable} -m pip install regressors

Now we are able to use the library __regressors__ <br>
__`Step 13`__ - Import __stats__ from regressors

In [None]:
from regressors import stats

__`Step 14`__ - Create a new object named __xlabels__ that will contain the name of the columns in __X_train__

In [None]:
xlabels = X_train.columns

__`Step 15`__ - From stats, call the method __summary()__ that will have as parameters:
- __clf = lin_model__ : The linear model created previously <br>
- __X = X_train__ :  The training data used to fit the classifier <br>
- __y = y_train__ : The target training values <br>
- __xlabels = xlabels__ :  The labels for the predictors <br>


In [None]:
stats.summary(clf = lin_model, X = X_train, y = y_train, xlabels = xlabels)

The summary statistic table calls many of the stats outputs the statistics in an pretty format, containing all the needed values to interpret our model: The residuals distribution, the coefficients and the t-value and the p-value for each of them, and also the evaluation of the model using the metrics R-Squared, Adjusted R-Squared and F-statistic. That evaluation, however, is based on the performance of the model in the training dataset. <br>

But in the last class, we saw how to calculate the R-Squared and the Adjusted R-Squared to our validation dataset by using __sklearn__.

__The p-value__ <br>
For each estimated regression coefficient, the p-value provides an estimate of the probability that the true coefficient is zero given the value of the estimate. Small p-values suggest that the true coefficient is very unlikely to be zero, which means that the feature is extremely unlikely to have no relationship with the dependent variable. <br> In this way, we can also check the p-value to understand the feature importance and select the most "important" variables to build our final model.




>><font color='Orange'> __Practice__ </font>

It's time now to build step by step a simple linear regression. To calculate the coefficient we are going to use the formula:

$$\beta _{1} = \frac{\sum \left ( x_{i}-\bar{x})( y_{i}-\bar{y}\right )}{\sum ( x_{i}-\bar{x})^{2}}$$

And the intercept is going to be calculated using the formula:
$$\beta _{0} = \bar{y} - \beta _{1}\bar{x} $$

You are going to work with the following dataset:

In [None]:
houses = pd.DataFrame({'m^2':[16,15,28,14,22,13],'Price':[360,340,664,330,560,380]})
houses

__`Step 16`__: Try to calculate the regression equation associated to the dataset step by step and predict the value for a house with $19m^{2}$

__`16.1.`__ Calculate the mean of the values in your X and assign it to the object __mean_m2__. In the same way, calculate the mean of your target and assign it to the object __mean_price__

In [None]:
mean_m2 = houses["m^2"].mean()
mean_m2

In [None]:
mean_price = houses["Price"].mean()
mean_price

__`16.2`__ Create a new column in your dataset 'houses' named as __xi-x_mean__ that will contain $( x_{i}-\bar{x})$

In [None]:
houses['xi-x_mean'] = houses["m^2"] - mean_m2
houses

__`16.3`__ Create a new column in your dataset 'houses' named as __yi-y_mean__ that will contain $( y_{i}-\bar{y})$

In [None]:
houses['yi-y_mean'] = houses["Price"] - mean_price
houses

__`16.4`__ Create a new column in your dataset 'houses' named as __square(xi-x_mean)__ that will be equal to $(x_{i}-\bar{x})^{2}$

In [None]:
houses['square(xi-x_mean)'] = houses['xi-x_mean']**2
houses

__`16.5`__ Create a new column in your dataset 'houses' named as __(xi-x_mean)(yi-y_mean)__ that will be equal to $( x_{i}-\bar{x})( y_{i}-\bar{y})$

In [None]:
houses['(xi-x_mean)(yi-y_mean)'] = houses['xi-x_mean']*houses['yi-y_mean']
houses

__`16.6`__ Calculate the coefficient of 'm^2' by using the formula below and assign it to the object __beta1__ <br> <br>
$$\beta _{1} = \frac{\sum \left ( x_{i}-\bar{x})( y_{i}-\bar{y}\right )}{\sum ( x_{i}-\bar{x})^{2}}$$

In [None]:
beta1 = houses['(xi-x_mean)(yi-y_mean)'].sum()/houses['square(xi-x_mean)'].sum()
beta1

__`16.7`__ Calculate the intercept and name it as __beta0__ by using the formula <br><br>

$$\beta _{0} = \bar{y} - \beta _{1}\bar{x} $$

In [None]:
beta0 = mean_price - beta1*mean_m2
beta0

__`16.8`__ Predict the price of a house with $19m^{2}$

In [None]:
prediction = beta0 + beta1*19
prediction

### Plot your regression!

In [None]:
X = houses['m^2']
y = houses['Price']
predictions = []

for value in X:
    predictions.append(beta0 + beta1*value)

fig = plt.figure()
plt.plot(X, y, 'r.', markersize=12)
plt.plot(X, predictions, 'b-')
plt.show()