# Intro To Data Science With Linear Regression

## What Is Linear Regression? 

In this excercise, you'll utilize the Linear Regression model from Scikit-Learn to predict housing prices in Boston.

Linear regression is the fundamental building block of data science and analytics. If you ever venture into data science, this will most likely be the first model you're taught.


Linear regression models are very simple, interpretable, and somewhat flexible. The goal is to predict a continuous output variable (e.g. MPG, prices, etc.) from a set of predictor variables, known as features.


In business, you'll almost always try the linear regression before moving to advanced models, such as GBM, random forests, or neural networks. 

## Getting Started & Preprocessing

First, import the necessary libraries to run the notebook. Press `Shift + Enter` to run the cell below.

In [None]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import cm as cm
import seaborn as sns
from matplotlib.colors import ListedColormap

Load the boston dataset. This is a dataset that's installed within Scikit-Learn.

The goal with this exercise: predict the housing price, using other columns (features) in the dataset.

Load the Boston housing data with the line below.

`boston = load_boston()`

Next, separate the data into the features and target using the following code:

`y = boston.target`

`boston = pd.DataFrame(boston.data)`

Print the boston dataset using the following code. The `head` method prints out the first 5 lines of your data.

`boston.head()`

The columns don't have any labels! This happens with some datasets. Assuming you have a data dictionary, you can label the columns. For the time being, add this line into the cell below, and call the `head` method on the DataFrame again.

Refer to the `data_dictionary.pdf` document to see what each column name refers to.


`boston.columns = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat']`

`boston.head()`

Now that the data is labeled, we have a better sense of what each column means.

To reiterate, we'll be predicting the housing prices using all of these columns (features). 

## Plotting Correlations

Now that the data is in the right format, we can plot a correlation matrix. This shows us what features are correlated with each other.

For reference, -1 is uncorrelated, and 1 is highly correlated. Run the function below to look at the numbers.



`boston.corr()`

We have the numbers from the correlation matrix, but it's not as easy to view or interpret as a plot.

To see correlations plotted by color, run the `correlation_matrix_plot` function below.

Examine the correlations in the lower triangle, then answer the questions below.

In [None]:
def correlation_matrix_plot(n_top_features, df):
    feats = n_top_features
    corr = df[list(feats)].corr()
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    labels = corr.where(np.triu(np.ones(corr.shape)).astype(np.bool))
    labels = labels.round(2)
    labels = labels.replace(np.nan,' ', regex=True)

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(9,9))

    # Generate a custom diverging colormap
    cmap = cm.get_cmap('jet', 30)
    # Draw the heatmap with the mask and correct aspect ratio
    ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.tight_layout()
    plt.show()
    
    
correlation_matrix_plot(boston.columns,boston)

## Correlation Matrix Questions

Looking at the plot above: 

1. What features are highly correlated?
2. Which features are highly uncorrelated?



## Building the Linear Regression Model 

Now that the data is in the right format, we can begin to build the linear regression model.

First, we're going to split the data. In data science, your data is split into two datasets.

The first dataset is the *training* set. Building a model is referred to as "training", hence the moniker of a "training" data set. The second dataset is the *test* set. This is used to make predictions, and evaluate if our model is performing well.

To split the data into training and test data sets, type the following line.

`X_train, X_test, y_train, y_test = train_test_split(boston, y, test_size=0.20, random_state=42)`

With the data being split, we'll now create the LinearRegression module. Write the line in the cell below:

`model = LinearRegression()`

You're now ready to train the model. Write and run the following line:

`model.fit(X_train, y_train)`

## Predict and Score Model 

Now that the model is trained, we can predict new values using the test set. Write the following code to predict the housing prices.

`predictions = model.predict(X_test)`

Next, we'll look at the coefficients for our model. Coefficients describe the mathematical relationship between each independent feature(s) and the target variable. 

The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

`coefficients = pd.DataFrame(model.coef_, boston.columns).sort_values(by=0, ascending=False)`

`print(coefficients)`

Finally, to gain an understanding of how our model is performing, we'll score the model against three metrics: R squared, mean squared error, and mean absolute error. Write the following lines of code to get your output.

`print("R Squared Score: ", r2_score(y_test, predictions))`

`print("Mean Squared Error: ", mean_squared_error(y_test, predictions))`

`print("Mean Absolute Error: ", mean_absolute_error(y_test, predictions))`

## Questions

1. Google R Squared, Mean Squared Error, and Mean Absolute Error. What do these metrics mean? What are the numbers telling you?
2. What do you think could improve the model?
3. What features do you think are not useful to the model?

## Sources

Statistics By Jim - http://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/