# Linear regression in Scikit -Learn

Scikit-learn is machine learning library for Python. It includes implementation of many machine learning algortihms such as clustering, classification and regression algorithms. Documentation of scikit-learn http://scikit-learn.org/stable/index.html gives an overview of all the algorithms available in this library.

Before we implement our first linear regression model, we will introduce a new
dataset, the Housing Dataset, which contains information about houses in the
suburbs of Boston collected by D. Harrison and D.L. Rubinfeld in 1978. The Housing
Dataset has been made freely available and can be downloaded from the UCI machine
learning repository at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

Specifically, your goal will be to use this data make predictions of the output given the input feature.

### to be corrected
You will predict the median value of owner occupied hauses ('MEDV') features such as average number of rooms per dwelling. Since the target variable here is quantitative, this is a regression problem.

* Import numpy and pandas as their standard aliases.
* Read the file 'housing.csv' into a DataFrame df using the read_csv() function.
* Using the metadata file 'housing.names', attach the corresponding name to each column.
* Check the result using the method df.head()


## Exploring the housing data

As always, it is important to explore your data before building models.
* Use the pandas data frame methods to get informations about the size and the form of the dataset
* Get a statistical summary of the dataset

* Create a scatterplot matrix that allows us to visualize the pair-wise correlations the a subset of features ('INDUS', 'NOX', 'RM','LSTAT', 'MEDV') in this dataset in one place.
* To plot the scatterplot matrix, we will use the pairplot function from the seaborn library
(http://stanford.edu/~mwaskom/software/seaborn/), which is a Python library for drawing statistical plots based on matplotlib

Before being able to fit our model using scikit-learn library, we need to bring it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. 

* Create array X for the 'RM' feature and array y for the 'MEDV' target variable.
* Reshape the arrays by using the .reshape() method and passing in (-1, 1).

## Fit & predict for regression

To fit a linear regression model, we are interested in those features that have a high
correlation with our target variable MEDV. Looking at the preceding correlation
matrix, we see that our target variable MEDV shows the largest correlation with
the LSTAT variable (-0.74). However, as you might remember from the scatterplot
matrix, there is a clear nonlinear relationship between LSTAT and MEDV. On the
other hand, the correlation between RM and MEDV is also relatively high (0.70) and
given the linear relationship between those two variables that we observed in the
scatterplot, RM seems to be a good choice for an exploratory variable to introduce
the concepts of a simple linear regression model in the following section.

Now, you will fit a linear regression and predict the value of a houses using just one feature.  In this exercise, you will use the 'RM' feature of *the housing* dataset. Since the goal is to predict houses' value, the target variable here is 'MEDV'. 

* Generate a scatter plot with 'RM' on the x-axis and 'MEDV' on the y-axis. As you can see, there is a strongly positive correlation, so a linear regression should be able to capture this trend.
* Your task is to fit a linear regression and then predict the houses' prices, overlaying these predicted values on the plot to generate a regression line.
* You will also compute and print the R2 score using sckit-learn's .score() method.


# Train/test split for regression

Train and test sets are vital to ensure that the supervised learning model is able to generalize well to new data. This is equally true for linear regression models, as for classification  models.

In this task, you will split the housing dataset into training and testing sets, and then fit and predict a linear regression over **all features**. In addition to computing the **R2** score, you will also compute the **Root Mean Squared Error (RMSE)**, which is another commonly used metric to evaluate regression models. 

* Import mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.
* Create feature and target arrays X (all features excepting 'MEDV') and y ('MEDV']
* Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
* Create a linear regression regressor called *reg_all*, fit it to the training set, and evaluate it on the test set.
* Compute and print the R2 score using the *.score()* method on the test set.
* Compute and print the *RMSE*. To do this, first compute the Mean Squared Error using the *mean_squared_error()* function with the arguments *y_test* and *y_pred*, and then take its square root using *np.sqrt()*.

# Cross Validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the *housing* data. By default, scikit-learn's cross_val_score() function uses R2 as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.


* Import LinearRegression from *sklearn.linear_model* and *cross_val_score* from *sklearn.model_selection*.
* Create a linear regression regressor called *reg*.
* Use the *cross_val_score()* function to perform 5-fold cross-validation on X and y.
* Compute and print the average cross-validation score. You can use NumPy's *mean()* function to compute the average.


# Regularization I: Lasso

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha.

As a result, the L1 regularization (penalizing the sum of the absolut values of the weights) has a sparcity effect, meaning that while shrinking the coefficients of certain features to 0, while preserving the most relevant features (performes feature selection).

In this exercise, you will fit a lasso regression to the housing data you have been working with.

* Import Lasso from sklearn.linear_model.
* Instantiate a Lasso regressor with an alpha of 0.1 and specify normalize=True.
* Fit the regressor to the data and compute the coefficients using the coef_ attribute.



## Regularization II: Ridge

Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the L2 norm. In this task, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated R2 scores for each, using this function that we have defined for you, which plots the R2 score as well as standard error for each alpha:

In [None]:
def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the R2 score varies with different alphas, and to understand the importance of selecting the right value for alpha. You'll learn how to tune alpha in the next chapter.

* Instantiate a Ridge regressor and specify normalize=True.
* Inside the for loop:
    * Specify the alpha value for the regressor to use.
    * Perform 10-fold cross-validation on the regressor with the specified alpha. The data is available in the arrays X and y.
    * Append the average and the standard deviation of the computed cross-validated scores. NumPy has been pre-imported for you as np.
    * Use the display_plot() function to visualize the scores and standard deviations