## The Boston Housing Dataset
We will take the Housing dataset which contains information about different houses in Boston. We can access this data from the scikit-learn library. There are 506 samples and 13 feature variables in this dataset. The objective is to predict the value of prices of the house using the given features. First, we will import the required libraries and download the dataset. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt 

import pandas as pd  
import seaborn as sns 

from sklearn.datasets import load_boston
boston_dataset = load_boston()

We print the value of the boston_dataset to understand what it contains. 

In [None]:
print(boston_dataset.keys())

* data: contains the information for various houses

* target: prices of the house

* feature_names: names of the features

* DESCR: describes the dataset

Look at the Description of the dataset to see what it contains

The prices of the house indicated by the variable **MEDV** is our target variable and the remaining are the feature variables based on which we will predict the value of a house.

Now load the data into a pandas dataframe and print the first five rows

Note that the target value **MEDV** is missing from the data. Create a new column of target values and add it to the dataframe.

### Data preprocessing
After loading the data, it’s a good practice to see if there are any missing values in the data. Count the number of missing values for each feature using ```isnull()```

### Exploratory Data Analysis
Exploratory Data Analysis is a very important step before training the model. In this section, we will use some visualizations to understand the relationship of the target variable with other features.

Let’s first plot the distribution of the target variable **MEDV**. You may plot it with different libraries, try using seaborn for a nicer plot. What distribution follows the data? Can you detect any outliers? 

Next, we create a correlation matrix that measures the linear relationships between the variables. You may compute the correlation matrix using the ```corr``` function from the ```pandas dataframe``` library. Use the ```heatmap``` function from the ```seaborn``` library to plot the correlation matrix. Find the way to display the correlation numbers on each cell. What does a correlation of 1 mean, and a correlation of -1, and 0?. What properties does the correlation matrix have? Explain the diagonal.

### Manual feature selection

To fit a linear regression model, we select those features which have a high correlation with our target variable MEDV. By looking at the correlation matrix which are the variables with strongest correlation to MEDV?

### Multi-co-linearity

An important point in selecting features for a linear regression model is to check for multi-co-linearity. That is, we should not select feature pairs which are strongly correlated to each other. Let us see simple case with two (independent) variables to see why:

Consider the simplest case where $Y$ is regressed against $X$ and $Z$ and where $X$ and $Z$ are highly positively correlated. Then the effect of $X$ on $Y$ is hard to distinguish from the effect of $Z$ on $Y$ because any increase in $X$ tends to be associated with an increase in $Z$.

Another way to look at this is to consider the equation. If we write 

$$Y=b_0+b_1X+b_2Z+\epsilon,$$

then the coefficient $b_1$ is the increase in $Y$ for every unit increase in $X$ while holding $Z$ constant. But in practice, it is often impossible to hold $Z$ constant and the positive correlation between $X$ and $Z$ mean that a unit increase in $X$ is usually accompanied by some increase in $Z$ at the same time.

Which features have strong correlation?

Finally, suggest two features to use for the regression. Use a scatter plot to see how these features vary with MEDV. Plot them using ```plt.subplot()``` and put the axis titles for each subplot.

What is the relationship between MDEV and LSTAT or RM? Does that make common sense given their descriptions? Can you detect any peculiarities in the data?

### Preparing the data for training
Concatenate the LSTAT and RM columns using ```np.c_``` provided by the numpy library.

### Training and test splits
Next, we split the data into training and testing sets. We train the model with 80% of the samples and test with the remaining 20%. We do this to assess the model’s performance on unseen data. To split the data you may use ```scikit-learn library```. Finally print the sizes of our training and test set to verify if the splitting has occurred properly.

### Training and testing the model
Use again ```scikit-learn``` to train a linear regression model on the training set.

### Model evaluation
Now evaluate the model using RMSE and R2-score.

Is this an expected result considering the scatter plots we saw before? How do you think we may improve the results?