# Predictive Models - Data Preparation

In this section we will be going over linear regression. We'll be going over how to use the scikit-learn regression model, as well as how to train the regressor using the fit() method, and how to predict new labels using the predict() method. We'll be analyzing a data set consisting of house prices in Boston. We'll start off with a single variable linear regression using numpy and then move on to using scikit learn. We'll do an overview of the mathematics behind the method we're using, but mostly we'll dive deeper into pratical "hands-on" coding lessons.

If you're interested in the deeper mathematics of linear regession methods, check out the [wikipedia page](http://en.wikipedia.org/wiki/Linear_regression) and also check out Andrew Ng's wonderful lectures for free on [youtube](https://www.youtube.com/watch?v=5u4G23_OohI).

In the next two sections we will be working through linear regression and multi linear regression with the following steps:

    Step 1: Getting and setting up the data.
    Step 2: Visualizing current data.
    Step 3: The mathematics behind the Least Squares Method.
    Step 4: Using Numpy for a Univariate Linear Regression.
    Step 5: Getting the error.
    Step 6: Using scikit learn to implement a multivariate regression.
    Step 7: Using Training and Validation.  
    Step 8: Predicting Prices
    Step 9 : Residual Plots
    

### Step 1: Getting and setting up the data.

We'll start by looking a an example of a dataset from scikit-learn. First we'll import our usual data analysis imports, then sklearn's built-in boston dataset.

In [None]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

Imports for plotting

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

Now import dataset from scikit learn as well as the linear_model module. Note: You may have to run a download, scikit learn will read an error and prompt you to if you don't have the datasets.

In [None]:
from sklearn.datasets import load_boston

Next we'll download the data set

In [None]:
# Load the housing dataset
boston = load_boston()
type(boston.data)

Let's see what the data set contains

In [None]:
print(boston.DESCR)

### Step 2: Visualizing current data

You should always try to do a quick visualization fo the data you have. Let's go ahead an make a histogram of the prices.

In [None]:
# Histogram of prices (this is the target of our dataset)

## What is a target? It is the attribute that we are trying to estimate. Here, it is the cost of the houses.

## TRY IT:  This dataset has an in-built value: boston.target. Try printing boston.target.
## Print and plot it in a histogram.
plt.hist(boston.target)

Interesting, now let's see a scatter plot of one feature, versus the target. In this case we'll use the housing price versus the number of rooms in the dwelling.

In [None]:

## TRY IT: Scatter plot "No.of rooms" which is in column 5 against the "price" or "target"  
plt.scatter(boston.data[:,5],boston.target)
#label
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')

Great! Now we can make out a slight trend that price increases along with the number of rooms in that house, which intuitively makes sense! Now let's use scikit learn to see if we can fit the data linearly.

Let's try to do the following:

    1.) Use pandas to transform the boston dataset into a DataFrame: 
    
    2.) Then use seaborn to perform an lmplot on that DataFrame to reproduce the scatter plot with a linear fit line.

In [None]:
# reset data as pandas DataFrame
boston_df = DataFrame(boston.data)

# label columns
boston_df.columns = boston.feature_names

#show
boston_df.head()

Now let's add the target of the boston data set, the price. We'll create a new column in our DataFrame.

In [None]:
# Set price column for target
boston_df['Price'] = boston.target

Now let's see the resulting DataFrame!

In [None]:
# TRY IT: Print the first 10 rows in the dataframes. Hint: use head()
boston_df.head(10)

Now, you might be reminded of the seaborn lmplot function we used during the visualization lectures. You could use it here to do a linear fit automatically!

In [None]:
# Using seabron to create a linear fit
sns.lmplot('RM','Price',data = boston_df)

However, we won't be able to do this when we move to more complicated regression models, so we'll stay focused on using the scikit learn library!