# Linear regression example

In this example, we'll read BMI and life expectancy information for a bunch of countries from a CSV file, we'll then use that information to predict the life expectancy for somebody from Laos (which is missing from the data) based on their average BMI.

## 0. Imports

In [2]:
from sklearn.linear_model import LinearRegression
import pandas as pd

- Pandas [10 minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html#min)
- Scikit-learn [official tutorial](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
- Matplotlib [official tutorial](http://matplotlib.org/users/pyplot_tutorial.html)

## 1. Load the data

In [3]:
bmi_life_data = pd.read_csv('bmi_and_life_expectancy.csv')

## 2. Build a linear regression model

This will take the BMI data points (x - independent) and create a line of best fit for the life expectancy data point (y - dependant).

In [4]:
bmi_life_model = LinearRegression()
bmi_life_model.fit(bmi_life_data[['BMI']], bmi_life_data[['Life expectancy']])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## 3. Predict using the model

In [6]:
laos_life_exp = bmi_life_model.predict([ [21.07931] ])
laos_life_exp

array([[ 60.31564716]])

## Outliers

What about BMIs that are far from normal? How well does this model fit?

In [10]:
outliers = bmi_life_model.predict([ [19], [40] ])
outliers

array([[  55.07894768],
       [ 107.9670159 ]])

Seemingly, the key to long life is to be extremely obese.....

# Linear Regression Warnings

1. Linear regression works best when the data is linear
2. Linear regression is sensitive to outliers

# Multiple Linear Regression

When you have a single predictor, such as BMI in the last example, the equation of the line is simply: 

$y = mx + b$

For two predictors, the equation becomes:

$y = m_1x_1 + m_2x_2 + b$

More generally, the equation for $n$ predictor variables is:

$y = m_1x_1 + m_2x_2 + ... + m_nx_n + b$

## Boston dataset

The Boston dataset has 13 features of 506 hours and their median value in $1000s.

### 0. Import

In [12]:
from sklearn.datasets import load_boston

### 1. Load the data

In [13]:
boston_data = load_boston()
x = boston_data['data']
y = boston_data['target']

### 2. Build the model

In [15]:
boston_model = LinearRegression()
boston_model.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### 3. Predict using the model

In [17]:
sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
                6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
                1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]
prediction = boston_model.predict(sample_house)
prediction

array([ 23.68420569])