# Regression

A regression is a predictive model that looks for a functional relationship between a set of variables (X) and a continuous outcome variable (y).

In other word, given an input array we try to predict a numerical value.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Weight - Height dataset

In [None]:
df = pd.read_csv('../data/weight-height.csv')

In [None]:
df.head()

### Visualize the dataset

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(df['Height'], df['Weight'], alpha = 0.2)
plt.title('Humans', size=20)
plt.xlabel('Height (in)', size=20)
plt.ylabel('Weight (lbs)', size=20)

## Visualize male and female populations

This could be done in many ways, below are two examples.

In [None]:
# males = df[df['Gender'] == 'Male']
# females = df[df['Gender'] == 'Female']

males = df.query('Gender == "Male"')
females = df.query('Gender == "Female"')

plt.figure(figsize=(15,10))
plt.scatter(males['Height'], males['Weight'], alpha = 0.3, label = 'males', c = 'c')
plt.scatter(females['Height'], females['Weight'], alpha = 0.3, label = 'females', c = 'pink')
plt.title('Humans', size = 20)
plt.xlabel('Height (in)', size = 20)
plt.ylabel('Weight (lbs)', size = 20)
plt.legend()

## Linear regression

Linear regression is the simplest functional form that one can imagine, where outcome and input are proportional to one another.

$$
y = \alpha + \beta x
$$

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# create instance of linear regression class
regr = LinearRegression()

# what's the purpose of the next line?
# try to print out df['Height'].values and x
# to figure it out
x = df['Height'].values[:,np.newaxis]

y = df['Weight']

# split data in 2 parts (20% test / 80% train)
l = len(y)
ind = range(l)
np.random.shuffle(ind)
test_ind = ind[:l/5]
train_ind = ind[l/5:]

x_train = x[train_ind]
x_test  = x[test_ind]
y_train = y[train_ind]
y_test  = y[test_ind]


regr.fit(x_train, y_train)

In [None]:
# The coefficients
print "Slope: %.2f" % regr.coef_
print "Intercept: %.2f" % regr.intercept_

In [None]:
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(x_test) - y_test) ** 2))

In [None]:
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x_test, y_test))

In [None]:
plt.scatter(x_test, y_test)
plt.plot(x_test, regr.predict(x_test), color = 'red')
plt.title('Humans')
plt.xlabel('Height (in)')
plt.ylabel('Weight (lbs)')


## Housing prices dataset

This dataset contains multiple columns:
- sqft
- bdrms
- age
- price

Our goal is to build a model of price as a function of the other house attributes.

In [None]:
df = pd.read_csv('../data/housing-data.csv')

In [None]:
df.head()

### Scatter matrix
The scatter matrix gives us an intuitive idea of how each variable is distributed and correlated to the other variables.

In [None]:
from pandas.tools.plotting import scatter_matrix

In [None]:
_  = scatter_matrix(df, alpha=0.2, figsize=(8, 8))

Question: Is any trend apparent from the figure above?

## Linear regression (multiple variables)

$$
y = \alpha + \beta_0 x_0 + \beta_1 x_1 + ...
$$

In [None]:
regr = LinearRegression()

In [None]:
# why do I not need to add an axis in this case?
X = df[['sqft', 'bdrms', 'age']]
y = df['price']

In [None]:
regr.fit(X, y)

In [None]:
# the coef_ attribute is now an array of coefficients
regr.coef_

In [None]:
regr.intercept_

In [None]:
regr.score(X, y)

In [None]:
regr.predict([[2000, 3, 20]])

## Nonlinear regression

Nonlinear regression is used when the functional relationship between input and output is more complex than a simple proportion rule.

What we do in this case is to create a linear combination of polynomial features, i.e. functions of higher powers of the input.

$$
y = \alpha + \beta_1 f(x) + \beta_2 f(x^2) + ... + \beta_n f(x^n)
$$

In [None]:
df = pd.read_json('../data/xy-regression.json')
line = pd.read_csv('../data/xy-function.csv')

In [None]:
plt.figure(figsize=(11,7))
plt.scatter(df.x, df.y, label="training points")
plt.xlabel('x', size = 20)
plt.ylabel('y', size = 20)

In [None]:
plt.scatter(df.x, df.y, label="training points")
plt.plot(line.x, line.y, label = "ground truth")

plt.legend(loc = 'best')

In [None]:
from sklearn.preprocessing import PolynomialFeatures


# convenient function that allows us to specify
# the maximum degree of polynomial features
# we intend to use
def poly_fit(degree = 3):
    poly = PolynomialFeatures(degree=degree)
    X_ = poly.fit_transform(df.x.values[:, np.newaxis])
    line_ = poly.fit_transform(line.x.values[:, np.newaxis])
    clf = LinearRegression()
    clf.fit(X_, df.y)
    return clf, line_

In [None]:
plt.figure(figsize=(11,7))

plt.scatter(df.x, df.y, label="training points")
plt.plot(line.x, line.y, label = "ground truth")


poly4, line_ = poly_fit(4)
plt.plot(line.x, poly4.predict(line_), label = "4th deg poly")

poly5, line_ = poly_fit(5)
plt.plot(line.x, poly5.predict(line_), label = "5th deg poly")

plt.xlabel('x', size = 20)
plt.ylabel('y', size = 20)
plt.legend(loc = 'best')

Exercises

1)
- repeat the regression tasks with Ridge or Lasso regression (http://scikit-learn.org/stable/modules/linear_model.html)
- what changes?

2)
- load a different dataset and explore linear relations among features

*Copyright &copy; 2015 Dataweekends.  All rights reserved.*