# Regression in scikit-learn

There are three main obstacles to working with regression in scikit-learn:

1. Selecting the proper model
2. Understanding the proper input data type, and scikit-learn syntax
3. Testing and validation

Today we will touch on all three, but focus on number 2.

For demonstration, we will use a boston housing dataset, which comes with scikit-learn:

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()

If you are going to follow along in other tutorials in the scikit-learn documentation, you will need to know the data structures used as inputs to the models. Let'see what's in the boston dataset:

In [None]:
boston.keys()

The description will tell us more about the dataset:

In [None]:
boston.DESCR

So we are working on predicitng median value of a home from 506 observations, and 13 covariates including crime rate, lot size, industry/commercial proportion, presence of the Charles River, nitric oxide concentration, rooms per dwelling, units built before 1940, distance to employment centers, access to highways, tax rate, school proxy, black population, and status. To get the variable names we can ask for them in the dictionary:

In [None]:
print(boston.feature_names)
print()
print(type(boston.feature_names))
print()
print(len(boston.feature_names))

We see the input is a numpy array of strings for the variable labels. To get the variable data, we ask the dictionary for the data:

In [None]:
print(boston.data)
print()
print(type(boston.data))
print()
print(len(boston.data))

The data is a numpy array, inside of which there is a separate array for each observation (all 506 for each hous, *not* 13 for each variable). Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**

The target, or *y* is accessed in the dictionary as well:

In [None]:
print(boston.target)
print()
print(type(boston.target))
print()
print(len(boston.target))

The target array is only one dimmension, lined up in order with the with the observations in the data array.

Now that we're familiar with the input data, we need to split it up for training and testing:

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
                                                    train_size=0.75, test_size=0.25)

Now we have 75% of the data as training data, and 25% of the data as testing data:

In [None]:
print(len(X_train), len(y_train))
print()
print(len(X_test), len(y_test))

In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing parameters for whichever model you choose. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, you should always select the model and parameters best suited for your data.

## Linear Regression

We'll start with a basic linear regression model:

In [None]:
from sklearn import linear_model
lin_reg = linear_model.LinearRegression(fit_intercept=True,
                                          normalize=False,
                                          copy_X=True,
                                          n_jobs=1)

model = lin_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## Ridge Regression

In [None]:
from sklearn import linear_model
ridge_reg = linear_model.Ridge(alpha=1.0,
                               fit_intercept=True,
                               normalize=False,
                               copy_X=True,
                               max_iter=None,
                               tol=0.001, solver='auto',
                               random_state=None)

model = ridge_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## Elastic Net Regression

In [None]:
elastic_reg = linear_model.ElasticNet(alpha=1.0,
                               fit_intercept=True,
                               normalize=False,
                               precompute=False,
                               copy_X=True,
                               max_iter=1000,
                               tol=0.0001, 
                               warm_start=False,
                               positive=False,
                               random_state=None,
                               selection='cyclic')

model = elastic_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## Support Vector Regression

In [None]:
from sklearn import svm

sv_reg = svm.SVR(kernel='rbf',
                 degree=3,
                 gamma='auto',
                 coef0=0.0,
                 tol=0.001,
                 C=1.0,
                 epsilon=0.1,
                 shrinking=True,
                 cache_size=200,
                 verbose=False, 
                 max_iter=-1)

model = sv_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## K-nearest neighbors regression

In [None]:
from sklearn import neighbors

knn_reg = neighbors.KNeighborsRegressor(5, weights='distance')

model = knn_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## AdaBoost Regression

In [None]:
ab_reg = ensemble.AdaBoostRegressor(base_estimator=None, 
                                    n_estimators=50,
                                    learning_rate=1.0,
                                    loss='linear',
                                    random_state=None)


model = ab_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## Random Forest Regression

In [None]:
from sklearn import ensemble

rf_reg = ensemble.RandomForestRegressor(n_estimators=10,
                                        criterion='mse',
                                        max_depth=None, 
                                        min_samples_split=2, 
                                        min_samples_leaf=1,
                                        min_weight_fraction_leaf=0.0,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        bootstrap=True,
                                        oob_score=False,
                                        n_jobs=1,
                                        random_state=None,
                                        verbose=0,
                                        warm_start=False)

model = rf_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

## Neural Network Regression

In [None]:
from sklearn import neural_network
nn_reg = neural_network.MLPRegressor(hidden_layer_sizes=(100, ),
                            activation='relu',
                            solver='adam',
                            alpha=0.0001,
                            batch_size='auto',
                            learning_rate='constant',
                            learning_rate_init=0.001,
                            power_t=0.5,
                            max_iter=200,
                            shuffle=True,
                            random_state=40, # reproducibility
                            tol=0.0001,
                            verbose=False,
                            warm_start=False,
                            momentum=0.9,
                            nesterovs_momentum=True,
                            early_stopping=False,
                            validation_fraction=0.1,
                            beta_1=0.9,
                            beta_2=0.999,
                            epsilon=1e-08)

model = nn_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

# TPOT

Selecting a model and parameters can be a difficult task. While you should at least conceptually understand the process and reason for your selection, a new package called `TPOT` uses genetic algorithms to help you select a model and its parameters by just feeding it the `X` and `y`. It will even write the scikit-learn code for you. 

In [None]:
from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')