# Intro to Data Science With scikit-learn

### Overview

Here we're going to walk through running a model, and looking at the results. Before we get started let's go over some terminology...

### Terminology

1.) `features` - Another word for the X variables, independent variables, predictors, regressors...   
2.) `target` - Another word for the Y variable, dependent variable, outcome variable, response...  
3.) `model` - What we use to relate one set of variables to another.      
4.) `training data set` - Refers to the observations from your data set that are used to train/learn the statistical model.   
5.) `testing data set` - Refers to the observations from your data that are **not** used to train/learn the statistical model. They are held out, and not seen by the model during training.  

### Scikit-learn import

```python
import sklearn 
```

Typically we're actually going to be importing something from one of the modules/libraries in `sklearn`. The [sklearn main page](http://scikit-learn.org/stable/) can help you determine where you might find something that you are looking for, and the [API reference](http://scikit-learn.org/stable/modules/classes) is also pretty helpful. A large majority of all of the machine learning algorithms you might run can be found somewhere within `sklearn`. Today we're going to talk through using a `Random Forest Regressor`.

### General workflow

Here are the steps by which we train a model... 

1.) Import whatever model you'll be fitting.  
2.) Instantiate the model (i.e. create a variable that holds your model object). Set any hyperparameters as you see fit (we'll discuss what these are shortly).   
3.) Feed in the X and Y variables (features and target) to the `.fit()` method.   
4.) Call the `.score()` or `.predict()` method to see how well the model does on the training data (or new data). 

##### What would be another word/term we might use to describe the `new data` from step (4) above?

We'll be working with a `RandomForestRegressor` tonight, which you can see the documentation for [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). I don't have time today to discuss in detail how the algorithm works, but I will if you take the data science course or ask me after class. It's well explained in the Elements of Statistical Learning, which you can download [here](http://statweb.stanford.edu/~tibs/ElemStatLearn/).

In [8]:
import pandas as pd
df = pd.read_csv('data/forestfires.csv') # Get the data.
print(df.shape)
df[100:120]

(517, 13)


Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
100,3,4,aug,sun,91.4,142.4,601.4,10.6,19.8,39,5.4,0.0,0.0
101,3,4,aug,tue,88.8,147.3,614.5,9.0,14.4,66,5.4,0.0,0.0
102,2,4,aug,tue,94.8,108.3,647.1,17.0,20.1,40,4.0,0.0,0.0
103,2,4,sep,sat,92.5,121.1,674.4,8.6,24.1,29,4.5,0.0,0.0
104,2,4,jan,sat,82.1,3.7,9.3,2.9,5.3,78,3.1,0.0,0.0
105,4,5,mar,fri,85.9,19.5,57.3,2.8,12.7,52,6.3,0.0,0.0
106,4,5,mar,thu,91.4,30.7,74.3,7.5,18.2,29,3.1,0.0,0.0
107,4,5,aug,sun,90.2,99.6,631.2,6.3,21.4,33,3.1,0.0,0.0
108,4,5,sep,sat,92.5,88.0,698.6,7.1,20.3,45,3.1,0.0,0.0
109,4,5,sep,mon,88.6,91.8,709.9,7.1,17.4,56,5.4,0.0,0.0


In [2]:
from sklearn.ensemble import RandomForestRegressor # Import our model. 
random_forest = RandomForestRegressor(n_estimators=100) # Instantiate it with 100 trees 

Let's create our features (X variables) and target (Y variable). I'm using the forest-fire 
data, and for now am only going to use the `X` and `Y` columns (which are the spatial coordinates of the fires) for the features, and the `area` column for the target (this is defined as the dependent variable on the UCI website where I got this data). A link to the data and it's description can be found [here](https://archive.ics.uci.edu/ml/datasets/Forest+Fires). As you can see, I'm building a model that predicts the burned area of forest fires in northeastern Portugal based on the precise geographical coordinates.

##### How do I pull the X and Y columns from our df to use as the features? How about the area?

In [3]:
features = df[['X','Y']]
target = df['area']

In [4]:
# Fit/train the model (i.e. build the model based off the training data)
random_forest.fit(features, target) 

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [9]:
predictions = random_forest.predict(features) # This gives us back a vector of predictions
                                              # (one for each observation). 

In terms of metrics, the [sklearn.metrics](http://scikit-learn.org/stable/modules/classes#sklearn-metrics-metrics) documentation will give you an idea of any of the metrics you can use to judge a model. The majority of these take the format of a fuction call where you input `(y_predictions, y_observations)`, and they output the calculated metric. We'll look at mean squared error below. 

In [10]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(predictions, target)
print(mse)

3911.57239199


Let's see if we can add in something else and get better. 

In [11]:
features = df[['X','Y', 'wind']]
target = df['area']
random_forest.fit(features, target)
predictions = random_forest.predict(features)
print(mean_squared_error(predictions, target))

3236.49155097


So, we have run two models now, and saw that the second one performed better. We could keep adding variables into our model, checking the MSE after adding in any variable. But we're not actually running our model on any data that we aren't training it on, so how do we know that what we are putting into our model would actually help on data that we've never seen. In other words, how do we tell if our model will generalize well? The answer is **cross validation**.

The way that **cross validation** works is that we break our data into `k` number of folds (typically 5 or 10). We train our model on `k-1` of those folds, and then predict on the `kth` fold. We take those predictions, and then get our scoring metric (`mean squared error`, in our case) using those predictions. We then do this again, and again, and again, until each of the `k` folds has been used for predictions (so with 5 folds, we do this 5 times, and with 10 folds, 10 times, etc.)

![cross-val-image](http://i.stack.imgur.com/1fXzJ.png)

Using **cross validation**, we can get an idea of how our model would perform on data it hasn't seen before, and then when we add in variables into our model (or change model hyperparameters), we can be more sure that they were actually worth putting into our model.

Best of all, it turns out that sklearn has a library we can use for this! Check out the [cross validation library](http://scikit-learn.org/stable/modules/classes#module-sklearn.cross_validation) for all the details. Today we'll be looking at the `cross_val_score` function, which allows you to pass in a model, a target (Y), a feature set (X), a number of folds (5 or 10, for example), and a scoring function (we'll use our mean_squared_error). 

In [12]:
from sklearn.cross_validation import cross_val_score
features = df[['X','Y']]
target = df['area']
# cross_val_score flips the sign, so I take the negative below:
results = -cross_val_score(random_forest, features, target, cv=20, scoring='mean_squared_error')
results

array([   423.0957952 ,    158.79785535,    493.16636869,    510.14849252,
           65.56792417,    269.31256367,    330.10878029,    182.42779078,
         1928.7019805 ,  50507.00150498,    115.29716993,    551.15148581,
          412.34427826,    136.63077743,   1215.61687166,  21388.19063418,
         1554.61814158,    478.27897526,   3364.08126135,    359.08456373])

In [13]:
results.mean()

4222.1811607676964

In [14]:
features = df[['X', 'Y', 'wind']]
target = df['area']
results = -cross_val_score(random_forest, features, target, cv=20, scoring='mean_squared_error')
results.mean()

5146.6889662686499

So it looks like `wind` might not have been as helpful as we thought. Good think we used cross-validation!

Cross-validation is a crucial part of a data-scientists workflow. We have to make sure that our model will generalize well, and cross-validation is a way to make sure that we are putting the right variables into our model. It can also be used to check our model hyperparameters (for a random forest, this might be the number of trees, the depth of each tree, etc.). `Sklearn` also has a built in to perform cross-validation over hyperparameters. It is located in the `sklearn.grid_search` module, and it is called `GridSearchCV`. As arguments, it takes an estimator/model (such as our Random Forest) and a parameter grid (dictionary). We instantiate it with these, and then we call the `.fit()` method on it, passing it our features and target. It returns back to us the best parameters to use for our model. 

In [15]:
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()
param_grid = {'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]}
grid_search_cv = GridSearchCV(random_forest, param_grid, scoring='mean_squared_error')

In [16]:
features = df[['X', 'Y', 'wind']]
target = df['area']
grid_search_cv.fit(features, target)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring='mean_squared_error',
       verbose=0)

In [17]:
best_model = grid_search_cv.best_estimator_ # Get a copy of the best model. 
best_params = grid_search_cv.best_params_ # Get a dictionary of the best parameters. 
best_score = grid_search_cv.best_score_ # Get the best score of scoring function we passed in.

In [18]:
best_params

{'max_depth': 1, 'n_estimators': 10}

In [19]:
-best_score

4154.9916400182738

In [20]:
features = df[['X', 'Y']]
target = df['area']
grid_search_cv.fit(features, target)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 100, 500], 'max_depth': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring='mean_squared_error',
       verbose=0)

In [21]:
best_model = grid_search_cv.best_estimator_ # Get a copy of the best model. 
best_params = grid_search_cv.best_params_ # Get a dictionary of the best parameters. 
best_score = grid_search_cv.best_score_ # Get the best score of scoring function we passed in.

In [22]:
best_params

{'max_depth': 1, 'n_estimators': 10}

In [23]:
-best_score

4159.6908088064329

So there's actually not that much of a difference between the performance of the best model, whether or not we include "wind" as a feature. In general (to avoid the risk of overfitting), simpler models should always be used unless the MSE improves significantly by adding additional features. In this case, adding "wind" actually makes the MSE go up, so it's clear we shouldn stick with the model that only uses geographical coordinates.

With all this being said, we can kind of re-define/re-work the steps in our general workflow... 

1.) Import whatever model you'll be fitting.  
2.) Instantiate the model (i.e. create a variable that holds your model object). Set any hyperparameters as you see fit.   
3.) Feed in the X and Y variables (features and target) to the `.fit()` method.   
4.) Call the `.score()` or `.predict()` method to see how well the model does on the training data (or new data).   
5.) Repeat steps (2) - (4) to find the best model given your chosen scoring metric.

**Note**: This assumes that all of your feature engineering/variable manipulation is done. Also note that the random forest is unique among machine learning models in that you don't have to cross-validate to test it (you can use out-of-box score instead), which is nice because cross-validation is computationally expensive and requires you to holdout data that could have been used to train the model. This is a discussion for another day though.

# Want some practice?

Sign up for the data science bootcamp to get a more in-depth understanding of various stats/ML models and how and when to use them.