From our previous explorations, we saw that as expected, the quality of predictions begins to drop down as we try to make the anticipation period longer. Nevertheless, for 20 and 30 minutes we still see some decent results and at the same time these periods are already good enough to serve a good informative purpose (for instance, allowing commuters to consider alternative routes to and back from work).

# 30 minutes anticipation

Here we will work on trying to improve as much as possible the quality of predictions with 30 minutes of anticipation for Q2, possibly by adding some additional datasets. Later we will see how much of this can be extrapolated to other quarters.

In [4]:
%matplotlib inline

from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import pickle   

n470_hects = range(53,70)

ngbr_hects = range(33,90)

rf = pd.read_csv('a13/Rechts_Flow_2011.csv', header=None)
rs = pd.read_csv('a13/Rechts_Speed_2011.csv', header=None)

rs_intersection = rs.iloc[:,n470_hects]

inters_means = rs_intersection.mean(axis=1)

jam_threshold = 95

mins_diff = 30
y = (inters_means < jam_threshold)[mins_diff:]

df = pd.concat([rf.iloc[:,ngbr_hects], rs.iloc[:,ngbr_hects]],axis=1)
df = df.iloc[:-mins_diff,:]

quarter_size = df.shape[0]//4

df = df.iloc[quarter_size:2*quarter_size,:]
y = y[quarter_size:2*quarter_size]


def train_test_split(df, y, test_size=0.3):
    cut = int(len(y) * test_size)

    X_train = df.iloc[:-cut,:]
    X_test = df.iloc[-cut:,:]

    y_train = y[:-cut]
    y_test = y[-cut:]

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test =  train_test_split(df, y)

We now use a the GridSearchCV modeule from sklearn to find the best combination of parameters for our problem, using 3-fold cross validation for each candidate combination.

In [10]:
params = {"max_depth": [5, 10, None],
          "max_features": [50, 100],
          "min_samples_split": [10, 20],
          "min_samples_leaf": [3],
          "criterion": ["gini", "entropy"],
          "n_estimators": [10, 20]
         }

clf = GridSearchCV(
    RandomForestClassifier(),  
    param_grid=params,  # parameters to tune via cross validation
    refit=True,  # fit using all data, on the best detected classifier
    n_jobs=-1,  # number of cores to use for parallelization; -1 for "all cores"
    scoring='f1',  # what score are we optimizing?
    cv=StratifiedKFold(y_train, n_folds=3),  # what type of cross validation to use
)

clf.fit(X_train, y_train)

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[False False ..., False False], n_folds=3, shuffle=False, random_state=None),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'min_samples_split': [10, 20], 'max_depth': [5, 10, None], 'max_features': [50, 100], 'n_estimators': [10, 20], 'criterion': ['gini', 'entropy'], 'min_samples_leaf': [3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1', verbose=0)

Let's take a look at the parameter for the best performing model

In [11]:
clf.best_params_

{'criterion': 'entropy',
 'max_depth': 5,
 'max_features': 50,
 'min_samples_leaf': 3,
 'min_samples_split': 10,
 'n_estimators': 10}

In [13]:
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.93      0.96      0.95     31370
       True       0.82      0.73      0.77      8047

avg / total       0.91      0.91      0.91     39417



This is a 3% improvement in congestion recall, while sacrificing only 1% in precision, relative to the model that we had before:

```
    precision    recall  f1-score   support

      False       0.93      0.96      0.95     31372
       True       0.83      0.70      0.76      8047

avg / total       0.91      0.91      0.91     39419
```

We can improve this results further by adding other datasets and doing some feature engineering.
But that will be the subject for future explorations. 

# 20 minutes of anticipation

Now, let us see if we can do some improvement for 20 minutes of anticipation.

These were the resuls for the previous model:

```
    precision    recall  f1-score   support

      False       0.94      0.97      0.96     31372
       True       0.87      0.77      0.82      8047

avg / total       0.93      0.93      0.93     39419
```

In [14]:
mins_diff = 20
y = (inters_means < jam_threshold)[mins_diff:]

df = pd.concat([rf.iloc[:,ngbr_hects], rs.iloc[:,ngbr_hects]],axis=1)
df = df.iloc[:-mins_diff,:]

quarter_size = df.shape[0]//4

df = df.iloc[quarter_size:2*quarter_size,:]
y = y[quarter_size:2*quarter_size]

X_train, X_test, y_train, y_test =  train_test_split(df, y)

In [15]:
clf = GridSearchCV(
    RandomForestClassifier(),  
    param_grid=params,  # parameters to tune via cross validation
    refit=True,  # fit using all data, on the best detected classifier
    n_jobs=-1,  # number of cores to use for parallelization; -1 for "all cores"
    scoring='f1',  # what score are we optimizing?
    cv=StratifiedKFold(y_train, n_folds=3),  # what type of cross validation to use
)

clf.fit(X_train, y_train)

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[False False ..., False False], n_folds=3, shuffle=False, random_state=None),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'min_samples_split': [10, 20], 'max_depth': [5, 10, None], 'max_features': [50, 100], 'n_estimators': [10, 20], 'criterion': ['gini', 'entropy'], 'min_samples_leaf': [3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1', verbose=0)

In [17]:
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

      False       0.95      0.97      0.96     31371
       True       0.87      0.78      0.82      8047

avg / total       0.93      0.93      0.93     39418



This time we got a 1% improvement in recall for congestions without sacrificing any other metrics.
Not a huge improvement here either.