# Predict Economic Regimes

In this notebook, we will use codes written in the 'FRED_01_data_preparation' notebook to fetch macro indicators. Remeber we downloaded the data as csv. Then we will use scikit learn to see how machine learning can used to predict recessions. 

In [23]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import getdata as gd

import warnings
warnings.filterwarnings("ignore")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Train data

We will first fetch the fred data and set independent and dependent variables. 

Then, we are going to use SVC, DecisionTreeClassifier, RandomForestClassifier and GradientBoostingClassifiers as classifiers to find out which method has the higest score.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [22]:
df = pd.read_csv('data/fred_data.csv', index_col = 'Date', parse_dates = ['Date'])

independent_variables = df.columns[:-1]
X = df[independent_variables] 
y = df['Regime']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [24]:
%%time
pipe = Pipeline(steps=[('clf', SVC())])

params_grid = [{
                'clf':[SVC()],
                'clf__C': [1, 10],
                'clf__gamma': ['scale', 'auto'],
                },
                {
                'clf': [DecisionTreeClassifier()],
                'clf__max_depth': [3, 5, 10, 20],
                'clf__splitter': ['best', 'random'],
                'clf__min_samples_split': [2, 3, 5],
                },
                {
                'clf': [RandomForestClassifier()],
                'clf__max_depth': [3, 5, 10, 20],
                'clf__n_estimators': [100,200,400],
                },
                {
                'clf': [GradientBoostingClassifier()],
                'clf__max_depth': [3, 5, 10, 20],
                'clf__min_samples_split': [2, 3, 5],
                'clf__n_estimators': [100,200,400],
                },
                {
                'clf': [LogisticRegression()],
                'clf__solver': ['saga'],
                'clf__max_iter': [1000],                    
                }                
              ]

grid = GridSearchCV(pipe, params_grid, cv=5)
clf = grid.fit(X_train, y_train)

CPU times: user 19min 30s, sys: 5.45 s, total: 19min 35s
Wall time: 35min 55s


Let's see how each classifer did with the train data.

In [72]:
res = pd.DataFrame(clf.cv_results_).sort_values(by=['rank_test_score']).iloc[:,4:]
res['param_clf'] = res['param_clf'].astype(str)
res.groupby('param_clf')['mean_test_score'].mean().sort_values(ascending=False)

param_clf
RandomForestClassifier(max_depth=10)    0.965111
GradientBoostingClassifier()            0.964424
LogisticRegression()                    0.953872
DecisionTreeClassifier()                0.948205
SVC()                                   0.908217
Name: mean_test_score, dtype: float64

They did pretty well with all of them having scored higher than 0.9 accuracy.

Let's check top 5 parameter settings.

In [73]:
res.head()

Unnamed: 0,param_clf,param_clf__C,param_clf__gamma,param_clf__max_depth,param_clf__min_samples_split,param_clf__splitter,param_clf__n_estimators,param_clf__max_iter,param_clf__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
41,GradientBoostingClassifier(),,,3,2.0,,200,,,"{'clf': GradientBoostingClassifier(), 'clf__ma...",0.99115,0.946903,0.982301,0.973451,0.964286,0.971618,0.015255,1
34,RandomForestClassifier(max_depth=10),,,10,,,100,,,"{'clf': RandomForestClassifier(max_depth=10), ...",0.973451,0.973451,0.964602,0.982301,0.964286,0.971618,0.006691,1
47,GradientBoostingClassifier(),,,3,5.0,,200,,,"{'clf': GradientBoostingClassifier(), 'clf__ma...",0.99115,0.946903,0.982301,0.982301,0.955357,0.971602,0.017234,3
46,GradientBoostingClassifier(),,,3,5.0,,100,,,"{'clf': GradientBoostingClassifier(), 'clf__ma...",0.99115,0.946903,0.982301,0.982301,0.955357,0.971602,0.017234,3
42,GradientBoostingClassifier(),,,3,2.0,,400,,,"{'clf': GradientBoostingClassifier(), 'clf__ma...",0.99115,0.955752,0.982301,0.973451,0.955357,0.971602,0.014249,3


## Test data

Let's take these 5 classifiers and parameters and test with the test data.

In [82]:
models = [
    GradientBoostingClassifier(min_samples_split=2, max_depth=3, n_estimators=200),
    RandomForestClassifier(max_depth=10, n_estimators=100),
    GradientBoostingClassifier(min_samples_split=5, max_depth=3, n_estimators=200),
    GradientBoostingClassifier(min_samples_split=5, max_depth=3, n_estimators=100),
    GradientBoostingClassifier(min_samples_split=2, max_depth=3, n_estimators=400),    
]

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('Test dataset scores using {}'.format(model))
    print(metrics.classification_report(y_test, y_pred))

Test dataset scores using GradientBoostingClassifier(n_estimators=200)
              precision    recall  f1-score   support

         0.0       0.97      0.98      0.98       164
         1.0       0.86      0.79      0.83        24

    accuracy                           0.96       188
   macro avg       0.92      0.89      0.90       188
weighted avg       0.96      0.96      0.96       188

Test dataset scores using RandomForestClassifier(max_depth=10)
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97       164
         1.0       0.89      0.71      0.79        24

    accuracy                           0.95       188
   macro avg       0.93      0.85      0.88       188
weighted avg       0.95      0.95      0.95       188

Test dataset scores using GradientBoostingClassifier(min_samples_split=5, n_estimators=200)
              precision    recall  f1-score   support

         0.0       0.97      0.98      0.98       164
         1.

We can see that these models yield 0.96 accuracy on the test dataset.

Let's try the same thing, but with default settings for each classifier.

In [76]:
d_models = [
    RandomForestClassifier(),    
    GradientBoostingClassifier(),            
    LogisticRegression(),                    
    DecisionTreeClassifier(),                
    SVC()                                   
]

for d_model in d_models:
    d_model.fit(X_train, y_train)
    y_pred_d = d_model.predict(X_test)
    print('Test dataset scores using {}'.format(d_model))
    print(metrics.classification_report(y_test, y_pred_d))

Test dataset scores using RandomForestClassifier()
              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       164
         1.0       0.89      0.67      0.76        24

    accuracy                           0.95       188
   macro avg       0.92      0.83      0.87       188
weighted avg       0.94      0.95      0.94       188

Test dataset scores using GradientBoostingClassifier()
              precision    recall  f1-score   support

         0.0       0.97      0.98      0.98       164
         1.0       0.86      0.79      0.83        24

    accuracy                           0.96       188
   macro avg       0.92      0.89      0.90       188
weighted avg       0.96      0.96      0.96       188

Test dataset scores using LogisticRegression()
              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       164
         1.0       0.94      0.62      0.75        24

    accuracy                 

The classifiers with default settings yield similar accuracy scores to those of the hyper-tuned classifiers except SVC. Hyper-tuned classifiers have done a better job in terms of f1.
