# Data Research

## Intro to Machine Learning

### Exercise 2
Use the dataset to perform a classification prediction using scikit-learn, as demonstrated in the lectures. You should split the data into test/train sets, train the model (output/comment the
scores), cross validate the model (output/comment the scores), and predict using the test set (output/comment the scores and actual accuracy).  

In [1]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [11]:
# Load data
%store -r data
aaplData = data

# If the return is positive
aaplData['upside'] = aaplData['1-week MA Daily Return'] > 0
aaplData

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Daily Return,1-week MA Volume,1-week MA Daily Return,Ticker,upside
5,2015-11-02,30.340000,29.902500,30.200001,30.295000,128813200.0,27.996040,0.014059,230585360.0,0.010190,AAPL,True
6,2015-11-03,30.872499,30.174999,30.197500,30.642500,182076000.0,28.317165,0.011471,211093040.0,0.013750,AAPL,True
7,2015-11-04,30.955000,30.405001,30.782499,30.500000,179544400.0,28.185480,-0.004650,178560800.0,0.004579,AAPL,True
8,2015-11-05,30.672501,30.045000,30.462500,30.230000,158210800.0,28.055548,-0.008852,169221120.0,0.000696,AAPL,True
9,2015-11-06,30.452499,30.155001,30.277500,30.264999,132169200.0,28.088028,0.001158,156162720.0,0.002637,AAPL,True
...,...,...,...,...,...,...,...,...,...,...,...,...
1254,2020-10-19,98.761624,96.489998,119.959999,115.980003,120639300.0,115.980003,-0.025542,152397020.0,-0.013857,AAPL,False
1255,2020-10-20,98.761624,96.489998,116.199997,117.510002,124423700.0,117.510002,0.013192,124815660.0,-0.005914,AAPL,False
1256,2020-10-21,98.761624,96.489998,116.669998,116.870003,89946000.0,116.870003,-0.005446,112592400.0,-0.007152,AAPL,False
1257,2020-10-22,98.761624,96.489998,117.449997,115.750000,101709700.0,115.750000,-0.009583,110422500.0,-0.008276,AAPL,False


#### Regression

In [12]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score, roc_curve, roc_auc_score, f1_score

num_cols = ['Daily Return']
cat_cols = []

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler())
    ])

pipeline = ColumnTransformer([
        ('num', num_pipeline, num_cols), 
        ('cat', OneHotEncoder(), cat_cols)
    ])

all_x_cols = num_cols + cat_cols

y_col = ['upside']

#### Supervised Learning

In [13]:
# Split Train set, Test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(aaplData[all_x_cols], aaplData[y_col], test_size=0.33)

In [14]:
# Train!
X_train_xformed = pipeline.fit_transform(X_train)
X_test_xformed = pipeline.transform(X_test)

# Ensemble method: combine multiple ML algo
forest_clf = RandomForestClassifier(**{'random_state':42, 'n_estimators':50, 'max_depth':16})

forest_clf = forest_clf.fit(X_train_xformed, y_train.values.ravel())

In [15]:
# Predict!

print('Train Scores\n')
train_pred = forest_clf.predict(X_train_xformed)

print(f'Precision Score: {precision_score(y_train.values.ravel(), train_pred):.2f}')
print(f'Recall Score: {recall_score(y_train.values.ravel(), train_pred):.2f}')
print(f'F1 Score: {f1_score(y_train.values.ravel(), train_pred):.2f}')

print('\nTest Scores\n')
test_pred = forest_clf.predict(X_test_xformed)

print(f'Precision Score: {precision_score(y_test.values.ravel(), test_pred):.2f}')
print(f'Recall Score: {recall_score(y_test.values.ravel(), test_pred):.2f}')
print(f'F1 Score: {f1_score(y_test.values.ravel(), test_pred):.2f}')

Train Scores

Precision Score: 0.98
Recall Score: 0.99
F1 Score: 0.99

Test Scores

Precision Score: 0.63
Recall Score: 0.62
F1 Score: 0.63


**Remarks:**  
63% of the time we did not misclassify negative as positive.  
62% of the time we did not misclassify positive as negative.  
The Daily Return variable is a passable predictor that the 1-week MA Daily Return will be positive or negative.

In [24]:
# Cross validation
x_train = pipeline.transform(aaplData[all_x_cols])
y_train = aaplData[y_col]
    
train_pred = cross_val_predict(forest_clf, x_train, y_train.values.ravel(), cv=10)
    
print(f'Precision Score: {precision_score(y_train.values.ravel(), train_pred):.2f}')
print(f'Recall Score: {recall_score(y_train.values.ravel(), train_pred):.2f}')

y_scores = cross_val_predict(forest_clf, x_train, y_train.values.ravel(), cv=5, method='predict_proba')
# Area under probability curve
print(f'ROC AUC Score: {roc_auc_score(y_train.values.ravel(), y_scores[:, 1]):.2f}')


Precision Score: 0.67
Recall Score: 0.71
ROC AUC Score: 0.64


**Remarks:**  
67% of the time we did not misclassify negative as positive.  
71% of the time we did not misclassify positive as negative.  
The Daily Return variable is a decent/ passable predictor that the 1-week MA Daily Return will be positive or negative.

In [27]:
import pandas as pd

predictData = pd.DataFrame({'Daily Return':[-0.025542]})  # Predicting out of sample data
display(predictData)

print(f'Upside: {bool(forest_clf.predict(pipeline.transform(predictData))[0])}')

Unnamed: 0,Daily Return
0,-0.025542


Upside: False


**Remarks:**  
Pretty accurate!

In [25]:
importances = forest_clf.feature_importances_
feature_names = aaplData.columns
print(dict(zip(feature_names, importances)))

{'Date': 1.0}


Not sure why date variable would be a feature_importances?