# Stock Trend Predictor
__Juan Javier Arosemena__
## Model Training
The modeling phase of the machine learning workflow consists in defining the models to be trained, train the created models, and subsequently test each model's accuracy.

I am using Amazon Sagemaker for the training, testing, and deployment of the defined models. 

In [2]:
import os
from os.path import isfile, join
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statistics
from statistics import mean, mode, median, stdev
import sagemaker
import boto3
from sagemaker.pytorch import PyTorch

In [7]:
time_range = [1, 5, 10, 20, 90, 270]
time_words = {1:'day', 5:'week', 10:'two_weeks', 20:'month', 90:'four_months', 270:'year'}
periods = [5, 10, 20, 90, 270]
data_dir = join('data/') 
final_data_dir = join(data_dir + 'final/')

prefix = 'stock-trend-predictor'

In [4]:
# session and role
boto_session = boto3.Session(
    aws_access_key_id=os.environ['AWS_ACCESS_KEY'],
    aws_secret_access_key=os.environ['AWS_SECRET_KEY'],
    region_name='us-east-2'
)
sagemaker_session = sagemaker.Session(boto_session=boto_session)
role = os.environ['AWS_SAGEMAKER_ROLE']

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [4]:
data = sagemaker_session.upload_data(final_data_dir, bucket=bucket, key_prefix=prefix)

In [5]:
estimators = []
for t in time_range:
    for n1 in periods:
        for n2 in periods:
            estimator = PyTorch(entry_point='train.py',
                                source_dir='source',
                                framework_version=1.0, 
                                role=role,
                                train_instance_count=1,
                                train_instance_type='ml.p2.xlarge',
                                sagemaker_session=sagemaker_session)
            estimator.fit({'training': data+'/'+time_words[t]+f'/{n1}_{n2}/train.csv'}, wait=False)
            estimators.append(estimator)

In [13]:
for estimator in estimators:
    sagemaker_session.wait_for_job(estimator.latest_training_job.job_name);


2019-10-18 22:13:36 Starting - Preparing the instances for training
2019-10-18 22:13:36 Downloading - Downloading input data
2019-10-18 22:13:36 Training - Training image download completed. Training in progress.
2019-10-18 22:13:36 Uploading - Uploading generated training model
2019-10-18 22:13:36 Completed - Training job completed

2019-10-18 22:13:35 Starting - Preparing the instances for training
2019-10-18 22:13:35 Downloading - Downloading input data
2019-10-18 22:13:35 Training - Training image download completed. Training in progress.
2019-10-18 22:13:35 Uploading - Uploading generated training model
2019-10-18 22:13:35 Completed - Training job completed

2019-10-18 22:13:29 Starting - Preparing the instances for training
2019-10-18 22:13:29 Downloading - Downloading input data
2019-10-18 22:13:29 Training - Training image download completed. Training in progress.
2019-10-18 22:13:29 Uploading - Uploading generated training model
2019-10-18 22:13:29 Completed - Training job co

In [14]:
predictors = []
for estimator in estimators:
    predictor = estimator.deploy(1, 'ml.m4.xlarge', wait=False)
    predictors.append(predictor)

In [17]:
for predictor in predictors:
    sagemaker_session.wait_for_endpoint(predictor.endpoint)

---------------------------------------------------------------------!!!-!!!

In [9]:
period_combinations = []
for t in time_range:
    for n1 in periods:
        for n2 in periods:
            period_combinations.append((t, n1, n2))

In [18]:
for (t, n1, n2), predictor in zip(period_combinations, predictors):
    # read in test data, assuming it is stored locally
    test_data = pd.read_csv(join(final_data_dir, time_words[t]+f"/{n1}_{n2}/test.csv"), header=None, names=None)
    # labels are in the first column
    test_y = test_data.iloc[:,0]
    test_x = test_data.iloc[:,1:]

    # First: generate predicted, class labels
    
    test_y_preds_ = predictor.predict(test_x)
    test_y_preds = test_y_preds_.round()


    # Second: calculate the test accuracy

    accuracy = 0
    fp = 0
    fn = 0
    for y, y_ in zip(test_y, test_y_preds):
        accuracy += 1 if y==y_ else 0
        fp += 1 if not y and y_ else 0
        fn += 1 if y and not y_ else 0

    accuracy /= len(test_y)
    print(f'\n\n{t} DAYS PREDICTION MODEL, n1 = {n1}, n2 = {n2}:')
    print('Accuracy: '+str(accuracy))
    print('False positives: '+str(fp))
    print('False negatives: '+str(fn))

    predictor.delete_endpoint()



1 DAYS PREDICTION MODEL:
Accuracy: 0.5105636010281487
False positives: 7807
False negatives: 0


5 DAYS PREDICTION MODEL:
Accuracy: 0.5227358848048503
False positives: 7550
False negatives: 7


10 DAYS PREDICTION MODEL:
Accuracy: 0.5216644397785837
False positives: 7518
False negatives: 0


20 DAYS PREDICTION MODEL:
Accuracy: 0.5146982646982647
False positives: 7495
False negatives: 0


90 DAYS PREDICTION MODEL:
Accuracy: 0.5511355311355312
False positives: 6127
False negatives: 0


270 DAYS PREDICTION MODEL:
Accuracy: 0.6059085841694537
False positives: 3535
False negatives: 0
