# Stock Trend Predictor
__Juan Javier Arosemena__
## Model Training
The modeling phase of the machine learning workflow consists in defining the models to be trained, train the created models, and subsequently test each model's accuracy.

I am using Amazon Sagemaker for the training, testing, and deployment of the defined models. 

In [4]:
import os
from os.path import isfile, join
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statistics
from statistics import mean, mode, median, stdev
import sagemaker
import boto3
from sagemaker.pytorch import PyTorch

In [5]:
# Constants to use for this notebook

# The time ranges that the trained models will predict
time_range = [1, 5, 10, 20, 90, 270]
time_words = {1:'day', 5:'week', 10:'two_weeks', 20:'month', 90:'four_months', 270:'year'}

# The time ranges used for feature calculations
periods = [5, 10, 20, 90, 270]

# directory containing training and testing datasets
data_dir = join('data/') 
final_data_dir = join(data_dir + 'final/')

prefix = 'stock-trend-predictor'

In order to run the following cell it is necessary that you have environment variables containing your AWS access and secret keys, as well as specifiying your account's region. Also, an IAM role must be specified, the name of which must be stores as an environment variable as well.

In [6]:
# AWS session and role
boto_session = boto3.Session(
    aws_access_key_id=os.environ['AWS_ACCESS_KEY'],
    aws_secret_access_key=os.environ['AWS_SECRET_KEY'],
    region_name='us-east-2' # AWS account region
)
sagemaker_session = sagemaker.Session(boto_session=boto_session)
role = os.environ['AWS_SAGEMAKER_ROLE'] # IAM role

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [7]:
# Upload the directory containing training data

data = sagemaker_session.upload_data(final_data_dir, bucket=bucket, key_prefix=prefix)

### Model Definition
Under the ``source/`` directory is the file ``model.py``, which contains the definition for a class named ``BinaryClassifier``. This class defines our base ANN model for this project which hast the following structure:
1. Three parameters need to be passed to the model:
    * ``input_features``: the number of neurons to create for input (11 in this case)
    * ``hidden_dim``: a parameter used to define the ANN hidden layers.
    * ``output_dim``: the number of neurons in the final layer of the ANN. For a binary classifier this is 1, and the result ranges from [0,1].
2. The number of neurons in the 4 hidden layers of the model are defined as:
    * ``hidden_dim``
    * ``2 * hidden_dim``
    * ``3 * hidden_dim``
    * ``hidden_dim``
3. The forward pass of the model
    * Input layer -> Linear transform to the first hidden layer
    * Passed into Rectifier Linear Unit function
    * Dropout layer (for training only)
    * Repeat the above steps until the final hidden layer...
    * Last hidden layer -> Linear transform to the output layer
    * Sigmoid Activation Function -> Result
    

### Model Training
The model training will be performed by Amazon Sagemaker. Training jobs will be created for each training dataset in the ``final/`` directory. Under ``source/`` there is a file named ``train.py``, which contains the structure for a PyTorch entry point. This is necesssary for creating estimators through Sagemaker.

In [17]:
# Create PyTorch estimators to train

estimators = {}
j = 0

for t in time_range:
    for n1 in periods:
        for n2 in periods:
            j += 1
            estimator = PyTorch(entry_point='train.py',
                                source_dir='source',
                                framework_version='1.0', 
                                role=role,
                                train_instance_count=1,
                                train_instance_type='ml.p2.xlarge',
                                sagemaker_session=sagemaker_session)
            estimator.fit({'training': data+'/'+time_words[t]+f'\\{n1}_{n2}/train.csv'}, wait=False)
            estimators[f'{t}-{n1}-{n2}'] = estimator
            
            # Since my account has a 20 instance limit, create transform jobs in groups of 10 at a time to avoid throttling errors 
            if j % 10 == 0:
                sagemaker_session.wait_for_job(estimator.latest_training_job.job_name);


2019-10-23 15:23:21 Starting - Starting the training job
2019-10-23 15:23:22 Starting - Launching requested ML instances.........
2019-10-23 15:24:18 Starting - Preparing the instances for training...............
2019-10-23 15:25:40 Downloading - Downloading input data....
2019-10-23 15:26:04 Training - Downloading the training image.......
2019-10-23 15:26:48 Training - Training image download completed. Training in progress...................
2019-10-23 15:28:25 Uploading - Uploading generated training model
2019-10-23 15:28:31 Completed - Training job completed

2019-10-23 15:28:52 Starting - Starting the training job
2019-10-23 15:28:54 Starting - Launching requested ML instances.........
2019-10-23 15:29:49 Starting - Preparing the instances for training.............
2019-10-23 15:31:17 Downloading - Downloading input data..
2019-10-23 15:31:35 Training - Downloading the training image........
2019-10-23 15:32:23 Training - Training image download completed. Training in progress.

ClientError: An error occurred (ThrottlingException) when calling the CreateTrainingJob operation (reached max retries: 4): Rate exceeded

### Model Evaluation
For evaluation I am deploying each training job created. For each estimator, a predictor endpoint is created briefly to be sent data to make predictions. For each predictor, their respective test datasets are passed. The endpoint for the predictor is then deleted. Then, accuracy calculations are made against the labeled test datasets, they are printed and stored into ``.txt`` files under the ``results/`` directory.


In [53]:
# Deploy each trained model and evaluate them

predictors = {}
j=0
for key in estimators:
    j+=1
    predictor = estimators[key].deploy(1, 'ml.m4.2xlarge', wait=False)
    predictors[key] = predictor
    
    if j%10 == 0:
        for predictor in predictors.values():
            sagemaker_session.wait_for_endpoint(predictor.endpoint)
        results = []

        for key in predictors:
            predictor = predictors[key]
            t = key.split('-')[0]
            n1 = key.split('-')[1]
            n2 = key.split('-')[2]

            # read in test data, assuming it is stored locally
            test_data = pd.read_csv(join(final_data_dir, time_words[int(t)]+f"/{n1}_{n2}/test.csv"), header=None, names=None)
            # labels are in the first column
            test_y = test_data.iloc[:,0]
            test_x = test_data.iloc[:,1:]

            # First: generate predicted, class labels

            test_y_preds_ = predictor.predict(test_x)
            test_y_preds = test_y_preds_.round()


            # Second: calculate the test accuracy

            accuracy = 0
            fp = 0
            fn = 0
            for y, y_ in zip(test_y, test_y_preds):
                accuracy += 1 if y==y_ else 0
                fp += 1 if not y and y_ else 0
                fn += 1 if y and not y_ else 0

            accuracy /= len(test_y)

            results.append([key, accuracy, fp, fn])

            print(f'\n\n{t} DAYS PREDICTION MODEL, n1 = {n1}, n2 = {n2}:')
            print('Accuracy: '+str(accuracy))
            print('False positives: '+str(fp))
            print('False negatives: '+str(fn))

            predictor.delete_endpoint()
            

        pd.DataFrame(results).to_csv(f'results/results_{j}.csv')
        predictors.clear()



------------------------------------------------------------------!!-!!!!!-!!!

1 DAYS PREDICTION MODEL, n1 = 5, n2 = 5:
Accuracy: 0.5107864891518737
False positives: 7893
False negatives: 44


1 DAYS PREDICTION MODEL, n1 = 5, n2 = 10:
Accuracy: 0.510891212347523
False positives: 7859
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 5, n2 = 20:
Accuracy: 0.5142730832386004
False positives: 7691
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 5, n2 = 90:
Accuracy: 0.5249625026783801
False positives: 6651
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 5, n2 = 270:
Accuracy: 0.5576654865357794
False positives: 3822
False negatives: 301


1 DAYS PREDICTION MODEL, n1 = 10, n2 = 5:
Accuracy: 0.510891212347523
False positives: 7859
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 10, n2 = 10:
Accuracy: 0.510891212347523
False positives: 7859
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 10, n2 = 20:
Accuracy: 0.5142730832386004
False positives: 7691
False negatives: 0


1 DAYS



-----------------------------------------------------------!---------------!!!------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!!!!!!

1 DAYS PREDICTION MODEL, n1 = 20, n2 = 5:
Accuracy: 0.5142730832386004
False positives: 7691
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 20, n2 = 10:
Accuracy: 0.5142730832386004
False positives: 7691
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 20, n2 = 20:
Accuracy: 0.5142730832386004
False positives: 7691
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 20, n2 = 90:
Accuracy: 0.5249625026783801
False positives: 6651
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 20, n2 = 270:
Accuracy: 0.5621714408325287
False positives: 4075
False negatives: 6


1 DAYS PREDICTION MODEL, n1 = 90, n2 = 5:
Accuracy: 0.554746089565031
False positives: 2961
False negatives: 3273


1 DAYS PREDICTION MODEL, n1 = 90, n2 



--------------------------------------------------------------------!!-!-!!!-!!!!

1 DAYS PREDICTION MODEL, n1 = 270, n2 = 5:
Accuracy: 0.5628151485892071
False positives: 4075
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 270, n2 = 10:
Accuracy: 0.5628151485892071
False positives: 4075
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 270, n2 = 20:
Accuracy: 0.5628151485892071
False positives: 4075
False negatives: 0


1 DAYS PREDICTION MODEL, n1 = 270, n2 = 90:
Accuracy: 0.5675356721381826
False positives: 3993
False negatives: 38


1 DAYS PREDICTION MODEL, n1 = 270, n2 = 270:
Accuracy: 0.5628151485892071
False positives: 4075
False negatives: 0


5 DAYS PREDICTION MODEL, n1 = 5, n2 = 5:
Accuracy: 0.5227300879474792
False positives: 7706
False negatives: 0


5 DAYS PREDICTION MODEL, n1 = 5, n2 = 10:
Accuracy: 0.5269823445005927
False positives: 7582
False negatives: 0


5 DAYS PREDICTION MODEL, n1 = 5, n2 = 20:
Accuracy: 0.531035795887281
False positives: 7389
False negatives: 



20 DAYS PREDICTION MODEL, n1 = 10, n2 = 270:
Accuracy: 0.7160466651992076
False positives: 2472
False negatives: 108


20 DAYS PREDICTION MODEL, n1 = 20, n2 = 5:
Accuracy: 0.5347985347985348
False positives: 7239
False negatives: 0


20 DAYS PREDICTION MODEL, n1 = 20, n2 = 10:
Accuracy: 0.5347985347985348
False positives: 7239
False negatives: 0


20 DAYS PREDICTION MODEL, n1 = 20, n2 = 20:
Accuracy: 0.5347985347985348
False positives: 7239
False negatives: 0


20 DAYS PREDICTION MODEL, n1 = 20, n2 = 90:
Accuracy: 0.5683881746204692
False positives: 5942
False negatives: 0


20 DAYS PREDICTION MODEL, n1 = 20, n2 = 270:
Accuracy: 0.7151975349400242
False positives: 2465
False negatives: 123
--------------------------------------------------------------------------------!!!!-!!-!!-!!

20 DAYS PREDICTION MODEL, n1 = 90, n2 = 5:
Accuracy: 0.5683881746204692
False positives: 5942
False negatives: 0


20 DAYS PREDICTION MODEL, n1 = 90, n2 = 10:
Accuracy: 0.5683881746204692
False positives:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Amazon SageMaker could not get a response from the sagemaker-pytorch-2019-10-23-16-20-55-347 endpoint. This can occur when CPU or memory utilization is high. To check your utilization, see Amazon CloudWatch. To fix this problem, use an instance type with more CPU capacity or memory.". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-pytorch-2019-10-23-16-20-55-347 in account 748183850282 for more information.

The following cell continues the processing halted by an error in the last cell. It should not be run again as the error was due to using a low-performance instance, which is already corrected in the previous cell. 

In [56]:
predictors = {}
j=0
for key in estimators:
    j+=1
    if j <= 100: continue # the previous cell stopped at this point
    predictor = estimators[key].deploy(1, 'ml.m4.2xlarge', wait=False)
    predictors[key] = predictor
    
    if j%10 == 0:
        for predictor in predictors.values():
            sagemaker_session.wait_for_endpoint(predictor.endpoint)
        results = []

        for key in predictors:
            predictor = predictors[key]
            t = key.split('-')[0]
            n1 = key.split('-')[1]
            n2 = key.split('-')[2]

            # read in test data, assuming it is stored locally
            test_data = pd.read_csv(join(final_data_dir, time_words[int(t)]+f"/{n1}_{n2}/test.csv"), header=None, names=None)
            # labels are in the first column
            test_y = test_data.iloc[:,0]
            test_x = test_data.iloc[:,1:]

            # First: generate predicted, class labels

            test_y_preds_ = predictor.predict(test_x)
            test_y_preds = test_y_preds_.round()


            # Second: calculate the test accuracy

            accuracy = 0
            fp = 0
            fn = 0
            for y, y_ in zip(test_y, test_y_preds):
                accuracy += 1 if y==y_ else 0
                fp += 1 if not y and y_ else 0
                fn += 1 if y and not y_ else 0

            accuracy /= len(test_y)

            results.append([key, accuracy, fp, fn])

            print(f'\n\n{t} DAYS PREDICTION MODEL, n1 = {n1}, n2 = {n2}:')
            print('Accuracy: '+str(accuracy))
            print('False positives: '+str(fp))
            print('False negatives: '+str(fn))

            predictor.delete_endpoint()
            

        pd.DataFrame(results).to_csv(f'results/results_{j}.csv')
        predictors.clear()



-------------------------------------------------------------------!------------!-!!!-!-!!-!!

90 DAYS PREDICTION MODEL, n1 = 5, n2 = 5:
Accuracy: 0.5724060050484922
False positives: 6437
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 5, n2 = 10:
Accuracy: 0.5764209680658767
False positives: 6327
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 5, n2 = 20:
Accuracy: 0.5847654118930715
False positives: 6089
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 5, n2 = 90:
Accuracy: 0.7160003117449926
False positives: 3644
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 5, n2 = 270:
Accuracy: 0.8433321064900012
False positives: 1277
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 10, n2 = 5:
Accuracy: 0.5764209680658767
False positives: 6327
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 10, n2 = 10:
Accuracy: 0.5764209680658767
False positives: 6327
False negatives: 0


90 DAYS PREDICTION MODEL, n1 = 10, n2 = 20:
Accuracy: 0.5847654118930715
False positives: 6089
Fals