## Using sentiment analysis to predict stock movement
---
The following notebook will walk you through using news articles to predict movement of the Dow Jones Industrial Average.  

**Notebook outline:**
1. Read and examine the Dow Jones end of day data
1. Apply logistic regression to evaluate prediction performance
1. Read and examine the article data
1. Use Amazon Comprehend to tokenize the article data
1. Use the tokens to train a SageMaker model to predict stock movement

This notebook will rely heavily on the Pandas and SciKitLearn python libraries.  Both libraries have fantastic documentation which you're encouraged to search in support of this notebook's activities.

- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [SciKitLearn Documentation](http://scikit-learn.org/stable/modules/)

By working through this notebook you will learn to engineer your data in preparation for training machine learning models.  You will also learn to apply ML algorithms from both the SciKitLearn library as well as the Amazon SageMaker service.  You will then independently test the SageMaker endpoint from Python to evaluate model performance is fit for purpose.

In addition to the aforementioned libraries you will also leverage the Boto3 Python library to programmatically interact with the AWS API.  For Boto3 documentation please visit https://boto3.readthedocs.io/.

### Note: How to get help
As you go through this notebook you may need assistance to complete a step or resolve an error message.  In addition to the documentation above you can also use this Notebook to obtain details about different functions.  The Jupyter notebook will respond to the `Tab` press on your keyboard and perform tab-completion of things like variable names, function names, and package names.  In addition if you press Shift-Tab while the cursor is inside the parenthesis of a function the Notebook will display detailed information about the function, its parameters, and return values.  In addition to all these resources please don't forget to ask your friendly neighborhood AWS SA for help if needed.

---

To get started import the Pandas library and reference as `pd`.

In [1]:
import pandas as pd

---
## Read Dow Jones Industrial Average data

Read in, evaluate, and format the data from the DJIA end of day data.  Use the Pandas library to format the data types and manipulate the data so that every row contains the end of day data as well as the next day's High value.

Use the 'read_csv' function from the Pandas library to read the data in the CSV file into a Pandas DataFrame.

In [2]:
djia_df = pd.read_csv('data/DJIA_table.csv')

Use the ``shape`` attribute of the Pandas DataFrame object to see how many rows and columns are contained in the data set.

In [None]:
djia_df.shape

Use the ``describe`` function of the DataFrame to obtain a table describing the mean, standard deviation, and min / max values for each feature.

In [None]:
djia_df.describe ()

Use the `sample` function of the DataFrame to obtain a number of example values from the DataFrame.

In [None]:
djia_df.sample (5)

The `dtypes` attribute of a DataFrame will list the individual data types (as auto-detected) of the columns read in from the CSV.

In [None]:
djia_df.dtypes

The `Date` column has been detected as an object or string, reformat it as a datetime data type so that we can more easily work with it.

In [7]:
djia_df.Date = pd.to_datetime (djia_df.Date, format='%Y-%m-%d')

Use the `min` and `max` functions of the `Date` column to see the range of dates held within the DJIA EOD data set.

In [None]:
print ("{} DJIA EOD data points from {} to {}".format (djia_df.shape[0], djia_df.Date.min (), djia_df.Date.max ()))

The DJIA EOD data holds traditional open, close, high, and low data - we are going to predict whether the DJIA has moved up or down on a given day.  To create a label (named `UpDown`) for the data set subtract the opening value from the closing value.  Then test for positive / negative values and use 1 to indicate a rise in DJIA value and a 0 to indicate a falling or unchanged value.

In [9]:
djia_df['UpDown'] = (djia_df.Close - djia_df.Open).astype ('int64')

In [10]:
djia_df.loc[djia_df.UpDown > 0, 'UpDown'] = 1

In [11]:
djia_df.loc[djia_df.UpDown <= 0, 'UpDown'] = 0

Lastly we will be wanting to predict the performance of the DJIA tomorrow, based on today's news.  To do this create another new column named `NextUpDown` which is the previous `UpDown` column but shifted by 1.

In [12]:
djia_df = pd.concat ([djia_df, djia_df.UpDown.rename('NextUpDown').shift (-1)], axis=1)
# by shifting the UpDown values by 1 we will have a NaN value in one row, remove it
djia_df.dropna(axis=0, how='any', inplace=True)
djia_df.NextUpDown = djia_df.NextUpDown.astype('int64')

Use the `head` function to check the first 5 rows of day in the DataFrame and ensure we have the result desired.

In [None]:
djia_df.head (5)

To further confirm that the `NextUpDown` column contains a mix of 1s and 0s, and in what proportion, use the `describe` function to summarize the data in the column.  We can see below that it is nearly a 50 / 50 mix of up and down values.

In [None]:
djia_df.describe ()

---
## Train and test simple linear regression model

Let's quickly test how well the DJIA EOD data can predict the next day's performance.  Use the SKLearn library to train and evaluate a Logistic Regression model to your data.  Use the first 1500 observations for training and the next 200 observations for testing the model's performance.

In [15]:
from sklearn.linear_model import LogisticRegression
lreg = LogisticRegression ()

In [None]:
lreg.fit (X=pd.DataFrame (djia_df.Open[:1500]), y=djia_df.NextUpDown[:1500])
lreg.score (X=pd.DataFrame(djia_df.Open[1500:1700]), y=djia_df.NextUpDown[1500:1700])

We can see that we are able to accurately predict the following day's movement using only the Opening value from the preceding day.  If we used both the opening and closing value would we gain greater accuracy?

In [None]:
lreg.fit (X=pd.concat ([djia_df.Open[:1500], djia_df.Close[:1500]], axis=1), y=djia_df.NextUpDown[:1500])
lreg.score (X=pd.concat([djia_df.Open[1500:1700], djia_df.Close[1500:1700]], axis=1), y=djia_df.NextUpDown[1500:1700])

Using only the opening and closing values of the previous day we are able to predict with 59% accuracy the performance of the DJIA the next day.  How can we explain this?

---
# Begin workbook section
---
## Lets add article data

We are able to achieve 59% accuracy in predicting the performance of the DJIA tomorrow, based upon today's open and close values.  Let's see what sort of accuracy we can achieve using today's news to predict tomorrow's performance.

---
## Read article data

Read in and evaluate the data contained in the Article CSV.

In [18]:
article_df = pd.read_csv ('data/Articles.csv', encoding='ISO-8859-1')

Use the DataFrame `shape` attribute to determine how many rows and columns are in the DataFrame

Use the DataFrame `dtypes` attribute to determine how each column is formatted by data type.

Use the DataFrame `sample` function to peek at some example rows.

Format the `Date` column to be of datetime data type rather than object.

The `NewsType` column is interesting, lets use the `unique` function of the `NewsType` series to determine what range of possible values this column holds.

Use the `min` and `max` functions to determine the range of dates held in the articles data set.

In [24]:
print ("{} articles from {} to {}".format (article_df.shape[0], article_df.Date.min (), article_df.Date.max ()))

2692 articles from 2015-01-01 00:00:00 to 2017-03-27 00:00:00


We now have a collection of over 2500 articles stored in memory.  We could parse these documents manually or using a library but lets use Comprehend to identify the key phrases from each article.

---
## Parse the articles using Comprehend

Using the Amazon Comprehend Boto3 client we will now parse the first 5000 characters from every news article and capture the key phrases identified by Comprehend.

In [25]:
import boto3

region = boto3.Session().region_name

comprehend = boto3.client('comprehend', region_name=region)

For every article in the DataFrame pass the article text to Comprehend in order to extract the key phrases for every article.  For every set of key phrases track the phrases retrieved globally as well as at a per-article level.  This will allow you to determine which key phrases are most popular across all articles as well as how many times each key phrase appears in every article.

Use the `iterrows` function of the article DataFrame to pass the article text to Comprehend for processing.  Using this API Comprehend requires that you not exceed 5000 bytes per API call.  For ease lets only pass the first 4900 characters of any one article.

For every key phrase returned append the phrase to an array and then `join` the array to a larger array which will hold every article's key phrases.  So for example the larger array should contain something such as:

```python
[
    'key phrase for article 1',
    'key phrases for article 2',
    'article 3 key phrases'
]
```

In [26]:
key_phrase_list = []
for index, row in article_df.iterrows ():
    key_phrases = comprehend.detect_key_phrases (Text = row['Article'][:4900], LanguageCode='en')
    
    phrase_strings = []
    for phrase in key_phrases['KeyPhrases']:
        text = phrase['Text']
        phrase_strings.append (text)
    key_phrase_list.append (' '.join (phrase_strings))

Using the `Series` constructor convert the large array to a Pandas Series and append it to the article DataFrame using the column name `phrase_string`.

When your done your article DataFrame should look similar to the below.

In [27]:
article_df.sample (5)

Unnamed: 0,Article,Date,Heading,NewsType,phrase_string
420,LONDON: The collapse in relations between Saud...,2016-01-06,Saudi Iran split dashes chance of OPEC deal to...,business,LONDON The collapse relations Saudi Arabia and...
839,strong>ISLAMABAD: The Jewellary exports from t...,2016-08-01,Pakistan Jewellary exports up 29 FY16,business,strong>ISLAMABAD The Jewellary exports the cou...
439,ISLAMABAD: A five-day Pakistan-China Business ...,2016-01-18,Pak China Business Conference opens in Islamabad,business,ISLAMABAD A five-day Pakistan China Business F...
2521,strong>KARACHI: The State Bank of Pakistan ann...,2016-11-26,State Bank maintains main policy interest rate...,business,strong>KARACHI The State Bank Pakistan its mon...
11,HONG KONG: Hong Kong stocks edged up 0.24 perc...,2015-01-15,hong kong stocks open 0.24 percent higher,business,HONG KONG Hong Kong stocks 0.24 percent early ...


---
## Train a LinearLearner on SageMaker using the data

You may have noticed that you have DJIA EOD data ranging from 2008 to 2016 and article data ranging from 2015 to 2017.  We need to focus on the intersection of these two data sets.  Use the `merge` function of the article DataFrame to join the article DataFrame with the DJIA DataFrame.  Use an `inner` join based upon the `Date` field in both DataFrames.

When complete you should have a DataFrame with observations ranging from 2015 to 2016-07-01.

In [29]:
print ("{} joined records from {} to {}".format (joined_df.shape[0], joined_df.Date.min (), joined_df.Date.max ()))

1448 joined records from 2015-01-02 00:00:00 to 2016-07-01 00:00:00


It is common in ML development to split a data set into 3 portions.  The training data set will be the largest data set, with approximately 20% of your data held back for validation during training and the last 20% retained for testing purposes post-training.  Create indexes to split your Pandas DataFrame into 60%, 20% and 20% portions.

In [31]:
train_start = 0
validate_start = 
test_start = 

It is also common practice to shuffle your training set in order to avoid overfitting, especially when performing multiple epochs over the training data.  Use the `shuffle` function of the sklearn.utils module to help shuffle your training set.  Also use the indexes from the last step to create 3 data sets: training_df, validation_df, and test_df.  For each data set also create a reference to the labels for each set: training_labels, validation_labels, and test_labels.

In [32]:
from sklearn.utils import shuffle

train_df = 
train_labels = 
validate_df =
validate_labels = 
test_df = 
test_labels = 

With the data now split into 3 separate sets use the `CountVectorizer` from the sklearn.features_extraction.text module to create a count of each of the words in each row of `phrase_string`.  Use the `fit_transform` function of the CountVectorizer to convert the `phrase_string` series all at once.  After you have fit the CountVectorizer you can then use the `transform` function to convert the `phrase_string` series of the validation and test data sets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

wordvec = CountVectorizer ()
train_vec = 

validate_vec =
test_vec = 

As a sanity check create a `LogisticRegression` model from the sklearn.linear_model library to fit to the vectorized training phrases and the training labels.

In [None]:
from sklearn.linear_model import LogisticRegression
lreg = LogisticRegression ()

After fitting the model score it against the validation set.  You should obtain a score of roughly 48%.  Anything less than 50% and you're better off just flipping a coin.

Finally score the model against your test data set and also use the `predict` function of your model to produce a series of predications for every observation in the test data set.  Use the `crosstab` function from Pandas to display a table of false positives and false negatives.

Out of curiosity peak into the coefficients of your model compared with the words in your vectorizer to see which words have the greatest positive and negative influence over the model's outcome.'

In [37]:
vec_words = wordvec.get_feature_names()
lref_coeffs = lreg.coef_.tolist()[0]
coeff_df = pd.DataFrame({'Word' : vec_words, 
                        'Coefficient' : lref_coeffs})
coeff_df = coeff_df.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeff_df.head(10)

Unnamed: 0,Coefficient,Word
506,0.748372,22
864,0.691797,64
444,0.681982,20
6453,0.66079,may
8961,0.553141,senior
5471,0.53179,investment
4138,0.522309,fares
8448,0.511151,review
969,0.502739,80
2805,0.50216,company


In [38]:
coeff_df.tail (10)

Unnamed: 0,Coefficient,Word
2004,-0.511915,biggest
6107,-0.522798,limited
4416,-0.529109,foreign
5729,-0.531351,karachi
5434,-0.572182,interior
7824,-0.576874,pressure
7328,-0.603428,oversupply
5051,-0.64743,hobc
9074,-0.664025,sharjah
10064,-0.761891,tariff


---
## Create a model hosted on SageMaker
Now in this final step you will train the XGBoost algorithm on SageMaker and host the trained model using a SageMaker endpoint.  To start you will need to write your data sets used earlier out to CSV and upload the CSV documents to S3.

The XGBoost algorithm requires that your data be formatted using comma-separated values and that the first value of any observation (the first column) be the observation's label.  So we will want to format our data with the first column being `NextUpDown` and all subsequent values being the word counts.

In [None]:
prefix = 'sentiment-xgboost'
bucket = 'jasbarto-ml-demo'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region, bucket)

def write_to_s3(fobj, bucket, key):
    return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fobj)

def upload_to_s3(bucket, channel, filename):
    fobj=open(filename, 'rb')
    key = prefix+'/'+channel
    url = 's3://{}/{}/{}'.format(bucket, key, filename)
    print('Writing to {}'.format(url))
    write_to_s3(fobj, bucket, key)

train_csv_df = pd.DataFrame (train_vec.todense ())
columns = train_csv_df.columns.tolist ()
columns.insert (0, 'NextUpDown')
train_csv_df['NextUpDown'] = train_df.NextUpDown.values
train_csv_df = train_csv_df[columns]

"""
Convert the validation data set in a fashion similar to the training set above.
"""

"""
Convert the test data set in a fashion similar to the training set above.
"""

train_csv_df.to_csv ('train_data.csv', encoding='utf-8', header=False, index=False)
"""
Write out the validation and test data sets in a fashion similar to the training set above.
"""

upload_to_s3 (bucket, 'train', 'train_data.csv')
upload_to_s3 (bucket, 'validation', 'validate_data.csv')

### Train a SageMaker algorithm
The code below will programmatically create a training job for the SageMaker XGBoost algorithm.  Alternatively you can create the training job manually via the AWS web console.

**To train via the console:**

Submit your data to the XGBoost algorithm via the Amazon SageMaker web console.  Select `Training Jobs` and click the `Create training job` button.

Give the training job a name such as 'xgboost-binary-UpDown-classifier' or similar.

For `Algorithm` select `XGBoost` from the drop down.

Accept the defaults for the remainder of the fields and move down to `Hyperparameters`.  Set the following values for the Hyperparameters accepting the default values for all other parameters.
- `num_round` -> 50
- `objective` -> 'binary:logistic'

For Input data configure two channels.  The first should be given the name 'train' and have the following settings:
- `Content-type` -> 'csv'
- `Compression type` -> None
- `Record wrapper` -> None
- `S3 data type` -> S3Prefix
- `S3 data distribution type` -> Fully replicated
- `S3 location` -> 's3://<your-bucket-name>/<your-model-prefix>/train'

For the second channel give it a name of 'validation' and set its parameters the same as the 'train' channel.  Give the 'validation' channel a different `S3 location` however, setting it to 's3://<your-bucket-name><your-model-prefix>/validation'

For the Output data configuration set the `S3 output path` to 's3://<your-bucket-name>/<your-model-prefix>/output'.




In [None]:
%%time

from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
import time
from time import gmtime, strftime

role = get_execution_role()

container = get_image_uri(boto3.Session().region_name, 'xgboost')

job_name = 'DEMO-xgboost-regression-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

#Ensure that the training and validation data folders generated above are reflected in the "InputDataConfig" parameter below.

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": bucket_path + "/" + prefix + "/single-boost"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"binary:logistic",
        "num_round":"50"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/train',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": bucket_path + "/" + prefix + '/validation',
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "csv",
            "CompressionType": "None"
        }
    ]
}


client = boto3.client('sagemaker')
client.create_training_job(**create_training_params)

status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)

### Create a model from training output

The preceding training job, when complete, will produce a model export into the configured `S3 output path`.  We will now use that output to create a model which can then be hosted by SageMaker.  The creation of a model can be done programmatically using the code below, alternatively you can create a model manually via the SageMaker web console.

From the SageMaker web console `Inference` -> `Models`.  Then click `Create model`.

Give the model a name such as **'xgboost-model'** and set the `Location of model artifacts` to your output directory.  For example: **`s3://<your-bucket-name>/<your-model-prefix>/output/model.tar.gz`**.  This same value will also be listed in the summary output of the preceding training job.  For the `Location of inference code image` set the value of **'685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:1'**.

Click `Create model`.

In [None]:
%%time

model_name=job_name + '-model'
print(model_name)

info = client.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

### Create endpoint configuration

SageMaker hosts your model as an endpoint.  This endpoint must be configured with which models you would like it to host.  To create an endpoint configuration you can use the code below to create one programmatically or create one manually via the web console.

From the SageMaker web console click `Endpoint configurations` under `Inference`.  Click `Create endpoint configuration`.

Give your endpoint configuration a name such as 'xgboost-model-endpoint-config' and under 'Production variants' click `Add model`.  Select the model you created in the previous step and click `Create endpoint configuration`.

In [None]:
endpoint_config_name = 'DEMO-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

### Deploy endpoint

Finally deploy the configured endpoint so that it can be invoked as a secure, RESTful endpoint.  To deploy the endpoint you can use the code below or create a deployment manually via the web console.

From the SageMaker web console click `Endpoints` under `Inference`.  Click `Create endpoint`.  Give your endpoint a name such as `xgboost-endpoint` and select the endpoint configuration created in the previous step.  Click `Deploy endpoint`.

In [None]:
%%time

endpoint_name = 'DEMO-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

### Invoke the endpoint

Once the SageMaker endpoint has been deployed your trained model is running and ready to perform inference.  Let's test it now using the Boto3 client.

In [44]:
sm_client = boto3.client('runtime.sagemaker')

In [45]:
!head -n1 test_data.csv > /tmp/test_data_single.csv

In [None]:
%%time
import json
from itertools import islice
import math
import struct
    
def predict (features):
    response = sm_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=features)
    result = response['Body'].read()
    result = 1 if float(result) > 0.5 else 0
    return result

file_name = '/tmp/test_data_single.csv' #customize to your test file

with open(file_name, 'r') as f:
    payload = f.read().strip()

print ("Label: {}\nPrediction: {}".format (payload[0:1], predict (payload[2:])))

Using the convenience function defined above iterate over the contents of the `test_data.csv` file and record the prediction for each observation.  The first number of every observation is the label for that observation, store the label, along with the prediction to then compare them as a measure of the model's performance.

In [None]:
test_labels = []
test_predictions = []

    

Import the `accuracy_score` function from the `sklearn.metrics` module.  Use the `accuracy_score` function to calculate the accuracy of the model against the labels and predictions collected in the previous step.

Next use the `crosstab` function of the Pandas library to produce a formatted table highlighting the number of false positives and false negatives.

In [None]:
from sklearn.metrics import accuracy_score


---
## Complete

That completes this lab.  You should have trained and created a hosted SageMaker model using the XGBoost algorithm for classifying stock movements using sentiment analysis.  The accuracy of your model should have been approximately 52%.  Now think about how you could improve the accuracy of the model.  What are some of the things you could do to perhaps the data, the algorithm, or the algorithm parameters, to tune the performance of your trained model?

### Note: Remove created resources
As part of this lab you will have created an S3 bucket, a SageMaker notebook server, and at least one training job, model, and endpoint.  Please be sure to create each of these to avoid incurring any further charges to your AWS account.