# Machine Learning for Stock Price Prediction
## Overview

### What You'll Learn
In this section, you'll learn
1. How to format data so that it can be used for machine learning
2. How to create, train, and test a model that predicts stock prices
3. How to improve it

### Prerequisites
Before starting this section, you should have an understanding of
1. [Basic Python (functions, loops, lists)](https://github.com/HackBinghamton/PythonWorkshop)
2. [scikit-learn](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/intro_ml_scikit.ipynb) ([Boston Housing Price example](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/housing_price_prediction.ipynb) if you'd like extra practice)

### Introduction
Stock price prediction has been a Holy Grail of machine learning for years. If one can predict changes in stock prices, they can buy and sell at just the right times to make tons of money. In this workshop section, we'll discuss how to make data about a given stock fit into a `sklearn` machine learning model as well as how to train and test it.

---

## Loading the Data
Usually, you have to use one of `sklearn`'s datasets, find a third-party dataset online, or build your own.

In this case, we've prepared datasets on different individual stocks for everyone! These datasets hold three types of data over multiple rows. In each row, you will find:

 1. The date of the following data
 2. The stock price change on that day
 3. The average sentiment of news related to the stock on that day

By "sentiment" we mean whether or not the news was largely positive or negative. For example, a sentence like "I hate licorice" has a generally negative sentiment, while "HackBU teaches amazing students like you how to create awesome things with code" has a generally positive sentiment. We gathered these sentiments using a Natural Language Processing library called NLTK on a set of news articles gathered from an API. The links to these resources are at the very end of this page.

**Feel free to tweak the `dataset` variable below to select which dataset you'd like to work with. Once you've selected which dataset you'd like, run the code block and your dataset will be loaded!**

In [None]:
import requests
import csv
import pandas as pd

# TWEAK THIS VALUE TO USE WHATEVER DATASET YOU'D LIKE
# Options: Facebook, Amazon, Microsoft, Nvidia, Apple
dataset = "Nvidia"


#########################
## DO NOT MODIFY BELOW ##
#########################

##### LOAD DATA #####
# Fetch the dataset contents
r = requests.get("https://raw.githubusercontent.com/HackBinghamton/MachineLearningWorkshopWeek2/master/stock_price_prediction/" + dataset + ".csv")

# Write to a local file
with open(dataset + ".csv", "w") as datafile:
    datafile.write(r.text)

# Open the dataset in a pandas dataframe
df = pd.read_csv(dataset + ".csv")

print(df)

## Preparing the Data
Great! So we've now loaded up a dataset into a list that looks something like this:

```python
            Date    Stock Change     Sentiment
0     2019-09-19            0.73      8.553792
1     2019-09-22            2.52      7.389561
 ...
12    2019-10-08           -0.69     11.110601
```

However, we can't yet feed this into our machine learning model. Here are a few problems with it:

1. The "Date" column is largely irrelevant. Unless we have multiple years worth of data points, this data is likely to cause confusion for our machine learning model
2. We don't know what data should be used to train the model and what data should be used to test it

In order to do fix this and make our data compatible with the machine learning model, we'll have to do the following:

1. Remove the date column
2. Break our data into 4 different lists: training and testing sets of both X (input/news sentiment) and Y (output/stock price) values

Let's do it!

### 1. Chopping the Date column off
The Date column is irrelevant, and is more likely to confuse the machine learning model than help it. It's all string data, and machine learning models like the ones we're using only take numerical data. Also, if there were years worth of data then maybe the model could find a correlation, but because there are only a few data points there's no way that a model can make informed decisions off of the dates.

In [None]:
# "Slices" in Python (the [:] things) let you chop out parts of lists to your liking
df = df.drop('Date', axis=1)

# Display the data
print(df)

### 2. Splitting into training and testing data
For our last step in data processing, we must split our data into training and testing groups, as well as organize our data so that it fits into our model.

Giving a model a large training set and a small testing set is like teaching students for a whole semester and only having one exam -- you risk not testing students on everything they've learned, so your measurement of their performance may not be accurate. 

On the other hand, small training sets and large testing sets lead to models that haven't learned much but keep getting tested on things that they may or may not have learned yet. This can result in inaccurate models.

**We've set `train_proportion` to 0.7, which uses most of the data for training and a healthy dose for testing. Feel free to tweak this and see how it affects your accuracy later on!**

In [None]:
from sklearn.model_selection import train_test_split

# Designate how much of our dataset we'd like to dedicate to testing (the rest goes to training)
test_proportion = 0.3

# Store the sentiment data in X and the stock change data in y
X = df[['Sentiment']]
Y = df[['Stock Change']]

# Split X and y into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_proportion, random_state=42)

# Display our newly-formatted data
print("X Training:\n", X_train, "\n")
print("Y Training:\n", Y_train, "\n")
print("X Testing:\n", X_test, "\n")
print("Y Testing:\n", Y_test, "\n")

## Creating, Training, and Testing Our Model

Now that we have our data properly formatted, we can finally create our model and run the data through it. Here's generally what this looks like, for any data set:

```python
# Load up the model to use
from sklearn.some_model_variety import ModelOfChoice

# Load your data as shown above...

# Create your model
model = ModelOfChoice()

# Train your model
model.fit(X_train, Y_train)

# Check its accuracy
print("ModelofChoice Accuracy:", str(model.score(X_test, Y_test) * 100) + "%")
```

We've decided to use a `LinearRegression`, which tries to draw a line of best fit through the X and Y data to predict values between data points. Poke around and see what other regressions you can use and how they affect the accuracy of the model. `Ridge` and `Lasso` are two other regressions which have their own advantages and disadvantages.

In [None]:
# Grab our model from sklearn
from sklearn.linear_model import LinearRegression

# Create our model
model = LinearRegression()

# Train the model
model.fit(X_train, Y_train)

# Test the model, and report its accuracy
print("Accuracy:", str(model.score(X_test, Y_test) * 100) + "%")


### What You Could Do to Improve This System

**You probably noticed that the accuracy of our model is *very* low. Don't worry! This is normal -- let's talk about how you could fix that.**

#### 1. Include more data
Machine learning models need as much data as they can get in order to make the most educated estimates. Our datasets contain roughly 10 days worth of stock data -- imagine how much better it would be if we had access to 10 years worth.

#### 2. Include more variables
Trying to predict stock prices based on news sentiments is like trying to predict the weather based on the average humidity. Both stock prices and the weather are very chaotic systems -- drastic changes can occur suddenly and unpredictably. In order to get better at predicting stock prices, we need not only more data, but more *types* of data.

In this workshop, we used news sentiment as one input. We could also gather data on the daily market average, the time of year, the time-proximity to nearby holidays, and so much more. The best models use the most data.

## Appendix: How We Collected Our Datasets
We used the [Alpha Vantage API](https://www.alphavantage.co/documentation/) to collect stock data on a daily basis, and the [News API](https://newsapi.org/) to gather news articles from the past month. To create average news sentiments, [we used the Natural Language Toolkit](https://www.nltk.org/) Vader analyzer.
