- This project uses sentiment analysis from Twitter tweets to help make predictions on the stock market. Every tweet used is associated to a particular stock symbol when a #(stock symbol) or $(stock symbol) is found. For example, the #SP500 or $SP500 is assumed to be related to the SP 500 stock.
Tweets were gathered using the [Tweepy](http://www.tweepy.org/) Python library. Tweets were streamed in real time and saved to a MongoDB database. Anywhere from 4-6 million tweets were gathered per day.
See save_stock_tweets.py for the code.
Both historical and current stock quotes were gathered via the [Yahoo Finance](https://pypi.python.org/pypi/yahoo-finance) Python library.
See yahoo_quotes.py for the code. This includes some data cleaning and preliminary modeling.
First Attempt
My first attempt at getting stock data involved scraping the NASDAQ website in real time for current and historic stock quotes. See scrape_nasdaq.py for the code. I ended up not using this method because it was very time consuming to get quotes. This made it unreasonable considering I wanted to live stream quotes in a web app.
An easy way to get an idea of what your data is doing is to visualize it. For this project I used TFIDF and Nonnegative Matrix Factorization to get an easily interpretable result to graph and model.
So what does this tell me? Well the blue line represents the closing price for a stock symbol for that day and the red lines represent the NMF values for a stock symbol for that day. What I can see from this is that when the red lines go up then the stock market also goes up in the next day. And possibly the same is true for when the market goes down.
See clustering.py for the code.
I can also get an idea of what people are saying about a particular stock symbol by looking at the most used words that relate to it. Enter the word cloud:
To start I used a Random Forest Classifier to see if I could simply identify whether the a particular stock symbol would increase or decrease in value in the following day. From this approach I was getting close to %70 accuracy so I decided to move on to creating a Random Forest Regression model. For this approach I was using the RMSE or Root Mean Squared Error, and the MSE or Mean Squared Error to get an idea of where a stock price would close in the next day.
This image shows the closing prices for a weeks worth of data for the TSLA (Tesla) stock symbol. The red box to the right of the graph shows where my model is predicting the market will close for that day. (You will probably notice that two points are missing here.. This is because those dates were on Saturday and Sunday and there will be no closing prices for those days.)
NMF and Regression
When working with Nonnegative Matrix Factorization, or NMF, you need a way to figure what the best number of features to use is. For this I gauged how a certain number of features changed the MSE in the Regression model. That code can be found in model_validation.py. This code is basically my version of Grid Searching a different number of NMF features and different Random Forest metrics.
Finally I wanted to turn this project into a usable application. To do this I used Flask to create a web application that could allow a user to search different stock symbols, live stream stock quotes, give historical stock data, and display the predictions my model was making for the different stock symbols.
In the end I believe that using unsupervised learning techniques, like Nonnegative Matrix Factorization, is a great way to fuel supervised learning techniques like Random Forest Regression. I used a lot of new technologies in this project and learned a lot in the process. I hope that this project has shown that I am a capable Data Scientist, Application Developer, and Interface Designer. These are three areas that I greatly enjoy working in.