# CSS688 Term Project Option 3
# Twitter Stocker
## Author: Mike Zhong
## Using the Twitter API to implement sentiment analysis on different sets of stocks
## https://github.com/myz540/twitter_stocker

## Setup

* Create a twitter account (out of scope) and through the developer section, get a set of `OAuth` credentials
* `import` dependencies (see requirements.txt for specific version numbers)
* `import twitter_utils`, my custom library containing classes to handle the work

## 1) Import the necessary libraries 

In [64]:
# comes with python 3.6
import datetime
import re
import configparser

In [65]:
# comes with anaconda
import pandas as pd
import tweepy
from tweepy import OAuthHandler, Stream, StreamListener
import pandas_datareader.data as pdr
import pytz
import nltk

In [66]:
# This is my custom module and contains classes which implement all the heavy lifting. It merits a good read and implements a
# (hopefully) easy-to-use interface for developers to build off of.
from twitter_utils import *

In [67]:
config = configparser.ConfigParser()

In [68]:
config.read('config/keys.txt')

['config/keys.txt']

## 2) Authenticate User: you should use ideally use your own login credentials

In [69]:
consumer_key = config['DEFAULT']['consumer_key']
consumer_secret = config['DEFAULT']['consumer_secret']
access_token = config['DEFAULT']['access_token']
access_secret = config['DEFAULT']['access_secret']

In [70]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# plug into the matrix
api = tweepy.API(auth)
api

<tweepy.api.API at 0x2581789ba90>

## 3) Load the pre-populated list of NASDAQ companies
This csv file was downloaded from the internet via a simple Google search

In [71]:
ticker_df = pd.read_csv('files/companylist.csv')

In [72]:
ticker_df.head()

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Unnamed: 8
0,YI,"111, Inc.",6.51,$530.85M,2018.0,Health Care,Medical/Nursing Services,https://www.nasdaq.com/symbol/yi,
1,PIH,"1347 Property Insurance Holdings, Inc.",5.2899,$31.81M,2014.0,Finance,Property-Casualty Insurers,https://www.nasdaq.com/symbol/pih,
2,PIHPP,"1347 Property Insurance Holdings, Inc.",24.5,$17.15M,,Finance,Property-Casualty Insurers,https://www.nasdaq.com/symbol/pihpp,
3,TURN,180 Degree Capital Corp.,1.86,$57.89M,,Finance,Finance/Investors Services,https://www.nasdaq.com/symbol/turn,
4,FLWS,"1-800 FLOWERS.COM, Inc.",18.34,$1.18B,1999.0,Consumer Services,Other Specialty Stores,https://www.nasdaq.com/symbol/flws,


In [73]:
ticker_df.shape

(3428, 9)

## 4) Determine the Gainers and Losers for a given day
The `StockHandler` object from `twitter_utils` is a custom object implemented to find the gainers and losers from a given day. The object wraps the `pandas_datareader` and makes calls to the `iex` financial database in order to get stock price information. The `StockHandler` also implements methods for computing the gain/loss or `diff`, as well as finding the three winning and losing stocks for a given day.

There are several matters to address here:

1) How do we define a gain or a loss?

* A gain or loss will be calculated as price(close)- price(open) / price(open)
* Implemented in twitter_utils.StockHandler

2) What day should we use when finding the gainers and losers?

* technically, any day can be passed to the StockHandler method for collecting tweets, I will use the yesterday

3) How can we ensure the tweets fetched are relevant to the given day?

* To ensure the tweets fetched were tweeted before the day in question, the implementing methods will check the timestamp
of the tweet to ensure it is before the day in question. The tweets are returned in order of "most recent" so we are sure 
to capture relevant tweets

To accomplish this, we create a dictionary with the company's ticker as the key and the `diff` as the value, where the `diff` is computed as stated above

In [74]:
tickers = ticker_df['Symbol']
print(tickers.head())
print(len(tickers))

0       YI
1      PIH
2    PIHPP
3     TURN
4     FLWS
Name: Symbol, dtype: object
3428


In [75]:
# the date of the lookup will default to yesterday
diff_dict = StockHandler.get_all_diffs(tickers, limit=len(tickers))

TPNL
JFKKU
ACET
AKAO
ADILW


KeyboardInterrupt: 

In [None]:
winner_dict, loser_dict = StockHandler.find_gainers_and_losers(diff_dict)
print(winner_dict)
print(loser_dict)

## 5) Instantiate a CorpusHandler and begin querying twitter
Now we know the companies we are interested in, we can start querying twitter. In my first pass, I created a search function to query just the ticker symbol, but found that I often could not find 100 tweets for small, irrelevant companies, which sometimes show up as winners or losers. To guard against this, I created an `extended_search` method which will also query the company name, and finally, the sector, if the symbol alone doesn't provide enough tweets

The `CorpusHandler` object from `twitter_utils` is a custom object that was implemented to handle much of the heavy lifting. This class contains a variety of methods for handling corpus as strings of tokens separated by whitespace or a delimiter of your choice, as well as converting the corpus into a list of tokens. This class also implements methods for saving and loading a corpora. The two attributes `gainer_corpus` and `loser_corpus` are populated when read from disk, these objects are also what get written to disk when saving. They are `dict` objects with the company as the key mapping to the 100 tweets stored as a `list` of strings. These strings will be pre-processed in section 6 before ultimately populating these two attributes

In [None]:
# create a CorpusHandler, custom object 
corpus_handler = CorpusHandler(api, ticker_df)