OK. Let's do this thing. 

# Basic Info
We want to create an LDA model for the Opioid Tweets.
The primary resource upon creating this notebook is [the following page](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html). Most the meat is from that site but I'm sure there'll be plenty things we'll need to add for our own purpose. This is just to get the ball rolling. 

The current skeleton for what needs to be done, along with associated questions:
* First, we need to establish some basic ideas for the tweets we will be analyzing:
    * We need to make sure that the only part of the data we are using is the actual text of the tweet. 
    * Our tokenization should be wary of contractions; i.e. we need to decide whether it's important that "don't" does not seperate into "don" and "t". 
    * We need to decide how the model will interpret emojis; i.e. will/can it interpret them to be their own words or will it just be more apt to remove them altogther?
    * Will we remove links completely or will we replace each link with a common, unique word (like "tco" to represent a `t.co` link) to give the model a chance to group  tweets with links together?
    * A similar question arises with regard to hashtags.
* Next, we create the model. 
    * Should we consider trying/comparing different token/stop-words/stem packages?
    * How many iterations? topics? 
* And finally evaluate, improve, and compare to the other model types. 
    * What areas can be improved and how? (hyperparameters)
    * How do we evaluate the success of the model?



# The Model
We import the necessary libraries and methods, including those of the Twitter data. 

In [16]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
import gensim
from gensim import corpora, models

import pandas as pd

In [17]:
tweets = []
df = pd.read_csv("data/opioid_tweets.csv")

for i in df['content']:
    tweets.append(i)

Here, I test and see there are a few non-string values in `tweets`:

In [18]:
counter = 0
print(len(tweets))
for i in tweets:
    if type(i) != str:
        print('error. index: ' + str(tweets.index(i)))
        counter+=1
        print(type(i))
        print(i)
print(counter)

42954
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
error. index: 14828
<class 'float'>
nan
14


It seems as though there were 14 `nan` values of type `float` in the `tweets` list up until this point. All of them showed index 14828. This line removes them:

In [19]:
tweets = [i for i in tweets if type(i) != float]

Below is the tokenizing, establishing of stop-words, and stemming needed for the model. 

In [20]:
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

Here, I ensure common elements in links do not get included in the corpus (e.g. 'http' or the 't' and 'co' in "t.co", etc. ) The reason for the `for` loop is so I don't have to resart the kernel everytime I test something new out. Otherwise this cell is simply `en_stop += ['co', 't', 'http']`. 

In [21]:
added_stop_words = ['co', 't', 'http', 's']
for i in added_stop_words:
    if i not in en_stop:
        en_stop.append(i)

In [22]:
texts = []
for i in tweets:
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    texts.append(stemmed_tokens)
    
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Here we run the model where `num_topics=2` means that we are primarily looking for 2 general topics to interpret. 

In [23]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word = dictionary, passes=20)

In [24]:
print(ldamodel.print_topics(num_topics=5, num_words=5))

[(0, '0.053*"http" + 0.027*"morphin" + 0.021*"codein" + 0.015*"crazi" + 0.013*"pain"'), (1, '0.076*"http" + 0.050*"fentanyl" + 0.014*"oxycontin" + 0.014*"drug" + 0.010*"heroin"'), (2, '0.030*"http" + 0.025*"percocet" + 0.019*"codein" + 0.018*"like" + 0.015*"vicodin"')]


As is evident, we need to account for links like those that start with `http` or the twitter version `t.co`. 

# Things to work on
* As of now, the main priority is accounting for the links. 
* Another thing to address after the links is the prominent theme of Future's "Codeine Crazy" and Apple Music coming up. 
* Once the above issues are taken care of, the main focus shifts to tweaking the model to give the most helpful image of the data and how to compare it to the other models the team has tried. 