# Twitter Tweets Topic Modeling

### Background
This is based on Sentiment140 dataset with 1.6 million tweets(https://www.kaggle.com/kazanova/sentiment140). This project is a continuation of the Twitter Sentiment Analysis that my group and I have done as a final project for Applied Machine Learning course at Wright State University. (direct to Github page for Twitter Sentiment Analysis report)

### Plan
Using preprocessed tweets dataset, I want to know the most topics that people tweeted with the help of topic modeling using LDA (Latent Dirichlet Allocation).
* Generate 5 to 20 topics from the dataset
* Create a word cloud based on frequency of texts
* Create LDA Visualization
* Generate a graph and table of perplexity vs. number of topics


In [33]:
import pandas as pd
data = pd.read_csv('tweets-clean.csv', error_bad_lines=False);
data_text = data[['Tweets']]
data_text['Index'] = data_text.index
documents = data_text

print(len(documents))
print(documents[:5])

1046516
                                              Tweets  Index
0  awww thats a bummer  you shoulda got david car...      0
1  is upset that he cant update his facebook by t...      1
2   i dived many times for the ball managed to sa...      2
3    my whole body feels itchy and like its on fire       3
4   no its not behaving at all im mad why am i he...      4


In [34]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import numpy as np
np.random.seed(2018)

import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arcit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [35]:
from nltk import PorterStemmer

def lemmatize_stemming(text):
    return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [36]:
doc_sample = documents[documents['Index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['i', 'am', 'so', 'sick', 'from', 'stress', 'gah', 'and', 'i', 'need', 'sleep', 'and', 'it', 'is', 'all', 'my', 'fault', '']


 tokenized and lemmatized document: 
['sick', 'stress', 'need', 'sleep', 'fault']


In [37]:
processed_docs = documents['Tweets'].map(preprocess)
processed_docs[:10]

0           [awww, that, bummer, shoulda, david, carr]
1    [upset, updat, facebook, text, result, school,...
2         [dive, time, ball, manag, save, rest, bound]
3                            [bodi, feel, itchi, like]
4                                              [behav]
5                                               [crew]
6                                               [need]
7                 [long, time, rain, fine, thank, how]
8                                        [nope, didnt]
9                                              [muera]
Name: Tweets, dtype: object

In [38]:
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [39]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [40]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [41]:
# how does it gets the topics? Is it per tweet or whole doc? Look up how LdaMulticore works
# Per topic, it comes from different tweets depending on if the word appear on the tweet
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.026*"like" + 0.023*"littl" + 0.018*"girl" + 0.018*"think" + 0.016*"sweet" + 0.016*"didnt" + 0.015*"funni" + 0.014*"cute" + 0.014*"guess" + 0.014*"quot"
Topic: 1 
Words: 0.154*"http" + 0.066*"quot" + 0.037*"twitpiccom" + 0.030*"bitli" + 0.029*"say" + 0.016*"tinyurlcom" + 0.015*"check" + 0.013*"head" + 0.013*"look" + 0.012*"like"
Topic: 2 
Words: 0.115*"love" + 0.052*"know" + 0.034*"haha" + 0.029*"your" + 0.029*"that" + 0.024*"miss" + 0.022*"cool" + 0.020*"thank" + 0.020*"song" + 0.018*"dont"
Topic: 3 
Words: 0.055*"work" + 0.044*"go" + 0.034*"today" + 0.031*"tomorrow" + 0.031*"weekend" + 0.029*"week" + 0.029*"home" + 0.024*"time" + 0.024*"wait" + 0.020*"friday"
Topic: 4 
Words: 0.076*"watch" + 0.061*"want" + 0.040*"dont" + 0.030*"wish" + 0.024*"movi" + 0.017*"tonight" + 0.017*"know" + 0.015*"like" + 0.011*"think" + 0.011*"drink"
Topic: 5 
Words: 0.054*"need" + 0.024*"help" + 0.023*"talk" + 0.017*"read" + 0.016*"final" + 0.015*"book" + 0.015*"write" + 0.014*"let" + 0.0