# Twitter Topic Modeling

### Background
This is based on Sentiment140 dataset with 1.6 million tweets(https://www.kaggle.com/kazanova/sentiment140). This project is a continuation of the Twitter Sentiment Analysis that my group and I have done as a final project for Applied Machine Learning course at Wright State University. (direct to Github page for Twitter Sentiment Analysis report)

### Plan
Using preprocessed tweets dataset, I want to know the most topics that people tweeted with the help of topic modeling using LDA (Latent Dirichlet Allocation).
* Generate 5 to 20 topics from the dataset
* Create a word cloud based on frequency of texts
* Create LDA Visualization
* Generate a graph and table of perplexity vs. number of topics


## Importing Data

In [93]:
# package used for importing csv
import pandas as pd

In [94]:
data = pd.read_csv('tweets-clean.csv', error_bad_lines=False)
data_text = data[['Tweets']]
data_text['index'] = data_text.index
documents = data_text
print (documents)

                                                    Tweets    index
0        awww thats a bummer  you shoulda got david car...        0
1        is upset that he cant update his facebook by t...        1
2         i dived many times for the ball managed to sa...        2
3          my whole body feels itchy and like its on fire         3
4         no its not behaving at all im mad why am i he...        4
...                                                    ...      ...
1046511  back home thought id done for the week but jus...  1046511
1046512           my grandma is making dinenr with my mum   1046512
1046513  mid morning snack time a bowl of cheese noodle...  1046513
1046514   same here  say it like from the terminiator m...  1046514
1046515                             im great thaanks  wbuu  1046515

[1046516 rows x 2 columns]


In [95]:
#Clean this up - do import on the cell that uses it

import gensim
from gensim.utils import simple_preprocess

import nltk


import numpy as np
np.random.seed(2018)

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arcit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text Processing

### Removing Special Characters
Special characters are noise to the tweets and has to be removed to lessen the noise. 

In [106]:
def removeSpecialCharacter(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

#this is the tweet that we will remove special characters from
doc_sample = documents[documents['index'] == 4312].values[0][0]

print('Before removing special characters: ')
print(doc_sample)

print('\nAfter special characters removed: ')
doc_sample = removeSpecialCharacter(doc_sample, remove_digits=True)
print(doc_sample)

Before removing special characters: 
blehhhhhhhhh way to early  *feels like puking* 

After special characters removed: 
blehhhhhhhhh way to early  feels like puking 


### Removing Stop Words
Stop words are words that needs to be filtered before or after natural language processing. These are words that are most common in the sentence like "a", "an", "the", etc. 

In [107]:
#download stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

#Print the stop words - these are the words that's gonna be removed
stop_words = set(stopwords.words('english'))
print(stop_words)

{'his', "needn't", 'only', 'on', 'to', "you'd", 'are', "haven't", 'do', 'can', "don't", 'which', 'a', 'all', 'than', 'you', 'why', 'from', 'themselves', 'been', 'what', 'so', 'himself', 'this', 'other', 'that', 'not', 'those', 'i', 'them', 'y', 'does', 'again', 'her', 'myself', "won't", 'just', 'nor', 'they', 'between', "couldn't", "it's", 'ourselves', 'through', 'should', 'haven', 'but', 'these', 'more', 'with', 'hasn', 'shouldn', "should've", 'itself', 'further', 'wouldn', 'isn', 'have', "shouldn't", 'were', 'below', 'some', 'yourself', 'after', 'aren', "didn't", 'over', 'by', 've', 'm', 'above', 'it', 'then', 'very', 'she', 'no', 'hadn', 'how', 'here', "weren't", 'mightn', 'wasn', 'under', 'ain', 'own', 'needn', 'has', 'while', 'our', 'at', "you'll", 'mustn', 'because', 'when', "hasn't", 'and', 'yours', 'there', "hadn't", 'had', 'where', 'such', 'their', 'we', 'against', 'each', 'few', 'weren', 'he', 'into', "mustn't", 'during', 'ours', 'up', 'didn', 'before', 'until', 'most', "aren

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arcit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Tokenization
Tokenization is chopping up sentences into pieces or words called tokens. 

In [None]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

In [108]:
# Tokenize
from nltk.tokenize import word_tokenize

tokens = word_tokenize(doc_sample)

print('Before Tokenization: ')
print(doc_sample)

print('\nAfter Tokenization: ')
print(tokens)

Before Tokenization: 
blehhhhhhhhh way to early  feels like puking 

After Tokenization: 
['blehhhhhhhhh', 'way', 'to', 'early', 'feels', 'like', 'puking']


### Tokenization and Removing Stop Words
Using tokenized sentence, we're now going to remove stop words. It is important to tokenize the sentence/document first before removing stop words.

In [109]:
processed_sample = [w for w in tokens if not w in stop_words]
processed_sample = []

for w in tokens:
    if w not in stop_words:
        processed_sample.append(w)

print('Before removing stopwords: ')
print(tokens)

print('\nAfter removing stopwords: ')
print(processed_sample)

Before removing stopwords: 
['blehhhhhhhhh', 'way', 'to', 'early', 'feels', 'like', 'puking']

After removing stopwords: 
['blehhhhhhhhh', 'way', 'early', 'feels', 'like', 'puking']


### Stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language(english language).

Inflection - modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change

##### Stems are created by removing suffixes or prefixes used with a word.
##### Stemming word or sentence may result in words that are not actual words or words that doesn't have meaning


In [110]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stemDocument(text):
    stem_document = []
    for word in processed_sample:
        stem_document.append(porter.stem(word))
    return stem_document
    
print('Before stemming: ')
print(processed_sample)

print('\nAfter stemming: ')
stemmed_doc = stemDocument(processed_sample)
print(stemmed_doc)
    

Before stemming: 
['blehhhhhhhhh', 'way', 'early', 'feels', 'like', 'puking']

After stemming: 
['blehhhhhhhhh', 'way', 'earli', 'feel', 'like', 'puke']


### Lemmatization
Unlike Stemming, it reduces the inflected words properly ensuring that the root word belongs to the language.

In [111]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def lemmaDocument(text):
    lemma_document = []
    for word in processed_sample:
        lemma_document.append(wordnet_lemmatizer.lemmatize(word))
    return lemma_document

print('Before lemmatization: ')
print(processed_sample)

print('\nAfter lemmatization: ')
lemmatized_doc = lemmaDocument(processed_sample)
print(lemmatized_doc)

Before lemmatization: 
['blehhhhhhhhh', 'way', 'early', 'feels', 'like', 'puking']

After lemmatization: 
['blehhhhhhhhh', 'way', 'early', 'feel', 'like', 'puking']


### Tokenizing and Lemmatizing all tweets in dataset
This is where we tokenize and lemmatize all tweets in the dataset

In [80]:
# OLD STEMMING AND LEMMATIZATION

def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        result.append(lemmatize_stemming(token))
    return result

#doc_sample = documents[documents['index'] == 4312].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

# Preprocessing each tweet and save them into processed_docs variable
processed_docs = documents['Tweets'].map(preprocess)
processed_docs[:10]

0           [awww, that, bummer, shoulda, david, carr]
1    [upset, updat, facebook, text, result, school,...
2         [dive, time, ball, manag, save, rest, bound]
3                            [bodi, feel, itchi, like]
4                                              [behav]
5                                               [crew]
6                                               [need]
7                 [long, time, rain, fine, thank, how]
8                                        [nope, didnt]
9                                              [muera]
Name: Tweets, dtype: object