# Twitter Topic Modeling

### Background
This is based on Sentiment140 dataset with 1.6 million tweets(https://www.kaggle.com/kazanova/sentiment140). This project is a continuation of the Twitter Sentiment Analysis that my group and I have done as a final project for Applied Machine Learning course at Wright State University. (direct to Github page for Twitter Sentiment Analysis report)

### Plan
Using preprocessed tweets dataset, I want to know the most topics that people tweeted with the help of topic modeling using LDA (Latent Dirichlet Allocation).
* Generate 5 to 20 topics from the dataset
* Create a word cloud based on frequency of texts
* Create LDA Visualization
* Generate a graph and table of perplexity vs. number of topics


## Importing Data

In [2]:
# package used for importing csv
import pandas as pd

In [3]:
data = pd.read_csv('tweets-clean.csv', error_bad_lines=False)
data_text = data[['Tweets']]
data_text['index'] = data_text.index
documents = data_text
print (documents)

                                                    Tweets    index
0        awww thats a bummer  you shoulda got david car...        0
1        is upset that he cant update his facebook by t...        1
2         i dived many times for the ball managed to sa...        2
3          my whole body feels itchy and like its on fire         3
4         no its not behaving at all im mad why am i he...        4
...                                                    ...      ...
1046511  back home thought id done for the week but jus...  1046511
1046512           my grandma is making dinenr with my mum   1046512
1046513  mid morning snack time a bowl of cheese noodle...  1046513
1046514   same here  say it like from the terminiator m...  1046514
1046515                             im great thaanks  wbuu  1046515

[1046516 rows x 2 columns]


In [4]:
#Clean this up - do import on the cell that uses it

import gensim
from gensim.utils import simple_preprocess

import nltk


import numpy as np
np.random.seed(2018)

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\arcit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text Processing

### Removing Special Characters
Special characters are noise to the tweets and has to be removed to lessen the noise. 

In [5]:
import re
def remove_special_character(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

#this is the tweet that we will remove special characters from
doc_sample = documents[documents['index'] == 2314].values[0][0]

print('Before removing special characters: ')
print(doc_sample)

print('\nAfter special characters removed: ')
doc_sample = remove_special_character(doc_sample, remove_digits=True)
print(doc_sample)


Before removing special characters: 
goodnight nobody since i have no followers nobody can see me say this 

After special characters removed: 
goodnight nobody since i have no followers nobody can see me say this 


### Removing Stop Words
Stop words are words that needs to be filtered before or after natural language processing. These are words that are most common in the sentence like "a", "an", "the", etc. 

In [6]:
#download stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

#Print the stop words - these are the words that's gonna be removed
stop_words = set(stopwords.words('english'))
print(stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\arcit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

{'herself', 'been', 'under', 'if', 'he', 'it', 'once', 'why', 'has', 'an', 'his', "mightn't", 'hers', 'have', 't', 'needn', 'should', 'during', 'on', 'very', 'had', 's', 'will', 'be', 'after', 'some', 'ours', 'not', 'at', 'again', 'such', 'their', 'because', 'out', 'mightn', 'am', 'to', "it's", 'who', 'me', 'our', 'a', 'theirs', 'down', 'both', 'weren', "wasn't", 'yourself', 'you', 'but', 'into', 'against', 'isn', 'before', 'doesn', 'were', 'hadn', 'off', 'shan', 'further', "hasn't", 'did', 'by', 'above', 'from', 'few', 'while', 'in', 'own', 'd', 'her', 'doing', 'm', "needn't", 'most', "hadn't", 'of', 'more', 'can', 'aren', 'those', 'when', 'do', 'ourselves', "won't", 'my', 'now', 'its', 'which', "isn't", 'we', "you'd", 'through', 'll', 'o', 'won', 'them', 'myself', "you're", 'here', 'there', 'up', 'each', 'where', "mustn't", "couldn't", 'is', 'nor', 'yourselves', 'and', "weren't", 'this', 'wasn', 'does', 'how', 'that', 'so', 'then', 'over', 'mustn', 'only', 'for', 'as', 'hasn', "would




### Tokenization
Tokenization is chopping up sentences into pieces or words called tokens. 

In [7]:
# Tokenize - only applied to single tweet for showing 
from nltk.tokenize import word_tokenize

tokens = word_tokenize(doc_sample)

print('Before Tokenization: ')
print(doc_sample)

print('\nAfter Tokenization: ')
print(tokens)

Before Tokenization: 
goodnight nobody since i have no followers nobody can see me say this 

After Tokenization: 
['goodnight', 'nobody', 'since', 'i', 'have', 'no', 'followers', 'nobody', 'can', 'see', 'me', 'say', 'this']


### Tokenization and Removing Stop Words
Using tokenized sentence, we're now going to remove stop words. It is important to tokenize the sentence/document first before removing stop words.

In [8]:
# Used this package 
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

In [9]:
#Tokenize and remove stopwords
def remove_stopwords(text):
    tokens = word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stop_words]
    filtered_text = ', '.join(filtered_tokens)   
    return filtered_text

print('Before removing stopwords: ')
print(tokens)

print('\nAfter removing stopwords: ')
processed_sample = remove_stopwords(doc_sample)
print(processed_sample)

Before removing stopwords: 
['goodnight', 'nobody', 'since', 'i', 'have', 'no', 'followers', 'nobody', 'can', 'see', 'me', 'say', 'this']

After removing stopwords: 
goodnight, nobody, since, followers, nobody, see, say


### Stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language(english language).

Inflection - modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change

##### Stems are created by removing suffixes or prefixes used with a word.
##### Stemming word or sentence may result in words that are not actual words or words that doesn't have meaning


In [31]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def text_stemmer(text):
    portStem = nltk.porter.PorterStemmer()
    text = ' '.join([portStem.stem(word) for word in text.split()])
    return text
    
print('Before stemming: ')
print(processed_sample)

print('\nAfter stemming: ')
stemmed_doc = text_stemmer(processed_sample)
print(stemmed_doc)

Before stemming: 
goodnight, nobody, since, followers, nobody, see, say

After stemming: 
goodnight, nobody, since, followers, nobody, see, say


### Lemmatization
Unlike Stemming, it reduces the inflected words properly ensuring that the root word belongs to the language.

In [11]:
!pip install -U nltk

Requirement already up-to-date: nltk in c:\users\arcit\anaconda3\lib\site-packages (3.5)


In [12]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [13]:
def text_lemmatizer(text):
    word_list = nltk.word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_text

print('Before lemmatization: ')
print(processed_sample)

print('\nAfter lemmatization: ')
lemmatized_doc = text_lemmatizer(processed_sample)
print(lemmatized_doc)

Before lemmatization: 
goodnight, nobody, since, followers, nobody, see, say

After lemmatization: 
goodnight , nobody , since , follower , nobody , see , say


### Apply Text Preprocessing to the tweets
This is where we tokenize and lemmatize all tweets in the dataset.

Note:
* On single text preprocessing(process per tweet), we stemmed and lemmatized the tweet separately to see the changes. Comparing the words that were stemmed and lemmatized, some of the words changed but some doesn't. 
* I will combine all text preprocessing that we've done to make it simple. 

In [15]:
#text_sample = []
def prepare_text(text):
    special_character = remove_special_character(text)
    stop_words = remove_stopwords(special_character)
    stem_text = text_stemmer(stop_words)
    lemmatize_text = text_lemmatizer(stem_text)
    #text_sample.append(lemmatize_text)
    return lemmatize_text

sample = 'The bat is hanging its feet for #2193@'

print(prepare_text(sample))

the , bat , hanging , foot , 2193


In [23]:
import random

text_data = []
i = 0
with open('tweets-clean.csv', errors='ignore') as f:
    for line in f:
        tokens = prepare_text(line)
        if i < 10:
            print(tokens)
            i += 1
        text_data.append(tokens)

awww , thats , bummer , shoulda , got , david , carr , third , day
upset , cant , update , facebook , texting , might , cry , result , school , today , also , blah
dived , many , time , ball , managed , save , 50 , rest , go , bound
whole , body , feel , itchy , like , fire
behaving , im , mad , cant , see
whole , crew
need , hug
hey , long , time , see , yes , rain , bit , bit , lol , im , fine , thanks , how
nope , didnt
que , muera
