In [6]:
# importing the data
import pymysql

import numpy as np
import pandas as pd

import nltk
import re

In [2]:
connection = pymysql.connect(host = "localhost", user = "pytester", password = "monty", db = "tweetsdb",
                            cursorclass = pymysql.cursors.DictCursor)

In [3]:
# getting the data
tweets = pd.read_sql("select * from tweets", connection)

# Exploration of the data

The main topic of this notebook is going to be topic modelling. I'm going to start of by looking at the tweets. Since that's what I'll be modelling.

In [6]:
tweets["tweet_text"]

0       RT @kryptonprobett: 🚨FREE NYE VIP PICKS🚨\n\nRE...
1       #NBA "Andre Iguodala fined for throwing ball i...
2       Daniel Hamilton will start for Hawks vs. IND -...
3       RT @marca: Espectacular. El mapa de la constel...
4       @KingJames saying he winning against Steph, Kl...
5       RT @RoyalRanksDFS: End Of The Year Special 🎉🥂 ...
6       RT @goodformgroup: Follow the link below and t...
7       #NFAC Closing Out 2018 w/ #BOMBS !!!\nMonday's...
8       RT @ToretoApuestas: Últimas 6/6 ✅\n\n✅ 2.00\n✅...
9       RT @nbastats: 📈🏀 STAT LEADERS THREAD 🏀📈\n\nThe...
10      RT @NBALatam: Los @Raptors siguen teniendo el ...
11      #MYNYEMONDAY \n\n#NBA: \n\nGrizzlies +4.5.\nPe...
12      RT @nbastats: The POINTS PER GAME leaders thro...
13      Let’s close out 2018 on a high note. I discuss...
14      RT @nbastats: The total REBOUNDS leaders throu...
15      RT @nbastats: The REBOUNDS PER GAME leaders th...
16      RT @nbastats: The total ASSISTS leaders throug...
17      👳🏻‍♂️L

There's a lot of "@" mentions here. These are all people. But, this is not going to add any information to us. Then, the first thing that pops out is the sheer quantity of text. So, I'll need to do something about it.

That's why I'm going to start off by pre-processing the text. Here's the plan for pre-processing the text:
    1. Remove punctuations.
    2. Remove @ and # symbols.
    3. Remove all non-alpha numeric characters.
    4. Remove all non-English tweets.
    5. Remove all stopwords
    6. Tokenize the data.
    7. Lemmatize the data.

# Data Pre-Processing

## Removing all punctuations

Before even looking at the other items in the list, there's one thing that can be done. There's this ugly "RT" that's present at the head of each tweet. I can do some feature engineering by using that to create a new variable is_retweet.

But, I won't be doing since the focus of this notebook is totally different.

In [17]:
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer

In [4]:
# creating a new object to operate on
tweet_text = tweets["tweet_text"].copy()

In [7]:
# removing the RT at the head of each tweet
tweet_text = tweet_text.apply(lambda x: re.sub("^RT", "", x))

In [8]:
# removing punctuations from the data
tweet_text = tweet_text.apply(lambda x: re.sub("\d+", "", x))

In [10]:
# normalizing case
tweet_text = tweet_text.apply(lambda x: x.lower())

In [15]:
# tokenizing words
tokenizer = RegexpTokenizer(r"\w+")
tweet_text = tweet_text.apply(lambda x: tokenizer.tokenize(x))

In [19]:
# lemmatizer to lemmatize words
lemma = WordNetLemmatizer()

# defining dictionary of stopwords
stopWords = stopwords.words()

In [20]:

# function to lemmatize text, remove stopwords and remove words with only one or two characters
def normalizeText(text):
    normalized_text = [word for word in text if word not in stopWords and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', word)]
    normalized_text = [lemma.lemmatize(word) for word in text]
    normalized_text = [word for word in normalized_text if len(word) > 3]
    return normalized_text

In [22]:
tweet_text = tweet_text.apply(lambda x: normalizeText(x))

There's not much use in retaining words that are too short. So, I'll get rid of them.

In [24]:
# removing words that have less than three characters in them
drop_words = {word for desc in tweet_text for word in desc if len(word) <= 3}

In [None]:
len(drop_words) #0

There seem to be not words that have less than four characters. The next step is to move onto words that are not English. Removing them is essential to making the topics more human understandable and better.

In [39]:
from nltk.corpus import words

In [40]:
engWords = words.words()
drop_words_lang = {word for tweet in tweet_text for word in tweet if word not in engWords}

In [41]:
len(drop_words_lang) # 2676

2676

So, there are 2676 words that are not part of the standard corpus of words defined in nltk. It's safe to assume that these words might not add any information and remove them. There are about 3800 words in all the tweets combined. So, it might not be the best to remove 2.6k words. 

Because I don't know which method is better, I'll simply build two models to know which is the better method.

However, since I'm already losing out on so much data, I won't filter the words by frequency. If I feel that word frequencies are screwing up the result, then I'll carry out the filter operation again. 

In [68]:
def removeDropWords(text):
    return [word for word in text if word not in drop_words_lang]

In [70]:
tweet_text = tweet_text.apply(lambda x: removeDropWords(x))

In [72]:
words = {word for tweet in tweet_text for word in tweet}
len(words)

Now, there are about 1120 words in the corpus. This is pretty good. It's time to train our LDA model. For this, I'll be using the gensim library in Python.

## Training the LDA Model

In [75]:
from gensim import models, corpora

In [78]:
# building the vocabulary of our model
dictionary = corpora.Dictionary(tweet_text)

In [83]:
# getting the bag of words model
corpus = [dictionary.doc2bow(text) for text in tweet_text]

Now, for the model, I'll be looking at 5 topics to start off with. I'll extract 10 keywords from each topic, look at the top two topics for each tweet. 

In [109]:
# training the model
lda_model = models.LdaModel(corpus = corpus, num_topics = 10, id2word = dictionary, random_state = 100)

In [110]:
topic_terms = {"Topic " + str(i):[] for i in range(10)}

for idx, topic in enumerate(topic_terms):
    terms = lda_model.get_topic_terms(idx)
    for term in terms[:10]:
        topic_terms[topic].append(dictionary[term[0]])

In [111]:
pd.DataFrame(topic_terms)

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,game,pacer,left,rocket,more,spur,hornet,ultimo,this,game
1,that,game,pacer,with,year,with,pour,parlay,year,with
2,their,tonight,drop,grizzly,game,year,para,thunder,with,tonight
3,will,hawk,handed,pick,sport,pick,sport,basketball,season,play
4,night,pick,watch,free,pour,pacer,rocket,spur,basketball,week
5,forward,player,young,today,thunder,tonight,grizzly,pick,sport,that
6,recognize,year,dish,over,maverick,week,check,match,harden,warrior
7,tomorrow,some,hammer,tonight,news,play,encore,book,latest,season
8,clipper,this,league,week,match,game,with,from,total,last
9,pride,today,into,season,have,slate,this,game,check,road


Now that I have the terms in each topic, I'll look into tagging each user based on the topics. 
After I do this rather simple process, I'll look into getting more data from the distribution of the topics over the users.

### Topics for every user

In [115]:
tweet_topics = {idx: [] for idx in tweet_text.index}
for idx, tweet in enumerate(tweet_text):
    bow_tweet = dictionary.doc2bow(tweet)
    topics = sorted(lda_model.get_document_topics(bow_tweet), key = lambda x: x[1], reverse = True)[:2]
    for topic in topics:
        tweet_topics[idx].append("Topic " + str(topic[0]))

In [118]:
tweet_topics = pd.Series(tweet_topics)

In [122]:
tweets["tweet_topics"] = tweet_topics

# Clustering documents by their topic distributions

The next step is to cluster those tweets that are similar to each other. To do this, I'll use the Jensen-Shannon divergence as a measure of how different two probability distributions are. 

This means that instead of looking at just the top two or three tags, I'll need to look at the entire distribution of the topics to determine the difference.

In [125]:
# getting the complete distributions for each tweet
tweets_distribution = {idx: [] for idx in tweet_text.index}

for idx, tweet in enumerate(tweet_text):
    bow_tweet = dictionary.doc2bow(tweet)
    topics = lda_model.get_document_topics(bow_tweet)
    for topic in topics:
        tweets_distribution[idx].append(topic[1])

Now that I have the tweet distributions, I need calculate the JSD for each and every pair of tweets. Let's get to that.

But before I can do that, I have to deal with 97 indexes that do not have a distribution value of 10. This is going to be a pain to fix this. So, I'll do it tomorrow.