## Topic modeling of Reddit comments using spaCy and Gensim

In this project, I use a topic modeling approach to reveal the abstract topics contained within our collected Reddit comment corpus. 

-  Entity recognition extraction using spaCy (exploration)
-  Tokenization (forming lemma tokens and storing each sentance of corpus into )
-  Phrase modeling (unigram, bigram, and trigram models)
-  Fit Gensim LDA model (number of topics = 30, trained on 954403 comments, 9676 tokenized comments in test set) 
-  Visualization of results using pyLDAvis
-  Analysis is provided in the accompanying report 

I closely follow the approach outlined here in the following link. I modify the source code of this approach where nesessary to reflect our unique case: <br>
https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb

### Initialize modules

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import spacy 
import base64
import re
import os
import codecs

In [None]:
df = pd.read_csv('reddit_comments_processed.csv', encoding='utf-8')

In [None]:
df.body_text.dtype

In [None]:
df

In [None]:
# read in Stanford stopword list (not used for this project). Instead I use spaCy's default English stopword list.

def read_stopwords(stopwords_path):
    """ This function reads the stopwords file line by line, returning a dictionary of stopwords
    inputs:
    stopwords_path - Path of stopwords file as a string
    outputs:
    output - dictionary of stopwords (Key = word: value = None)
    """
    output = {}
    f = open(stopwords_path,"r")
    content = f.readlines()
    content = [line.strip('\n') for line in content]
    f.close()
    # Create dictionary of stopwords
    for word in content:
        output[word] = None
    return output

In [None]:
stopwords=read_stopwords('stopwords.txt')
stopwords

In [None]:
df.head(n=10)

### Drop rows with NaN

These are rows with no text (either emoji or url extracted)

In [None]:
df.isnull().sum()

In [None]:
inds = pd.isnull(df).any(1).nonzero()[0]

In [None]:
inds

In [None]:
df.dropna(how ='any',axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.shape

Covert "created_utc" to date time format

In [None]:
df['created_utc'] = pd.to_datetime(df['created_utc'],unit='s')

In [None]:
df.head(n=70)

## Tokenization 

In [None]:
from __future__ import unicode_literals
nlp = spacy.load('en')

In [None]:
text = df['body_text']

In [None]:
test = "Hello! Unfortunately, since [your account has less than 10 combined karma](/u/me) and new account spam makes up a significant portion of all spam, your post was automatically removed. However, you may still contribute by commenting on existing posts in /r/technology! Additionally, you may make meaningful contributions to [other subreddits](/subreddits) to increase your karma count. If you believe this is a legitimate submission, please [message the moderators](/message/compose?to=/r/technology&amp;subject=Request for post review - account karma) to have them manually review your post, or wait a few days and try again. Thank you!  *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/technology) if you have any questions or concerns.*"

In [None]:
sample = test
sample

In [None]:
%%time
parsed_review = nlp(sample)

In [None]:
parsed_review

In [None]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
df      

In [None]:
for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

In [None]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

In [None]:
token_text = [token.text for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]
token_lemma = [token.lemma_ for token in parsed_review]

test = pd.DataFrame(list(zip(token_text, token_pos, token_lemma)), columns=['token_text', 'part_of_speech','column_lemma'])

In [None]:
test

In [None]:
nlp = spacy.load('en', disable=['ner'])

In [None]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

            
def lemmatized_sentence_corpus(df):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    for parsed_review in nlp.pipe((df), n_threads=4):
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent if not punct_space(token)])

In [None]:
%%time
if 1 == 1:
    with codecs.open('unigram_sentences_all.txt', 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(text):
            f.write(sentence + '\n')

## Phrase models (unigram, bigram, trigram)

In [None]:
import gensim 
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
import itertools as it

In [None]:
# Load unigram sentences from disk (each sentance is a list of tokens)
unigram_sentences = LineSentence('unigram_sentences_all.txt')

In [None]:
# Load biagram sentences from disk (each sentance is a list of tokens)
for unigram_sentence in it.islice(unigram_sentences, 500, 550):
        print(u' '.join(unigram_sentence))
        print("\n")

### Test biagram model on sample

In [None]:
%%time

bigram_model = Phrases(unigram_sentences)
bigram_model.save('bigram_model_all')

bigram_model = Phrases.load('bigram_model_all')

In [None]:
%%time

with codecs.open("bigram_sentences_all.txt", 'w', encoding='utf_8') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            f.write(bigram_sentence + '\n')

In [None]:
bigram_sentences = LineSentence('bigram_sentences_all.txt')

In [None]:
%%time

trigram_model = Phrases(bigram_sentences)
trigram_model.save('trigram_model_all')
trigram_model = Phrases.load('trigram_model_all')

In [None]:
%%time

with codecs.open("trigram_sentences_all.txt", 'w', encoding='utf_8') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            f.write(trigram_sentence + '\n')

In [None]:
nlp = spacy.load('en', disable=['ner'])

In [None]:
# apply phrase models to the full next 

from spacy.lang.en.stop_words import STOP_WORDS

with codecs.open("trigram_reviews_all.txt", 'w', encoding='utf_8') as f:
    
    for parsed_review in nlp.pipe(text,batch_size=10000, n_threads=4):
            
        # lemmatize the text, removing punctuation and whitespace
        unigram_review = [token.lemma_ for token in parsed_review if not punct_space(token)]
            
        # apply the first-order and second-order phrase models
        bigram_review = bigram_model[unigram_review]
        trigram_review = trigram_model[bigram_review]
            
        # remove any remaining stopwords and numbers
        trigram_review = [term for term in trigram_review if term not in STOP_WORDS]
            
        # write the transformed review as a line in the new file
        trigram_review = u' '.join(trigram_review)
        f.write(trigram_review + '\n')

## Topic Modeling 

In [7]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import _pickle as pickle
import pandas as pd
import sys
import re
import numpy as np
import codecs
from gensim.models.word2vec import LineSentence

In [8]:
with open('trigram_reviews_all.txt') as f:
    content = f.readlines()
content = [x.strip('\n') for x in content] 

In [9]:
# print each Reddit comment 
content[0:10]

["simply industry group publish sure publish result reproducible -PRON- wavenet paper -PRON- would_rather at_least algorithmic view 's work share data wholly infeasible time data collect user agreement preclude release datum third_party",
 "-PRON- propose datum protection regulation work approach particularly health sphere institution -PRON- work strict ethic_board state specific consent require use individual 's datum particularly sharing anonymisation easily achievable case e.g. genetic datum geolocation etc",
 'gt ’ holdup iclr close -PRON- -PRON- need require code github',
 'nature industry competitive -PRON- make_sense corporation share -PRON- research -PRON- help competition -PRON- effectively incentivis opposite',
 '-PRON- reproducible science faulty -PRON- mean -PRON- -PRON- science overstretch especially machine_learn field -PRON- reproduce experiment confirm result -PRON- -PRON- wrong especially machine_learning field',
 'exactly -PRON- know science fault -PRON- model right d

In [10]:
df_main = pd.DataFrame(data=content,columns=['text'])

In [11]:
df_main.shape

(972304, 1)

In [12]:
# strip "-PRON-" and "urltag" from all strings
pron_match = re.compile(r'-PRON-')
url_match = re.compile(r'urltag')
df_main['text'] = df_main['text'].str.replace(pron_match, '')
df_main['text'] = df_main['text'].str.replace(url_match, '')

In [13]:
df_main['text'].replace('', np.nan, inplace=True)
df_main.dropna(inplace=True)
df_main = df_main.reset_index(drop=True)

In [14]:
df_main.shape

(964079, 1)

In [15]:
msk = np.random.rand(len(df_main)) < 0.99
df_train = df_main[msk]
df_test = df_main[~msk]

In [16]:
len(df_train)

954454

In [17]:
len(df_test)

9625

In [18]:
df_train_list = df_train.text.tolist()

In [19]:
df_test_list = df_test.text.tolist()

In [20]:
# print first 10 entries 
df_train_list[0:10]

["simply industry group publish sure publish result reproducible  wavenet paper  would_rather at_least algorithmic view 's work share data wholly infeasible time data collect user agreement preclude release datum third_party",
 " propose datum protection regulation work approach particularly health sphere institution  work strict ethic_board state specific consent require use individual 's datum particularly sharing anonymisation easily achievable case e.g. genetic datum geolocation etc",
 'gt ’ holdup iclr close   need require code github',
 'nature industry competitive  make_sense corporation share  research  help competition  effectively incentivis opposite',
 ' reproducible science faulty  mean   science overstretch especially machine_learn field  reproduce experiment confirm result   wrong especially machine_learning field',
 'exactly  know science fault  model right datum',
 'work new conference spring contemporary practice accept',
 'good important question  know answer   assume

In [21]:
# print first 10 entries 
df_test_list[0:10]

['actually work   request w youtube block   guess  compliance low tenth digit',
 ' think belong probably more_suited /r futurology/',
 ' hard blind',
 'agree  curious people think purpose peer_review quote op gt  want  try understand question line  negate purpose sound_like exactly purpose purpose peer_review  ideal world think  happen reality',
 'great',
 'example end google blog_post actual team use book restaurant  eat  photo  eat restaurant  flat lie  like   ',
 ' mean  idea fun post pointless video ml forum gangsta',
 'delete',
 'shit post',
 'obvious way  fail global method more_importantly issue border case signifcant problem object size  refer  actual advice  appreciate ']

In [81]:
# Save final text strings to file

with codecs.open("trigram_sentences_train_final.txt", 'w', encoding='utf_8') as f:
    for text in df_train_list:
        f.write(text + '\n')

In [82]:
# Save final text strings to file

with codecs.open("trigram_sentences_test_final.txt", 'w', encoding='utf_8') as f:
    for text in df_test_list:
        f.write(text + '\n')

In [83]:
%%time

# Load final trigram comments as a list of tokens
trigram_comments = LineSentence('trigram_sentences_train_final.txt')

# learn the dictionary by iterating over comments
trigram_dictionary = Dictionary(trigram_comments)
    
# filter tokens that are very rare or too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
trigram_dictionary.compactify()
trigram_dictionary.save('trigram_dict_all.dict')
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load('trigram_dict_all.dict')

CPU times: user 37.8 s, sys: 328 ms, total: 38.1 s
Wall time: 38.4 s


In [84]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [85]:
%%time

# generate bag-of-words representations for
# all reviews and save them as a matrix
MmCorpus.serialize('trigram_bow_corpus_all.mm',trigram_bow_generator("trigram_sentences_train_final.txt"))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus('trigram_bow_corpus_all.mm')

CPU times: user 59.9 s, sys: 1.89 s, total: 1min 1s
Wall time: 1min 3s


In [86]:
%%time

# workers => sets the parallelism, and should be
# set to your number of physical cores minus one

with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        lda = LdaMulticore(trigram_bow_corpus,num_topics=30,id2word=trigram_dictionary,workers=3)
lda.save('lda_model_all')
    
# load the finished LDA model from disk
lda = LdaMulticore.load('lda_model_all')

CPU times: user 2min 32s, sys: 40.6 s, total: 3min 13s
Wall time: 3min 15s


## Visualization

In [124]:
def explore_topic(topic_number, topn=20):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=20):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [127]:
explore_topic(topic_number=19)

term                 frequency

think                0.017
human                0.016
like                 0.013
thing                0.013
people               0.013
ai                   0.012
use                  0.011
way                  0.009
nuclear              0.008
mean                 0.007
robot                0.006
need                 0.006
datum                0.006
brain                0.006
facebook             0.006
know                 0.005
's                   0.005
idea                 0.004
good                 0.004
lot                  0.004


In [128]:
%%time

if 1 == 1:

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus, trigram_dictionary)

    with open('ldavisual_prepared', 'w') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open('ldavisual_prepared') as f:
    LDAvis_prepared = pickle.load(f)

TypeError: write() argument must be str, not bytes

In [129]:
pyLDAvis.display(LDAvis_prepared)