# Import and Explore Data

## Import from CSV
The data_exploration notebook was used to retrieve data from Reddit using the Pushshift API via the PSAW and PRAW wrappers. That data was exported to multiple CSVs which we will use now.

In [None]:
import pandas as pd
load_subreddit = "hololive"
load_num_posts = 20000
load_num_days = 180
submissions_df = pd.read_csv(f'./data/{load_subreddit}_submissions_{load_num_posts}_{load_num_days}.csv', delimiter=';', header=0)
comments_df = pd.read_csv(f'./data/{load_subreddit}_comments_{load_num_posts}_{load_num_days}.csv', delimiter=';', header=0)

## Explore Data
We take a look at what data we have imported

In [None]:
submissions_df.shape

In [None]:
submissions_df.head()

In [None]:
comments_df.shape

In [None]:
comments_df.head()

In [None]:
post_ids = comments_df['link_id'].unique()

In [None]:
len(post_ids)

In [None]:
post_ids[:10]

In [None]:
comments_df[comments_df['link_id'] == post_ids[1]]['body']

# Clean and Organize Data

In [None]:
import re
import nltk
from nltk import word_tokenize

In [None]:
def clean_comment(comment, URL_token='-URL-'):
    """
    Description:
        We clean the comment by removing comment quotes, replace URLS with tokens,
        replace punctuation with full stops, and discard tokens that are not alphanumeric,
        a period, or n't.
        
    Input:
        comment: a raw comment text string
        URL_token: optional parameter to set URL token, defaults to -URL-
    
    Output:
        clean_comment: an array of token representing the clean comment ready to be used in a corpus 
    """
    if isinstance(comment, str):
        # remove comment quotes
        comment = re.sub(r'^>(.*?)\n$', '', comment, flags=re.M)

        # replace URLs with token
        comment = re.sub(r'https*://\S*', URL_token, comment)

        # remove punctuation
        comment = re.sub(r'[.,!?;]+', '.', comment)
    else:
        return []
    
    comment_tokens = nltk.word_tokenize(comment)
    clean_comment = [ch.lower() for ch in comment_tokens 
                     if ch.isalpha() # keep alphabetic words
                     or ch == '.' # keep periods
                     or ch =='n\'t' # keep n't = not
                    ]
    return clean_comment

In [None]:
comment_arr = []
comments_raw = comments_df[comments_df['link_id'] == post_ids[1]]['body']
for comment in comments_raw:
    comment_arr.extend(clean_comment(comment))
print(comment_arr)

We organize the comments for each post into a dictionary to provide a clear mapping of the comment tokens (words) for each post (document)

In [None]:
post_to_comments_dict = {}
all_comments = []

for post_id in post_ids:
    comments = []
    # clean comments
    comments_raw = comments_df[comments_df['link_id'] == post_id]['body']
    for comment in comments_raw:
        comments.extend(clean_comment(comment))
    all_comments.extend(comments)
    post_to_comments_dict[post_id] = comments

In [None]:
len(all_comments)

## Explore the Cleaned Data
We examine the most frequent words which can also help us determine if more preprocessing is necessary before training our model.

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(all_comments)
common_words = fdist.most_common(10)
print("\nCommon Words: ", common_words)
fdist.plot(10)

Looking at the most common words, we can see that stopwords (the, i, to, a, it and, is, of, that) and the full stop dominate the top 10 frequent words. These will provide very little value to the topic model so we will update our clean comments function and generate a new corpora to be used. I'm also removing the URL token as this won't provide any value with regards to the topics. For now, I am avoiding stemming but may implement it in a future model to compare the improvements.

In [None]:

test_tokens = ['would', 'like', 'one', 'korone']
clean_comment = [ch.lower() for ch in test_tokens 
                 if ch.isalpha() and
                 ch.lower() not in en_stop # keep alphabetic words
                ]
clean_comment

In [None]:
def clean_comment(comment, stop_words, lemmatizer):
    """
    Description:
        We clean the comment by removing comment quotes, URLS, punctuation, 
        and discarding tokens that are not alphanumeric. We don't include words which are
        stopwords in english
        
    Input:
        comment: a raw comment text string
        stop_words: a set of stop words to be eliminated
        lemmatizer: a function with a method to lemmatize text
    
    Output:
        clean_comment: an array of token representing the clean comment ready to be used in a corpus 
    """
    
    if isinstance(comment, str):
        # remove comment quotes
        comment = re.sub(r'^>(.*?)\n$', '', comment, flags=re.M)

        # replace URLs with token
        comment = re.sub(r'https*://\S*', '', comment)

        # remove punctuation
        comment = re.sub(r'[.,!?;]+', '', comment)
    else:
        return []
    
    comment_tokens = nltk.word_tokenize(comment)
    clean_comment = [ch.lower() for ch in comment_tokens 
                     if ch.isalpha() and
                     ch.lower() not in stop_words 
                     and len(ch) > 1
                    ]
    clean_comment = [lemmatizer.lemmatize(word) for word in clean_comment]
    return clean_comment

#### Lemmatizer and Stop Words
the nltk stopwords appeared to be insufficient as they did not include some of the most common terms, using gensim ones instead

In [None]:
# Define lemmatizer and stop words
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [None]:
from gensim.parsing.preprocessing import STOPWORDS
en_stop = STOPWORDS.union(set(['like', 'savevideo']))
en_stop

We test the new clean_comment function to see if it works

In [None]:
comment_arr = []
comments_raw = comments_df[comments_df['link_id'] == post_ids[1]]['body']
for comment in comments_raw:
    comment_arr.extend(clean_comment(comment, en_stop, wnl))
print(comment_arr)

In [None]:
post_to_comments_dict = {}
all_comments = []

for post_id in post_ids:
    comments = []
    # clean comments
    comments_raw = comments_df[comments_df['link_id'] == post_id]['body']
    for comment in comments_raw:
        comments.extend(clean_comment(comment, en_stop, wnl))
    all_comments.extend(comments)
    post_to_comments_dict[post_id] = comments
    
len(all_comments)

After removing the stopwords, full stops, and URL tokens we now have a corpora of size 1,704,363 compared to the previous size of 3,614,762.

Adding lemmatization and using a slightly extended set of stopwords from gensim results in a corpora size of 1,468,682

In [None]:
fdist = FreqDist(all_comments)
common_words = fdist.most_common(10)
print("\nCommon Words: ", common_words)
fdist.plot(10)

In [None]:
len(post_to_comments_dict)

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white", width=800, height=400, colormap="Spectral", max_words=500)
wordcloud.generate_from_frequencies(fdist)
wordcloud.to_image()

We can see that the most frequent words are now like, one, would, really, stream ... perhaps this is more representative of the comments. Next we start by converting the documents into a simple vector representation using the count vectorizer. Then, we will convert a list of post comments into lists of vectors, all with length equal to the vocabulary.

After updating the stopwords we have removed some of these most frequent terms, and we can see some of the most frequent words are now: stream, time, know, think, people. Not very indicative of any topics. We will evaluate this model and perhaps we can filter words out based on their POS tag for more valuable insights.

In [None]:
# Create a list of post comment lists
post_to_comments_list = list(post_to_comments_dict.values())
len(post_to_comments_list)

In [None]:
len(post_to_comments_dict.keys())

In [None]:
' '.join(post_to_comments_list[0])

## Method 1 - Use sklearn LDA with BoW

In [None]:
# Load the library with the CountVectorizer method
from sklearn.feature_extraction.text import CountVectorizer
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
#Convert each array of word tokens in post_to_comments_list into a string
post_to_comments_list_str = []
for arr in post_to_comments_list:
    post_to_comments_list_str.append(' '.join(arr))

# Initialise the count vectorizer
count_vectorizer = CountVectorizer()

In [None]:
# Fit and transform the processed comments
count_data = count_vectorizer.fit_transform(post_to_comments_list_str)
count_data

In [None]:
for word in count_data[0]:
    print(word[0])

In [None]:
# Helper function
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        

In [None]:
# Tweak the two parameters below
number_topics = 10
number_words = 10
# Create and fit the LDA model
lda_model1 = LDA(n_components=number_topics, n_jobs=-1)
lda_model1.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda_model1, count_vectorizer, number_words)

# Analyzing LDA model results
Now that we have a trained model let’s visualize the topics for interpretability. To do so, we’ll use a popular visualization package, pyLDAvis which is designed to help interactively with:
1. Better understanding and interpreting individual topics, and
2. Better understanding the relationships between the topics.
    - Intertopic Distance Plot

In [None]:
from pyLDAvis import sklearn as sklearn_lda
import pyLDAvis

In [None]:
LDAvis_prepared = sklearn_lda.prepare(lda_model1, count_data, count_vectorizer)
pyLDAvis.save_html(LDAvis_prepared, './output/topic_modeling/ldavis_prepared_model1.html')

# Gensim LDA

We have already processed our data for the first model in which we used sklearn, we will quickly examine that model and begin working with the gensim library, we will encode the corpora in 2 different ways, 1. with the BoW method, and 2. with TF-IDF

In [None]:
post_to_comments_list[:10]

In [None]:
# ipykernel deprecation warning was being triggered in every cell: https://github.com/ipython/ipykernel/issues/540
import warnings
warnings.filterwarnings('ignore')

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(post_to_comments_list)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

Let's filter out terms which appear in less than 15 documents, and terms which appear in more than 40% of documents, as well as keeping only the first 100,000 most frequent tokens

In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.4, keep_n=100000)

## Model 2: Gensim LDA with BoW

### BoW Corpus

In [None]:
bow_corpus = [dictionary.doc2bow(post) for post in post_to_comments_list]

In [None]:
len(bow_corpus)

In [None]:
bow_corpus[1]

In [None]:
test_bow = bow_corpus[1220]
for i in range(len(test_bow)):
    word_num = test_bow[i][0]
    print(f'Word {word_num}, {dictionary[word_num]}, appears {test_bow[i][1]} time(s).')

In [None]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=10, chunksize=100, random_state=100, workers=7)

In [None]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
from pyLDAvis import gensim as gensim_lda
LDAvis_prepared = gensim_lda.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.save_html(LDAvis_prepared, './output/topic_modeling/ldavis_prepared_model2.html')

In [None]:
pyLDAvis.display(LDAvis_prepared)

## Model 3: Gensim LDA with TF-IDF

### TF-IDF Corpus

We create the tf-idf model using the bag of words corpus

In [None]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [None]:
count = 0
for doc in corpus_tfidf:
    print(doc)
    count+=1
    if count > 1:
        break

In [None]:
lda_model_3 = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=10, chunksize=100, random_state=100, workers=7)
for idx, topic in lda_model_3.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

## Model 4 - Using Bigrams and POS Filtering

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(post_to_comments_list, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[post_to_comments_list], threshold=100)

In [None]:
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [None]:
# def functions for bigrams, trigrams
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

In [None]:
import spacy
# Form Bigrams
data_words_bigrams = make_bigrams(post_to_comments_list)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])

In [None]:
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)

In [None]:
from pprint import pprint
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Model Assessment
We assess the models using model perplexity and coherence. 

### Perplexity
Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. It captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set.

However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated. This limitation of perplexity measure served as a motivation for more work trying to model the human judgment, and thus Topic Coherence.

### What is topic coherence?
Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. But …

### What is coherence?
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

In [None]:
# calculate coherence