# Toxic Comment: Topic Modelling with pyLDAvis

The following material is inspired by jagangupta's post on Kaggle, found [here](https://www.kaggle.com/jagangupta/understanding-the-topic-of-toxicity), and [this tutorial](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/).


Here we will look at a collection of commentd from Wikipedia developer forums. Some comments have been labelled as toxic and some as clean. Since Latent Dirichlet Allocation (LDA) is a statistical model based on word codependencies, we hope to see that some topics in our data are nothing but cuss words. This would show us that our data is nicely clustered. 

In [1]:
import numpy as np
import pandas as pd
import string

#Text manipulation
from string import punctuation
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis.gensim
%matplotlib inline

#Setting NLTK constants
stop_words = stopwords.words("english")

#settings
color = sns.color_palette()

sns.set_style("dark")
import warnings
warnings.filterwarnings("ignore")



In [2]:
train = pd.read_csv('C:\\Users\\harri\\.kaggle\\competitions\\jigsaw-toxic-comment-classification-challenge\\train.csv\\train.csv').fillna(' ')
test = pd.read_csv('C:\\Users\\harri\\.kaggle\\competitions\\jigsaw-toxic-comment-classification-challenge\\test.csv\\test.csv').fillna(' ')

df = pd.concat([train.iloc[:,0:2], test.iloc[:,0:2]])
df = df.reset_index(drop=True)

In [3]:
def clean(comment):
    """ This is a basic cleaner function that will remove any ugly end of line characters, wikiperdia identifying infor, and urls. 
        It will also serve as a basic preprocesser to tokenize and convert to lowercase."""
    #conv to lowercase
    comment = comment.lower()
    #replace new line
    comment = re.sub('\\n','',comment)
    #remove ip 
    comment = re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", "",comment)
    #remove username
    comment=re.sub("\[\[.*\]","",comment)
    #remove urls
    comment = re.sub("http://.*com", '', comment)
    #article ids
    comment = re.sub("\d:\d\d\s{0,5}$", '', comment)
    #tokenizer
    comment = gensim.utils.simple_preprocess(comment, deacc=True, min_len=3)
    return comment

df['comment_text'] = df['comment_text'].apply(clean)

In [4]:
#Bigrams are words that frequently appear together in the data
bigram = gensim.models.Phrases(df.comment_text, threshold = 15)
bigram_mod = gensim.models.phrases.Phraser(bigram)
lem = WordNetLemmatizer()

In [6]:
def cleanv2(word_list, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """
    Function to further clean the pre-processed word lists 
    
    Following transformations will be done
    1) Stop words removal from the nltk stopword list
    2) Bigram collation (Finding common bigrams and grouping them together using gensim.models.phrases)
    3) Lemmatization (Converting word to its root form : babies --> baby ; children --> child)
    """
    #remove stop words
    clean_words = [w for w in word_list if not w in stop_words]
    #collect bigrams
    clean_words = bigram_mod[clean_words]
    #Lemmatize
    clean_words=[lem.lemmatize(word, "v") for word in clean_words]
    return clean_words  



df['comment_text'] = df['comment_text'].apply(cleanv2)

KeyboardInterrupt: 

In [None]:
dictionary = Dictionary(df.comment_text)
corpus = [dictionary.doc2bow(text) for text in df.comment_text]

# Visualization via LDAvis

- Size of the circles represents the relevance of the topic within our group of comments. 
    - Bigger circle = topic words are included in more comments.
- Top 30 most salient (frequent) words within a topic are displayed on the right.


In [None]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary, random_state = 100)
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)