**In previous notebook we performed the cleaning and EDA for the comments. This notebook contains topic extraction using the Topic Modelling technique.**

In [None]:
pip install pyldavis



## Import Packages

In [None]:
import pandas as pd
import numpy as np

from pprint import pprint
import re
import nltk

import pyLDAvis

from gensim import corpora
from gensim.models import LdaMulticore

import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt

from google.colab import drive

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

## Import Data

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/My Drive/Capstone/Data/

/content/drive/.shortcut-targets-by-id/1oPJof-sZbxMW4yf3cLG3l2yFEepNLfxR/Capstone/Data


In [None]:
train = pd.read_csv("train_clean_lemmatize.csv")

In [None]:
train.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,bisexual,black,buddhist,christian,female,heterosexual,hindu,homosexual_gay_or_lesbian,intellectual_or_learning_disability,jewish,latino,male,muslim,other_disability,other_gender,other_race_or_ethnicity,other_religion,other_sexual_orientation,physical_disability,psychiatric_or_mental_illness,transgender,white,comment_cleaned,comment_lemmatized,target_label
0,59848,0.0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,this is so cool it like would you want your mo...,cool like would want mother read realli great ...,0
1,59849,0.0,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,thank you this would make my life lot less anx...,thank would make life lot less anxieti induc k...,0
2,59852,0.0,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,this is such an urgent design problem kudos to...,urgent design problem kudo take impress,0
3,59855,0.0,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,is this something ll be able to install on my ...,someth abl instal site releas,0
4,59856,0.893617,haha you guys are a bunch of losers.,0.021277,0.0,0.021277,0.87234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,haha you guys are bunch of losers,haha guy bunch loser,1


In [None]:
train.shape

(1778628, 35)

In [None]:
train['target_label'].value_counts()

0    1636193
1     142435
Name: target_label, dtype: int64

## Latent Dirichlet Algorithm (LDA)

In previous notebook we cleaned and performed EDA for the  comments. We will further extract and analyse what are the topics that people are giving negetive response through comments. We would expect to get topics that are targeting certain identities.

Latent Dirichlet Algorithm is one of the most popular topic modeling methods. Here we will use gensim's LdaMulticore to extract the topics out of the toxic comments.

### Preprocess

In order to use LDA we need to preprocess the text. We need to **clean**, **remove special characters**, **remove stopwords** and **lemmatize** the text. These steps we have already performed in the EDA notebook, hence we will directly use the lemmtized comments from train dataset.

As we are analyzing only the toxic topics, we will only use the most toxic ones. Using those comments whose target probability is greater than 0.5.

In [None]:
toxic_comments = train[train['target'] > 0.5]['comment_lemmatized']

In [None]:
len(toxic_comments)

104866

In [None]:
toxic_comments[:5]

4                                  haha guy bunch loser
5                                     ur sh tti comment
12    ridicul guy call protest arm violenc make terr...
30    yet call muslim act get pillori okay smear ent...
33                      bitch nut would read book woman
Name: comment_lemmatized, dtype: object

In [None]:
# Tokenize(split) the sentences into words
comments_tokens = [[text for text in doc.split()] for doc in toxic_comments]

### Dictionary & Corpus

We will first create a dictionary object which maps each word to a unique id. This dictionary object will then be used to create 'bag of word' corpus. 

In [None]:
# Create dictionary
dictionary = corpora.Dictionary(comments_tokens)

In [None]:
print(dictionary)

Dictionary(43100 unique tokens: ['bunch', 'guy', 'haha', 'loser', 'comment']...)


In [None]:
print(dictionary.token2id)

{'bunch': 0, 'guy': 1, 'haha': 2, 'loser': 3, 'comment': 4, 'sh': 5, 'tti': 6, 'ur': 7, 'arm': 8, 'call': 9, 'make': 10, 'protest': 11, 'ridicul': 12, 'terrorist': 13, 'violenc': 14, 'act': 15, 'bash': 16, 'christian': 17, 'entir': 18, 'get': 19, 'idiot': 20, 'muslim': 21, 'okay': 22, 'pillori': 23, 'religion': 24, 'sect': 25, 'smear': 26, 'yet': 27, 'bitch': 28, 'book': 29, 'nut': 30, 'read': 31, 'woman': 32, 'would': 33, 'also': 34, 'farmer': 35, 'gluten': 36, 'laughabl': 37, 'love': 38, 'market': 39, 'murphi': 40, 'papa': 41, 'particip': 42, 'portland': 43, 'prop': 44, 'psu': 45, 'sentenc': 46, 'shame': 47, 'tastebud': 48, 'tri': 49, 'atroci': 50, 'back': 51, 'check': 52, 'com': 53, 'dine': 54, 'door': 55, 'dumb': 56, 'final': 57, 'floorpan': 58, 'ft': 59, 'go': 60, 'hallway': 61, 'kitchenett': 62, 'leas': 63, 'live': 64, 'long': 65, 'modern': 66, 'next': 67, 'oh': 68, 'one': 69, 'opportun': 70, 'pick': 71, 'room': 72, 'roommat': 73, 'set': 74, 'share': 75, 'spend': 76, 'spot': 77, 

Gensim will use this dictionary to create a bag-of-words corpus where the words in the documents are replaced with its respective id provided by this dictionary.

In [None]:
mycorpus = [dictionary.doc2bow(doc, allow_update=True) for doc in comments_tokens]

In [None]:
pprint(mycorpus[:4])

[[(0, 1), (1, 1), (2, 1), (3, 1)],
 [(4, 1), (5, 1), (6, 1), (7, 1)],
 [(1, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(9, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 2),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1)]]


The above id and word count is not too intuitive to understand the word count. We can replace the ids with actual words to get a better understanding.

In [None]:
word_counts = [[(dictionary[id], count) for id, count in line] for line in mycorpus]

In [None]:
pprint(word_counts[:5])

[[('bunch', 1), ('guy', 1), ('haha', 1), ('loser', 1)],
 [('comment', 1), ('sh', 1), ('tti', 1), ('ur', 1)],
 [('guy', 1),
  ('arm', 1),
  ('call', 1),
  ('make', 1),
  ('protest', 1),
  ('ridicul', 1),
  ('terrorist', 1),
  ('violenc', 1)],
 [('call', 1),
  ('act', 1),
  ('bash', 1),
  ('christian', 1),
  ('entir', 1),
  ('get', 1),
  ('idiot', 1),
  ('muslim', 1),
  ('okay', 2),
  ('pillori', 1),
  ('religion', 1),
  ('sect', 1),
  ('smear', 1),
  ('yet', 1)],
 [('bitch', 1),
  ('book', 1),
  ('nut', 1),
  ('read', 1),
  ('woman', 1),
  ('would', 1)]]


We can see from above the word count for each word in the comments. 

### LDA Model

Using LDAMulticore to predict the models. The above dictionary and corpus object will now be used by LDA model. The model below predicts 10 topics.

The number of topics and passes have been selected after few iterations.

In [None]:
lda_model = LdaMulticore(corpus = mycorpus,
                         id2word = dictionary,
                         random_state = 42,
                         num_topics = 10,
                         passes = 10)

In [None]:
lda_model.print_topics()

[(0,
  '0.012*"peopl" + 0.012*"tax" + 0.011*"money" + 0.010*"get" + 0.010*"stupid" + 0.009*"pay" + 0.007*"state" + 0.007*"work" + 0.007*"need" + 0.007*"go"'),
 (1,
  '0.030*"loser" + 0.019*"troll" + 0.016*"trash" + 0.012*"garbag" + 0.011*"like" + 0.008*"anoth" + 0.008*"piec" + 0.008*"brain" + 0.008*"get" + 0.007*"pathet"'),
 (2,
  '0.018*"god" + 0.013*"gun" + 0.013*"church" + 0.009*"cathol" + 0.009*"use" + 0.009*"homosexu" + 0.009*"denver" + 0.008*"war" + 0.007*"jesu" + 0.007*"weapon"'),
 (3,
  '0.020*"liar" + 0.018*"lie" + 0.018*"liber" + 0.016*"clown" + 0.014*"nfl" + 0.009*"idiot" + 0.009*"hypocrit" + 0.008*"putin" + 0.008*"justin" + 0.007*"trudeau"'),
 (4,
  '0.031*"women" + 0.021*"sexual" + 0.018*"sex" + 0.016*"men" + 0.013*"woman" + 0.012*"rape" + 0.011*"abus" + 0.009*"child" + 0.008*"man" + 0.008*"mental"'),
 (5,
  '0.045*"white" + 0.030*"black" + 0.027*"racist" + 0.023*"peopl" + 0.013*"muslim" + 0.012*"hate" + 0.011*"right" + 0.011*"kill" + 0.009*"countri" + 0.009*"american"'),


Above are the top 10 topics for the toxic comments. But this is too hard to interpret. Let us format these comments in a dataframe. 

### Topics

In [None]:
lda_topics = lda_model.print_topics()

In [None]:
def format_topics(topics_list):
  '''
  Fetching the topics from the string,
  Creating a dataframe for better understanding.
  '''
  topics_words = []

  for topic in topics_list:

    # Split the topic string through +
    word_prob = topic[1].split('+')

    # Extract only topic words
    word = re.findall(r'[a-zA-Z]+', str(word_prob))

    # Add to the topic words list
    topics_words.append(word)

  topics_df = pd.DataFrame(data = topics_words,
                           columns = ['Word1', 'Word2', 'Word3', 'Word4', 'Word5', 'Word6', 'Word7','Word8', 'Word9', 'Word10'],
                           index = ['Monetary', 'Trolling', 'Religious Conflicts', 'Dishonesty', 'Sexual Abuse', 'Identities', 
                                    'Abstract', 'Stupidity', 'Canada', 'US Politics'])

  return topics_df

In [None]:
topics_df = format_topics(lda_topics)
topics_df

Unnamed: 0,Word1,Word2,Word3,Word4,Word5,Word6,Word7,Word8,Word9,Word10
Monetary,peopl,tax,money,get,stupid,pay,state,work,need,go
Trolling,loser,troll,trash,garbag,like,anoth,piec,brain,get,pathet
Religious Conflicts,god,gun,church,cathol,use,homosexu,denver,war,jesu,weapon
Dishonesty,liar,lie,liber,clown,nfl,idiot,hypocrit,putin,justin,trudeau
Sexual Abuse,women,sexual,sex,men,woman,rape,abus,child,man,mental
Identities,white,black,racist,peopl,muslim,hate,right,kill,countri,american
Abstract,like,get,go,one,would,guy,peopl,time,think,good
Stupidity,stupid,peopl,like,one,think,say,comment,ignor,make,would
Canada,canada,countri,canadian,us,world,govern,liber,trudeau,fool,north
US Politics,trump,presid,republican,vote,democrat,obama,elect,lie,parti,clinton


### PyLDAvis Panel

Below is the intercative visualisation for the toxic comments using pyLDAvis.

The left graph with circles represents topics and the distance between them and the right hortizontal bar chart lists top-30 terms. 

If we hover over a topic on the left graph, the right bar plot will interactively display terms including the corresponding topic.

In [None]:
pyLDAvis.enable_notebook()
panel = gensimvis.prepare(lda_model, mycorpus, dictionary, mds='tsne')
panel