# Topic Modelling with Latent Dirichlet Allocation

The following is a mini project in aid of my larger Toxic Comment Project. We will be doing a light exercise in Topic Modelling to get some insight into our data.


http://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

http://brooksandrew.github.io/simpleblog/articles/latent-dirichlet-allocation-under-the-hood/

### The Data

In [5]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation

tc = pd.read_csv('C:\\Users\\harri\\.kaggle\\competitions\\jigsaw-toxic-comment-classification-challenge\\train.csv\\train.csv')
tc = tc.drop('id', axis = 1)
X = tc.iloc[:,0]
y = tc.iloc[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2)

tc.head()

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [5]:
X.iloc[0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

### The Topics

In [6]:
n_samples = 159571
n_features = 4000
n_components = 5
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

    
count = CountVectorizer(max_df = 0.90, min_df = 2, max_features = n_features, stop_words = 'english')


tf = count.fit_transform(X)
count_feature_names = count.get_feature_names()
print('Done with vectorizer.')
print()

lda = LatentDirichletAllocation(n_components = n_components, max_iter = 5, learning_method = 'online')
lda.fit(tf)
print('Done with LDA.')
print()


print_top_words(lda, count_feature_names, n_top_words)

Done with vectorizer.

Done with LDA.

Topic #0: fuck blocked utc continue editing block did edit stop fucking suck ip gay 2005 faggot request ass 2006 color 20
Topic #1: article just don know think like page time people ve talk good make way did really wp section say ll
Topic #2: wikipedia page talk help like thank hi pages welcome user edit editing thanks ask hope vandalism use hello want place
Topic #3: people world english used does lol says state years book language term new history fact source states american old redirect
Topic #4: article wikipedia page deletion image use deleted articles link sources source information http talk links subject add com tag content

