# Topic Modeling 

This notebook contains topic modeling and analysis on Reddit posts across a variety of subreddits. In this way, clear differences and similarities can be seen between different groups on Reddit.

Section Breakdown:
1. Importing Libraries and Data
2. Preliminary Text Cleaning
3. Tokenize the Posts
4. LDA Topic Modeling Using Word Counts
5. Analysis

# 1. Importing Libraries and Data
Data obtained from the [BigQuery Reddit post dataset](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.2017_09?pli=1). Subreddits chosen were all text based dealing with similar themes so comparisons could be made more clearly. These subreddits include:
* r/incels
* r/legaladvice
* r/redpill
* r/relationships
* r/twoxchromosomes
* r/confessions

r/incels and r/redpill are included to give a benchmark for toxicity, as these communities contain much more hate speech than the majority of subreddits. Ultimately, we look to create a measure to identify similar hate speech in other environments.

In [1]:
# Libraries

# Pandas and Numpy to format data
import pandas as pd
import numpy as np

# Text cleaning
import re
import string

# Tolkenize the corpuses 
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 

#Models
from sklearn.cluster import KMeans
from sklearn.decomposition import NMF,PCA
from gensim import corpora, models, similarities, matutils

In [2]:
# r/incels post data
incels1 = pd.read_csv("../posts/incels.csv")
incels2 = pd.read_csv("../posts/incels2.csv")
incels3 = pd.read_csv("../posts/incels3.csv")
incels = pd.concat([incels1, incels2, incels3], axis=0).reset_index(drop=True)

In [3]:
# r/legaladvice post data
legal1 = pd.read_csv("../posts/legaladvice.csv")
legal2 = pd.read_csv("../posts/legaladvice2.csv")
legal3 = pd.read_csv("../posts/legaladvice3.csv")
legal = pd.concat([legal1, legal2, legal3], axis=0).reset_index(drop=True)

In [4]:
# r/redpill post data
redpill1 = pd.read_csv("../posts/redpill.csv")
redpill2 = pd.read_csv("../posts/redpill2.csv")
redpill3 = pd.read_csv("../posts/redpill3.csv")
redpill = pd.concat([redpill1, redpill2, redpill3], axis=0).reset_index(drop=True)

In [5]:
# r/relationships post data
relationships1 = pd.read_csv("../posts/relationships.csv")
relationships2 = pd.read_csv("../posts/relationships2.csv")
relationships3 = pd.read_csv("../posts/relationships3.csv")
relationships = pd.concat([relationships1, relationships2, relationships3], axis=0).reset_index(drop=True)

In [6]:
# r/twoxchromosomes post data
twox1 = pd.read_csv("../posts/twox.csv")
twox2 = pd.read_csv("../posts/twox2.csv")
twox3 = pd.read_csv("../posts/twox3.csv")
twox = pd.concat([twox1, twox2, twox3], axis=0).reset_index(drop=True)

In [7]:
# r/confessions post data
confessions1 = pd.read_csv("../posts/confessions.csv")
confessions2 = pd.read_csv("../posts/confessions2.csv")
confessions3 = pd.read_csv("../posts/confessions3.csv")
confessions = pd.concat([confessions1, confessions2, confessions3], axis=0).reset_index(drop=True)

In [8]:
# Combine all data into one dataframe
posts = pd.concat([incels, legal, redpill, relationships, twox, confessions], axis=0).reset_index(drop=True)

# 2. Preliminary Text Cleaning

In [9]:
# Function to remove urls from the corpuses
def remove_url(item):
    item2 = item
    item2 = re.sub(r"http\S+", "", item2)
    item2 = re.sub(r"www\S+", "", item2)
    return item2

In [10]:
# Function to remove posts that are too short, where topics would be more difficult to identify
def remove_short_posts(item):
    item = str(item)
    if len(item) > 300:
        return item
    else:
        return None

In [11]:
# Remove posts that were removed or deleted
posts = posts[~posts['selftext'].isin(['[removed]','[deleted]'])].reset_index(drop=True)
# Apply the earlier function to remove short posts
posts['selftext'] = posts['selftext'].apply(remove_short_posts)
# Remove all the None type posts
posts.dropna(subset=['selftext'], inplace = True)
# Reset the index
posts = posts.reset_index(drop=True)
# Apply the earlier function to remove urls
posts['selftext'] = posts['selftext'].apply(remove_url)

# 3. Tokenize the Posts

In [12]:
# Custom stop words including Reddit specific slang
other_stop_words = ['incel','chad', 'alpha', 'beta',
                    've','gt','amp','nbsp','rp','pm', 'ampnbsp','th','aa',
                    'adam','john',
                    'really', 'just',
                    'b','c','d','e','f','g','h','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
                    '’','”',"“"
                   ]

stop_words = text.ENGLISH_STOP_WORDS.union(other_stop_words)

In [13]:
# Custom tokenizer in order to lemmatize words
def custom_tokenizer(text):

    # remove punctuation
    remove_punct = str.maketrans('', '', string.punctuation)
    text = text.translate(remove_punct)

    # remove digits and convert to lower case
    remove_digits = str.maketrans('', '', string.digits)
    text = text.lower().translate(remove_digits)

    # tokenize
    tokens = word_tokenize(text)

    # remove stop words
    stop_words2 = stop_words
    tokens_stop = [y for y in tokens if y not in stop_words2]

    # stem
    lem = WordNetLemmatizer()
    tokens_lem = [lem.lemmatize(y) for y in tokens_stop] 

    return tokens_lem

In [16]:
# Tokenize with CountVectorizer
count_vect = CountVectorizer(tokenizer=custom_tokenizer, max_df=0.8, min_df=0.025)
count_vect.fit(posts['selftext'])
# Combine some common terms
count_vect.vocabulary_['man'] = count_vect.vocabulary_['men']
count_vect.vocabulary_['kid'] = count_vect.vocabulary_['child']
# Get token counts per post
counts = count_vect.transform(posts['selftext']).transpose()

In [17]:
# Tokenize with TfidfVectorizer
tfidf_vect = TfidfVectorizer(tokenizer=custom_tokenizer, max_df=0.8, min_df=0.025)
tfidf_vect.fit(posts['selftext'])
# Combine some common terms
tfidf_vect.vocabulary_['man'] = tfidf_vect.vocabulary_['men']
tfidf_vect.vocabulary_['kid'] = tfidf_vect.vocabulary_['child']
# Get token tfidf per post
tfidf = tfidf_vect.transform(posts['selftext']).transpose()

# 4. LDA Topic Modeling Using Word Counts

Note: Unable to use Tfidf to train LDA topic models

In [18]:
# Run the model
corpus = matutils.Sparse2Corpus(counts)
id2word = dict((v, k) for k, v in count_vect.vocabulary_.items())
lda = models.LdaModel(corpus=corpus, num_topics=20, id2word=id2word, passes=50)

In [19]:
# View the topics
lda.print_topics()

[(0,
  '0.218*"im" + 0.086*"ive" + 0.053*"dont" + 0.049*"he" + 0.034*"know" + 0.031*"like" + 0.029*"shes" + 0.020*"id" + 0.016*"thats" + 0.015*"sure"'),
 (1,
  '0.105*"work" + 0.069*"job" + 0.059*"company" + 0.033*"hour" + 0.029*"business" + 0.029*"working" + 0.027*"time" + 0.025*"day" + 0.022*"week" + 0.022*"employee"'),
 (2,
  '0.208*"house" + 0.105*"home" + 0.062*"room" + 0.046*"door" + 0.035*"property" + 0.033*"water" + 0.025*"living" + 0.022*"leave" + 0.022*"come" + 0.021*"clean"'),
 (3,
  '0.042*"want" + 0.042*"dont" + 0.039*"like" + 0.031*"know" + 0.028*"feel" + 0.026*"thing" + 0.023*"make" + 0.020*"people" + 0.019*"think" + 0.016*"say"'),
 (4,
  '0.341*"friend" + 0.062*"people" + 0.050*"group" + 0.037*"social" + 0.031*"video" + 0.031*"game" + 0.029*"picture" + 0.023*"best" + 0.023*"facebook" + 0.021*"close"'),
 (5,
  '0.033*"relationship" + 0.027*"time" + 0.027*"feel" + 0.025*"like" + 0.021*"want" + 0.020*"friend" + 0.019*"year" + 0.018*"love" + 0.016*"thing" + 0.016*"boyfriend

These topics seem to be fairly coherent. Topic labels:
0. Other
1. Work
2. House Ownership
3. Insecurity
4. Friendship
5. Romantic Relationships
6. Seeking Advice
7. Marriage
8. Gender Issues
9. Time Periods
10. Illegal Activity
11. Crude Sex
12. College
13. Lawyer
14. Thankful for Advice
15. Respectful Sex
16. Car Legal Issues
17. Texting
18. Apartments
19. Family

In [20]:
# Predict topics for each post
lda_corpus = lda[corpus]
lda_docs = [doc for doc in lda_corpus]
# View example posts to get a sense of distributions and topics accuracy
lda_docs[0:5]

[[(0, 0.11682687),
  (3, 0.34772292),
  (8, 0.122964025),
  (9, 0.07831175),
  (12, 0.12128804),
  (13, 0.10906896),
  (15, 0.09241391)],
 [(6, 0.49987704), (11, 0.455123)],
 [(3, 0.38820687),
  (8, 0.25135508),
  (11, 0.19065842),
  (15, 0.10549299),
  (19, 0.04009305)],
 [(0, 0.21043836),
  (3, 0.35669893),
  (9, 0.17327507),
  (12, 0.10357035),
  (14, 0.048274882),
  (15, 0.066565976)],
 [(0, 0.086266175),
  (8, 0.07941179),
  (9, 0.08448296),
  (11, 0.5931004),
  (13, 0.03766228),
  (14, 0.06181989),
  (15, 0.046423126)]]

In [22]:
# Find predicted topic for each post
lda_topics = []
for lis in lda_docs:
    topic = max(lis,key=lambda item:item[1])[0]
    lda_topics.append(topic)

In [26]:
# Create a copy of the data frame with only relevant info
new_posts = posts[['subreddit','score','selftext']]
# Add predicted topic for each post
new_posts['lda_topic'] = lda_topics

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


# 5. Analysis
Discoveries from the topics found through LDA.

#### Which topics score the highest and lowest on average?

In [28]:
new_posts.groupby( ['lda_topic'] )['score'].mean().sort_values()

lda_topic
16     16.755669
18     17.338235
9      17.922013
0      20.177503
13     20.320028
17     20.583333
2      23.339080
5      29.741906
15     32.690141
1      34.226852
10     35.845679
14     41.523262
19     54.977716
8      62.199422
3      66.156520
11     89.198892
4     108.353448
6     111.297777
7     115.892202
12    159.488789
Name: score, dtype: float64

Topic 16, car legal issues, scores the lowest while topic 12, college, scores the highest.

#### How do subreddits compare with each other? 

<img src="topic_distributions.png" style="width: 500px;" align="left"/>

After graphing the distribution of the topics (where each color in the graph above represents a different topic), it's clear that r/incels and r/theredpill share a very similar breakdown. This is useful for identifying future toxic communities. r/confessions and r/twoxchromosomes also appear to similar, while legal advice has many more topics but no predominant ones and r/relationships largely focuses on one topic (5: Romantic Relationships).

In addition, the pink topic (11: Crude Sex) and purple topic (3: Insecurity) are the predominant ones for the toxic subreddits, while they are minor occurrences in the other subreddits. This means that identifying these topics can give an indication for where hate speak is occurring. 