<a href="https://colab.research.google.com/github/matakahas/portfolio/blob/main/reddit_proed_pt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic modeling and flair prediction from the banned r/proED/ subreddit (Part 2)

Part 2 of this project conducts topic modelling using the dataset scraped and saved in Part 1.

### Import libraries and dataset
I will start off by downloading an English dictionary from SpaCy.

In [None]:
!python -m spacy download en_core_web_md | grep -v 'already satisfied'

Restart runtime after the dictionary has been downloaded. Then import all the libraries I will use. 

In [None]:
import numpy as np
import pandas as pd
import re

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import spacy
nlp = spacy.load("en_core_web_md")

#this funstion suppresses sklearn deprecation warnings 
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Mount my google drive (where my dataset is) onto this notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Import the dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/proED_full_dataset.csv')
df = df.iloc[:, 1:]
df.head()

Unnamed: 0,Flair,Title,User,Date,Text
0,none,Will be trying my first 2-day fast and I plan ...,/u/CharChar12 [5' 9.5 |160lbs|23.5| Male],2017-03-14 00:34:55,[removed]
1,Rant/Rave,Body issues and my mom,/u/PutinsThirdLover,2017-03-14 00:31:14,"Edit: Can't flair, as on mobile.I'm just frust..."
2,Discussion,What's the most amount of weight you've lost i...,/u/[deleted],2017-03-13 22:50:08,[deleted]
3,Rant/Rave,Day 8 of restriction. Want encouragement please ♡,/u/AdloraOfSolitude [5'2 | 105.8 | 20 | -12 lb...,2017-03-13 22:15:27,[removed]
4,Discussion,Why We Eat Too Much,"/u/skin_ny [5'9.5"" | 113.6 | 16.19 | -44 | F]",2017-03-13 21:46:32,http://www.thebookoflife.org/why-we-eat-too-much/


### pre-processing
I'll rename some of the flairs with similar concepts to in order to reduce the number of categories, which should make flair prediction easier. I will also remove stopwords and make the letters lower case.

In [None]:
df['Flair'] = df['Flair'].apply(lambda x: x.lower())
df['Flair'] = df['Flair'].apply(lambda x: "rant/rave" if re.search(r"(?:rant|rave)", x) else x)
df['Flair'] = df['Flair'].apply(lambda x: "discussion" if re.search(r"discussion", x) else x)
df['Flair'] = df['Flair'].apply(lambda x: "humor" if re.search(r"humor", x) else x)

In [None]:
#select 10 largest categories
flair_counts = df['Flair'].value_counts().to_frame()
top10 = flair_counts[:10].index.tolist()

df = df.loc[df['Flair'].isin(top10) == True,]

I will make a smaller dataframe consisting of `Text` (titles and main contents combined) and `Flair`.

In [None]:
data = {'Text': df['Title'] + ' ' + df['Text'], 'Flair': df['Flair']}
df_sm = pd.DataFrame(data=data)
df_sm.head()

Unnamed: 0,Text,Flair
0,Will be trying my first 2-day fast and I plan ...,none
1,"Body issues and my mom Edit: Can't flair, as o...",rant/rave
2,What's the most amount of weight you've lost i...,discussion
3,Day 8 of restriction. Want encouragement pleas...,rant/rave
4,Why We Eat Too Much http://www.thebookoflife.o...,discussion


Remove rows with an empty value, or a value that is too long (probably the result of a glitch during scraping)

In [None]:
df_sm = df_sm[df_sm['Text'].isna()==False]
df_sm = df_sm[df_sm['Text'].map(len) < 1000000]

Create a set of stopwords and add \[deleted\] and \[removed\] to the set

In [None]:
stpwrd = stopwords.words('english')
stpwrd.extend(['[deleted]', '[removed]'])

Remove the stopwords and make letters lower. This will take some time...

In [None]:
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    clean = ' '.join(word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english'))
    return clean

df_sm['Text'] = df_sm['Text'].map(lambda x: x if str(x) == 'nan' else process(x))

Save the cleaned dataset

In [None]:
df_sm.to_csv('proED_full_dataset_clean_sm.csv')

Re-import the dataset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/proED_full_dataset_clean_sm.csv')
df = df.iloc[:, 1:]
df.head()

Unnamed: 0,Text,Flair
0,trying first 2day fast plan use post motivatio...,none
1,body issues mom edit cant flair mobileim frust...,rant/rave
2,whats amount weight youve lost shortest period...,discussion
3,day 8 restriction want encouragement please ♡ ...,rant/rave
4,eat much httpwwwthebookoflifeorgwhyweeattoomuch,discussion


I will focus on 5 out of the 10 flairs. The code below subsets rows based on the selected flairs

In [None]:
flairs_selected = ['rant/rave', 'discussion', 'help', 'goal', 'thinspo']
df = df[df['Flair'].isin(flairs_selected)]

#check the size of dataset
len(df)

47468

### Topic modeling

I will first process the dataset by using TF-IDF Vectorization to create a vectorized document term matrix. I will then use the Non-Negative Matrix Factorization (NMF) technique to generate 8 components (topics) with the list of words characteristic of each topic. \
I adapted the code from an assignment in an Udemy course "NLP - Natural Language Processing with Python" ([link](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) to course)

In [None]:
tfidf = TfidfVectorizer()
dtm = tfidf.fit_transform(df['Text'])

In [None]:
nmf_model = NMF(n_components=5,random_state=42)
nmf_model.fit(dtm)

NMF(n_components=5, random_state=42)

In [None]:
nmf_topics = []
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 10 WORDS FOR TOPIC #{index}')
    topics = [tfidf.get_feature_names()[i] for i in topic.argsort()[-10:]]
    nmf_topics.append(topics)
    print(topics)
    print('\n')

THE TOP 10 WORDS FOR TOPIC #0
['going', 'lose', 'like', 'cant', 'want', 'know', 'dont', 'ive', 'weight', 'im']


THE TOP 10 WORDS FOR TOPIC #1
['thinspo', 'else', 'dae', 'binge', 'fast', 'anyone', 'ed', 'rant', 'help', 'deleted']


THE TOP 10 WORDS FOR TOPIC #2
['anyone', 'purge', 'diet', 'binge', 'fast', 'weight', 'need', 'fasting', 'help', 'removed']


THE TOP 10 WORDS FOR TOPIC #3
['format', 'thread', 'estimatecalculation', 'total', 'calorie', 'food', '2016', 'post', 'daily', 'diary']


THE TOP 10 WORDS FOR TOPIC #4
['get', 'want', 'binge', 'calories', 'day', 'food', 'eating', 'feel', 'like', 'eat']




### Most characteristic words for each flair
Now I will find out which words are the most correlated to each flair. I will examine how similar the list of words generated above and the one below look.

In [None]:
mapping_dict = {}
for i, f in enumerate(flairs_selected):
    mapping_dict[f] = i

X = df['Text']
y = df['Flair']

y = y.map(mapping_dict).values

In [None]:
tfidf = TfidfVectorizer()
feat = tfidf.fit_transform(X)

In [None]:
# chisq2 statistical test
chisq_topics = {}
N = 10    # Number of examples to be listed
for f, i in sorted(mapping_dict.items()):
    chi2_feat = chi2(feat, y == i)
    indices = np.argsort(chi2_feat[0])
    feat_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [w for w in feat_names if len(w.split(' ')) == 1]
    chisq_topics[f] = unigrams[-N:]
    print("\nFlair '{}':".format(f))
    print("Most correlated words: {}".format(unigrams[-N:]))


Flair 'discussion':
Most correlated words: ['necessarily', 'include', 'estimatecalculation', 'format', 'else', 'anyone', '2016', 'thread', 'diary', 'dae']

Flair 'goal':
Most correlated words: ['luck', 'nsv', 'proud', 'reached', 'accountability', 'shaping', 'finally', 'checkin', 'goals', 'goal']

Flair 'help':
Most correlated words: ['ephedrine', 'suggestions', 'question', 'please', 'thinspo', 'ec', 'removed', 'advice', 'tips', 'help']

Flair 'rant/rave':
Most correlated words: ['thread', 'dae', 'daily', 'diary', 'fuck', 'hate', 'anyone', 'thinspo', 'rant', 'fucking']

Flair 'thinspo':
Most correlated words: ['antithinspo', 'bw', 'kpop', 'nsfw', 'reverse', 'daily', 'male', 'thinspiration', 'album', 'thinspo']


Now let's compare the similarity between the two lists of keywords

In [None]:
for i, nmf in enumerate(nmf_topics):
  max = 0
  idx = 0
  for j, chq in enumerate(chisq_topics.values()):
    overlaps = len(set.intersection(set(nmf), set(chq)))
    if overlaps > max:
      max = overlaps
      idx = j
  if max == 0:
    print(f'There is no good match for Topic {i}')
  else:
    print(f'The most similar category of Topic {i} is {list(chisq_topics.keys())[idx]}, which has these keywords in common : {set.intersection(set(nmf), set(list(chisq_topics.values())[idx]))}')


There is no good match for Topic 0
The most similar category of Topic 1 is rant/rave, which has these keywords in common : {'thinspo', 'rant', 'anyone', 'dae'}
The most similar category of Topic 2 is help, which has these keywords in common : {'removed', 'help'}
The most similar category of Topic 3 is discussion, which has these keywords in common : {'diary', '2016', 'estimatecalculation', 'thread', 'format'}
There is no good match for Topic 4


Based on the result, it looks like some of the topics that were generated by NMF overlap with the actual categories, while there were no topics corresponding to 'thinspo' and 'goal', which makes sense given that there are a lot fewer posts under these flairs in the dataset.

In [None]:
df['Flair'].value_counts()

rant/rave     19770
discussion    13445
help           9687
goal           2375
thinspo        2191
Name: Flair, dtype: int64

This is the end of Part 2. Thank you for tagging along!