**Reddit scraping project** \\
In this project, I will scrape the subreddit r/EDAnonymous and try different ML models to see whether they can predict the "flair" (tag) on posts. 

#required packages

In [None]:
!pip install praw #official python reddit API wrapper
!python -m spacy download en_core_web_md

In [21]:
import praw
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import spacy
import warnings
warnings.filterwarnings('ignore')

nlp = spacy.load("en_core_web_md")

import torch
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2

#scraping

In [2]:
# Read-only instance
reddit_read_only = praw.Reddit(client_id="your client id",
                               client_secret="your client secret",
                               user_agent="your user agent")

In [3]:
subreddit = reddit_read_only.subreddit("EDAnonymous")

The code below scrapes the top 1000 posts of the current month, and stores relevant information to the dictionary

In [None]:
sub_limit = 1000
posts = subreddit.top("month", limit=sub_limit)
 
posts_dict = {"Title": [], "Post Text": [],
              "Flair": [], "Score": [],
              "Total Comments": [], "Post URL": []
              }
 
for post in posts:
    # Title of each post
    posts_dict["Title"].append(post.title)
     
    # Text inside a post
    posts_dict["Post Text"].append(post.selftext)
     
    # Unique ID of each post
    posts_dict["Flair"].append(post.link_flair_text)
     
    # The score of a post
    posts_dict["Score"].append(post.score)
     
    # Total number of comments inside the post
    posts_dict["Total Comments"].append(post.num_comments)
     
    # URL of each post
    posts_dict["Post URL"].append(post.url)
 
# Saving the data in a pandas dataframe
top_posts = pd.DataFrame(posts_dict)

In [8]:
data = {'Text': top_posts['Title'] + ' ' + top_posts['Post Text'], 'Flair': top_posts['Flair']}
df = pd.DataFrame(data=data)
df.head()

Unnamed: 0,Text,Flair
0,hot take: we need to stop acting like asking s...,Rant / Rave
1,Is it weird my safe foods tend to be things li...,Discussion
2,is sweating like crazy related to ed? i ate a ...,Discussion
3,"Does anyone else have random ""fuck my ed"" meal...",Discussion
4,Fighting for my life on the toilet I am MORTIF...,oh no


In [6]:
df['Text'].tolist()[:5]

['hot take: we need to stop acting like asking someone if they have an ed is A Terrible Thing let me explain a little here\n\nthere is an obviously disordered youtuber. it honestly doesnt matter who she is and her videos are boring as fuck u wouldn\'t like them. the main point is that she is extremely uw and has very strict food rules and anyone who wouldn\'t think "hmmm this does not seem like a normal or healthy diet" when watching her videos literally does not have their brain turned on. \n\nbut if u suggest it in her comments, or even on the comments of a video specifically pointing out how fucked up her eating is, there is always someone who will respond "u dont know her, how dare u accuse her of something like that" \n\naccuse her of something like what? an eating disorder? that she exhibits many symptoms of? and that affects nearly 10% or 30 million americans in their lifetime? this is the terrible, unspeakable thing we cant possibly accuse her of? what the fuck? \n\nlike can u 

#pre-processing

I'll rename some flairs with similar concepts to reduce the number of categories

In [None]:
#drop rows (posts) with no flairs
df = df.dropna(subset=['Flair'])

In [15]:
df['Flair'] = df['Flair'].apply(lambda x: re.sub(r'TW.*', 'TW', x))
df['Flair'] = df['Flair'].apply(lambda x: re.sub(r'Recovery.*', 'Recovery', x))
df['Flair'].value_counts(dropna=False)

Shitpost            258
Rant / Rave         189
Discussion          104
TW                   90
Recovery             67
oh no                63
Story Time           43
Victories            20
Shitpost / Memes     10
Image                 5
Meta                  5
Fatphobia             3
Food                  2
Harm Reduction        2
Family Vent           2
Educational           1
Mod Post              1
Substance abuse       1
Name: Flair, dtype: int64

In [None]:
#select 5 largest categories and save the dataset to csv file
categories = ['Discussion', 'Rant / Rave', 'TW', 'Shitpost', 'Recovery']
df = df.loc[df['Flair'].isin(categories) == True,]
df.to_csv('raw_data.csv')

In [12]:
df = pd.read_csv('raw_data.csv')

In [13]:
def process(text):
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ''.join(nopunc)
  
  clean = ' '.join(word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english'))
  return clean

df['Text'] = df['Text'].map(lambda x: process(x))

df.head()

Unnamed: 0.1,Unnamed: 0,Text,Flair
0,0,raise hand personally victimized â€¢the biggest ...,Shitpost
1,1,yâ€™all need better boyfriends thatâ€™s thatâ€™s post,Rant / Rave
2,2,ancestors ate high calorie meals survive harsh...,Shitpost
3,4,skinny feeling pooping 3,Shitpost
4,5,people fasting community mfs going week withou...,Shitpost


split between train and test data

In [14]:
X = df['Text']
y = df['Flair']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
print(len(X_train), len(X_test))

652 163


#single-layer neural network

In [None]:
def vectorize(text):
  text = nlp(text)
  vec = [word.vector for word in text]
  return torch.tensor(sum(vec) / len(vec))

In [None]:
X_train = torch.stack([vectorize(text) for text in X_train])
X_test = torch.stack([vectorize(text) for text in X_test])

mapping_dict = {'Discussion':0, 'Rant / Rave':1, 'TW':2, 'Shitpost':3, 'Recovery':4}

y_train = torch.LongTensor(y_train.map(mapping_dict).values)
y_test = torch.LongTensor(y_test.map(mapping_dict).values)

In [None]:
import torch.nn as nn

class SingleNN(nn.Module):
  def __init__(self, input_size, output_size):
    super().__init__()
    self.fc = nn.Linear(input_size, output_size, bias=False)
    nn.init.normal_(self.fc.weight, 0.0, 1.0)
  
  def forward(self,x):
    x = self.fc(x)
    return x

In [None]:
model = SingleNN(input_size=X_train.size()[1], output_size=5)
criterion = nn.CrossEntropyLoss()
model

SingleNN(
  (fc): Linear(in_features=300, out_features=5, bias=False)
)

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [None]:
epochs = 300
losses = []

for i in range(epochs):
    i+=1
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)
    losses.append(loss)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # a neat trick to save screen space:
    if i%10 == 1:
        print(f'epoch: {i:3}  train_loss: {loss.item():10.8f}')

In [None]:
rows = len(y_test)
correct = 0

with torch.no_grad():
    y_val = model(X_test)

for i in range(rows):
    if y_val[i].argmax().item() == y_test[i]:
        correct += 1

print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')


72 out of 163 = 44.17% correct


#linearSVC with count vectorizer

In [None]:
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train)
X_train_cv.shape

(652, 5695)

In [None]:
clf = LinearSVC()
clf.fit(X_train_cv, y_train)

LinearSVC()

In [None]:
X_test_cv = cv.transform(X_test)

In [None]:
predictions = clf.predict(X_test_cv)

In [None]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[27  6  2  5  6]
 [11 17  0  8  6]
 [ 5  3  5  1  2]
 [ 6  9  0 10  4]
 [ 8  8  2  4  8]]
0.4110429447852761


#linearSVC with tf-idf vectorizer

In [16]:
clf_tfidf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

# Feed the training data through the pipeline
clf_tfidf_lsvc.fit(X_train, y_train)

# Form a prediction set
predictions = clf_tfidf_lsvc.predict(X_test)
# Print the overall accuracy
print('{}% correct'.format(round(metrics.accuracy_score(y_test,predictions)*100, 2)))

49.08% correct


Descriptive: extract the top 5 words most correlated with each flair using tf-idf vectorizer

In [18]:
mapping_dict = {'Discussion':0, 'Rant / Rave':1, 'TW':2, 'Shitpost':3, 'Recovery':4}

y = y.map(mapping_dict).values

In [19]:
tfidf = TfidfVectorizer()
feat = tfidf.fit_transform(X).toarray()

In [22]:
# chisq2 statistical test
N = 5    # Number of examples to be listed
for f, i in sorted(mapping_dict.items()):
    chi2_feat = chi2(feat, y == i)
    indices = np.argsort(chi2_feat[0])
    feat_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [w for w in feat_names if len(w.split(' ')) == 1]
    print("\nFlair '{}':".format(f))
    print("Most correlated unigrams:\n\t. {}".format('\n\t. '.join(unigrams[-N:])))


Flair 'Discussion':
Most correlated unigrams:
	. mine
	. dae
	. subs
	. else
	. anyone

Flair 'Rant / Rave':
Most correlated unigrams:
	. nothing
	. unfair
	. thigh
	. gift
	. people

Flair 'Recovery':
Most correlated unigrams:
	. modeling
	. hope
	. scare
	. proud
	. merry

Flair 'Shitpost':
Most correlated unigrams:
	. increases
	. lmao
	. dump
	. toilet
	. lies

Flair 'TW':
Most correlated unigrams:
	. left
	. low
	. bmi
	. 600
	. thighs
