# Reddit scraping project
In this project, I will scrape the subreddit [r/EDAnonymous](https://www.reddit.com/r/EDAnonymous/) and try different machine learning models to see whether they can predict the "flair" (tag) on posts. 

### required packages

In [1]:
!pip install praw -q #official python reddit API wrapper
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.1.0/en_core_web_md-3.1.0-py3-none-any.whl (45.4 MB)
     |████████████████████████████████| 45.4 MB 3.0 MB/s            
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
import praw
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import spacy
import warnings
warnings.filterwarnings('ignore')

nlp = spacy.load("en_core_web_md")

import torch
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("reddit_client_id")
secret_value_1 = user_secrets.get_secret("reddit_client_secret")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### scraping
You can download the scraped dataset [here](https://www.kaggle.com/matakahas/reddit-redanomymous-dataset).

In [3]:
# Read-only instance
reddit_read_only = praw.Reddit(client_id=secret_value_0, #your client id
                               client_secret=secret_value_1, #your client secret
                               user_agent="personal Reddit project by u/Comfortable-Watch-92")

In [4]:
subreddit = reddit_read_only.subreddit("EDAnonymous")

The code below scrapes the top 1000 posts of the current month, and stores relevant information to the dictionary

In [5]:
sub_limit = 1000
posts = subreddit.top("month", limit=sub_limit)
 
posts_dict = {"Title": [], "Post Text": [],
              "Flair": [], "Score": [],
              "Total Comments": [], "Post URL": []
              }
 
for post in posts:
    # Title of each post
    posts_dict["Title"].append(post.title)
     
    # Text inside a post
    posts_dict["Post Text"].append(post.selftext)
     
    # Unique ID of each post
    posts_dict["Flair"].append(post.link_flair_text)
     
    # The score of a post
    posts_dict["Score"].append(post.score)
     
    # Total number of comments inside the post
    posts_dict["Total Comments"].append(post.num_comments)
     
    # URL of each post
    posts_dict["Post URL"].append(post.url)
 
# Saving the data in a pandas dataframe
top_posts = pd.DataFrame(posts_dict)

In [6]:
data = {'Text': top_posts['Title'] + ' ' + top_posts['Post Text'], 'Flair': top_posts['Flair']}
df = pd.DataFrame(data=data)
df.head()

Unnamed: 0,Text,Flair
0,Raise your hand if you have been personally vi...,Shitpost
1,y’all need better boyfriends. that’s it. that’...,Rant / Rave
2,My ancestors ate high calorie meals to survive...,Shitpost
3,Can people stop commenting competitive stuff o...,Meta
4,Showing your pain by losing weight ? Does anyo...,TW


In [7]:
df['Text'].tolist()[:5]

["Raise your hand if you have been personally victimized by: •The biggest loser\n•Red band society \n•Skins\n•2013 Tumblr\n•the late 90's/early 00's\n•peanut butter\n•the doctor's office\n•middle school health class\n•that one period that your mum did Atkins\n•that one time everyone was calling Britney spears fat\n•kpop\n•Asian sizes\n•One size stores\n•your younger sister\n•low rise jeans\n•that Random aunt pinching your chicks\nAnd/or •freelee the banana girl",
 'y’all need better boyfriends. that’s it. that’s the post.',
 'My ancestors ate high calorie meals to survive through harsh winters and pass their genes onto me and here I am, eating veggies and hovering on the radiator They would NOT be impressed. Or proud.',
 'Can people stop commenting competitive stuff on recovery wins and struggle posts? just saw someone celebrating their favourite pizza place reopening and they spoke about eating pizza with essentially no fucks given. thats great! why the hell is there someone in the co

### pre-processing
I'll rename some of the flairs with similar concepts to in order to reduce the number of categories. I will also remove stopwords and make the letters lower case.

In [8]:
#drop rows (posts) with no flairs
df = df.dropna(subset=['Flair'])

In [9]:
df['Flair'] = df['Flair'].apply(lambda x: re.sub(r'TW.*', 'TW', x))
df['Flair'] = df['Flair'].apply(lambda x: re.sub(r'Recovery.*', 'Recovery', x))
df['Flair'].value_counts(dropna=False)

Discussion         191
Rant / Rave        188
TW                 170
Shitpost            85
oh no               83
Recovery            75
Food                39
Story Time          24
Family Vent         16
Harm Reduction      15
Educational          6
Fatphobia            3
Substance abuse      3
Meta                 1
question             1
Story Time - TW      1
Name: Flair, dtype: int64

In [10]:
#select 5 largest categories and save the dataset to csv file
categories = ['Discussion', 'Rant / Rave', 'TW', 'Shitpost', 'Recovery']
df = df.loc[df['Flair'].isin(categories) == True,]
df.to_csv('raw_data.csv')

In [11]:
df = pd.read_csv('raw_data.csv') #export the dataset as a csv file

In [12]:
def process(text):
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ''.join(nopunc)
  
  clean = ' '.join(word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english'))
  return clean

df['Text'] = df['Text'].map(lambda x: process(x))

df.head()

Unnamed: 0.1,Unnamed: 0,Text,Flair
0,0,raise hand personally victimized •the biggest ...,Shitpost
1,1,y’all need better boyfriends that’s that’s post,Rant / Rave
2,2,ancestors ate high calorie meals survive harsh...,Shitpost
3,4,showing pain losing weight anyone else feel li...,TW
4,5,guy i’m seeing told likes i’m “not skinny fat”...,Rant / Rave


split between train and test data

In [13]:
X = df['Text']
y = df['Flair']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
print(len(X_train), len(X_test))

567 142


## single-layer neural network

In [14]:
def vectorize(text):
  text = nlp(text)
  vec = [word.vector for word in text]
  return torch.tensor(sum(vec) / len(vec))

In [15]:
X_train_tensor = torch.stack([vectorize(text) for text in X_train])
X_test_tensor = torch.stack([vectorize(text) for text in X_test])

mapping_dict = {'Discussion':0, 'Rant / Rave':1, 'TW':2, 'Shitpost':3, 'Recovery':4}

y_train_tensor = torch.LongTensor(y_train.map(mapping_dict).values)
y_test_tensor = torch.LongTensor(y_test.map(mapping_dict).values)

In [16]:
import torch.nn as nn

class SingleNN(nn.Module):
  def __init__(self, input_size, output_size):
    super().__init__()
    self.fc = nn.Linear(input_size, output_size, bias=False)
    nn.init.normal_(self.fc.weight, 0.0, 1.0)
  
  def forward(self,x):
    x = self.fc(x)
    return x

In [17]:
model = SingleNN(input_size=X_train_tensor.size()[1], output_size=5)
criterion = nn.CrossEntropyLoss()
model

SingleNN(
  (fc): Linear(in_features=300, out_features=5, bias=False)
)

In [18]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [19]:
epochs = 300
losses = []

for i in range(epochs):
    i+=1
    y_pred = model(X_train_tensor)
    loss = criterion(y_pred, y_train_tensor)
    losses.append(loss)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # a neat trick to save screen space:
    if i%10 == 1:
        print(f'epoch: {i:3}  train_loss: {loss.item():10.8f}')

epoch:   1  train_loss: 3.61970258
epoch:  11  train_loss: 2.26477766
epoch:  21  train_loss: 1.80710912
epoch:  31  train_loss: 1.70539975
epoch:  41  train_loss: 1.59294772
epoch:  51  train_loss: 1.50916052
epoch:  61  train_loss: 1.44941473
epoch:  71  train_loss: 1.39656925
epoch:  81  train_loss: 1.35267305
epoch:  91  train_loss: 1.31429005
epoch: 101  train_loss: 1.27972722
epoch: 111  train_loss: 1.24827802
epoch: 121  train_loss: 1.21936977
epoch: 131  train_loss: 1.19258642
epoch: 141  train_loss: 1.16764581
epoch: 151  train_loss: 1.14434087
epoch: 161  train_loss: 1.12250710
epoch: 171  train_loss: 1.10200500
epoch: 181  train_loss: 1.08271110
epoch: 191  train_loss: 1.06451404
epoch: 201  train_loss: 1.04731274
epoch: 211  train_loss: 1.03101528
epoch: 221  train_loss: 1.01553893
epoch: 231  train_loss: 1.00080991
epoch: 241  train_loss: 0.98676229
epoch: 251  train_loss: 0.97333741
epoch: 261  train_loss: 0.96048290
epoch: 271  train_loss: 0.94815242
epoch: 281  train_lo

In [20]:
rows = len(y_test_tensor)
correct = 0

with torch.no_grad():
    y_val = model(X_test_tensor)

for i in range(rows):
    if y_val[i].argmax().item() == y_test_tensor[i]:
        correct += 1

print(f'\n{correct} out of {rows} = {100*correct/rows:.2f}% correct')


63 out of 142 = 44.37% correct


## linearSVC with count vectorizer

In [21]:
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train)
X_train_cv.shape

(567, 5418)

In [22]:
clf = LinearSVC()
clf.fit(X_train_cv, y_train)

LinearSVC()

In [23]:
X_test_cv = cv.transform(X_test)

In [24]:
predictions = clf.predict(X_test_cv)

In [25]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[18 10  4  1  5]
 [13 11  4  2  8]
 [ 5  3  4  1  2]
 [ 5  3  2  4  3]
 [ 8 10  4  3  9]]
0.323943661971831


## linearSVC with tf-idf vectorizer

In [26]:
clf_tfidf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

# Feed the training data through the pipeline
clf_tfidf_lsvc.fit(X_train, y_train)

# Form a prediction set
predictions = clf_tfidf_lsvc.predict(X_test)
# Print the overall accuracy
print('{}% correct'.format(round(metrics.accuracy_score(y_test,predictions)*100, 2)))

40.85% correct


**Descriptive**: extract the top 5 words most correlated with each flair using tf-idf vectorizer (based on [this post](https://towardsdatascience.com/predicting-reddit-flairs-using-machine-learning-and-deploying-the-model-using-heroku-part-2-d681e397f258))

In [27]:
mapping_dict = {'Discussion':0, 'Rant / Rave':1, 'TW':2, 'Shitpost':3, 'Recovery':4}

y = y.map(mapping_dict).values

In [28]:
tfidf = TfidfVectorizer()
feat = tfidf.fit_transform(X).toarray()

In [29]:
# chisq2 statistical test
N = 5    # Number of examples to be listed
for f, i in sorted(mapping_dict.items()):
    chi2_feat = chi2(feat, y == i)
    indices = np.argsort(chi2_feat[0])
    feat_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [w for w in feat_names if len(w.split(' ')) == 1]
    print("\nFlair '{}':".format(f))
    print("Most correlated unigrams:\n\t. {}".format('\n\t. '.join(unigrams[-N:])))


Flair 'Discussion':
Most correlated unigrams:
	. oatmeal
	. dae
	. hair
	. else
	. anyone

Flair 'Rant / Rave':
Most correlated unigrams:
	. waste
	. videos
	. fucking
	. finals
	. sake

Flair 'Recovery':
Most correlated unigrams:
	. imma
	. existing
	. step
	. scales
	. recovery

Flair 'Shitpost':
Most correlated unigrams:
	. hits
	. rip
	. cats
	. lady
	. licorice

Flair 'TW':
Most correlated unigrams:
	. lbs
	. pregnant
	. tw
	. pounds
	. purge


## to-dos
Moving on, I'd like to collect more data (including comments in addition to post titles), and explore other algorithms to refine the model.