### Read Raw Data from CSVs

Subreddit data is split by subreddit, where each CSV file has information 
about 1000 posts from a subreddit. We unpack data and store it according to the
subreddit it came from.

In [79]:
import pandas as pd

subreddit_file = open('archive/50_subreddits_list.csv', mode='r', encoding='utf-8-sig')
subreddits = {}
for line in subreddit_file.readlines():
  subreddit = line.rstrip("\n").lower()
  df = pd.read_csv(f'archive/{subreddit}.csv')
  subreddits[subreddit] = df

### Get Title and Text from DataFrames

Our classifier only works based on the title and main body of reddit posts,
so we extract this information here. Our model doesn't distinguish between text
in either portion since different subreddits have different rules about text
in titles or bodies, so we store this as a single string of both.

In [80]:
subreddit_bodies = {}
for subreddit in subreddits:
  df = subreddits[subreddit]
  df['title'] = df['title'].fillna('')
  df['body'] = df['body'].fillna('')
  subreddit_bodies[subreddit] = df['title'] + " " + df['body'] 

subreddit_bodies['anime']

0      'Dragon Ball' Creator Akira Toryiyama Has Pass...
1        Kaguya-sama: Love Is War - Season 3 announced! 
2                         Aqua in yoga pants | Konosuba 
3                     This is not a Cigarette [Gintama] 
4         The Devil is a Part-Timer Season 2 Announced! 
                             ...                        
991    Mob Psycho 100 Season 2 - Episode 5 discussion...
992    Never thought in a million years I’d come acro...
993    I seriously hate Sundays [Engaged to the Unide...
994                "Spy Classroom" New Character Visual 
995    Hayasaka Ai in Spy Suit from "Kaguya: Love is ...
Length: 996, dtype: object

### Remove Links
Our dataset includes links that people included in posts, which likely won't be
informative in classifying posts. We remove links here using regex.

In [93]:
import re # Regex Library

def removeLinks(txt):
  return re.sub("(www|http:|https:)+[^\s]+", "", txt)

no_links = {}
for subreddit in subreddit_bodies:
  data = subreddit_bodies[subreddit]
  no_links[subreddit] = data.map(lambda txt: removeLinks(txt))

no_links['travel']

0      I visited North Korea recently, these are some...
1      Taken with a phone out of my hotel window in V...
2      Taking a ride on the Bernina Express through t...
3      Wife and I hate big social events and love tra...
4      The exact moment I took a step too close to th...
                             ...                        
992    Sisteron- France. Beautiful place we had a cof...
993    Croatia, probably the most beautiful country i...
994    Michelangelo's David is great, but pieta is on...
995    If you don't mind a little dust and grit and y...
996    I’d never realized how beautiful Montenegro wa...
Length: 997, dtype: object

### Extracting Words

Since reddit posts are often formatted as prose with various punctuation marks,
we split words into word tokens to focus on the words in each post instead of
punctiation (i.e. 'wow!' should be treated the same as 'wow').

We also format words here to limit variability caused by the context of each word, such as:
- Turning each word to lowercase (i.e. 'WOW' should be treated the same as 'wow')
- Replacing the special unicode character ’ with ' since a number of posts included it (i.e. "I’d" vs "I'd")

In [94]:
import nltk

tokenizer = nltk.RegexpTokenizer(
  pattern=r"[\w']+", # Only match words as tokens (coarsely, \w + apostrophes)
  gaps=False,
  discard_empty=True # Remove empty tokens caused by markdown content
)

def tokenize(txt):
  return tokenizer.tokenize(txt.replace("’", "'").lower())

tokenized_subreddits = {}
for subreddit in no_links:
  data = no_links[subreddit]
  tokenized_subreddits[subreddit] = data.map(lambda txt: tokenize(txt)) 

tokenized_subreddits['history']

0      [new, discovery, mode, turns, video, game, ass...
1      [we, are, not, here, to, help, you, with, your...
2      [a, 1776, excerpt, from, john, adam's, diary, ...
3      [famous, viking, warrior, burial, revealed, to...
4      [3, 000, year, old, underwater, castle, discov...
                             ...                        
987    [dna, study, has, now, provided, support, for,...
988    [stonehenge, megalith, came, from, scotland, n...
989    [french, resistance, man, breaks, silence, ove...
990    [holy, grail, of, shipwrecks', to, be, raised,...
991    [emily, wilson's, new, translation, of, the, i...
Length: 992, dtype: object

In [122]:
# Bring tokens back together as data
def joinTokens(tokens):
  return ' '.join(tokens)

subreddit_classes = {}
subreddit_df = pd.DataFrame(columns=['text', 'subreddit'])

# We label classes incrementing from 0 so class information can later be
# determined by indexing into arrays
next_class = 0
for subreddit in tokenized_subreddits:
  subreddit_classes[next_class] = subreddit
  data = pd.DataFrame()
  data['text'] = tokenized_subreddits[subreddit].map(lambda tokens: joinTokens(tokens))
  data['subreddit'] = next_class
  subreddit_df = pd.concat([subreddit_df, data], ignore_index=True)
  next_class += 1

subreddit_df

{0: 'funny', 1: 'askreddit', 2: 'gaming', 3: 'aww', 4: 'music', 5: 'pics', 6: 'science', 7: 'worldnews', 8: 'movies', 9: 'todayilearned', 10: 'videos', 11: 'news', 12: 'showerthoughts', 13: 'earthporn', 14: 'gifs', 15: 'jokes', 16: 'mildlyinteresting', 17: 'iama', 18: 'books', 19: 'lifeprotips', 20: 'diy', 21: 'sports', 22: 'nottheonion', 23: 'food', 24: 'explainlikeimfive', 25: 'space', 26: 'history', 27: 'art', 28: 'internetisbeautiful', 29: 'documentaries', 30: 'tifu', 31: 'askscience', 32: 'dataisbeautiful', 33: 'upliftingnews', 34: 'futurology', 35: 'writingprompts', 36: 'getmotivated', 37: 'oldschoolcool', 38: 'creepy', 39: 'listentothis', 40: 'technology', 41: 'personalfinance', 42: 'interestingasfuck', 43: 'wholesomememes', 44: 'relationship_advice', 45: 'memes', 46: 'travel', 47: 'television', 48: 'anime', 49: 'nostupidquestions'}


Unnamed: 0,text,subreddit
0,my cab driver tonight was so excited to share ...,0
1,guardians of the front page,0
2,gas station worker takes precautionary measure...,0
3,the conversation my son and i will have on chr...,0
4,the denver broncos have the entire town of sou...,0
...,...,...
49261,how come no one has invented a foot pedal for ...,49
49262,why are trans people talked about so much desp...,49
49263,did your penis ever fall asleep like your legs...,49
49264,someone stole my bike i tracked it to its loca...,49


### Create Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = subreddit_df['text']
y = subreddit_df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)
X_train = X_train.astype('str')
X_test = X_test.astype('str')
y_train = y_train.astype('int')
y_test = y_test.astype('int')

### Define Pipeline

Scikit Learn allows you to string multiple functions together to process data
and train model as is done here. The steps our model takes are

- CountVectorizer: Takes the preprocessed post strings and counts word occurences
- TfidfTransformer: Weights importance of words based on number of occurences across posts
- SGDClassifier: Stochastic Gradient Descent to classify subreddits

This pipeline is inspired by a Scikit Learn tutorial on text classification at [https://scikit-learn.org/1.4/tutorial/text_analytics/working_with_text_data.html](https://scikit-learn.org/1.4/tutorial/text_analytics/working_with_text_data.html)

In [97]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

reddit_classifer = Pipeline([
  ('cv', CountVectorizer()),
  ('tfidf', TfidfTransformer()),
  ('sgd', SGDClassifier()),
])

### Grid Search

Each step in the pipeline has parameters to tune how the model runs; Scikit Learn
goes through each combination of parameters and returns the best combination of
parameters to maximize accuracy. We chose to search over:

- The size of "word" features to consider (1-word features or 1-2 word features)
- The minimum word frequency to consider for features (since particularly uncommon words may not be insightful)
- The maximum number of features to consider (since the least impactful features may not be insightful)
- SGD loss function (either Log Loss or Modified Huber since these are the only two that can provide the class probabilities we show)

Note: While accuracy is not the best test of model efficacy as discussed in class,
here we felt the default is sufficient since with 50 roughly equally-represented
classes a dummy classifier can't achieve above a ~2% accuracy.

In [104]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(reddit_classifer, {
  'cv__ngram_range': [(1, 1), (1, 2)],
  'cv__min_df': [1, 3, 5],
  'cv__max_features': [10000, 100000, None],
  'sgd__loss': ['log_loss', 'modified_huber']
}, n_jobs=8)
grid_search.fit(X_train[:10000], y_train[:10000]) # Subset of Data for efficiency

print(grid_search.best_score_)
grid_search.best_params_

0.5862


{'cv__max_df': 0.5,
 'cv__min_df': 1,
 'cv__ngram_range': (1, 2),
 'sgd__loss': 'modified_huber'}

### Train Model

Using the GridSearch parameters above, we train our pipeline.

In [107]:
reddit_classifer = Pipeline([
  ('cv', CountVectorizer(ngram_range=(1,2))),
  ('tfidf', TfidfTransformer()),
  ('sgd', SGDClassifier(loss='modified_huber')),
])
reddit_classifer.fit(X_train, y_train)

### Model Metrics

In [108]:
import numpy as np

y_pred = reddit_classifer.predict(X_test)
"Accuracy: " + str(np.mean(y_pred == y_test))

'Accuracy: 0.6611187789234392'

### Prediction Helper Functions

Since we preprocessed our post data before classifcation, we need to do the same
to any incoming post data.

In [130]:
def preproccessPost(title, text):
  noLinks = removeLinks(title + " " + text)
  tokens = tokenize(noLinks)
  return joinTokens(tokens)

def predictPrompt(title, text=""):
  post = preproccessPost(title, text)
  predicted_class = reddit_classifer.predict([post])[0]
  return subreddit_classes[predicted_class-1]

def predictPromptProbabilities(title, text="", threshold=0.0):
  post = preproccessPost(title, text)
  prediction = {}
  predictions = reddit_classifer.predict_proba([post])[0]
  for subreddit_class in subreddit_classes:
    if predictions[subreddit_class] > threshold:
      subreddit = subreddit_classes[subreddit_class]
      prediction[subreddit] = predictions[subreddit_class]
  return prediction

print(predictPrompt("My gf broke up with me, what should I do?"))
predictPromptProbabilities("My gf broke up with me, what should I do?", "", 0.05)

relationship_advice


{'funny': 0.11195594792986989,
 'askreddit': 0.17723183877180526,
 'art': 0.06273888876347852,
 'personalfinance': 0.14012097604544976,
 'relationship_advice': 0.30201504562243203}

### Flask Server

We serve the model and transparency results on a simple Flask server for our frontend.

In [131]:
from flask import Flask, request, jsonify
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

@app.route("/predict")
def predictBackend():
  title = request.args.get('title')
  text = request.args.get('text')
  prediction = predictPromptProbabilities(title, text)
  return jsonify({'prediction': prediction})

app.run(host='0.0.0.0', port=3000)

 * Serving Flask app '__main__'


 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:3000
 * Running on http://192.168.1.86:3000
Press CTRL+C to quit
127.0.0.1 - - [29/Nov/2024 16:13:26] "GET /predict?title=Soccer!&text= HTTP/1.1" 200 -
127.0.0.1 - - [29/Nov/2024 16:13:33] "GET /predict?title=Soccer2&text= HTTP/1.1" 200 -
127.0.0.1 - - [29/Nov/2024 16:13:37] "GET /predict?title=Soccer%202&text= HTTP/1.1" 200 -
