# Am I The A-hole Predictor Project

This project is based on one of Reddit's most popular section: <a href="https://www.reddit.com/r/AmItheAsshole/">r/AmITheAsshole</a>. This subreddit (AITA for short) has a simple premise of having people post about social experiences where they're unsure if they're the one at fault or not. Redditors participate in deciding their judgement, commenting on the situation, giving a brief summary of what they think is going on and argumenting for each of the valid judgements:

- **YTA** (You're the A-hole)
- **NTA** (Not the A-hole, but the other person is)
- **ESH** (Everyone Sucks Here)
- **NAH** (No A-holes Here)
- **INFO** (Not enough information)

People vote in their comments, and after 18 hours, a flair is given to each post with the verdict of the public, which then usually translates to the original posters reacting to their judgement, for better or worse (and this is where this subreddit gets one of the juiciest comment sections on the entire website!). However, this being both Reddit and the internet, there are some biases included, and the crowd allegedly tends to pick certain verdicts when the story has some buzzwords or characters which are not as favored in the internet, such as: mothers-in-law, pregnant people, childen being loud, among many others, as well as certain situations where the subreddit is pretty vocal about defending, specially when it includes people asking for help when there's no legal obligation to do so.

In this project, I took this perceived notion of being able to predict the outcome when the story is told in a certain way or with certain participates, and a dataset of 100,000 submissions was obtained and processed to obtain some descriptive statistics about posts based on Reddit activity such as comments and up/downvotes to characterize the posts, and build a machine learning model to predict the verdict of a user input text telling a story.

**Import statements**

In [81]:
import pandas as pd
import datetime
import numpy as np
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

**Loading the dataset**

The dataset was obtained from a script which uses two Reddit APIs to obtain posts and then pull information about each of them.

In [3]:
df = pd.read_csv("/Users/pilarvasquez/Documents/Repos/AITA-Predictor/reddit_scraper/reddit_posts.csv")

**Cleaning up rows**

Remove from dataset all rows without a body, and rows with tags outside the verdicts (mod posts).

In [4]:
nan = np.nan
clean_df = df.query("body not in ['[deleted]', '[removed]'] & verdict not in ['TL;DR', 'UPDATE', @nan, 'Talk ENDED', 'Open Forum', 'Mods Needed!', 'META']")

In [5]:
clean_df.sort_values(by="score", ascending=False)

Unnamed: 0,title,id,score,upvote_ratio,url,num_comments,body,created,edited,verdict,over_18
2637,AITA for giving my MIL a fake copy of my house...,rrjhmz,41057,0.97,https://www.reddit.com/r/AmItheAsshole/comment...,4983,\nI wanna preface this by saying that I f34 ma...,2021-12-29 18:38:13,1640821313.0,Not the A-hole,False
65632,AITA for kicking out my girlfriend,qkntpa,35463,0.97,https://www.reddit.com/r/AmItheAsshole/comment...,1946,So I have a cat named Raven who's 3 years old....,2021-11-01 18:06:27,False,Not the A-hole,False
920,AITA for setting a glitter trap to catch my mo...,rsnfcq,34346,0.98,https://www.reddit.com/r/AmItheAsshole/comment...,2607,For some weird reason my MIL really wants to g...,2021-12-31 03:27:13,1640981534.0,Not the A-hole,False
16904,AITA for telling my fiancé if I see her friend...,rijyk1,33318,0.97,https://www.reddit.com/r/AmItheAsshole/comment...,2796,Edit: Update. I want to thank you all for the ...,2021-12-17 12:33:20,1639778419.0,Not the A-hole,False
70815,AITA for losing weight before my sister's wedd...,qgxbzw,33096,0.94,https://www.reddit.com/r/AmItheAsshole/comment...,3692,"I 28F used to be quite overweight, over the la...",2021-10-27 11:00:03,False,Not the A-hole,False
...,...,...,...,...,...,...,...,...,...,...,...
71299,AITA for making a classmate leave the class be...,qgk973,0,0.40,https://www.reddit.com/r/AmItheAsshole/comment...,57,"I was in class today, in the university caree...",2021-10-26 21:27:42,1635298606.0,Asshole,False
33707,AITA Friends ghost me after one claims I said ...,r6xr83,0,0.45,https://www.reddit.com/r/AmItheAsshole/comment...,9,So for reference I used to be a member of this...,2021-12-02 00:37:21,False,Everyone Sucks,False
7722,AITA for demanding quarantine because of a sto...,rogo42,0,0.50,https://www.reddit.com/r/AmItheAsshole/comment...,15,"So, my BIL is in therapy and assisted living (...",2021-12-25 17:20:21,False,Not the A-hole,False
79987,AITA for deciding to cut a friend off after ge...,qbfrox,0,0.38,https://www.reddit.com/r/AmItheAsshole/comment...,8,I (16F) have been friends with “Zee” (16F) for...,2021-10-19 13:59:35,False,Not the A-hole,False


In [10]:
clean_df.id.count()

21852

In [None]:
is_edited = 

In [34]:
clean_df.body.str.contains("mil", case=False).sum()

9089

In [35]:
clean_df.body.str.contains("mother-in-law", case=False).sum()

29

In [None]:
clean_df.body.str.contains("mil|mother-in-law|mother in law", case=False).sum()

In [18]:
clean_df.created.tail(10)

99965    2021-09-30 14:32:00
99968    2021-09-30 14:23:56
99970    2021-09-30 14:23:32
99973    2021-09-30 14:21:24
99982    2021-09-30 14:13:03
99983    2021-09-30 14:12:06
99989    2021-09-30 14:07:20
99990    2021-09-30 14:07:12
99992    2021-09-30 14:05:17
99997    2021-09-30 14:03:03
Name: created, dtype: object

In [19]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21852 entries, 1 to 99997
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         21852 non-null  object 
 1   id            21852 non-null  object 
 2   score         21852 non-null  int64  
 3   upvote_ratio  21852 non-null  float64
 4   url           21852 non-null  object 
 5   num_comments  21852 non-null  int64  
 6   body          21852 non-null  object 
 7   created       21852 non-null  object 
 8   edited        21852 non-null  object 
 9   verdict       21852 non-null  object 
 10  over_18       21852 non-null  bool   
dtypes: bool(1), float64(1), int64(2), object(7)
memory usage: 1.9+ MB


In [20]:
clean_df.describe()

Unnamed: 0,score,upvote_ratio,num_comments
count,21852.0,21852.0,21852.0
mean,608.442889,0.799142,117.842577
std,2274.651992,0.141629,370.333993
min,0.0,0.11,3.0
25%,6.0,0.71,16.0
50%,16.0,0.82,28.0
75%,96.0,0.92,61.0
max,41057.0,1.0,8053.0


In [21]:
clean_df.verdict.value_counts()

Not the A-hole     16211
Asshole             3068
No A-holes here     1222
Everyone Sucks       876
Not enough info      475
Name: verdict, dtype: int64

In [37]:
clean_df.groupby("verdict").upvote_ratio.mean()

verdict
Asshole            0.699498
Everyone Sucks     0.745308
No A-holes here    0.759542
Not enough info    0.723242
Not the A-hole     0.826118
Name: upvote_ratio, dtype: float64

In [38]:
clean_df.groupby("verdict").num_comments.mean()

verdict
Asshole            193.902868
Everyone Sucks     110.381279
No A-holes here     59.662848
Not enough info     82.488421
Not the A-hole     109.272593
Name: num_comments, dtype: float64

In [39]:
clean_df.groupby("verdict").score.mean()

verdict
Asshole            505.537810
Everyone Sucks     321.625571
No A-holes here    172.189853
Not enough info    283.547368
Not the A-hole     685.821911
Name: score, dtype: float64

## ML

In [44]:
X = clean_df.body.values

In [41]:
y = clean_df.verdict.values

array(['Not the A-hole', 'Everyone Sucks', 'Not the A-hole', ...,
       'Asshole', 'Not the A-hole', 'Not the A-hole'], dtype=object)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Tokenize**

In [51]:
def tokenize(text):
    lemmatizer = WordNetLemmatizer()
    
    urls = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    detected_urls = re.findall(urls, text)
    
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    words = word_tokenize(text)
    stop_words = stopwords.words("english")
    words = [x for x in words if x not in stop_words]
    
    clean_tokens = []
    for word in words:
        clean_token = lemmatizer.lemmatize(word).strip().lower()
        clean_tokens.append(clean_token)
        
    return clean_tokens

**Model**

In [77]:
pipeline = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer()),
                ('clf', DecisionTreeClassifier())
    ])

parameters = {
        "tfidf__use_idf": [True, False],
        "clf__splitter": ["best", "random"],
        "clf__min_samples_split": [2, 10, 20]
    }

model = GridSearchCV(pipeline, parameters, verbose = 10)

In [78]:
model.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5; 1/2] START clf__splitter=best..........................................
[CV 1/5; 1/2] END ........................clf__splitter=best; total time= 1.9min
[CV 2/5; 1/2] START clf__splitter=best..........................................
[CV 2/5; 1/2] END ........................clf__splitter=best; total time= 1.7min
[CV 3/5; 1/2] START clf__splitter=best..........................................
[CV 3/5; 1/2] END ........................clf__splitter=best; total time= 1.9min
[CV 4/5; 1/2] START clf__splitter=best..........................................
[CV 4/5; 1/2] END ........................clf__splitter=best; total time= 1.9min
[CV 5/5; 1/2] START clf__splitter=best..........................................
[CV 5/5; 1/2] END ........................clf__splitter=best; total time= 1.8min
[CV 1/5; 2/2] START clf__splitter=random........................................
[CV 1/5; 2/2] END ......................clf__spli

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x7f80688698b0>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf', DecisionTreeClassifier())]),
             param_grid={'clf__splitter': ['best', 'random']}, verbose=10)

In [79]:
y_pred = model.predict(X_test)

In [83]:
print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

        Asshole       0.18      0.15      0.16       619
 Everyone Sucks       0.10      0.08      0.09       170
No A-holes here       0.08      0.06      0.07       249
Not enough info       0.01      0.01      0.01       103
 Not the A-hole       0.75      0.80      0.77      3230

       accuracy                           0.62      4371
      macro avg       0.22      0.22      0.22      4371
   weighted avg       0.59      0.62      0.60      4371



In [112]:
test3="I (29F) work in a tech company that delivers supermarket goods. I want to become Minister of Health in the future, so I'm working on a Data Science Nanodegree to learn more stuff, as well as working on my current job to get more experience. I have a project to deliver before April ends, and I've been struggling to finish it as I didn't pay much attention in my machine learning classes, so I asked a friend for help interpreting my results, and he absolutely refused to help me. I know I asked him at almost 11PM but that's what friends do, right? So Reddit, AITA?"

In [111]:
print(test3)

I (29F) work in a tech company that delivers supermarket goods. I want to become Minister of Health in the future, so I'm working on a Data Science Nanodegree to learn more stuff, as well as working on my current job to get more experience. I have a project to deliver before April ends, and I've been struggling to finish it as I didn't pay much attention in my machine learning classes, so I asked a friend for help interpreting my results, and he absolutely refused to help me. I know I asked him at almost 11PM but that's what friends do, right? So Reddit, AITA?
