# **Are You the Asshole? Judging r/AITA Drama with NLP and ML**
*Final Project for CMSC176: Natural Language Processing* </br> </br>
**Author**: Daenielle Rai Peladas </br>
**Last Modified**: December 8, 2024 

## **Data Acquisition**

In [None]:
import os
import re
import sys
import json
import pandas as pd

In [22]:
# Retrieve comment directory
comments_dir = os.path.join('/mnt/c/Users/daeni/Desktop/LAB/CMSC176NLP/NLP Final Project/AITAH-FAFO', 'scrapes', 'comments')

This function generates a DataFrame from all the scraped comments stored in the specified directory.

In [23]:
def create_dataframe_from_comments(dir_path):
    # Directory path where the JSON files are stored
    comments_dir = os.path.join(dir_path, 'scrapes', 'comments')
    
    # Initialize a list to hold the data
    data = []

    # Go through all files in the directory
    for filename in os.listdir(comments_dir):
        if filename.endswith('.json'):
            file_path = os.path.join(comments_dir, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                # Load the JSON file
                content = json.load(file)
                
                metadata = content['data']['submission_metadata']
                comments = content['data']['comments']
                
                # Extract the required fields
                title = metadata['title']
                selftext = metadata['selftext']
                aitah_tag = count_YTA_NTA(comments)
                
                # Append to the data list
                data.append({'title': title, 'selftext': selftext, 'aitah_tag': aitah_tag})
    # Create a DataFrame from the data
    df = pd.DataFrame(data)
    return df

This function analyzes the comment section to count occurrences of keywords related to 'You're the Asshole' (YTA) and 'Not the Asshole' (NTA). Based on the keyword counts, it determines the overall judgment of whether the subject is considered the asshole or not.

In [24]:
def count_YTA_NTA(comments_list):
    # Extract 'body' content from each comment
    bodies = [comment['body'] for comment in comments_list if 'body' in comment]

    # Initialize counters for YTA and NTA
    YTA = 0
    NTA = 0
    
    # Loop through each comment body and count occurrences of YTA and NTA (including YTAH and NTAH)
    for body in bodies:
        YTA += len(re.findall(r'\bYTA\b', body)) + len(re.findall(r'\bYTAH\b', body))
        NTA += len(re.findall(r'\bNTA\b', body)) + len(re.findall(r'\bNTAH\b', body))
    
    if YTA > NTA:
        return 1
    return 0

In [25]:
# Specify the root directory
root_directory = '/mnt/c/Users/daeni/Desktop/LAB/CMSC176NLP/NLP Final Project/AITAH-FAFO'
df = create_dataframe_from_comments(root_directory)

In [33]:
df.head()

Unnamed: 0,title,selftext,aitah_tag
2,AITA for asking my parents to wait a year unti...,I (16M) am in my junior year of high school an...,0
3,AITA for being weirded out after my girlfriend...,"This is a weird and uncomfortable situation, b...",0
4,AITA for cutting off my mom,I come from a small family. While my extended ...,0
5,AITA for Deciding to Skip Thanksgiving With My...,"Hi everyone, I need some outside perspective o...",0
6,AITA for exposing my best friend for cheating ...,Throw away since it is fresh \n\nThis happened...,0


## **Basic Insights**

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      99 non-null     object
 1   selftext   99 non-null     object
 2   aitah_tag  99 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


## **Data Wrangling**

In [28]:
df.isnull().sum()

title        0
selftext     0
aitah_tag    0
dtype: int64

There are no null values shown here. However, there were some posts that were deleted and therefore must be removed since the text in the post itself is necessary for analysis.

In [29]:
df = df[df['selftext'] != '[deleted]']
df = df[df['selftext'] != '[removed]']

In [34]:
pd.set_option('display.max_rows', None)
df.head(100)

Unnamed: 0,title,selftext,aitah_tag
2,AITA for asking my parents to wait a year unti...,I (16M) am in my junior year of high school an...,0
3,AITA for being weirded out after my girlfriend...,"This is a weird and uncomfortable situation, b...",0
4,AITA for cutting off my mom,I come from a small family. While my extended ...,0
5,AITA for Deciding to Skip Thanksgiving With My...,"Hi everyone, I need some outside perspective o...",0
6,AITA for exposing my best friend for cheating ...,Throw away since it is fresh \n\nThis happened...,0
7,AITA for going no contact with my best friend ...,"Okay, so I (27F) feel sick even writing this b...",0
9,AITA for Kicking My Brother Out of My House Af...,"I (38F) have a 15-year-old daughter, “Emma.” F...",0
10,AITA For Masturbating On Call?,I (26M) was on a call with a friend of mine (2...,1
12,AITA for not telling my in-laws about the gift...,"\nFor context, The baby's 100th day is celeb...",0
14,AITA for not tolerating my brother’s behaviour...,"I, 24F have two brothers - 27M and 19M. My old...",0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 76 entries, 2 to 98
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      76 non-null     object
 1   selftext   76 non-null     object
 2   aitah_tag  76 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 2.4+ KB


## **Text Processing**

In [42]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [48]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/rx/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/rx/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/rx/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [49]:
# Gets a list of english stop words
stop_words = stopwords.words('english')

In [50]:
def preprocess_data(reddit_post):
    reddit_post = str(reddit_post).lower()
    reddit_post = re.sub("[^a-zA-Z0-9\s]",'', reddit_post)
    
    temp_final =[]
    
    for word in reddit_post.split():
        if word =='' or '\r\n' in word or word in stop_words:
            None
        else:
            temp_final.append(word)
            
    return word_tokenize(' '.join(temp_final))

In [51]:
df['processed_selftext'] = df['selftext'].apply(preprocess_data)

In [52]:
df.head()

Unnamed: 0,title,selftext,aitah_tag,processed_selftext
2,AITA for asking my parents to wait a year unti...,I (16M) am in my junior year of high school an...,0,"[16m, junior, year, high, school, plan, move, ..."
3,AITA for being weirded out after my girlfriend...,"This is a weird and uncomfortable situation, b...",0,"[weird, uncomfortable, situation, really, need..."
4,AITA for cutting off my mom,I come from a small family. While my extended ...,0,"[come, small, family, extended, family, large,..."
5,AITA for Deciding to Skip Thanksgiving With My...,"Hi everyone, I need some outside perspective o...",0,"[hi, everyone, need, outside, perspective, sit..."
6,AITA for exposing my best friend for cheating ...,Throw away since it is fresh \n\nThis happened...,0,"[throw, away, since, fresh, happened, recently..."


In [53]:
df['aitah_tag'].value_counts()

aitah_tag
0    72
1     4
Name: count, dtype: int64