# Jakub Bandurski, Michał Bryzik, Kacper Gruca
## Text Mining and Social Media Mining Project
## Can sentiment analysis predict the subreddit r/AITA top comment verdict?
## What are the most common topics on this subreddit?

# Text cleaning

## Extraction of verdicts

In [39]:
import pandas as pd
df = pd.read_pickle("./data/raw_data.pkl")
df.head()

Unnamed: 0,title_body,C1
0,(non specific conflict) Telemarketers vs. Phon...,"<|response|> Yes, you are the asshole, with ca..."
1,"AIAA, Disabled CSGO player who found another ""...",<|response|> Why would you be the asshole? You...
2,"AITA ""...You left this marriage for your own s...",<|response|> YTA. It's only been three months ...
3,"AITA ""Friend"" called me narcissistic a**hole f...",<|response|> Judging from the situation and yo...
4,"AITA ""Friends"" argue with GF, now are mad at m...",<|response|> NTA Your “friends” are a bunch of...


In [40]:
positive = ["YTA","NTA","ESH","NAH"]
df["yes"] = (df["C1"].str.contains("YTA")) | (df["C1"].str.contains("ESH"))
df["no"] = (df["C1"].str.contains("NTA")) | (df["C1"].str.contains("NAH"))
df = df[(df["yes"]==True)|(df["no"]==True)]
df["AITA"] = df["yes"]
df = df.drop(labels=["yes","no"], axis=1).reset_index(drop=True)
df.head()


Unnamed: 0,title_body,C1,AITA
0,"AITA ""...You left this marriage for your own s...",<|response|> YTA. It's only been three months ...,True
1,"AITA ""Friends"" argue with GF, now are mad at m...",<|response|> NTA Your “friends” are a bunch of...,False
2,"AITA ""I gave you photo credit on the last coup...",<|response|> NTA - It sounds like he won't be ...,False
3,"AITA ""Selling out a coworker"" So, I'm gonna be...","<|response|> NTA. From what I understand, he u...",False
4,"AITA ""UNWELCOME IN MY OWN HOME"" This happened ...",<|response|> NTA - You should never feel unwel...,False


## Text cleaning

In [41]:
# Import the regex module
import re

# Define a function to clean text using regex
def regex_clean(text):
    # Remove all characters that are not alphanumeric, spaces, newlines, or periods
    text = re.sub("[^a-zA-Z0-9' \n\.]", '', text)
    
    # Replace multiple spaces with a single space
    text = re.sub(' +', ' ', text)
    
    # Remove all characters that are not alphanumeric or spaces
    text = re.sub(r"[^\w\s']",'', text)
    
    # Remove all digits
    text = re.sub('\d', '', text)
    
    # Convert the text to lowercase
    text = text.lower()
    
    # Return the cleaned text
    return text

In [42]:
# clean columns
df["C1_clean"] = df["C1"].apply(lambda x: regex_clean(x))
df["title_body_clean"] = df["title_body"].apply(lambda x: regex_clean(x))

In [43]:
df.head()

Unnamed: 0,title_body,C1,AITA,C1_clean,title_body_clean
0,"AITA ""...You left this marriage for your own s...",<|response|> YTA. It's only been three months ...,True,response yta it's only been three months since...,aita you left this marriage for your own selfi...
1,"AITA ""Friends"" argue with GF, now are mad at m...",<|response|> NTA Your “friends” are a bunch of...,False,response nta your friends are a bunch of assho...,aita friends argue with gf now are mad at me f...
2,"AITA ""I gave you photo credit on the last coup...",<|response|> NTA - It sounds like he won't be ...,False,response nta it sounds like he won't be gettin...,aita i gave you photo credit on the last coupl...
3,"AITA ""Selling out a coworker"" So, I'm gonna be...","<|response|> NTA. From what I understand, he u...",False,response nta from what i understand he used a ...,aita selling out a coworker so i'm gonna be to...
4,"AITA ""UNWELCOME IN MY OWN HOME"" This happened ...",<|response|> NTA - You should never feel unwel...,False,response nta you should never feel unwelcomed ...,aita unwelcome in my own home this happened to...


## Stop words removal

In [44]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

In [45]:
stop_words = set(stopwords.words("english"))
# remove additional subject-specific stopwards
for el in ["aita","response","yta","nta","esh","nah","asshole","a**hole",
            "'m","'s","n't","'d","'t","'ve","'ll","'re","'em","'nt","'d'nt","'t'nt",
            "'ve'nt","'ll'nt","'re'nt","'em'nt"]:
    stop_words.add(el)
for sw in stop_words:
    if "'" in sw:
        sw.replace("'","")

In [46]:
# removing stopwords
def remove_stopwords(text, word_tokenizer=word_tokenize, stop_words=stop_words):
    word_tokens = word_tokenize(text)
    return " ".join([w for w in word_tokens if not w in stop_words])


In [47]:
# remove stopwords from columns
df["C1_no_stopwords"] = df["C1_clean"].apply(lambda x: remove_stopwords(text=x,word_tokenizer=word_tokenize, stop_words=stop_words))
df["title_body_no_stopwords"] = df["title_body_clean"].apply(lambda x: remove_stopwords(text=x,word_tokenizer=word_tokenize, stop_words=stop_words))

In [48]:
df.head()

Unnamed: 0,title_body,C1,AITA,C1_clean,title_body_clean,C1_no_stopwords,title_body_no_stopwords
0,"AITA ""...You left this marriage for your own s...",<|response|> YTA. It's only been three months ...,True,response yta it's only been three months since...,aita you left this marriage for your own selfi...,three months since bf split wife divorce two a...,left marriage selfish right happy think thats ...
1,"AITA ""Friends"" argue with GF, now are mad at m...",<|response|> NTA Your “friends” are a bunch of...,False,response nta your friends are a bunch of assho...,aita friends argue with gf now are mad at me f...,friends bunch assholes obviously endoftext,friends argue gf mad siding hey first time pos...
2,"AITA ""I gave you photo credit on the last coup...",<|response|> NTA - It sounds like he won't be ...,False,response nta it sounds like he won't be gettin...,aita i gave you photo credit on the last coupl...,sounds like wo getting photos anytime soon end...,gave photo credit last couple since whole flar...
3,"AITA ""Selling out a coworker"" So, I'm gonna be...","<|response|> NTA. From what I understand, he u...",False,response nta from what i understand he used a ...,aita selling out a coworker so i'm gonna be to...,understand used racial slur attempted avoid co...,selling coworker gon na totally honest bc want...
4,"AITA ""UNWELCOME IN MY OWN HOME"" This happened ...",<|response|> NTA - You should never feel unwel...,False,response nta you should never feel unwelcomed ...,aita unwelcome in my own home this happened to...,never feel unwelcomed home even partner feels ...,unwelcome home happened today still working he...


## Stemming

In [49]:
from nltk.stem import PorterStemmer 

In [50]:
# stemming
def stemming(text,ps=PorterStemmer):
    output = []
    for word in text.split():
        # if the word is not found in the stemmer don't stem it
        try:
            output.append(ps.stem(word))
        except:
            output.append(word)
    return " ".join(output)

In [51]:
# remove stopwords from columns
df["C1_stemmed"] = df["C1_no_stopwords"].apply(lambda x: stemming(text=x,ps=PorterStemmer))
df["title_body_stemmed"] = df["title_body_no_stopwords"].apply(lambda x: stemming(text=x,ps=PorterStemmer))

## Save results
Containing cleaned versions of posts and comments without stopwords as well as the stemmed versions of said posts and comments.

In [52]:
df[["title_body_no_stopwords","C1_no_stopwords","title_body_stemmed","C1_stemmed","AITA"]].to_pickle("./data/cleaned_data.pkl")

In [53]:
df[["title_body_no_stopwords","C1_no_stopwords","title_body_stemmed","C1_stemmed","AITA"]].head()

Unnamed: 0,title_body_no_stopwords,C1_no_stopwords,title_body_stemmed,C1_stemmed,AITA
0,left marriage selfish right happy think thats ...,three months since bf split wife divorce two a...,left marriage selfish right happy think thats ...,three months since bf split wife divorce two a...,True
1,friends argue gf mad siding hey first time pos...,friends bunch assholes obviously endoftext,friends argue gf mad siding hey first time pos...,friends bunch assholes obviously endoftext,False
2,gave photo credit last couple since whole flar...,sounds like wo getting photos anytime soon end...,gave photo credit last couple since whole flar...,sounds like wo getting photos anytime soon end...,False
3,selling coworker gon na totally honest bc want...,understand used racial slur attempted avoid co...,selling coworker gon na totally honest bc want...,understand used racial slur attempted avoid co...,False
4,unwelcome home happened today still working he...,never feel unwelcomed home even partner feels ...,unwelcome home happened today still working he...,never feel unwelcomed home even partner feels ...,False
