# AITAH analysis using Spacy  
In this notebook, we attempt to classify data from the r/AITAH subreddit using an English model from spacy. A balanced sample of 20000 posts is utilized in the training of this model.

## DO NOT RUN THIS NOTEBOOK UNLESS YOU KNOW WHAT YOU'RE DOING, IT WILL BAKE YOUR COMPUTER

## Data preparation
First, we load the data. This attempt is made on the balanced clean dataset that has already been preprocessed elsewhere. This means that while the tag YTA (you are the asshole) is less frequent in the whole data, we chose enough posts with this tag in order to get better accuracy from the model, otherwise it would in most cases be better to just classify as the majority class.

In [3]:
#basic libraries import for working with dataframes
import pandas as pd
import numpy as np

In [7]:
df=pd.read_csv('../data/samples/balanced_sample_20000.csv')

In [None]:
df.head()

Unnamed: 0,selftext,link_flair_text,target
0,This is such a stupid argument but it's stress...,Not the A-hole,0
1,"For Cinco de Mayo, my girlfriend told me she w...",Not the A-hole,0
2,The first thing you need to know is that my gr...,Not the A-hole,0
3,Today at work me(M) and 3 other people were lo...,Not the A-hole,0
4,"Yesterday, I started a new job as a Senior Lec...",Not the A-hole,0


In [41]:
#changing the datatype of the target in order to work with the data more easily
df["target"]=df['target'].astype(np.bool)
df.head()

Unnamed: 0,selftext,link_flair_text,target
0,This is such a stupid argument but it's stress...,Not the A-hole,False
1,"For Cinco de Mayo, my girlfriend told me she w...",Not the A-hole,False
2,The first thing you need to know is that my gr...,Not the A-hole,False
3,Today at work me(M) and 3 other people were lo...,Not the A-hole,False
4,"Yesterday, I started a new job as a Senior Lec...",Not the A-hole,False


In [42]:
#preparing the data for spacy
#create a dictionary for categories
df['cats'] = [{'YTA': y==True, 'NTA':y==False} for y in df['target']]
df.head()

Unnamed: 0,selftext,link_flair_text,target,cats
0,This is such a stupid argument but it's stress...,Not the A-hole,False,"{'YTA': False, 'NTA': True}"
1,"For Cinco de Mayo, my girlfriend told me she w...",Not the A-hole,False,"{'YTA': False, 'NTA': True}"
2,The first thing you need to know is that my gr...,Not the A-hole,False,"{'YTA': False, 'NTA': True}"
3,Today at work me(M) and 3 other people were lo...,Not the A-hole,False,"{'YTA': False, 'NTA': True}"
4,"Yesterday, I started a new job as a Senior Lec...",Not the A-hole,False,"{'YTA': False, 'NTA': True}"


In [None]:
#there are some floats in places where text should be -> we have to fix that
dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]
dtypeCount

[selftext
 <class 'str'>      19984
 <class 'float'>        6
 Name: count, dtype: int64,
 link_flair_text
 <class 'str'>    19990
 Name: count, dtype: int64,
 target
 <class 'bool'>    19990
 Name: count, dtype: int64,
 cats
 <class 'dict'>    19990
 Name: count, dtype: int64]

In [None]:
df[df.isna().any(axis=1)] #some random na's
#drop them because that is completely useless and we cannot really input anything instead of the examples

Unnamed: 0,selftext,link_flair_text,target,cats
7396,,Not the A-hole,False,"{'YTA': False, 'NTA': True}"
9003,,Asshole,True,"{'YTA': True, 'NTA': False}"
9610,,Not the A-hole,False,"{'YTA': False, 'NTA': True}"
9871,,Asshole,True,"{'YTA': True, 'NTA': False}"
10012,,asshole,True,"{'YTA': True, 'NTA': False}"
12677,,Asshole,True,"{'YTA': True, 'NTA': False}"


In [43]:
#dropping the NA containing rows and checking the datatypes again just in case
df=df.dropna()
dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]
dtypeCount #only string type remains in selftext, now everything should work

[selftext
 <class 'str'>    19984
 Name: count, dtype: int64,
 link_flair_text
 <class 'str'>    19984
 Name: count, dtype: int64,
 target
 <class 'bool'>    19984
 Name: count, dtype: int64,
 cats
 <class 'dict'>    19984
 Name: count, dtype: int64]

In [44]:
#data split for train and test set
#our dataset is quite big so perhaps a bigger train set could still work, but we'll use the usual 80/20 split
aitah_data = {}
aitah_data['train'] = df.sample(frac=0.8, random_state=42)
aitah_data['test'] = df.drop(aitah_data['train'].index, inplace=False)

## Loading model 1
Loading the first language model from Spacy. The chosen model is en_core_web_lg, which has more unique vectors than the other spacy models, possibly making it more accurate. This also means it takes up more space, however, we do need a big model to capture the textual nuances of the posts.

### warning
After experimentation with this model, I found out that it is absolutely impossible to run it. I might be possible with a smaller sample, but a small sample would likely not give good results anyway so while this section has been kept, we do not actually work with this data later. The problem is an *out of memory error*, which is likely caused by both the size of the data and the size of the model with its large number of unique vectors. Because ouf this, the memory overloads, both while running it on CPU and GPU (on Google Colab, I actually didn't try this locally because I think it would make my laptop explode).

In [None]:
#downloading the model
model = "en_core_web_lg"
import spacy
spacy.cli.download(model)
nlp = spacy.load(model)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
#parsing the documents we already have as examples (this takes about 18 minutes)
from tqdm import tqdm
docs = {}
docs['train'] = list(tqdm(nlp.pipe(aitah_data['train']['selftext']), total=len(aitah_data['train'])))
docs['test'] = list(tqdm(nlp.pipe(aitah_data['test']['selftext']), total=len(aitah_data['test'])))
docs

100%|██████████| 15987/15987 [15:55<00:00, 16.73it/s]
100%|██████████| 3997/3997 [03:37<00:00, 18.38it/s]


{'train': [I (19f) used to be friends with Olivia (17f) but we still have loads of mutual friends. Earlier this year we were supposed to go to a concert together but she sold my ticket and went without me after we had a fight about her being toxic.
  
  We also had another concert lined up for later this month but I have the tickets for that and she obviously won’t be going. 
  
  To clarify she did pay me back for the first concert in full ($100 roughly) and the ticket that I haven’t payed her back for is worth $190. 
  
  I blocked her number in hopes that she would forget about it but now our mutual friends are asking me to pay her back even though I have other responsibilities which I need to think about.
  
  What everyone needs to understand is that I am utterly heartbroken by what she did. We had one fight and she ruined everything. 
  
  It also hurts that our mutual friends still talk to her as I’ve been trying to tell them and everyone else that she is a bad person by showing

In [None]:
#creating examples for train and test set and saving them in binary format for training
from spacy.training import Example
from spacy.tokens import DocBin
bin = {"train": DocBin(), "test": DocBin()}
with nlp.select_pipes(enable="tok2vec"):
  for i in range(len(docs['train'])):
    example = Example.from_dict(docs['train'][i], {'cats': aitah_data['train'].iloc[i]['cats']})
    bin['train'].add(example.reference)
  for i in range(len(docs['test'])):
    example = Example.from_dict(docs['test'][i], {'cats': aitah_data['test'].iloc[i]['cats']})
    bin['test'].add(example.reference)

bin['train'].to_disk("train.spacy")
bin['test'].to_disk("test.spacy")

## Training the model
Now that we have our data preprocessed and ready to go, we can start training the model. We will utilize Weights and Biases to track the performace of the model training.

### warning  
Once again, this model did NOT work and overloaded the memory and crashed before even creating one model.

In [28]:
#importing model again because I disconnected runtime to enable gpu
model = "en_core_web_lg"
import spacy
spacy.cli.download(model)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# instalation of Weights and Biasis package via `pip`
#!pip install wandb -qU

In [None]:
# Weights and Biases logger (we can watch the changes in model performance and how the training taxes the processing power we have)
# login into the logger by copying the access code into the prompt
import wandb
wandb.login()

[34m[1mwandb[0m: [wandb.login()] Loaded credentials for https://api.wandb.ai from C:\Users\emado\_netrc.
[34m[1mwandb[0m: Currently logged in as: [33mnanakoshiroi[0m ([33mnanakoshiroi-prague-university-of-economics-and-business[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
#config file added (with extensive cutdowns on memory use, but its still wasn't enough for the big model to actually run)
config_path = "./base_config_ensamble.cfg"
!python -m spacy init fill-config $config_path config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
#training the model (this crashed HARD like 10000 times)
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./test.spacy --verbose --gpu-id 0

[2026-01-14 17:32:12,260] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2026-01-14 17:32:12,895] [INFO] Set up nlp object from config
[2026-01-14 17:32:13,571] [DEBUG] Loading corpus from path: test.spacy
[2026-01-14 17:32:13,573] [DEBUG] Loading corpus from path: train.spacy
[2026-01-14 17:32:13,573] [INFO] Pipeline: ['tok2vec', 'textcat']
[2026-01-14 17:32:13,576] [INFO] Created vocabulary
[2026-01-14 17:32:13,577] [INFO] Finished initializing nlp object
[2026-01-14 17:34:05,691] [INFO] Initialized pipeline components: ['tok2vec', 'textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2026-01-14 17:34:05,715] [DEBUG] Loading corpus from path: test.spacy
[2026-01-14 17:34:05,719] [DEBUG] Loading corpus from path: train.spacy
[2026-01-14 17:34:05,726] [DEBUG] Removed existing output directory: output/model-last
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat'][0m
[38;5;4mℹ Initial learn rate

Okay, this thing is failing HARD, possibly because of the data size. Changing batch size in config file.  
Changing static vectors to false in config.  
Okay, so I guess the model is just too big and totally fucks it, so we're doing the preprocessing again :)


## Smaller model attempt
Here, we use the en_core_web_md model instead and hope it will not overload the memory so much. This actually works, even tho it takes like 30 minutes for one model to be trained and it still kills itself in Colab. I was able to get it running locally, could probably fry eggs on my laptop tho.

In [None]:
#loading the new model, this one is significantly smaller
model2 = "en_core_web_md"
import spacy
spacy.cli.download(model2)
nlp2 = spacy.load(model2)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
#parsing the documents we already have as examples for the new train and test set
from tqdm import tqdm
docs2 = {}
docs2['train'] = list(tqdm(nlp2.pipe(aitah_data['train']['selftext']), total=len(aitah_data['train'])))
docs2['test'] = list(tqdm(nlp2.pipe(aitah_data['test']['selftext']), total=len(aitah_data['test'])))
docs2

100%|██████████| 15987/15987 [12:31<00:00, 21.27it/s]
100%|██████████| 3997/3997 [02:53<00:00, 23.01it/s]


{'train': [I (19f) used to be friends with Olivia (17f) but we still have loads of mutual friends. Earlier this year we were supposed to go to a concert together but she sold my ticket and went without me after we had a fight about her being toxic.
  
  We also had another concert lined up for later this month but I have the tickets for that and she obviously won’t be going. 
  
  To clarify she did pay me back for the first concert in full ($100 roughly) and the ticket that I haven’t payed her back for is worth $190. 
  
  I blocked her number in hopes that she would forget about it but now our mutual friends are asking me to pay her back even though I have other responsibilities which I need to think about.
  
  What everyone needs to understand is that I am utterly heartbroken by what she did. We had one fight and she ruined everything. 
  
  It also hurts that our mutual friends still talk to her as I’ve been trying to tell them and everyone else that she is a bad person by showing

In [47]:
#creating examples for train and test set and saving them in binary format
from spacy.training import Example
from spacy.tokens import DocBin
bin = {"train": DocBin(), "test": DocBin()}
with nlp2.select_pipes(enable="tok2vec"):
  for i in range(len(docs2['train'])):
    example = Example.from_dict(docs2['train'][i], {'cats': aitah_data['train'].iloc[i]['cats']})
    bin['train'].add(example.reference)
  for i in range(len(docs2['test'])):
    example = Example.from_dict(docs2['test'][i], {'cats': aitah_data['test'].iloc[i]['cats']})
    bin['test'].add(example.reference)

bin['train'].to_disk("train2.spacy")
bin['test'].to_disk("test2.spacy")

In [8]:
#config file added (the same cut-down config we used for the previous model but this time it is actually enough to get it working)
config_path = "./base_config_ensamble.cfg"
!python -m spacy init fill-config $config_path config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
#we are already logged into the logger so all that remains is training it and seeing how it goes
!python -m spacy train config.cfg --output ./output --paths.train ./train2.spacy --paths.dev ./test2.spacy --verbose

^C


Wandb run: wandering-morning-13 (to see in weights and biases).  
I ran this thing for about 3 hours and while the scores did improve significantly with retraining (it went through 7 models in that time, I think), it is still pretty useless and only a fraction better than just random guessing. It literally has AUC per type 0.56, which doesn't really do much for us and I don't think my machine can handle a longer session of this. Google Colab won't even run it so that's out of the question.

### checking how the small model works with unseen data

In [None]:
#load the best model
nlp = spacy.load("output/model-best")

In [None]:
import pandas as pd
test=pd.read_csv('../data/samples/balanced_sample_200.csv')
test.head()

Unnamed: 0,selftext,link_flair_text,target
0,This is such a stupid argument but it's stress...,Not the A-hole,0
1,"For Cinco de Mayo, my girlfriend told me she w...",Not the A-hole,0
2,The first thing you need to know is that my gr...,Not the A-hole,0
3,Today at work me(M) and 3 other people were lo...,Not the A-hole,0
4,"Yesterday, I started a new job as a Senior Lec...",Not the A-hole,0


In [14]:
def evaluate(row, nlp=nlp):
    row["ASSHOLE_score"]=nlp(row["selftext"]).cats["YTA"]
    return row

In [20]:
test = test.apply(evaluate, nlp=nlp, axis=1)

test["AITA"] = pd.cut(test["ASSHOLE_score"],
                   bins=[0,0.4,0.6,1],
                   labels=["NTA", "NEUTRAL", "YTA"])
test.sort_values(by="ASSHOLE_score")

Unnamed: 0,selftext,link_flair_text,target,ASSHOLE_score,AITA
115,I(28M) am having work despite of covid-19 and...,Not the A-hole,0,0.228077,NTA
102,So the title is basically self-explanatory. It...,Not the A-hole,0,0.312130,NTA
18,My twin sister Carly and her children have rec...,Asshole,1,0.344083,NTA
184,I (32f) have managed to upset and anger my mum...,Asshole,1,0.344645,NTA
64,"A year ago, I invited my family from another s...",Not the A-hole,0,0.351762,NTA
...,...,...,...,...,...
85,I was walking back home today when someone wit...,Asshole,1,0.601993,YTA
23,My younger brother is getting married. Because...,Asshole,1,0.606071,YTA
188,I was walking through the town I live near whe...,Asshole,1,0.615787,YTA
111,There was this streamer I used to watch and I ...,Not the A-hole,0,0.643233,YTA


In [22]:
test[test["AITA"]=="NTA"]

Unnamed: 0,selftext,link_flair_text,target,ASSHOLE_score,AITA
18,My twin sister Carly and her children have rec...,Asshole,1,0.344083,NTA
44,I (30 f) and my fiancée (29m) are set to get m...,Asshole,1,0.364529,NTA
64,"A year ago, I invited my family from another s...",Not the A-hole,0,0.351762,NTA
95,AITA Recently went to a big name concert in TX...,Asshole,1,0.375612,NTA
102,So the title is basically self-explanatory. It...,Not the A-hole,0,0.31213,NTA
115,I(28M) am having work despite of covid-19 and...,Not the A-hole,0,0.228077,NTA
129,"Hello, first time reddit poster here. I (19F) ...",Asshole,1,0.378532,NTA
134,I (28F) have been with my boyfriend (28M) for ...,Asshole,1,0.39163,NTA
149,"English isn't my 1st language, execuse me for ...",Asshole,1,0.392349,NTA
153,So I (13f) have a confusing relationship with ...,Asshole,1,0.397256,NTA


Mostly, it's just really not sure and it is often incorrect. So yeah, spacy probably isn't the way

## Next steps  
I will try to get a spacy model running on smaller data, that way, it can retrain itself multiple times and perhaps get better. That being said tho, I thing that a smaller sample will take down the accuracy too, so it may very well lead absolutely nowehere. At least we tried tho, right?

## Small sample and small model  
Trying to get better results with a smaller sample that will allow the model to retrain itself multiple times. I still think that 200 is not enough tho, so I will take a random balanced sample from the big dataset, size 600 and try it with that.

In [None]:
#creating a smaller dataframe from the one made at the beginning with all the edits already in it
small_df=(
    df.groupby('target', group_keys=False)
      .apply(lambda x: x.sample(n=300, random_state=42))
)

In [None]:
# splitting to train and test set
small_data = {}
small_data['train'] = small_df.sample(frac=0.8, random_state=42)
small_data['test'] = small_df.drop(small_data['train'].index, inplace=False)

In [None]:
#parsing the documents we already have as examples for the new train and test set
#using the nlp2 created previously - the smaller language model
from tqdm import tqdm
small_docs = {}
small_docs['train'] = list(tqdm(nlp2.pipe(small_data['train']['selftext']), total=len(small_data['train'])))
small_docs['test'] = list(tqdm(nlp2.pipe(small_data['test']['selftext']), total=len(small_data['test'])))
small_docs

In [None]:
#binarizing again, just as the two last times
from spacy.training import Example
from spacy.tokens import DocBin
bin = {"train": DocBin(), "test": DocBin()}
with nlp2.select_pipes(enable="tok2vec"):
  for i in range(len(small_docs['train'])):
    example = Example.from_dict(small_docs['train'][i], {'cats': small_data['train'].iloc[i]['cats']})
    bin['train'].add(example.reference)
  for i in range(len(small_docs['test'])):
    example = Example.from_dict(small_docs['test'][i], {'cats': small_data['test'].iloc[i]['cats']})
    bin['test'].add(example.reference)

bin['train'].to_disk("train_small.spacy")
bin['test'].to_disk("test_small.spacy")

In [None]:
#config 
config_path = "./base_config_ensamble.cfg"
!python -m spacy init fill-config $config_path config.cfg

In [None]:
#trying the model on smaller data, see how it goes
!python -m spacy train config.cfg --output ./output_small --paths.train ./train_small.spacy --paths.dev ./test_small.spacy