##### Trying to build a model with dilBERT

I tried to get a classification model going using BERT, but my computer isn't powerful enough to handle it. So here's a version using distillBert, which is smaller and hopefully more usable.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [2]:
# pip install torch

In [3]:
# pip install transformers

In [4]:
df1=pd.read_csv('../data/sample_posts_manual_coding_2.csv')

In [5]:
import numpy as np

In [6]:
posts_analyzed = 600

df = df1.head(posts_analyzed)

## Removing the marker of where things were left off.

df.highly_relevant[598] = np.nan

df['highly_relevant'] = df['highly_relevant'].fillna(0)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.highly_relevant[598] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.highly_relevant[598] = np.n

In [14]:
df['labels'] = df['highly_relevant'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['labels'] = df['highly_relevant'].astype(int)


In [17]:
df['combined_text'] = df['title'] + ' ' + df['selftext']

# Punctuation removal
def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

# Text preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    
    # Tokenization
    tokens = remove_punctuation(text).split()
    
    # Lowercase and remove stopwords
    tokens = [word.lower() for word in tokens] # if word.lower() not in stop_words]
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

# Apply preprocessing to the combined text column
df['text'] = df['combined_text'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined_text'] = df['title'] + ' ' + df['selftext']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['combined_text'].apply(preprocess_text)


In [55]:
df.tail()

Unnamed: 0,title,selftext,created_utc,over_18,subreddit,date_created,self,is_relevant,highly_relevant,combined_text,processed_text,labels,text
595,I'm depressed and got intense fear of abandonm...,last few days i'm just sad and extremely anxio...,1643579271,False,BPD,2022-01-30 21:47:51,1,1,1,I'm depressed and got intense fear of abandonm...,im depressed and got intense fear of abandonme...,1,im depressed and got intense fear of abandonme...
596,My BF Always Criticizes(?) Me When I Let Him K...,so i'm going to do my best to explain this? it...,1643578689,False,BPD,2022-01-30 21:38:09,1,0,0,My BF Always Criticizes(?) Me When I Let Him K...,my bf always criticizes me when i let him know...,0,my bf always criticizes me when i let him know...
597,Husband unintentionally stopped taking his med...,"my husband has bpd, which, until recently, has...",1643577150,False,BPD,2022-01-30 21:12:30,0,0,0,Husband unintentionally stopped taking his med...,husband unintentionally stopped taking his med...,0,husband unintentionally stopped taking his med...
598,Splitting,"i don’t even know if i’m borderline, i just kn...",1643575704,False,BPD,2022-01-30 20:48:24,1,0,0,"Splitting i don’t even know if i’m borderline,...",splitting i don’t even know if i’m borderline ...,0,splitting i don’t even know if i’m borderline ...
599,Had a fight with partner/FP last night... Stil...,so yeah we can an argument last night related ...,1643573845,False,BPD,2022-01-30 20:17:25,1,0,0,Had a fight with partner/FP last night... Stil...,had a fight with partnerfp last night still an...,0,had a fight with partnerfp last night still an...


In [19]:
train_df, test_df = train_test_split(df, test_size=0.2, stratify = df['labels'])

In [11]:
#pip install datasets

In [20]:
from datasets import Dataset
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Convert datasets to tokenized format
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

def tokenize_data(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = train_dataset.map(tokenize_data, batched=True)
tokenized_test = test_dataset.map(tokenize_data, batched=True)

Map: 100%|██████████| 480/480 [00:00<00:00, 907.02 examples/s]
Map: 100%|██████████| 120/120 [00:00<00:00, 860.88 examples/s]


In [21]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding

# Load pre-trained DistilBERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Prepare data collator for padding sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_strategy="epoch"
)

# Define Trainer object for training the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the trained model
trainer.save_model('model')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
                                       
  0%|          | 0/300 [28:04<?, ?it/s]           

{'loss': 0.2025, 'grad_norm': 2.57312273979187, 'learning_rate': 0.00016, 'epoch': 1.0}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                       
[A                                               

  0%|          | 0/300 [29:11<?, ?it/s]        
[A
[A

{'eval_loss': 0.17083123326301575, 'eval_runtime': 66.982, 'eval_samples_per_second': 1.792, 'eval_steps_per_second': 0.224, 'epoch': 1.0}


                                       
  0%|          | 0/300 [46:00<?, ?it/s]          

{'loss': 0.1677, 'grad_norm': 2.6016523838043213, 'learning_rate': 0.00012, 'epoch': 2.0}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                       
[A                                              

  0%|          | 0/300 [47:02<?, ?it/s]        
[A
[A

{'eval_loss': 0.16156624257564545, 'eval_runtime': 62.0109, 'eval_samples_per_second': 1.935, 'eval_steps_per_second': 0.242, 'epoch': 2.0}


                                       
  0%|          | 0/300 [59:49<?, ?it/s]          

{'loss': 0.1637, 'grad_norm': 0.15527881681919098, 'learning_rate': 8e-05, 'epoch': 3.0}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                       
[A                                              

  0%|          | 0/300 [1:00:38<?, ?it/s]      
[A
[A

{'eval_loss': 0.1697501391172409, 'eval_runtime': 48.9892, 'eval_samples_per_second': 2.45, 'eval_steps_per_second': 0.306, 'epoch': 3.0}


                                         
  0%|          | 0/300 [1:11:10<?, ?it/s]          

{'loss': 0.1653, 'grad_norm': 0.299267441034317, 'learning_rate': 4e-05, 'epoch': 4.0}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                         
[A                                                

  0%|          | 0/300 [1:11:56<?, ?it/s]      
[A
[A

{'eval_loss': 0.1586875170469284, 'eval_runtime': 46.1768, 'eval_samples_per_second': 2.599, 'eval_steps_per_second': 0.325, 'epoch': 4.0}


                                         
  0%|          | 0/300 [1:22:36<?, ?it/s]          

{'loss': 0.17, 'grad_norm': 0.25765475630760193, 'learning_rate': 0.0, 'epoch': 5.0}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                         
[A                                                

  0%|          | 0/300 [1:23:21<?, ?it/s]      
[A
                                         
100%|██████████| 300/300 [1:17:19<00:00, 15.47s/it]


{'eval_loss': 0.15938633680343628, 'eval_runtime': 44.9434, 'eval_samples_per_second': 2.67, 'eval_steps_per_second': 0.334, 'epoch': 5.0}
{'train_runtime': 4639.6637, 'train_samples_per_second': 0.517, 'train_steps_per_second': 0.065, 'train_loss': 0.17384321848551432, 'epoch': 5.0}


In [22]:
#pip install accelerate -U

In [24]:
trainer.evaluate()

100%|██████████| 15/15 [00:47<00:00,  3.16s/it]


{'eval_loss': 0.15938633680343628,
 'eval_runtime': 50.1621,
 'eval_samples_per_second': 2.392,
 'eval_steps_per_second': 0.299,
 'epoch': 5.0}

In [25]:
pretrained_model = AutoModelForSequenceClassification.from_pretrained('model')

In [126]:
text_relevant = "I have found dbt to be very helpful, as has lamictal. Taking medication is crucial for me."
text_irrelevant = "My fp went away and now I am sad. What should I do? Here is a random mention of therapy to throw the model off"
text_random = "The quick brown fox jumped over the lazy dog"



encoding = tokenizer(text_irrelevant, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

In [129]:
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.1355, -2.2658]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [127]:
logits = outputs.logits
logits.shape

torch.Size([1, 2])

In [68]:
import torch

In [128]:
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.1)] = 1

In [130]:
print(predictions)

[1. 0.]


Strangely enough, this model is performing worse than the keyword model. I believe the reason is that we are looking for something very specific (posts which discuss their history of treatment with bpd) and the training set was simply too small for this purpose. Therefore, an approach using keywords did a bit better.