## 4 Build and Train the Classifier 3.0 (Bert) ❌

In this notebook, I attempted another approach to troubleshoot imbalance data problem. I utilized a pre-trained language model [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) to leverage its broader understanding. However, the predicted results still remained skewed due to imbalanced data. 

### **1) Collection 1**
- **Training Data - combined_bc_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 5 apps focused on birth control.
- **Unlabeled Data - combined_bc_unlabeled.csv** - This file contains all the unlabeled data from before and after the overturn, covering 5 Birth-Control-Oriented Apps (Collection 1).


### **2) Collection 2**
- **Training Data - combined_pt_unlabeled.csv** - This file includes all the unlabeled data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).

- **Unlabeled Data - combined_pt_tagged.csv** - This file contains all the manually tagged data from before and after the overturn, covering 2 Period-and-Fertility-Tracking Apps (Collection 2).


## Collection 1 - Birth-Control-Oriented Apps (x5)

In [1]:
#!pip install transformers pandas scikit-learn torch

### Step 1: Load the necessary libraries and data

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import torch
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.preprocessing import MultiLabelBinarizer
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import f1_score

# Load the training and unlabeled data
train_data = pd.read_csv('combined_bc_tagged.csv')
unlabeled_data = pd.read_csv('combined_bc_unlabeled.csv')


### Step 2: Split the training data into training and validation sets

In [3]:
# Split the training data into training and validation sets
train_df, val_df = train_test_split(train_data, test_size=0.2, random_state=42)

### Step 3: Initialize the tokenizer

In [4]:
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Step 4: Tokenize the reviews

In [5]:
def tokenize_data(data, max_length=256):
    return tokenizer(
        data['review'].tolist(),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

train_tokens = tokenize_data(train_df)
val_tokens = tokenize_data(val_df)
unlabeled_tokens = tokenize_data(unlabeled_data)


### Step 5: Define category definitions

In [6]:
category_definitions = {
    'l1_inaccurate_cycle_prediction': 'The cycle prediction algorithm of the app is inaccurate, sometimes leading to unplanned pregnancies.',
    'l2_delayed_customer_service': 'Difficulty in contacting customer service and long wait times, which oftentimes result in late or inaccurate deliveries of prescriptions and medications.',
    'l3_poor_prescription_management': 'Users experience issues such as missing or incorrect prescriptions, incorrect birth control medications, inaccurate refill frequencies, late deliveries, and canceled medications.',
    'l4_problematic_billing_practices': 'Users encounter unexpected charges including but not limited to auto-renewals without notification, and charges on old credit cards without refunds, or they fail to use the current insurance plan for insurance billing.'
}


### Step 6: Prepare labels for training and validation

In [7]:
mlb = MultiLabelBinarizer(classes=list(category_definitions.keys()))
train_labels = mlb.fit_transform(train_df[list(category_definitions.keys())].values)
val_labels = mlb.fit_transform(val_df[list(category_definitions.keys())].values)




### Step 7: Print the classes to ensure they match the keys in category_definitions

In [8]:
print("mlb.classes_:", mlb.classes_)
for cls in mlb.classes_:
    if cls not in category_definitions:
        raise KeyError(f"Definition for class {cls} not found in category_definitions.")


mlb.classes_: ['l1_inaccurate_cycle_prediction' 'l2_delayed_customer_service'
 'l3_poor_prescription_management' 'l4_problematic_billing_practices']


### Step 8: Create a combined input for BERT

In [9]:
def create_combined_input(reviews, definitions, labels):
    combined_inputs = []
    for review, label_set in zip(reviews, labels):
        combined_review = review
        for idx, label in enumerate(label_set):
            if label == 1:
                combined_review += ' ' + definitions[mlb.classes_[idx]]
        combined_inputs.append(combined_review)
    return combined_inputs

combined_train_inputs = create_combined_input(train_df['review'], category_definitions, train_labels)
combined_val_inputs = create_combined_input(val_df['review'], category_definitions, val_labels)


### Step 9: Tokenize the combined inputs

In [10]:
train_tokens = tokenizer(
    combined_train_inputs,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors='pt'
)
val_tokens = tokenizer(
    combined_val_inputs,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors='pt'
)


### Step 10: Create a dataset class

In [11]:
class ReviewDataset(Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx]).float()
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])


In [None]:
#Step 11: Check if the lengths of inputs and targets match:

In [12]:
assert len(train_tokens['input_ids']) == len(train_labels), "Mismatch between input and target lengths"
assert len(val_tokens['input_ids']) == len(val_labels), "Mismatch between validation input and target lengths"

In [13]:
#Step 12: Prepare datasets

In [14]:
train_dataset = ReviewDataset(train_tokens, train_labels)
val_dataset = ReviewDataset(val_tokens, val_labels)

In [None]:
#Step 13: Create DataLoaders to batch the data

In [15]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)

In [None]:
#Step 14: Set up training arguments

In [16]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)




In [17]:
#Step 15: Create the model

In [18]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(mlb.classes_))


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#Step 16: Create the trainer

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)


In [22]:
#Step 17: Train the model

In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.795628
2,0.804700,0.743747
3,0.759800,0.637527


TrainOutput(global_step=27, training_loss=0.7588075002034506, metrics={'train_runtime': 187.7616, 'train_samples_per_second': 1.134, 'train_steps_per_second': 0.144, 'total_flos': 28021830580224.0, 'train_loss': 0.7588075002034506, 'epoch': 3.0})

In [None]:
#Step 18: Evaluate the model on the validation set

In [24]:
val_predictions = trainer.predict(val_dataset)
val_predicted_labels = torch.sigmoid(torch.tensor(val_predictions.predictions)).numpy()
val_predicted_labels = (val_predicted_labels > 0.5).astype(int)
f1 = f1_score(val_labels, val_predicted_labels, average='micro')
print(f"F1 Score: {f1}")

F1 Score: 0.0


In [None]:
#Step 19: Make predictions on the unlabeled data and save the predictions to a new CSV file

In [26]:
# Tokenize unlabeled data
unlabeled_combined_inputs = create_combined_input(unlabeled_data['review'], category_definitions, [[0]*len(category_definitions)]*len(unlabeled_data))
unlabeled_tokens = tokenizer(
    unlabeled_combined_inputs,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors='pt'
)

# Create dataset and dataloader for unlabeled data
unlabeled_dataset = ReviewDataset(unlabeled_tokens)
unlabeled_dataloader = DataLoader(unlabeled_dataset, batch_size=8, shuffle=False)

# Make predictions
unlabeled_predictions = trainer.predict(unlabeled_dataset)
unlabeled_predicted_labels = torch.sigmoid(torch.tensor(unlabeled_predictions.predictions)).numpy()
unlabeled_predicted_labels = (unlabeled_predicted_labels > 0.5).astype(int)

# Convert predicted labels to dataframe
unlabeled_pred_df = pd.DataFrame(unlabeled_predicted_labels, columns=mlb.classes_)

# Concatenate predictions with the original unlabeled data
unlabeled_data_with_predictions = pd.concat([unlabeled_data.reset_index(drop=True), unlabeled_pred_df], axis=1)

# Save predictions to a new CSV file
#unlabeled_data_with_predictions.to_csv('unlabeled_predictions.csv', index=False)


In [27]:
unlabeled_data_with_predictions

Unnamed: 0.1,Unnamed: 0,date,developerResponse,review,rating,isEdited,userName,title,app_name,app_id,l1_inaccurate_cycle_prediction,l2_delayed_customer_service,l3_poor_prescription_management,l4_problematic_billing_practices
0,16,2021-04-08 12:35:25,"{'id': 22156213, 'body': ""Hi Lynn, we are disa...",I first used Nurx a few years ago and it was a...,1,False,ALynnJ42,Used to be good,nurx-birth-control-delivered,1213141301,1,0,0,0
1,19,2021-01-05 06:10:19,,"I am not one to usually write reviews, but my ...",1,False,cp_2015,Avoid at all costs,nurx-birth-control-delivered,1213141301,1,0,0,1
2,33,2020-05-29 20:59:19,"{'id': 11157202, 'body': 'We understand your r...",First the bad. I thought with this app/service...,2,True,Jen316,Good and bad,nurx-birth-control-delivered,1213141301,1,0,0,0
3,38,2021-08-12 03:31:37,"{'id': 24534904, 'body': ""Hello Anna, thank yo...","If this worked well, I would love it. Unfortun...",1,False,anna.eliza,"Good idea, bad execution",nurx-birth-control-delivered,1213141301,1,0,0,0
4,39,2020-12-29 21:47:51,"{'id': 20112386, 'body': ""We are sorry to hear...","I hesitate to write negative reviews, but this...",1,False,Sacatu,Waste of time and $$,nurx-birth-control-delivered,1213141301,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
930,263,2024-02-01 20:20:43,"{'id': 41980499, 'body': 'Oh no! We sincerely ...","In miserable pain for a UTI, showing all sympt...",1,False,dessert enjoyer,Cancelled UTI prescription request,planned-parenthood-direct,1214393415,1,0,0,1
931,268,2023-12-07 15:27:51,,The app is not working! don't waste your time,1,False,paulineczka1212,The app is not working,planned-parenthood-direct,1214393415,1,0,1,1
932,278,2023-09-15 02:26:42,"{'id': 39037722, 'body': 'Hi there - we are so...",it’s been 6 days and i have yet to get my pack...,1,False,uraqtbaeee,"never received, nobody answered",planned-parenthood-direct,1214393415,1,0,0,1
933,279,2023-06-11 13:16:06,"{'id': 37257712, 'body': ""Oh no, that doesn't ...",dumb app,1,False,Destiny Amari Robinson,doesnt work,planned-parenthood-direct,1214393415,1,0,1,1
