# Multi-Label Text Classification: Modeling with Transformers

This notebook covers the modeling workflow for multi-label text classification using transformer models. We will load the processed data, perform train/test split, tokenize the text, and prepare the data for modeling.


## 1. Load the Processed Data

Load the pre-processed dataset from disk using pandas.

In [11]:
# Load processed data
import pandas as pd

# Choose one: CSV or pickle (pickle preserves dtypes)
df = pd.read_pickle('../data/processed/tourism_reviews_processed.pkl')
# df = pd.read_csv('../data/processed/tourism_reviews_processed.csv')

df.head()

Unnamed: 0,Location_Name,Located_City,Location,Location_Type,User_ID,User_Location,User_Locale,User_Contributions,Travel_Date,Published_Date,...,clean_text,clean_title,Regenerative & Eco-Tourism_keyword_count,Integrated Wellness_keyword_count,Immersive Culinary_keyword_count,Off-the-Beaten-Path Adventure_keyword_count,label_regenerative_eco_tourism,label_integrated_wellness,label_immersive_culinary,label_off_the_beaten_path_adventure
0,Arugam Bay,Arugam Bay,"Arugam Bay, Eastern Province",Beaches,User 1,"Dunsborough, Australia",en_US,8,2019-07,2019-07-31T07:53:21-04:00,...,i had a manicure here and it really was profes...,best nail spa in arugam bay on the water!,1,2,0,0,1,1,0,0
1,Arugam Bay,Arugam Bay,"Arugam Bay, Eastern Province",Beaches,User 2,"Bendigo, Australia",en_US,4,2019-06,2019-07-21T21:50:11-04:00,...,"overall, it is a wonderful experience. we visi...",best for surfing,0,0,1,0,0,0,1,0
2,Arugam Bay,Arugam Bay,"Arugam Bay, Eastern Province",Beaches,User 3,"Melbourne, Australia",en_US,13,2019-07,2019-07-15T18:52:55-04:00,...,"great place to chill, swim, surf, eat, shop, h...",we love arugam bay,1,0,0,0,1,0,0,0
3,Arugam Bay,Arugam Bay,"Arugam Bay, Eastern Province",Beaches,User 4,"Ericeira, Portugal",en_US,4,2019-06,2019-07-03T10:32:41-04:00,...,good place for surf and a few stores to going ...,sun and waves.,0,0,0,0,0,0,0,0
4,Arugam Bay,Arugam Bay,"Arugam Bay, Eastern Province",Beaches,User 5,"Pistoia, Italy",en_US,14,2019-07,2019-07-02T17:07:02-04:00,...,this place is great for surfing but even if yo...,"great swimming, surfing, great fish aznd frien...",1,0,1,0,1,0,1,0


## 2. Perform Train/Test Split

Split the loaded data into training and testing sets using scikit-learn's train_test_split.

In [12]:
# Train/test split
from sklearn.model_selection import train_test_split

# Define text and label columns
text_col = 'clean_text'
label_cols = [col for col in df.columns if col.startswith('label_')]

X = df[text_col].values
Y = df[label_cols].values

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=42, stratify=Y
)

print(f'Train size: {len(X_train)}, Test size: {len(X_test)}')

Train size: 12924, Test size: 3232


## 3. Tokenize for Transformers

Use a Hugging Face tokenizer (BertTokenizer) to tokenize the text data for transformer models.

In [13]:
# Tokenization for transformers
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize train and test text
def tokenize_texts(texts, tokenizer, max_length=128):
    return tokenizer(
        list(texts),
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

train_encodings = tokenize_texts(X_train, tokenizer)
test_encodings = tokenize_texts(X_test, tokenizer)

print('Tokenization complete.')
print(f"Train encodings: {train_encodings['input_ids'].shape}")
print(f"Test encodings: {test_encodings['input_ids'].shape}")

Tokenization complete.
Train encodings: torch.Size([12924, 128])
Test encodings: torch.Size([3232, 128])


## 4. Prepare Data for Modeling

Convert tokenized data and labels into a PyTorch Dataset for use with transformer models.

In [14]:
# Prepare PyTorch Dataset for modeling
import torch
from torch.utils.data import Dataset, DataLoader

class ReviewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels, dtype=torch.float32)
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item
    def __len__(self):
        return len(self.labels)

train_dataset = ReviewsDataset(train_encodings, Y_train)
test_dataset = ReviewsDataset(test_encodings, Y_test)

print(f'Train dataset size: {len(train_dataset)}')
print(f'Test dataset size: {len(test_dataset)}')

Train dataset size: 12924
Test dataset size: 3232


In [15]:
# Save only encodings and labels for reproducibility and compatibility
import os
os.makedirs('../data/processed', exist_ok=True)
torch.save({'encodings': train_encodings, 'labels': Y_train}, '../data/processed/train_dataset.pt')
torch.save({'encodings': test_encodings, 'labels': Y_test}, '../data/processed/test_dataset.pt')
print('Saved train and test encodings/labels for later use.')

Saved train and test encodings/labels for later use.
