#**BUSI/COMP488-001 Data Science in the Business World.**
###**Multi-Label Classification of Luxury Brand Perceptions on TikTok.**
###**Team Name:** Team D.
###**Team Members:** Carley Wiley, Eldar Utiushev, Bek Tukhtasinov, Mira Mohan, Aryonna Rice, and Tammy Duong.

This notebook explores a multi-label classification problem for luxury fashion brands based on TikTok comments and captions.

In [1]:
%%capture
%pip install datasets transformers pandas matplotlib tqdm --upgrade --quiet

In [2]:
# Automatically loads changes in other files in this project
%load_ext autoreload
%autoreload 2

### 1. Importing Necessary Libraries.

First, we need to import various Python libraries that will help us manipulate data, perform computations, and model our classifier.

In [3]:
import torch
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


In [4]:
# Change to project root
%cd ..
%pwd

/Users/aryonnarice/488FinalProject


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


'/Users/aryonnarice/488FinalProject'

### 2. Data Loading and Exploration.

Next, we will load the dataset containing TikTok comments and captions. This data will be used to train our model to classify brand perceptions.

In [5]:
# Load data
df = pd.read_csv('data/validated_labeled_data_cleaned.csv')

In [6]:
# View first few rows
df.head()

Unnamed: 0,Text,brand_label,emotion_label
0,"opium founding father,",['reputation & heritage'],['horrible']
1,"yall trippin fit clean,",['product quality'],['love']
2,"might destroy lonely,",['reputation & heritage'],['horrible']
3,"alr show us women,",['reputation & heritage'],['neutral']
4,"bad think jeans ripped pull bad,",['product quality'],['bad']


#### 2.1. Fixing Data Format.

We saw that the items in df are not list of strings. They need to be so we must fix the formatting.


In [7]:
# Drop rows with any NaN values
df = df.dropna(subset=['brand_label', 'Text', 'emotion_label'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Text           500 non-null    object
 1   brand_label    500 non-null    object
 2   emotion_label  500 non-null    object
dtypes: object(3)
memory usage: 11.8+ KB


In [8]:
# Function that converts the values in the brand_label and emotion_label columns to lists of strings
def convert_to_list_of_strings(value):
    # Ensure that the input is actually a string
    if isinstance(value, str):
        # Remove unwanted characters and split
        value = value.strip("[]").replace("'", "").split(", ")
    return value

# Convert Text to str type
df['Text'] = df['Text'].astype(str)
df['Text'] = df['Text'].str.rstrip(",")
df.rename(columns={'Text': 'text'}, inplace=True)

# Convert brand_label and emotion_label to list of strings
df['brand_label'] = df['brand_label'].apply(convert_to_list_of_strings)
df['emotion_label'] = df['emotion_label'].apply(convert_to_list_of_strings)

# Check the types to verify
print(df.dtypes)

text             object
brand_label      object
emotion_label    object
dtype: object


#### 2.2. Summary of the Data.


In [None]:
# Basic exploration
df.info()
df.describe()

### 3. Data Preprocessing. 

The data cleaning and preprocessing step is essential for ensuring the quality and consistency of our analysis. We handle missing values, outliers, and ensure that the data types are correct for each column. This step sets the stage for accurate and reliable insights. We'll remove URLs, special characters, convert text to lowercase, and remove stopwords.

In [9]:
# Define a mapping for the brand perception labels and emotion labels
brand_perception_labels_map_to_label = {
        0: 'product quality',
        1: 'reputation & heritage',
        2: 'customer service',
        3: 'social impact',
        4: 'ethical practices',
        5: 'sustainability'
    }

emotion_labels_map_to_emotion = {0: "admiration",
    1: "amusement",
    2: "anger",
    3: "annoyance",
    4: "approval",
    5: "caring",
    6: "confusion",
    7: "curiosity",
    8: "desire",
    9: "disappointment",
    10: "disapproval",
    11: "disgust",
    12: "embarrassment",
    13: "excitement",
    14: "fear",
    15: "gratitude",
    16: "grief",
    17: "joy",
    18: "love",
    19: "nervousness",
    20: "optimism",
    21: "pride",
    22: "realization",
    23: "relief",
    24: "remorse",
    25: "sadness",
    26: "surprise",
    27: "neutral"}

brand_perception_labels_map_to_index = {
        'product quality': 0,
        'reputation & heritage': 1,
        'customer service': 2,
        'social impact': 3,
        'ethical practices': 4,
        'sustainability': 5
    }

emotion_labels_map_to_index = {
    "admiration": 0,
    "amusement": 1,
    "anger": 2,
    "annoyance": 3,
    "approval": 4,
    "caring": 5,
    "confusion": 6,
    "curiosity": 7,
    "desire": 8,
    "disappointment": 9,
    "disapproval": 10,
    "disgust": 11,
    "embarrassment": 12,
    "excitement": 13,
    "fear": 14,
    "gratitude": 15,
    "grief": 16,
    "joy": 17,
    "love": 18,
    "nervousness": 19,
    "optimism": 20,
    "pride": 21,
    "realization": 22,
    "relief": 23,
    "remorse": 24,
    "sadness": 25,
    "surprise": 26,
    "neutral": 27
}

#### 3.1. Create Datasets.

In [10]:
# Turn the text column of df to a list of strings to be inserted into the model
texts = [item for item in df['text'] if isinstance(item, str) and item.strip() != '']

In [11]:
# Create a list of hot encoded values for brand aspects (1: aspect found in text, else 0)
def hot_encode_brand_perception(row):
    result = np.zeros(6)
    for label in row['brand_label']:  # iterate through the list of labels in each row
        if label in brand_perception_labels_map_to_index:
            result[brand_perception_labels_map_to_index[label]] = 1
    return result

# Apply the function to each row
brand_labels = df.apply(hot_encode_brand_perception, axis=1).tolist()

#### 3.3. Fixing Labeled Data.

Some of the labeled data used emotions that were not of the 28 the pretrained model was trained to identify. Hence, these values need to be changed to words that are included in the go_emotions dataset but that also closely match the definition of the original word.

In [12]:
# Create a dictionary off all emotions that are in the df but that are NOT valid emotions (of the 28)
random_emotions = []
for emotion_list in df['emotion_label']:
    for emotion in emotion_list:
        if emotion not in emotion_labels_map_to_index:
            random_emotions.append(emotion)
random_emotion_dict = {}
for emotion in random_emotions:
    if emotion in random_emotion_dict:
        random_emotion_dict[emotion] += 1
    else:
        random_emotion_dict[emotion] = 1
print(random_emotion_dict)
    
                  

{'horrible': 30, 'bad': 29, 'hate': 36, 'excited': 75, 'worse': 18, 'disappointed': 98, 'great': 21, 'amazing': 19, 'impressed': 16, 'thrilled': 16, 'terrible': 16, 'amused': 3, 'curious': 1, 'worst': 1, 'good': 5, 'regret': 1, 'need': 1, 'trust': 1, 'inspired': 1, 'amazed': 2, 'confused': 1, 'happy': 1, 'better': 1}


In [13]:
# Step 1: Create a mapping: invalid word -> close match from go_emotions dataset
incorrect_to_correct = {
    "horrible": ["disgust", "sadness"],
    "love": ["admiration", "joy"],
    "neutral": ["neutral"],
    "bad": ["annoyance", "disapproval"],
    "hate": ["anger", "disgust"],
    "excited": ["excitement"],
    "worse": ["disappointment"],
    "disappointed": ["disappointment"],
    "great": ["joy", "admiration"],
    "amazing": ["joy", "admiration"],
    "impressed": ["admiration"],
    "thrilled": ["joy", "excitement"],
    "terrible": ["disgust", "sadness"],
    "amused": ["amusement"],
    "curious": ["curiosity"],
    "worst": ["disgust", "sadness"],
    "good": ["approval", "joy"],
    "regret": ["remorse"],
    "need": ["desire"],
    "trust": ["admiration"],
    "inspired": ["admiration", "joy"],
    "amazed": ["surprise", "admiration"],
    "confused": ["confusion"],
    "happy": ["joy"],
    "better": ["approval", "optimism"]
}

# Step 2: Write a function to process the column
def map_emotions(emotion_labels):
    return [synonym for emotion in emotion_labels for synonym in incorrect_to_correct.get(emotion, [emotion])]

# Step 3: Apply the function to the DataFrame
df['emotion_label'] = df['emotion_label'].apply(map_emotions)


In [14]:
# Create a list of hot encoded values for emotions (1: emotion found in text, else 0)
def hot_encode_emotions(row):
    result = np.zeros(28)
    for label in row['emotion_label']:  # iterate through the list of labels in each row
        if label in emotion_labels_map_to_index:
            result[emotion_labels_map_to_index[label]] = 1
    return result

# Apply the function to each row
emotion_labels = df.apply(hot_encode_emotions, axis=1).tolist()

In [None]:
# Examine a random entry in emotion_label column
print(df['emotion_label'][8])

In [15]:
# Sanity check: check if there are any invalid emotions in df... dictionary should be empty 
random_emotions = []
for emotion_list in df['emotion_label']:
    for emotion in emotion_list:
        if emotion not in emotion_labels_map_to_index:
            random_emotions.append(emotion)
random_emotion_dict = {}
for emotion in random_emotions:
    if emotion in random_emotion_dict:
        random_emotion_dict[emotion] += 1
    else:
        random_emotion_dict[emotion] = 1
print(random_emotion_dict)

{}


In [16]:
# Split into validation, test, and train splits

# Step 1: Split into train and temp (either test or validation)
texts_train, texts_temp, emotions_train, emotions_temp, brands_train, brands_temp = train_test_split(
    texts, emotion_labels, brand_labels, test_size=0.2, random_state=42)

# Step 2: Then, split the temp data into validation and test sets
texts_val, texts_test, emotions_val, emotions_test, brands_val, brands_test = train_test_split(
    texts_temp, emotions_temp, brands_temp, test_size=0.5, random_state=42)  # This splits the remaining 20% into two 10% segments


The following code was originally used to create the Datasets. It was ran once and the results have been stored in the datasetss folder. I'm just leaving it here for context.

```python
from datasetss.brand_perception_dataset import BrandPerceptionDataset

train_dataset = BrandPerceptionDataset(texts_train, emotions_train, brands_train)
val_dataset = BrandPerceptionDataset(texts_val, emotions_val, brands_val)
test_dataset = BrandPerceptionDataset(texts_test, emotions_test, brands_test)

```


### 4. Training.

In [17]:
# Loading datasets 
import pickle
with open('datasetss/train_dataset.pkl', 'rb') as f:
    train_dataset = pickle.load(f)

with open('datasetss/val_dataset.pkl', 'rb') as f:
    val_dataset = pickle.load(f)

with open('datasetss/test_dataset.pkl', 'rb') as f:
    test_dataset = pickle.load(f)

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
# Creating dataloaders

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

#### 4.1. Initializing the Model.

In [19]:
from modules.BrandPerceptionModel import BrandPerceptionModel
config = {
    'model_name': 'SamLowe/roberta-base-go_emotions',
    'n_labels_bp': 6,
    'batch_size': 16,
    'lr': 1.5e-5,
    'warmup': 0.2, 
    'train_size': len(train_loader),
    'weight_decay': 0.001,
    'n_epochs': 10
}
print("Config:", config)

Config: {'model_name': 'SamLowe/roberta-base-go_emotions', 'n_labels_bp': 6, 'batch_size': 16, 'lr': 1.5e-05, 'warmup': 0.2, 'train_size': 25, 'weight_decay': 0.001, 'n_epochs': 10}


This is how the model was originally trained. The trained model has been saved so that this doesn't need to be ran again.

```python
import pytorch_lightning as pl
trainer = pl.Trainer(max_epochs=config['n_epochs'], num_sanity_val_steps=5, accelerator='gpu')
#VALIDATION TOOK PLACE HERE:
trainer.fit(model, train_loader, val_loader)
trainer.save_checkpoint("models/brand_perception_model_checkpoint.ckpt")
```

In [21]:
# Load model
model = BrandPerceptionModel.load_from_checkpoint("models/brand_perception_model_checkpoint.ckpt", config=config)

/Users/aryonnarice/488FinalProject/final_proj_env/lib/python3.11/site-packages/pytorch_lightning/utilities/migration/utils.py:55: The loaded checkpoint was produced with Lightning v2.2.3, which is newer than your current Lightning version: v2.1.1
Some weights of RobertaModel were not initialized from the model checkpoint at SamLowe/roberta-base-go_emotions and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 5. Evaluation.

In [22]:
# Test model
import pytorch_lightning as pl
trainer = pl.Trainer(max_epochs=config['n_epochs'], accelerator="gpu" if torch.cuda.is_available() else "cpu")
trainer.test(model, dataloaders=test_loader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/aryonnarice/488FinalProject/final_proj_env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Missing logger folder: /Users/aryonnarice/488FinalProject/lightning_logs
/Users/aryonnarice/488FinalProject/final_proj_env/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'test_dataloader' does not have many workers which may be a bottleneck. Co

Testing DataLoader 0:   0%|          | 0/4 [00:00<?, ?it/s]Emotion Logits Size: torch.Size([16, 28])
Brand Logits Size: torch.Size([16, 6])
Labels Emotion Size: torch.Size([16, 28])
Labels Brand Size: torch.Size([16, 6])
Testing DataLoader 0:  25%|██▌       | 1/4 [00:20<01:00,  0.05it/s]Emotion Logits Size: torch.Size([16, 28])
Brand Logits Size: torch.Size([16, 6])
Labels Emotion Size: torch.Size([16, 28])
Labels Brand Size: torch.Size([16, 6])
Testing DataLoader 0:  50%|█████     | 2/4 [00:49<00:49,  0.04it/s]Emotion Logits Size: torch.Size([16, 28])
Brand Logits Size: torch.Size([16, 6])
Labels Emotion Size: torch.Size([16, 28])
Labels Brand Size: torch.Size([16, 6])
Testing DataLoader 0:  75%|███████▌  | 3/4 [01:16<00:25,  0.04it/s]Emotion Logits Size: torch.Size([2, 28])
Brand Logits Size: torch.Size([2, 6])
Labels Emotion Size: torch.Size([2, 28])
Labels Brand Size: torch.Size([2, 6])
Testing DataLoader 0: 100%|██████████| 4/4 [01:19<00:00,  0.05it/s]
────────────────────────────

[{'test_loss_epoch': 0.8029792308807373,
  'test_accuracy_emotion': 0.9421428442001343,
  'test_f1_score_emotion': 0.04906122386455536,
  'test_accuracy_brand': 0.7866666913032532,
  'test_f1_score_brand': 0.31708115339279175}]

### 6. Predictions.

Now let's use the some data pertaining to only one luxary fashion brand, Amiri, to see how the model would behave in a real world sceanrio.

In [23]:
# Load data (data for one specifc brand: Amiri)
amiri_df = pd.read_csv('data/filtered_amiri_data.csv')

In [24]:
# Construct data set and loader
from datasetss.brand_perception_dataset import BrandPerceptionDataset
amiri_texts = [item for item in amiri_df['text'] if isinstance(item, str) and item.strip() != '']
amiri_dataset = BrandPerceptionDataset(amiri_texts)
amiri_loader = DataLoader(amiri_dataset, batch_size=4, num_workers=4)

### This was the code used to predict the brand perception of Amiri. Results have been saved so code doesn't need to be ran again.

``` python
import torch
from torch.cuda.amp import autocast, GradScaler
import torch.utils.checkpoint as checkpoint

# Determine device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
scaler = GradScaler()

all_emotion_probs = []
all_brand_probs = []

# Move model to GPU
model.to(device)

def checkpointed_predict_step(batch):
    def forward_func(input_ids, attention_mask):
        return model(input_ids, attention_mask)
    return checkpoint.checkpoint(forward_func, batch['input_ids'], batch['attention_mask'])

for batch_idx, batch in enumerate(amiri_loader):
    batch = {
        "input_ids": batch['input_ids'].to(device),
        "attention_mask": batch['attention_mask'].to(device),
        "labels_emotion": batch['labels_emotion'].to(device),
        "labels_brand": batch['labels_brand'].to(device),
    }

    with autocast():
        # Use checkpointing to manage memory
        loss, emotion_probs, brand_probs = checkpointed_predict_step(batch)

    emotion_probs = emotion_probs.cpu()
    brand_probs = brand_probs.cpu()

    all_emotion_probs.append(emotion_probs)
    all_brand_probs.append(brand_probs)

    # Clear GPU memory
    torch.cuda.empty_cache()

    # Clear variables
    del batch, loss, emotion_probs, brand_probs

# Concatenate all probabilities for final results
all_emotion_probs = torch.cat(all_emotion_probs, dim=0)
all_brand_probs = torch.cat(all_brand_probs, dim=0)

# Display final results
print(f"Emotion probabilities shape: {all_emotion_probs.shape}")
print(f"Brand probabilities shape: {all_brand_probs.shape}")

# Print memory summary
print(torch.cuda.memory_summary(device=device, abbreviated=True))

```


In [25]:
# Load the results for Amiri
with open("results/probs.pkl", "rb") as f:
    all_emotion_probs, all_brand_probs = pickle.load(f)

In [26]:
# Convert logits to probabilities and aggregate them to create a summary of the brand's perception
import torch.nn.functional as F

# Apply sigmoid to convert logits to probabilities
all_emotion_probs = F.sigmoid(all_emotion_probs)
all_brand_perception_probs = F.sigmoid(all_brand_probs)

# Calculate the average probabilities for each emotion and brand aspect
avg_emotion_probs = all_emotion_probs.mean(dim=0)
avg_brand_perception_probs = all_brand_perception_probs.mean(dim=0)

print(f"Average Emotion Probabilities: {avg_emotion_probs}")
print(f"Average Brand Perception Probabilities: {avg_brand_perception_probs}")


Average Emotion Probabilities: tensor([0.2791, 0.0393, 0.0728, 0.0716, 0.0461, 0.0299, 0.0347, 0.0328, 0.0343,
        0.2739, 0.0605, 0.0908, 0.0317, 0.1637, 0.0311, 0.0374, 0.0274, 0.2805,
        0.0296, 0.0292, 0.0341, 0.0285, 0.0305, 0.0338, 0.0311, 0.0809, 0.0334,
        0.2271], dtype=torch.float16)
Average Brand Perception Probabilities: tensor([0.4209, 0.4966, 0.1868, 0.0593, 0.0525, 0.0461], dtype=torch.float16)


In [27]:
# Function to map dimensions of tensor to labels (index i of a tensor represents a certain emotion or brand aspect)
def map_to_labels(tensor, labels_map):
    labels = []
    for i, value in enumerate(tensor):
        label = labels_map.get(i, "Unknown")
        labels.append((label, value.item()))
    return labels

In [28]:
# Map indices to labels for brand perception tensor
amiri_brand_perception_labels = map_to_labels(avg_brand_perception_probs, brand_perception_labels_map_to_label)
print("Brand Perception:")
for label, value in amiri_brand_perception_labels:
    print(f"{label}: {value}")

# Map indices to labels for emotion tensor
amiri_emotion_labels = map_to_labels(avg_emotion_probs, emotion_labels_map_to_emotion)
print("\nEmotion:")
for label, value in amiri_emotion_labels:
    print(f"{label}: {value}")

Brand Perception:
product quality: 0.4208984375
reputation & heritage: 0.49658203125
customer service: 0.186767578125
social impact: 0.059295654296875
ethical practices: 0.052459716796875
sustainability: 0.046112060546875

Emotion:
admiration: 0.279052734375
amusement: 0.039337158203125
anger: 0.07275390625
annoyance: 0.07159423828125
approval: 0.04608154296875
caring: 0.0299224853515625
confusion: 0.03466796875
curiosity: 0.032806396484375
desire: 0.0343017578125
disappointment: 0.27392578125
disapproval: 0.060516357421875
disgust: 0.09075927734375
embarrassment: 0.03173828125
excitement: 0.1636962890625
fear: 0.031097412109375
gratitude: 0.037384033203125
grief: 0.027374267578125
joy: 0.280517578125
love: 0.0295562744140625
nervousness: 0.02923583984375
optimism: 0.034088134765625
pride: 0.0284881591796875
realization: 0.030487060546875
relief: 0.033843994140625
remorse: 0.0311279296875
sadness: 0.0809326171875
surprise: 0.033355712890625
neutral: 0.22705078125
