# **Deep Learning Spring 2024 - Youtube Title Classifier Using Finetuned BERT Model**

This Project is Part of the Technion's Deep Learning Course.
It is one model of two model fine-tune and comperision.

**Objective**

To classify the YouTube trending videos into 2 categories using the BERT model, a pre-trained model fine-tuned on the Google BERT model.

**Preparation Steps**
- Download the Google BERT model and tokenizer from HuggingFace.
- Download the dataset from Kaggle and preprocess it.
- Prepare the data for training and testing.
- Fine-tune the BERT model on the dataset.
- Evaluate the trained model on the test set.

**Hardware**

the finetune process was performed using
- NVIDIA GeForce RTX 3080 

**Source Material and Code Base**

- Google BERT model:     https://huggingface.co/google-bert/bert-base-uncased
- Data Set From Kaggle:  Youtube. (2024). YouTube Trending Video Dataset (updated daily) [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/8125862

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')


Mounted at /content/drive


## Importing Packeges
and Checking if GPU is available

In [None]:
# !pip install peft

In [2]:
import os
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from tqdm import tqdm, trange

  from .autonotebook import tqdm as notebook_tqdm


In [3]:

if torch.cuda.is_available():
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


There are 1 GPU(s) available.
We will use the GPU: NVIDIA GeForce RTX 3080


## Preparing The Data

- the original data contains several features per video
- in this project we focus on the $title$ as a textual feature to predict the label $CategoryID$

In [3]:
# Load the dataset into a pandas dataframe.
df = pd.read_csv("./Dataset/US_youtube_trending_data.csv")
df.head()

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11T17:00:10Z,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12T00:00:00Z,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...
3,kXLn3HkpjaA,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11T16:38:55Z,UCbg_UMjlHJg_19SZckaKajg,XXL,10,2020-08-12T00:00:00Z,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg,False,False,Subscribe to XXL → http://bit.ly/subscribe-xxl...
4,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11T15:10:05Z,UCDVPcEbVLQgLZX0Rt6jo34A,Mr. Kate,26,2020-08-12T00:00:00Z,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg,False,False,Transforming The LaBrant Family's empty white ...


In [4]:
# Parsing the YouTube API CategoryID to a String
all_category_to_string = {
    1: "Film & Animation",
    2: "Autos & Vehicles",
    10: "Music",
    15: "Pets & Animals",
    17: "Sports",
    18: "Short Movies",
    19: "Travel & Events",
    20: "Gaming",
    21: "Videoblogging",
    22: "People & Blogs",
    23: "Comedy",
    24: "Entertainment",
    25: "News & Politics",
    26: "Howto & Style",
    27: "Education",
    28: "Science & Technology",
    29: "Nonprofits & Activism",
    30: "Movies",
    31: "Anime/Animation",
    32: "Action/Adventure",
    33: "Classics",
    34: "Comedy",
    35: "Documentary",
    36: "Drama",
    37: "Family",
    38: "Foreign",
    39: "Horror",
    40: "Sci-Fi/Fantasy",
    41: "Thriller",
    42: "Shorts",
    43: "Shows",
    44: "Trailers",
}
# The only categories that we Classify
train_categories = {20: "Gaming" ,24: "Entertainment"}


### Checking the Data

In [5]:
# Taking only therelevant features: Title and label: CategoryID
data = df[["title", "categoryId"]]
data.head()

Unnamed: 0,title,categoryId
0,I ASKED HER TO BE MY GIRLFRIEND...,22
1,Apex Legends | Stories from the Outlands – “Th...,20
2,I left youtube for a month and THIS is what ha...,24
3,XXL 2020 Freshman Class Revealed - Official An...,10
4,Ultimate DIY Home Movie Theater for The LaBran...,26


### Loading the Data from pre-splited files  

In [5]:
# training data
df = pd.read_csv("./Dataset/train_Dataset_v3.csv")
print(df.head())
X_train = df["title"]
X_train.head()
y_train = df["tags"]

#  val_data
df_val = pd.read_csv("./Dataset/test_Dataset_v3.csv")
print(df.head())
X_val = df_val["title"]
y_val = df_val["tags"]

# test data
df_test = pd.read_csv("./Dataset/test_Dataset_v3.csv")
print(df.head())
X_test = df_test["title"]
y_test = df_test["tags"]


                                               title  tags  \
0                      SMILE (2022) Ending Explained     1   
1               i Got a Pet Monkey For The AMP House     1   
2                            Rick and Morty is Real.     1   
3  Ashnikko - You Make Me Sick! (Official Music V...     0   
4  Skyrim, but if my Heart Rate goes up it spawns...     1   

                                            sentence  
0  "SMILE (2022) Ending Explained" is a title of ...  
1  "i Got a Pet Monkey For The AMP House" is a ti...  
2  "Rick and Morty is Real." is a title of a vide...  
3  "Ashnikko - You Make Me Sick! (Official Music ...  
4  "Skyrim, but if my Heart Rate goes up it spawn...  
                                               title  tags  \
0                      SMILE (2022) Ending Explained     1   
1               i Got a Pet Monkey For The AMP House     1   
2                            Rick and Morty is Real.     1   
3  Ashnikko - You Make Me Sick! (Official Music V

## BERT

the class of the model with all its functions

In [6]:
# run this cell
import torch
import os
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from peft import LoraConfig, get_peft_model
from tqdm import tqdm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

class BertClassifier:
    def __init__(self, model_name='bert-base-uncased', num_classes=2, max_length=128, batch_size=32, num_epochs=3, learning_rate=2e-5, r=8, alpha=16):
        self.model_name = model_name
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)

        # Configuring the LoRA layer
        lora_config = LoraConfig(
            r=r,
            lora_alpha=alpha,
            lora_dropout=0.1
        )
        self.model = get_peft_model(self.model, lora_config)

        self.max_length = max_length
        self.batch_size = batch_size
        self.num_epochs = num_epochs
        self.learning_rate = learning_rate
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.optimizer = AdamW(self.model.parameters(), lr=self.learning_rate)

    def encode_data(self, texts, labels):
        encoded_data = self.tokenizer.batch_encode_plus(
            texts,
            add_special_tokens=True,
            return_attention_mask=True,
            padding=True,
            max_length=self.max_length,
            return_tensors='pt'
        )
        input_ids = encoded_data['input_ids']
        attention_masks = encoded_data['attention_mask']
        labels = torch.tensor(labels.to_numpy())
        return TensorDataset(input_ids, attention_masks, labels)

    def create_dataloader(self, dataset, sampler_type='random'):
        sampler = RandomSampler(dataset) if sampler_type == 'random' else SequentialSampler(dataset)
        return DataLoader(dataset, sampler=sampler, batch_size=self.batch_size)

    def train(self, train_texts, train_labels, val_texts, val_labels, encoded=False, encoded_train_data=None, encoded_val_data=None):
        if encoded and encoded_train_data is None:
            print("No encoded training data provided. Encoding the data...")
            encoded = False
        if encoded and encoded_val_data is None:
            print("No encoded validation data provided. Encoding the data...")
            encoded = False

        train_dataset = self.encode_data(train_texts, train_labels) if not encoded else encoded_train_data
        val_dataset = self.encode_data(val_texts, val_labels) if not encoded else encoded_val_data

        train_dataloader = self.create_dataloader(train_dataset)
        val_dataloader = self.create_dataloader(val_dataset, sampler_type='sequential')

        best_val_loss = float('inf')
        for epoch in range(self.num_epochs):
            print(f"Epoch {epoch + 1}/{self.num_epochs}")
            self.model.train()
            train_loss = 0

            epoch_iterator = tqdm(train_dataloader, desc=f"Training (Epoch {epoch+1})", leave=False)
            for batch in epoch_iterator:
                batch = tuple(t.to(self.device) for t in batch)
                inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
                outputs = self.model(**inputs)
                loss = outputs.loss

                loss.backward()
                self.optimizer.step()
                self.optimizer.zero_grad()

                train_loss += loss.item()
                epoch_iterator.set_postfix(loss=loss.item())

            avg_train_loss = train_loss / len(train_dataloader)
            val_loss, val_metrics = self.evaluate(val_dataloader)
            print(f"Training Loss: {avg_train_loss:.4f}, Validation Loss: {val_loss:.4f}")
            print(f"Validation Accuracy: {val_metrics['accuracy']:.4f}")

            # Save the model if the validation loss has decreased
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                self.save_checkpoint(epoch)
                print("Validation loss improved. Model checkpoint saved.")

    #used for testing the model
    def evaluate(self, dataloader):
        self.model.eval()
        total_loss = 0
        all_preds = []
        all_labels = []

        for batch in dataloader:
            batch = tuple(t.to(self.device) for t in batch)
            with torch.no_grad():
                inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
                outputs = self.model(**inputs)
                loss = outputs.loss
                total_loss += loss.item()

                logits = outputs.logits
                preds = torch.argmax(logits, dim=1).cpu().numpy()
                labels = batch[2].cpu().numpy()

                all_preds.extend(preds)
                all_labels.extend(labels)

        avg_loss = total_loss / len(dataloader)
        accuracy = accuracy_score(all_labels, all_preds)
        report = classification_report(all_labels, all_preds, target_names=["Class 0", "Class 1"], digits=4)
        conf_matrix = confusion_matrix(all_labels, all_preds)

        print("\nClassification Report:\n", report)
        print("\nConfusion Matrix:\n", conf_matrix)

        metrics = {
            'accuracy': accuracy,
            'confusion_matrix': conf_matrix,
            'classification_report': report
        }

        return avg_loss, metrics

    # given a text of the youtube title, predict the class
    def predict(self, text):
        self.model.eval()
        encoded_data = self.tokenizer(text, add_special_tokens=True, max_length=self.max_length, padding='max_length', truncation=True, return_tensors='pt')
        input_ids = encoded_data['input_ids'].to(self.device)
        attention_masks = encoded_data['attention_mask'].to(self.device)
        with torch.no_grad():
            outputs = self.model(input_ids, attention_mask=attention_masks)
        logits = outputs.logits
        predicted_class = torch.argmax(logits, dim=1).item()
        return predicted_class

    def save_checkpoint(self, epoch):
        checkpoint_path = f'checkpoint_v3_epoch_{epoch + 1}.pt'
        torch.save({
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'hyperparameters': {
                'model_name': self.model_name,
                'num_classes': self.model.num_labels,
                'max_length': self.max_length,
                'batch_size': self.batch_size,
                'num_epochs': self.num_epochs,
                'learning_rate': self.learning_rate
            }
        }, checkpoint_path)
        print(f"Checkpoint saved at epoch {epoch + 1}")

# Example usage:
# bert_classifier = BertClassifier(num_classes=NUM_OF_CLASSES, num_epochs=5)
# bert_classifier.train(train_texts, train_labels, val_texts, val_labels)
# eval_loss, eval_metrics = bert_classifier.evaluate(eval_dataloader)
# print(f'Evaluation Loss: {eval_loss}')


## Using the Class to Encode our Data

In [28]:
# checking the encoding of the model works properly 
model = BertClassifier(num_classes=2, num_epochs=10)

train_dataset = model.encode_data(X_train, y_train)
train_dataloader = model.create_dataloader(train_dataset)

# Save the train_dataset to a file
# torch.save(train_dataset, "./encoded_dataset/encoded_YT.pt")
# To load the dataset later, you can use:
# loaded_train_dataset = torch.load('train_dataset.pt')

print(train_dataset[:5])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(tensor([[  101,  2868,  1006, 16798,  2475,  1007,  4566,  4541,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1045,  2288,  1037,  9004, 10608,  2005,  1996,

### Training parameters

In [29]:
def count_trainable_parameters(model):
    all_param = sum(p.numel() for p in model.parameters())
    trainable_param =  sum(p.numel() for p in model.parameters() if p.requires_grad)
    ratio = float(trainable_param  / all_param) *100
    return all_param, trainable_param, ratio

 
# Print the number of trainable parameters
print(f"Number of trainable parameters: {count_trainable_parameters(model.model)}")


Number of trainable parameters: (109778690, 294912, 0.2686423020715587)


In [30]:
print(model.model)

PeftModel(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768

## training the Bert Model



In [31]:
# this function might take about an hour to run, varying in hardware 
model.train(X_train, y_train, X_val, y_val, encoded = False)
print ("finished training, now evaluating")


Epoch 1/10


                                                                                 


Classification Report:
               precision    recall  f1-score   support

     Class 0     0.7423    0.8730    0.8024      5890
     Class 1     0.7298    0.5309    0.6146      3805

    accuracy                         0.7387      9695
   macro avg     0.7360    0.7019    0.7085      9695
weighted avg     0.7374    0.7387    0.7287      9695


Confusion Matrix:
 [[5142  748]
 [1785 2020]]
Training Loss: 0.6192, Validation Loss: 0.5208
Validation Accuracy: 0.7387
Checkpoint saved at epoch 1
Validation loss improved. Model checkpoint saved.
Epoch 2/10


                                                                                 


Classification Report:
               precision    recall  f1-score   support

     Class 0     0.7624    0.8722    0.8136      5890
     Class 1     0.7454    0.5792    0.6519      3805

    accuracy                         0.7572      9695
   macro avg     0.7539    0.7257    0.7327      9695
weighted avg     0.7557    0.7572    0.7501      9695


Confusion Matrix:
 [[5137  753]
 [1601 2204]]
Training Loss: 0.5087, Validation Loss: 0.4841
Validation Accuracy: 0.7572
Checkpoint saved at epoch 2
Validation loss improved. Model checkpoint saved.
Epoch 3/10


                                                                                 


Classification Report:
               precision    recall  f1-score   support

     Class 0     0.7791    0.8618    0.8184      5890
     Class 1     0.7440    0.6218    0.6775      3805

    accuracy                         0.7676      9695
   macro avg     0.7616    0.7418    0.7479      9695
weighted avg     0.7653    0.7676    0.7631      9695


Confusion Matrix:
 [[5076  814]
 [1439 2366]]
Training Loss: 0.4851, Validation Loss: 0.4666
Validation Accuracy: 0.7676
Checkpoint saved at epoch 3
Validation loss improved. Model checkpoint saved.
Epoch 4/10


                                                                                 

## Evalutaion of the model

In [9]:

# model = load_model("checkpoint_best_BERT_epoch_9.pt") # uncomment to use the the best pretrained BERT model 
test_dataset = model.encode_data(X_test, y_test)
test_dataloader = model.create_dataloader(test_dataset)


eval_loss = model.evaluate(test_dataloader)
print(f'Evaluation Loss: {eval_loss}')




Classification Report:
               precision    recall  f1-score   support

     Class 0     0.8072    0.8543    0.8301      5890
     Class 1     0.7521    0.6841    0.7165      3805

    accuracy                         0.7875      9695
   macro avg     0.7796    0.7692    0.7733      9695
weighted avg     0.7856    0.7875    0.7855      9695


Confusion Matrix:
 [[5032  858]
 [1202 2603]]
Evaluation Loss: (0.4348954964588971, {'accuracy': 0.7875193398659103, 'confusion_matrix': array([[5032,  858],
       [1202, 2603]]), 'classification_report': '              precision    recall  f1-score   support\n\n     Class 0     0.8072    0.8543    0.8301      5890\n     Class 1     0.7521    0.6841    0.7165      3805\n\n    accuracy                         0.7875      9695\n   macro avg     0.7796    0.7692    0.7733      9695\nweighted avg     0.7856    0.7875    0.7855      9695\n'})


## Saving the model

In [21]:
import os
save_dir = "./checkpoints"

checkpoint_path = os.path.join(save_dir, 'checkpoint_v3_new_best_model.pt')
torch.save({
    'model_state_dict': model.model.state_dict(),
    'optimizer_state_dict': model.optimizer.state_dict(),
    'hyperparameters': {
        'model_name': 'bert-base-uncased',
        'num_classes': 2,
        'max_length': model.max_length,
        'batch_size': model.batch_size,
        'num_epochs': model.num_epochs,
        'learning_rate': model.learning_rate
    }
    }, checkpoint_path)
print(f"retrived from old was saved successfully")



retrived from old was saved successfully


# test an example title

In [34]:
# Define a function for prediction
def predict_title(model, tokenizer, text):
    # Encode the input text
    encoded_data = tokenizer(
        text,
        add_special_tokens=True,
        return_attention_mask=True,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors='pt'
    )
    
    # Prepare the input dictionary
    inputs = {
        'input_ids': encoded_data['input_ids'],
        'attention_mask': encoded_data['attention_mask']
    }
    
    # Move input tensors to the appropriate device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print("chosen device" , device)
    model.model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}
    
    # Perform a forward pass through the model
    with torch.no_grad():
        outputs = model.model(**inputs)
    
    # Get the logits and make predictions
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()
    
    return predicted_class

# Example usage
text_to_predict = "intro to Gradient Descent"
prediction = predict_title(model, model.tokenizer, text_to_predict)

print(f'Predicted class: {prediction}')

chosen device cuda
Predicted class: 0


calculating accuracy

## Code for Loading a model from a file
- run this before the evalutaion and instead of the training method

In [7]:
def load_checkpoint(checkpoint_path):
    # Load the checkpoint
    checkpoint = torch.load(checkpoint_path)
    
    # Retrieve the hyperparameters from the checkpoint
    hyperparameters = checkpoint['hyperparameters']
    
    # Initialize the model using the hyperparameters
    model = BertClassifier(
        model_name=hyperparameters['model_name'],
        num_classes=hyperparameters['num_classes'],
        max_length=hyperparameters['max_length'],
        batch_size=hyperparameters['batch_size'],
        num_epochs=hyperparameters['num_epochs'],
        learning_rate=hyperparameters['learning_rate']
    )
    
    # Initialize the optimizer
    optimizer = AdamW(model.model.parameters(), lr=hyperparameters['learning_rate'])
    
    # Load the state dictionaries
    model.model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    
    # Move the model to the appropriate device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.model.to(device)
    
    print(f"Model loaded successfully from {checkpoint_path} with hyperparameters: {hyperparameters}")
    
    return model, optimizer, hyperparameters

In [8]:
checkpoint_path = './checkpoint_v3_epoch_9.pt'
model, optimizer, hyperparameters = load_checkpoint(checkpoint_path)

  checkpoint = torch.load(checkpoint_path)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully from ./checkpoint_v3_epoch_9.pt with hyperparameters: {'model_name': 'bert-base-uncased', 'num_classes': 2, 'max_length': 128, 'batch_size': 32, 'num_epochs': 10, 'learning_rate': 2e-05}
