<a href="https://colab.research.google.com/github/paramchhabra/AIProject-1/blob/main/AI_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install transformers




In [5]:
pip install datasets



In [6]:
!pip install sentence-transformers



I installed the basic required dependencies uptil now. In the following code section, I'll check if my csv dataset loads correctly.


In [7]:
import pandas as pd

filepath = '/content/navigator-batch-generate-66e45a0657fb48a168f5b606-data.csv'
data = pd.read_csv(filepath)

data.head()


Unnamed: 0,Date & Time,SenderName,SenderEmail,Subject,Text,Type
0,26-07-2023 14:30,Rajesh Patel,rajesh.patel@vit.ac.in,Internship Opportunity at Google,We are excited to announce that we have partne...,"Internship/Placement Email,college"
1,27-07-2023 11:45,Aisha Ali,aisha.ali@yandex.com,Join Our Hackathon,Calling all coders! Join our hackathon and sho...,"Hackathon Email,external"
2,28-07-2023 09:00,Liam Chen,liam.chen@university.edu,Course Registration Open,Don't miss out on our new course on 'Data Scie...,"Course Advertisement,external"
3,29-07-2023 15:30,Fatima Khan,fatima.khan@vit.ac.in,Event: AI Conference,Join us for the AI Conference on August 20th a...,"Event Email,college"
4,30-07-2023 12:00,Ethan Lee,ethan.lee@outlook.com,Other Email: Library Book Issue,"You have an overdue book, 'Python for Data Sci...","Other Emails,external"


This section forms the base of the MiniLm model, to convert the data into labels and texts.

In [8]:
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(data, test_size=0.15, random_state=42)

print(f"Training data: {len(train_data)}, Validation data: {len(val_data)}")

Training data: 850, Validation data: 150


Now we need to 'Tokenize' the data. Normally we can use a model specific tokenizer, here we are using 'AutoTokenizer'.

In [9]:
from transformers import AutoTokenizer

#Loading MiniLm tokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

label_mapping = {
    "Internship/Placement Email,college": 0,
    "Internship/Placement Email,external": 1,
    "Hackathon Email,college": 2,
    "Hackathon Email,external": 3,
    "Education Email,college": 4,
    "Education Email,external": 5,
    "Event Email,college": 6,
    "Event Email,external": 7,
    "Course Advertisement,college": 8,
    "Course Advertisement,external": 9,
    "Other Emails,college": 10,
    "Other Emails,external": 11

}

def tokenize_function(data):
  combined_text = (
      "SenderName: "+data['SenderName'] + "|" +
      "Date & Time: "+str(data['Date & Time']) + "|" +
      "Subject:" + data['Subject'] + "|" +
      "SenderEmail: "+ data['SenderEmail'] + "|" +
      "Text: "+ data['Text']
  )
  encoding = tokenizer(combined_text, padding='max_length', truncation=True)
  label = label_mapping[data['Type']]
  encoding['labels'] = label
  return encoding

train_encodings = train_data.apply(tokenize_function, axis=1)
val_encodings = val_data.apply(tokenize_function, axis=1)

print(train_encodings[0])



{'input_ids': [101, 4604, 11795, 14074, 1024, 11948, 9953, 20455, 1064, 3058, 1004, 2051, 1024, 2656, 1011, 5718, 1011, 16798, 2509, 2403, 1024, 2382, 1064, 3395, 1024, 22676, 4495, 2012, 8224, 1064, 4604, 7869, 21397, 1024, 11948, 9953, 1012, 20455, 1030, 6819, 2102, 1012, 9353, 1012, 1999, 1064, 3793, 1024, 2057, 2024, 7568, 2000, 14970, 2008, 2057, 2031, 12404, 2007, 8224, 2000, 3749, 1037, 1017, 1011, 3204, 22676, 2565, 2005, 2493, 1012, 1996, 2565, 2003, 2881, 2000, 3073, 2398, 1011, 2006, 3325, 1999, 4007, 2458, 1998, 2097, 2421, 1037, 2358, 15457, 4859, 1997, 1002, 13509, 1012, 2065, 2017, 2024, 4699, 1010, 3531, 7514, 2000, 2023, 10373, 2011, 2257, 3083, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Now we prepare the Dataset for finetuning. We'll use HuggingFace's Dataset Library for the same.

In [10]:
from datasets import Dataset

# Convert the Series of tokenized data to a dictionary of lists
def convert_to_dict(encodings):
    # Collect all keys from the first sample
    keys = encodings.iloc[0].keys()

    # Create a dictionary where each key corresponds to a list of values
    encoding_dict = {key: [] for key in keys}
    for encoding in encodings:
        for key, value in encoding.items():
            encoding_dict[key].append(value)
    return encoding_dict

# Convert train and validation encodings
train_dataset_dict = convert_to_dict(train_encodings)
val_dataset_dict = convert_to_dict(val_encodings)

train_dataset = Dataset.from_dict(train_dataset_dict)
val_dataset = Dataset.from_dict(val_dataset_dict)

print(train_dataset[0])

{'input_ids': [101, 4604, 11795, 14074, 1024, 22854, 2050, 5035, 1064, 3058, 1004, 2051, 1024, 6185, 1011, 5511, 1011, 16798, 2509, 2340, 1024, 2382, 1064, 3395, 1024, 9046, 20578, 8988, 2239, 1064, 4604, 7869, 21397, 1024, 22854, 2050, 1012, 5035, 1030, 2742, 1012, 4012, 1064, 3793, 1024, 6203, 2493, 1010, 2017, 2024, 4778, 2000, 2256, 9046, 20578, 8988, 2239, 1012, 1996, 2724, 2097, 2202, 2173, 2006, 2257, 4833, 2012, 1016, 7610, 1012, 3531, 2424, 1996, 4751, 4987, 2005, 2062, 2592, 1012, 2057, 2298, 2830, 2000, 3773, 2017, 2045, 999, 2190, 12362, 1010, 2115, 9450, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Now the most important task, Loading the model to finetune it. We will also set up the training arguments for the finetuned model to work on.

In [11]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    'sentence-transformers/all-MiniLm-L6-v2',
    num_labels=12
)

training_args = TrainingArguments(
    output_dir = './results',
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    remove_unused_columns=False,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLm-L6-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We'll create a trainer object to run it along with the arguments. The object is provided by HuggingFace and simplifies the fine_tuning process by abstractions.

In [12]:

# # Drop unnecessary columns like index columns
# train_dataset = train_dataset.remove_columns(['__index_level_0__'])
# val_dataset = val_dataset.remove_columns(['__index_level_0__'])
print(model.main_input_name)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print(train_dataset[1])



input_ids
{'input_ids': [101, 4604, 11795, 14074, 1024, 13192, 18998, 1064, 3058, 1004, 2051, 1024, 2340, 1011, 5511, 1011, 16798, 2549, 2260, 1024, 4002, 1064, 3395, 1024, 20578, 8988, 2239, 8468, 1064, 4604, 7869, 21397, 1024, 13192, 1012, 18998, 1030, 20917, 4014, 1012, 4012, 1064, 3793, 1024, 2017, 2024, 4778, 2000, 5589, 1999, 2256, 9046, 20578, 8988, 2239, 1012, 1996, 2724, 2097, 2022, 2218, 2006, 1996, 10965, 1997, 2257, 1998, 2097, 9125, 13729, 3471, 1998, 27696, 2115, 4813, 1012, 3531, 4236, 2011, 1996, 3983, 1997, 2257, 2000, 5851, 2115, 3962, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Let the fine-tuning begin.

In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.244627
2,No log,1.946018
3,No log,1.857413


TrainOutput(global_step=162, training_loss=2.1705039695457176, metrics={'train_runtime': 3237.9775, 'train_samples_per_second': 0.788, 'train_steps_per_second': 0.05, 'total_flos': 84602974003200.0, 'train_loss': 2.1705039695457176, 'epoch': 3.0})

Let's Evaluate the tests

In [14]:
trainer.evaluate()

{'eval_loss': 1.8574129343032837,
 'eval_runtime': 41.9056,
 'eval_samples_per_second': 3.579,
 'eval_steps_per_second': 0.239,
 'epoch': 3.0}

We need to save these models for future use

In [15]:
model.save_pretrained("/content/finetuned_minilm")
tokenizer.save_pretrained("/content/finetuned_minilm")


('/content/finetuned_minilm/tokenizer_config.json',
 '/content/finetuned_minilm/special_tokens_map.json',
 '/content/finetuned_minilm/vocab.txt',
 '/content/finetuned_minilm/added_tokens.json',
 '/content/finetuned_minilm/tokenizer.json')

We will now save our fine_tuned model in our google drive so that we can use it directly in the future without training it again for every use.

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [17]:
model.save_pretrained('/content/drive/MyDrive/finetuned_minilm')
tokenizer.save_pretrained('/content/drive/MyDrive/finetuned_minilm')


('/content/drive/MyDrive/finetuned_minilm/tokenizer_config.json',
 '/content/drive/MyDrive/finetuned_minilm/special_tokens_map.json',
 '/content/drive/MyDrive/finetuned_minilm/vocab.txt',
 '/content/drive/MyDrive/finetuned_minilm/added_tokens.json',
 '/content/drive/MyDrive/finetuned_minilm/tokenizer.json')