# Model Training Notebook

After exploring the data and identifying potential challenges, I decided to use a state-of-the-art transformer Language model.

## Model Architecture

"DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of a BERT model by 40%. To leverage the inductive biases learned by larger models during pre-training, the authors introduce a triple loss combining language modeling, distillation and cosine-distance losses."

__[PaperswithCode](https://paperswithcode.com/method/distillbert)__

![BertArchitecture](https://heidloff.net/assets/img/2023/02/transformers.png)

## Transfer Learning

I selected Distilbert since it's a lightweight LLM with fast training and inference. In order to create a mindmap category classifier, I'll use the pretrained Distilbert model and fine tune it to the existing data to be able to classify different languages.

![TransferLearning](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/BERT-transfer-learning.png?ssl=1)

## Setting Up Huggingface

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Importing Packages

In [2]:
import evaluate
import numpy as np
import pandas as pd

from datasets import Dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import pipeline

## Loading Raw Data

In [3]:
maps_df = pd.read_csv("../data/raw/public_maps.csv")

## Selecting Features 

To train the classifier we only need to use the text features of the data, I'll add the index as a column to have an unique Id for every row.

In [4]:
maps_df['id'] = maps_df.index

In [5]:
maps_df_selected = maps_df[['id', 'map_title', 'idea_title', 'map_category_name']]

## Creating Input Feature

To classify the category we need to concatenate the idea and mindmap title to have the context in a sigle sentence.

In [7]:
maps_df_selected["text"] = maps_df_selected['idea_title'] + ' ' + maps_df_selected['map_title']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  maps_df_selected["text"] = maps_df_selected['idea_title'] + ' ' + maps_df_selected['map_title']


In [8]:
maps_df_selected = maps_df_selected[['text', 'map_category_name']]
maps_df_selected.rename(columns={'map_category_name': 'label'}, inplace=True)

## Creating Huggingface Dataset

To make the training process smoother, we can create a Huggingface dataset. This will provide a variety of pre processing features to create a dataset suitable for the classification task.

In [9]:
dataset = Dataset.from_pandas(maps_df_selected)
dataset = dataset.class_encode_column("label")

labels = dataset.features["label"].names

# Create Label/Index mappings
id2label = {idx: label for idx, label in enumerate(labels)}
label2id = {label: idx for idx, label in enumerate(labels)}

Casting to class labels:   0%|          | 0/13560 [00:00<?, ? examples/s]

We can use the train_test_split method to create our test dataset with stratification to make sure were getting balanced categories in the dataset.

In [10]:
dataset = dataset.train_test_split(test_size=0.2, stratify_by_column="label")

## Preprocessing

To train the DISTILBERT model I followed the Huggingface transformers documentation available here: https://huggingface.co/docs/transformers/main/tasks/sequence_classification

In [11]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [12]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [13]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/10848 [00:00<?, ? examples/s]

Map:   0%|          | 0/2712 [00:00<?, ? examples/s]

In [14]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Since the categories are somewhat balanced we can use accuracy to evaluate the model and get reasonable results.

In [15]:
accuracy = evaluate.load("accuracy")

In [16]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Now we can fine-tune the pretrained model with the training data we have.

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(labels), id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
training_args = TrainingArguments(
    output_dir="../models/meister-mindmap-model-pytorch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Cloning https://huggingface.co/soymia/meister-mindmap-model-pytorch into local empty directory.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7075,0.054834,0.987832
2,0.0613,0.016271,0.99705


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

TrainOutput(global_step=1356, training_loss=0.28903380853939903, metrics={'train_runtime': 251.7162, 'train_samples_per_second': 86.192, 'train_steps_per_second': 5.387, 'total_flos': 335060623221024.0, 'train_loss': 0.28903380853939903, 'epoch': 2.0})

## Evaluation

The pretrained model gave great results with just 2 epochs, overfitting is a possibility but with some hyperparameter tunning it could be reduced.

## Uploading the Model to Huggingface

For an easy deployment we can store the model in the Huggingface cloud to have easy access to it.

In [19]:
trainer.push_to_hub()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file pytorch_model.bin:   0%|          | 1.00/255M [00:00<?, ?B/s]

To https://huggingface.co/soymia/meister-mindmap-model-pytorch
   3b450e6..8698954  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/soymia/meister-mindmap-model-pytorch
   8698954..27f0a88  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'https://huggingface.co/soymia/meister-mindmap-model-pytorch/commit/869895408cc5d70be407053e5de911b81b7ef750'

## Inference

We can test the inference easily using the transformer pipeline from Huggingface. It works in multi linugual mode out of the box.

In [31]:
text, label = dataset['test'][788]['text'], dataset['test'][788]['label']

In [32]:
text, label

('Easense Legal ปฏิบัติการเคมีทั่วไป 2 (ความร้อนของปฏิกิริยาเคมี)', 4)

In [33]:
classifier = pipeline("text-classification", model="soymia/meister-mindmap-model-pytorch")
classifier(text)

[{'label': 'Other', 'score': 0.9953964352607727}]