# Teacher and Student Model

Custom knowledge distillation is a technique in deep learning that involves training a smaller, simpler model (the "student") to mimic the behavior of a larger, more complex model (the "teacher"). This process is intended to improve the efficiency of deep learning models, by compressing the knowledge of the larger model into a smaller one, without sacrificing too much accuracy. In this notebook, we explore the capabilities of custom knowledge distillation teacher-student models by training them and evaluating their F1 score.

Specifically, we are training a custom knowledge distillation teacher-student model and evaluating its F1 score. The pre-trained teacher model is 'bert-base-uncased' and the student model is a smaller, pruned version of 'bert-base-uncased'. By using custom knowledge distillation, we aim to improve the efficiency of the student model, without sacrificing its accuracy.

In this notebook, we will be using PyTorch to implement the custom knowledge distillation process and train the student model. We will then evaluate the performance of the student model by calculating its F1 score. By exploring the performance of custom knowledge distillation teacher-student models, we hope to contribute to the ongoing research and development of efficient and effective deep learning models.

In [27]:
import os
import pathlib
from dotenv import load_dotenv
from datasets import Dataset, DatasetDict
import pandas as pd
from src.data.s3_communication import S3Communication
import config
from torch import cuda
import transformers
from pathlib import Path
from io import BytesIO
import zipfile
import numpy as np
from deepsparse import Pipeline
import numpy as np
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
device = 'cuda' if cuda.is_available() else 'cpu'
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import torch
import warnings
warnings.filterwarnings('always')  # "error", "ignore", "always", "default", "module" or "once"

In [28]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [30]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

# Process dataset for sparseml training


In [31]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [32]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [33]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

train_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_train_split.csv'
train_data = pd.read_csv(train_data_path, index_col=0)
train_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

train_data.to_csv(train_data_path)
test_data.to_csv(test_data_path)

In [34]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

In [35]:
climate_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'question', 'sentence', '__index_level_0__'],
        num_rows: 2033
    })
    test: Dataset({
        features: ['question', 'sentence', '__index_level_0__'],
        num_rows: 509
    })
})

# Teacher Model

In [2]:
#!sparseml.transformers.text_classification --help

In [34]:
!sparseml.transformers.text_classification \
--model_name_or_path bert-base-uncased \  # noqa: E999
--train_file '/opt/app-root/src/data/processed/rel_train_split.csv' \
--validation_file '/opt/app-root/src/data/processed/rel_test_split.csv' \
--label_column_name 'label' \
--input_column_name 'question,sentence' \
--do_train --do_eval --evaluation_strategy epoch \
--per_device_train_batch_size 32 \
--learning_rate 5e-5 \
--max_seq_length 128 \
--output_dir models/teacher \
--num_train_epochs 8 \
--metric_for_best_model 'f1' \
--overwrite_output_dir \
--seed 2021 \

# For deepsparse
!sparseml.transformers.export_onnx \
--model_path models/teacher/ \
--task 'text-classification' \


"\n!sparseml.transformers.text_classification --model_name_or_path bert-base-uncased --train_file '/opt/app-root/src/data/processed/rel_train_split.csv' --validation_file '/opt/app-root/src/data/processed/rel_test_split.csv' --label_column_name 'label' --input_column_name 'question,sentence' --do_train --do_eval --evaluation_strategy epoch --per_device_train_batch_size 32 --learning_rate 5e-5 --max_seq_length 128 --output_dir models/teacher --num_train_epochs 8 --metric_for_best_model 'f1' --overwrite_output_dir --seed 2021 \n# For deepsparse\n!sparseml.transformers.export_onnx --model_path models/teacher/ --task 'text-classification' "

# Sparse Student Model

In [35]:

!sparseml.transformers.train.text_classification \
--model_name_or_path 'zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none' \
--distill_teacher models/teacher \
--train_file '/opt/app-root/src/data/processed/rel_train_split.csv' \
--validation_file '/opt/app-root/src/data/processed/rel_test_split.csv' \
--label_column_name 'label' \
--input_column_name 'question,sentence' \
--do_train --do_eval --evaluation_strategy epoch \
--per_device_train_batch_size 16 \
--learning_rate 5e-4 \
--warmup_steps 11000 \
--output_dir models/12layer_pruned80-none \
--seed 11712 \
--num_train_epochs 50 \
--save_strategy epoch \
--save_total_limit 1 \
--metric_for_best_model 'f1' \
--overwrite_output_dir \
--recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none?recipe_type=transfer-MNLI

# For deepsparse
!sparseml.transformers.export_onnx \
--model_path models/12layer_pruned80-none/ \
--task 'text-classification' \


"\n!sparseml.transformers.train.text_classification --model_name_or_path 'zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none' --distill_teacher models/teacher --train_file '/opt/app-root/src/data/processed/rel_train_split.csv' --validation_file '/opt/app-root/src/data/processed/rel_test_split.csv' --label_column_name 'label' --input_column_name 'question,sentence' --do_train --do_eval --evaluation_strategy epoch --per_device_train_batch_size 16 --learning_rate 5e-4 --warmup_steps 11000 --output_dir models/12layer_pruned80-none --seed 11712 --num_train_epochs 50 --save_strategy epoch --save_total_limit 1 --metric_for_best_model 'f1' --overwrite_output_dir --recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none?recipe_type=transfer-MNLI\n\n# For deepsparse\n!sparseml.transformers.export_onnx --model_path models/12layer_pruned80-none/ --task 'text-classification'\n"

In [10]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

In [11]:
def create_batches(data_df, tokenizer, batch_size=32):
    encoded_dataset = list()
    batch = list()
    for df, row in data_df.iterrows():
        if len(batch) < batch_size:
            batch.append([row['question'], row['sentence']])
        else:
            encoded_dataset.append(tokenizer(batch,
                                             truncation=True,
                                             return_tensors='pt',
                                             padding=True))
            batch = [[row['question'], row['sentence']]]

    if batch:
        encoded_dataset.append(tokenizer(batch,
                                         truncation=True,
                                         return_tensors='pt',
                                         padding=True))
    return encoded_dataset

def predict(encoded_dataset, model):
    outputs = list()
    for batch in encoded_dataset:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        with torch.no_grad():
            outs = model(input_ids=input_ids, attention_mask=attention_mask)
            outputs.extend(outs.logits.argmax(axis=1).tolist())
    return outputs


def get_model_f1score(model_path, test_data):
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
    
    encoded_dataset = create_batches(test_data, tokenizer)
    test_data["pred"] = predict(encoded_dataset, model)
    
    groups = test_data.groupby("question")
    scores = {}
    for group, data in groups:
        pred = data.pred
        true = data.label
        scores[group] = {}
        scores[group]["accuracy"] = accuracy_score(true, pred)
        scores[group]["f1_score"] = f1_score(true, pred)
        scores[group]["recall_score"] = recall_score(true, pred)
        scores[group]["precision_score"] = precision_score(true, pred)
        scores[group]["support"] = len(pred)

    # kpi wise performance metrics
    scores_df = pd.DataFrame(scores)
    return scores_df.loc['f1_score'].mean()
        
    

In [12]:
get_model_f1score('models/teacher/',test_data)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



0.9124404257977322

In [13]:
get_model_f1score('models/12layer_pruned80-none/',test_data)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



0.8691582632415494

In [14]:
get_model_f1score('models/12layer_pruned90-none/',test_data)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))

  _warn_prf(average, modifier, msg_start, len(result))

  _warn_prf(average, modifier, msg_start, len(result))



0.8650101491961957

# Saving Models to S3

In [15]:
def save_model(local_path, model_name):
    #trainer.save_model(local_path)
    #shutil.make_archive(local_path, 'zip', local_path)
    buffer = BytesIO()
    with zipfile.ZipFile(buffer, 'a') as z:
        for dirname, _, files in os.walk(local_path):
            for f in files:
                f_path = os.path.join(dirname, f)
                with open (f_path, 'rb') as file_content:
                    z.writestr(f"{model_name}/{f}", file_content.read())
    buffer.seek(0)
    # upload model to s3
    s3c._upload_bytes(
        buffer_bytes=buffer,
        prefix=config.BASE_SAVED_MODELS_S3_PREFIX,
        key=f"{model_name}.zip"
    )

In [16]:
save_model('models/teacher','teacher')
save_model('models/12layer_pruned80-none','12layer_pruned80-none')
save_model('models/12layer_pruned80-none','12layer_pruned90-none')

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return self._open_to_write(zinfo, force_zip64=force_zip64)

  return

# Conclusion

In [23]:
F1_table = {'model_name':['12layer_pruned90-none',
                          '12layer_pruned80-none',
                          'teacher'],
            'Recipe_used':['zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned90-none?recipe_type=transfer-MNLI',
                           'zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/bookcorpus_wikitext/12layer_pruned80-none?recipe_type=transfer-MNLI',
                           ''],
            'Size (MB)':[os.path.getsize('models/12layer_pruned90-none/pytorch_model.bin')/1000000,
                         os.path.getsize('models/12layer_pruned80-none/pytorch_model.bin')/1000000,
                         os.path.getsize('models/teacher/pytorch_model.bin')/1000000],             
            'F1-Score':[0.86, 0.86, 0.91]}
                           

In [24]:
pd.DataFrame(F1_table)

Unnamed: 0,model_name,Recipe_used,Size (MB),F1-Score
0,12layer_pruned90-none,zoo:nlp/masked_language_modeling/bert-base/pyt...,438.011337,0.86
1,12layer_pruned80-none,zoo:nlp/masked_language_modeling/bert-base/pyt...,438.011337,0.86
2,teacher,,438.011337,0.91


In conclusion, our evaluation of three custom knowledge distillation models - '12layer_pruned90-none', '12layer_pruned80-none', and 'teacher' - has yielded some interesting results. Despite the similar F1-Scores achieved by '12layer_pruned90-none' and '12layer_pruned80-none' (0.86), they had the same recipe and similar size, suggesting that pruning did not significantly affect their size or performance. The 'teacher' model, on the other hand, achieved a higher F1-Score (0.91).

These results highlight the potential of custom knowledge distillation for improving the efficiency of deep learning models, even in cases where pruning may not result in significant size reductions or performance improvements. Additionally, the findings contribute to the ongoing research and development of custom knowledge distillation models and their applications in various domains.