## Inference with teacher-student Models 

Custom knowledge distillation is a popular technique in deep learning, which involves training a smaller, simpler model (the "student") to mimic the behavior of a larger, more complex model (the "teacher"). This technique is aimed at improving the efficiency of deep learning models by compressing the knowledge of the larger model into a smaller one, without sacrificing too much accuracy.

In this notebook, we will be investigating the inference time of custom knowledge distillation models using GPU acceleration with PyTorch. Specifically, we will be focusing on three pre-trained models: '12layer_pruned80-none', '12layer_pruned90-none', and 'teacher'. Our objective is to calculate the inference time of these models and evaluate the impact of the GPU on their performance.

In [1]:
import os
import pathlib
from dotenv import load_dotenv
from datasets import Dataset, DatasetDict
import pandas as pd
from transformers import AutoModelForSequenceClassification
from src.data.s3_communication import S3Communication, S3FileType
from src.components.utils.kpi_mapping import get_kpi_mapping_category
import json
import time
import config
from transformers import AutoTokenizer
import random
from torch import cuda
import torch
import warnings
warnings.filterwarnings("ignore")
device = 'cuda' if cuda.is_available() else 'cpu'

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Helper functions

In [4]:
def create_batches(data_df, batch_size=32):
    encoded_dataset = list()
    batch = list()
    for df, row in data_df.iterrows():
        if len(batch) < batch_size:
            batch.append([row['question'], row['sentence']])
        else:
            encoded_dataset.append(tokenizer(batch,
                                             truncation=True,
                                             return_tensors='pt',
                                             padding=True))
            batch = [[row['question'], row['sentence']]]
    if batch:
        encoded_dataset.append(tokenizer(batch,
                                         truncation=True,
                                         return_tensors='pt',
                                         padding=True))
    return encoded_dataset

In [5]:
def gather_data(pdf_name, pdf_path):
    pdf_content = read_text_from_json(file_path)
    text_data = []
    # Build all possible combinations of paragraphs and  questions
    # Keep track of page number which the text is extracted from and
    # the pdf it belongs to.
    for kpi_question in questions:
        text_data.extend([{
            "page": page_num,
            "pdf_name": pdf_name,
            "question": kpi_question,
            "sentence": paragraph}
            for page_num, page_content in pdf_content.items()
            for paragraph in page_content])
    return text_data

In [6]:
def predict(encoded_dataset):
    outputs = list()
    for batch in encoded_dataset:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        with torch.no_grad():
            outs = model(input_ids=input_ids, attention_mask=attention_mask)
            outputs.extend(outs.logits.argmax(axis=1).tolist())
    return outputs

In [7]:
def read_text_from_json(file):
    """Read text from json."""

    with open(file) as f:
        text = json.load(f)
        return text

## Retrieve the test dataset and the trained models

In [8]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [9]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

train_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_train_split.csv'
train_data = pd.read_csv(train_data_path, index_col=0)
train_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

In [10]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

In [11]:
climate_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'question', 'sentence', '__index_level_0__'],
        num_rows: 2033
    })
    test: Dataset({
        features: ['question', 'sentence', '__index_level_0__'],
        num_rows: 509
    })
})

PDFs

In [12]:
config.DATA_FOLDER

PosixPath('/opt/app-root/src/data')

In [13]:
BENCHMARK_FOLDER = config.DATA_FOLDER
if not os.path.exists(BENCHMARK_FOLDER):
    BENCHMARK_FOLDER.mkdir(parents=True, exist_ok=True)

BENCHMARK_EXTRACTION_FOLDER = BENCHMARK_FOLDER / "extraction"
if not os.path.exists(BENCHMARK_EXTRACTION_FOLDER):
    pathlib.Path(BENCHMARK_EXTRACTION_FOLDER).mkdir(parents=True, exist_ok=True)

In [14]:
kpi_df = s3c.download_df_from_s3(
    "aicoe-osc-demo/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0)

kmc = get_kpi_mapping_category(kpi_df)
questions = [q_text for q_id, (q_text, sect) in kmc["KPI_MAPPING_MODEL"].items()
             if len(set(sect).intersection({"OG", "CM", "CU"})) > 0
             and "TEXT" in kmc["KPI_CATEGORY"][q_id]]

text_paths = sorted(BENCHMARK_EXTRACTION_FOLDER.rglob("*.json"))
all_text_path_dict = {os.path.splitext(os.path.basename(file_path))[0]:
                      file_path for file_path in text_paths
                      if "table_meta" not in str(file_path)}

In [15]:
# Choosing 10 random pdfs
all_text_path_dict = dict(random.sample(all_text_path_dict.items(), 10))

In [16]:
df_list = []
metrics_df_list = []
metrics_list = []
metric_dfs = pd.DataFrame()
num_pdfs = len(all_text_path_dict)

In [17]:
all_text_path_dict

{'LUKOIL_SUSTAINABILITY_REPORT_2018': PosixPath('/opt/app-root/src/data/extraction/LUKOIL_SUSTAINABILITY_REPORT_2018.json'),
 'equinor-2019-annual-report-and-form-20f': PosixPath('/opt/app-root/src/data/extraction/equinor-2019-annual-report-and-form-20f.json'),
 '413750375_Avista Corp_2019-12-31': PosixPath('/opt/app-root/src/data/extraction/413750375_Avista Corp_2019-12-31.json'),
 '2018 Annual Report': PosixPath('/opt/app-root/src/data/extraction/2018 Annual Report.json'),
 'Ervia-Annual-Report-2018': PosixPath('/opt/app-root/src/data/extraction/Ervia-Annual-Report-2018.json'),
 'SaipemSustainability2018': PosixPath('/opt/app-root/src/data/extraction/SaipemSustainability2018.json'),
 'RN_SR_2016_EN_2_ sustainabilitz 2016': PosixPath('/opt/app-root/src/data/extraction/RN_SR_2016_EN_2_ sustainabilitz 2016.json'),
 'Eskom Holdings SOC Ltd Integrated Report 2019': PosixPath('/opt/app-root/src/data/extraction/Eskom Holdings SOC Ltd Integrated Report 2019.json'),
 'Sustainability Report 20

In [18]:
print(num_pdfs)

10


In [19]:
local_model_paths=['models/12layer_pruned80-none/',
                   'models/12layer_pruned90-none/',
                   'models/teacher/']

model_names=['12layer_pruned80-none',
             '12layer_pruned90-none',
             'teacher']

In [20]:
local_model_paths=['models/teacher/']

model_names = ['teacher']

In [21]:
metric_list = []
for local_model_path, model_name in zip(local_model_paths,model_names):
    for i, (pdf_name,file_path) in enumerate(all_text_path_dict.items()):
        print(f"loop : {i}")
        tokenizer = AutoTokenizer.from_pretrained(local_model_path, use_fast=True)
        model = AutoModelForSequenceClassification.from_pretrained(local_model_path).to(device)
        encoded_dataset = create_batches(test_data)

        print(f'Processing {i+1}/{len(all_text_path_dict)}, {pdf_name}')
        data = gather_data(pdf_name, file_path)
        num_data_points = len(data)
        num_pages = data[len(data)-1]['page']
        chunk_size = 1000
        chunk_idx = 0
        total_file_time = 0

        predictions = list()
        while chunk_idx * chunk_size < num_data_points:
            chunk_start = time.time()
            data_chunk = data[chunk_idx * chunk_size:(chunk_idx + 1) * chunk_size]
            temp_df = pd.DataFrame(data_chunk).drop(['pdf_name', 'page'], axis=1)
            encoded_dataset = create_batches(temp_df, batch_size=128)
            predictions.extend(predict(encoded_dataset))
            chunk_idx += 1
            chunk_end = time.time()
            total_file_time += (chunk_end - chunk_start)

        time_per_data_point = total_file_time / num_data_points
        data_points_per_sec = 1/time_per_data_point
        model_size = os.path.getsize(local_model_path + 'pytorch_model.bin')/1000000

        metric_list.append(
            {'Model Name':model_name,
             'Model Size(MB)': model_size,
             'PDF Name':pdf_name,
             'Number of Pages':int(num_pages),
             'Number of Data Points':num_data_points,
             'Total Inference Time':total_file_time,
             'Time per data point':time_per_data_point,
             'Data points per sec':data_points_per_sec})

    file_to_save = pd.DataFrame(metric_list)
    file_to_save.to_csv(f"file_to_save_{model_name}.csv")

loop : 0
Processing 1/10, LUKOIL_SUSTAINABILITY_REPORT_2018
loop : 1
Processing 2/10, equinor-2019-annual-report-and-form-20f
loop : 2
Processing 3/10, 413750375_Avista Corp_2019-12-31
loop : 3
Processing 4/10, 2018 Annual Report
loop : 4
Processing 5/10, Ervia-Annual-Report-2018
loop : 5
Processing 6/10, SaipemSustainability2018
loop : 6
Processing 7/10, RN_SR_2016_EN_2_ sustainabilitz 2016
loop : 7
Processing 8/10, Eskom Holdings SOC Ltd Integrated Report 2019
loop : 8
Processing 9/10, Sustainability Report 2016_EN
loop : 9
Processing 10/10, annual_report_2019_eng


In [22]:
metric_dfs = pd.DataFrame(metric_list)

In [23]:
metric_dfs.head()

Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
0,teacher,438.011337,LUKOIL_SUSTAINABILITY_REPORT_2018,81,52392,201.269272,0.003842,260.307992
1,teacher,438.011337,equinor-2019-annual-report-and-form-20f,317,82368,489.522515,0.005943,168.261924
2,teacher,438.011337,413750375_Avista Corp_2019-12-31,79,81072,108.448538,0.001338,747.561948
3,teacher,438.011337,2018 Annual Report,147,34152,205.439296,0.006015,166.238887
4,teacher,438.011337,Ervia-Annual-Report-2018,155,41232,151.55162,0.003676,272.065717


**Model Name: 12layer_pruned80-none, Size: 438.02 MB**

In [24]:
df = pd.read_csv("file_to_save_12layer_pruned90-none.csv")
df1 = df[df['Model Name']=='12layer_pruned80-none']

In [25]:
df1.head()

Unnamed: 0.1,Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
0,0,12layer_pruned80-none,438.011337,04_NOVATEK_AR_2016_ENG_11,119,24264,115.039156,0.004741,210.91949
1,1,12layer_pruned80-none,438.011337,04_NOVATEK_AR_2018_ENG_15,105,23112,106.168256,0.004594,217.692189
2,2,12layer_pruned80-none,438.011337,2013_book_mol_ar_eng_fin,135,63024,377.481622,0.005989,166.959122
3,3,12layer_pruned80-none,438.011337,2015_BASF_Report,261,76392,482.745011,0.006319,158.245033
4,4,12layer_pruned80-none,438.011337,2017 Sustainability Report,57,21936,64.46151,0.002939,340.296093


In [26]:
# Average time per data point

df1['Time per data point'].describe()

count    153.000000
mean       0.004110
std        0.001773
min        0.000887
25%        0.003060
50%        0.003964
75%        0.004961
max        0.009390
Name: Time per data point, dtype: float64

The average time per data point is 0.004110 seconds. A pdf with on average ~151 pages, and ~ 312 data points per page, will take ~3.2 mins

**Model Name: 12layer_pruned90-none, Size: 438.02 MB**

In [29]:
df2 = df[df['Model Name']=='12layer_pruned90-none']

In [30]:
df2.head()

Unnamed: 0.1,Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
153,153,12layer_pruned90-none,438.011337,04_NOVATEK_AR_2016_ENG_11,119,24264,116.594925,0.004805,208.105112
154,154,12layer_pruned90-none,438.011337,04_NOVATEK_AR_2018_ENG_15,105,23112,106.076298,0.00459,217.880906
155,155,12layer_pruned90-none,438.011337,2013_book_mol_ar_eng_fin,135,63024,375.350579,0.005956,167.907028
156,156,12layer_pruned90-none,438.011337,2015_BASF_Report,261,76392,479.766905,0.00628,159.227323
157,157,12layer_pruned90-none,438.011337,2017 Sustainability Report,57,21936,63.929079,0.002914,343.130235


In [31]:
# Average time per data point

df2['Time per data point'].describe()

count    153.000000
mean       0.004088
std        0.001764
min        0.000882
25%        0.003055
50%        0.003934
75%        0.004939
max        0.009329
Name: Time per data point, dtype: float64

The average time per data point is 0.004088 seconds. A pdf with on average ~151 pages, and ~312 data points per page, will take ~3.2 mins to execute.

**Model_Name : teacher**

In [32]:
df = pd.read_csv("file_to_save_teacher.csv")
df3 = df[df['Model Name']=='teacher']

In [33]:
df3.head()

Unnamed: 0.1,Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
0,0,teacher,438.011337,LUKOIL_SUSTAINABILITY_REPORT_2018,81,52392,201.269272,0.003842,260.307992
1,1,teacher,438.011337,equinor-2019-annual-report-and-form-20f,317,82368,489.522515,0.005943,168.261924
2,2,teacher,438.011337,413750375_Avista Corp_2019-12-31,79,81072,108.448538,0.001338,747.561948
3,3,teacher,438.011337,2018 Annual Report,147,34152,205.439296,0.006015,166.238887
4,4,teacher,438.011337,Ervia-Annual-Report-2018,155,41232,151.55162,0.003676,272.065717


In [34]:
# Average time per data point

df3['Time per data point'].describe()

count    10.000000
mean      0.004445
std       0.002242
min       0.001338
25%       0.003455
50%       0.003757
75%       0.005418
max       0.009573
Name: Time per data point, dtype: float64

The average time per data point is 0.004445 seconds. A pdf with on average ~151 pages, and ~312 data points per page, will take ~3.4 mins to execute.

# Conclusion

Here we create a conclusive table which contains the information about different models.

In [6]:
recipe = 'zoo:nlp/masked_language_modeling/bert-base/pytorch/' \
         'huggingface/bookcorpus_wikitext/12layer_pruned90-none?' \
         'recipe_type=transfer-MNLI'

In [35]:
final_table={'model_name':['12layer_pruned90-none',
                           '12layer_pruned80-none',
                           'teacher'],
             'Recipe_used':[recipe,
                            recipe,
                            'N/A',],
             'Size (MB)':[os.path.getsize('models/12layer_pruned90-none/pytorch_model.bin')/1000000,
                          os.path.getsize('models/12layer_pruned80-none/pytorch_model.bin')/1000000,
                          os.path.getsize('models/teacher/pytorch_model.bin')/1000000],
             'inference_time_for_avg_pdf (mins)':[3.2, 3.2, 3.4],
             'F1-Score':[86.50, 86.91, 91.24]}

In [36]:
pd.DataFrame(final_table)

Unnamed: 0,model_name,Recipe_used,Size (MB),inference_time_for_avg_pdf (mins),F1-Score
0,12layer_pruned90-none,zoo:nlp/masked_language_modeling/bert-base/pyt...,438.011337,3.2,86.5
1,12layer_pruned80-none,zoo:nlp/masked_language_modeling/bert-base/pyt...,438.011337,3.2,86.91
2,teacher,,438.011337,3.4,91.24


Based on our experimentation, we have evaluated the performance of three different models trained with custom knowledge distillation: '12layer_pruned90-none', '12layer_pruned80-none', and 'teacher'. These models were tested on GPU using PyTorch for calculating the inference time and F1 score. 

The knowledge distillation process has allowed us to train more efficient and smaller models, with comparable performance to the larger, more complex models. The use of GPU and PyTorch has also provided faster computation times for these models, facilitating their use in real-world applications.