## Inference with teacher-student Models 

Custom knowledge distillation is a technique in deep learning that involves training a smaller, simpler model (the "student") to mimic the behavior of a larger, more complex model (the "teacher"). This process is intended to improve the efficiency of deep learning models, by compressing the knowledge of the larger model into a smaller one, without sacrificing too much accuracy. In this notebook, we explore the capabilities of DeepSparse and SparseZoo in optimizing and executing custom knowledge distillation models on the CPU.

Specifically, we are conducting inference tests on custom knowledge distillation models using DeepSparse and SparseZoo, while evaluating the impact of the number of CPU cores. Our experiments focus on calculating the inference time for three pre-trained models: '12layer_pruned80-none', '12layer_pruned90-none', and 'teacher'. We will be testing these models using 4 and 8 cores of the CPU, which will allow us to investigate the effect of core count on the inference time of these models.



In [3]:
import os
import pathlib
from dotenv import load_dotenv
from datasets import Dataset, DatasetDict
import pandas as pd
from transformers import AutoModelForSequenceClassification
from src.data.s3_communication import S3Communication, S3FileType
from src.components.utils.kpi_mapping import get_kpi_mapping_category
import json
import time
import config
import random
from transformers import AutoTokenizer
from torch import cuda
import torch
import warnings
warnings.filterwarnings("ignore")
device = 'cuda' if cuda.is_available() else 'cpu'

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Helper functions

In [4]:
def create_batches(data_df, batch_size=32):
    encoded_dataset = list()
    batch = list()
    for df, row in data_df.iterrows():
        if len(batch) < batch_size:
            batch.append([row['question'], row['sentence']])
        else:
            encoded_dataset.append(tokenizer(batch,
                                             truncation=True,
                                             return_tensors='pt',
                                             padding=True))
            batch = [[row['question'], row['sentence']]]
    if batch:
        encoded_dataset.append(tokenizer(batch,
                                         truncation=True,
                                         return_tensors='pt',
                                         padding=True))
    return encoded_dataset

In [5]:
def gather_data(pdf_name, pdf_path):
    pdf_content = read_text_from_json(file_path)
    text_data = []
    # Build all possible combinations of paragraphs and  questions
    # Keep track of page number which the text is extracted from and
    # the pdf it belongs to.
    for kpi_question in questions:
        text_data.extend([{
            "page": page_num,
            "pdf_name": pdf_name,
            "question": kpi_question,
            "sentence": paragraph}
            for page_num, page_content in pdf_content.items()
            for paragraph in page_content])
    return text_data

In [6]:
def predict(encoded_dataset):
    outputs = list()
    for batch in encoded_dataset:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        with torch.no_grad():
            outs = model(input_ids=input_ids, attention_mask=attention_mask)
            outputs.extend(outs.logits.argmax(axis=1).tolist())
    return outputs

In [7]:
def read_text_from_json(file):
    """Read text from json."""

    with open(file) as f:
        text = json.load(f)
        return text

## Retrieve the test dataset and the trained models

In [8]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [9]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

train_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_train_split.csv'
train_data = pd.read_csv(train_data_path, index_col=0)
train_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

In [10]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

In [11]:
climate_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'question', 'sentence', '__index_level_0__'],
        num_rows: 2033
    })
    test: Dataset({
        features: ['question', 'sentence', '__index_level_0__'],
        num_rows: 509
    })
})

PDFs

In [12]:
config.DATA_FOLDER

PosixPath('/opt/app-root/src/data')

In [13]:
BENCHMARK_FOLDER = config.DATA_FOLDER
if not os.path.exists(BENCHMARK_FOLDER):
    BENCHMARK_FOLDER.mkdir(parents=True, exist_ok=True)

BENCHMARK_EXTRACTION_FOLDER = BENCHMARK_FOLDER / "extraction"
if not os.path.exists(BENCHMARK_EXTRACTION_FOLDER):
    pathlib.Path(BENCHMARK_EXTRACTION_FOLDER).mkdir(parents=True, exist_ok=True)

In [14]:
kpi_df = s3c.download_df_from_s3(
    "aicoe-osc-demo/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0)

kmc = get_kpi_mapping_category(kpi_df)
questions = [q_text for q_id, (q_text, sect) in kmc["KPI_MAPPING_MODEL"].items()
             if len(set(sect).intersection({"OG", "CM", "CU"})) > 0
             and "TEXT" in kmc["KPI_CATEGORY"][q_id]]

text_paths = sorted(BENCHMARK_EXTRACTION_FOLDER.rglob("*.json"))
all_text_path_dict = {os.path.splitext(os.path.basename(file_path))[0]:
                      file_path for file_path in text_paths
                      if "table_meta" not in str(file_path)}

In [15]:
# Choosing 10 random pdfs
all_text_path_dict = dict(random.sample(all_text_path_dict.items(), 10))

In [16]:
df_list = []
metrics_df_list = []
metrics_list = []
metric_dfs = pd.DataFrame()
num_pdfs = len(all_text_path_dict)

In [17]:
all_text_path_dict

{'LUKOIL_SUSTAINABILITY_REPORT_2018': PosixPath('/opt/app-root/src/data/extraction/LUKOIL_SUSTAINABILITY_REPORT_2018.json'),
 'equinor-2019-annual-report-and-form-20f': PosixPath('/opt/app-root/src/data/extraction/equinor-2019-annual-report-and-form-20f.json'),
 '413750375_Avista Corp_2019-12-31': PosixPath('/opt/app-root/src/data/extraction/413750375_Avista Corp_2019-12-31.json'),
 '2018 Annual Report': PosixPath('/opt/app-root/src/data/extraction/2018 Annual Report.json'),
 'Ervia-Annual-Report-2018': PosixPath('/opt/app-root/src/data/extraction/Ervia-Annual-Report-2018.json'),
 'SaipemSustainability2018': PosixPath('/opt/app-root/src/data/extraction/SaipemSustainability2018.json'),
 'RN_SR_2016_EN_2_ sustainabilitz 2016': PosixPath('/opt/app-root/src/data/extraction/RN_SR_2016_EN_2_ sustainabilitz 2016.json'),
 'Eskom Holdings SOC Ltd Integrated Report 2019': PosixPath('/opt/app-root/src/data/extraction/Eskom Holdings SOC Ltd Integrated Report 2019.json'),
 'Sustainability Report 20

In [18]:
print(num_pdfs)

10


In [19]:
local_model_paths=['models/12layer_pruned80-none/',
                   'models/12layer_pruned90-none/',
                   'models/teacher/']

model_names = ['12layer_pruned80-none',
               '12layer_pruned90-none',
               'teacher']

In [20]:
local_model_paths=['models/teacher/']

model_names = ['teacher']

In [21]:
metric_list = []
for local_model_path, model_name in zip(local_model_paths,model_names):
    for i, (pdf_name,file_path) in enumerate(all_text_path_dict.items()):
        print(f"loop : {i}")
        tokenizer = AutoTokenizer.from_pretrained(local_model_path, use_fast=True)
        model = AutoModelForSequenceClassification.from_pretrained(local_model_path).to(device)
        encoded_dataset = create_batches(test_data)

        print(f'Processing {i+1}/{len(all_text_path_dict)}, {pdf_name}')
        data = gather_data(pdf_name, file_path)
        num_data_points = len(data)
        num_pages = data[len(data)-1]['page']
        chunk_size = 1000
        chunk_idx = 0
        total_file_time = 0

        predictions = list()
        while chunk_idx * chunk_size < num_data_points:
            chunk_start = time.time()
            data_chunk = data[chunk_idx * chunk_size:(chunk_idx + 1) * chunk_size]
            temp_df = pd.DataFrame(data_chunk).drop(['pdf_name', 'page'], axis=1)
            encoded_dataset = create_batches(temp_df, batch_size=128)
            predictions.extend(predict(encoded_dataset))
            chunk_idx += 1
            chunk_end = time.time()
            total_file_time += (chunk_end - chunk_start)

        time_per_data_point = total_file_time / num_data_points
        data_points_per_sec = 1/time_per_data_point
        model_size = os.path.getsize(local_model_path + 'pytorch_model.bin')/1000000

        metric_list.append(
            {'Model Name':model_name,
             'Model Size(MB)': model_size,
             'PDF Name':pdf_name,
             'Number of Pages':int(num_pages),
             'Number of Data Points':num_data_points,
             'Total Inference Time':total_file_time,
             'Time per data point':time_per_data_point,
             'Data points per sec':data_points_per_sec})

    file_to_save = pd.DataFrame(metric_list)
    file_to_save.to_csv(f"file_to_save_{model_name}.csv")

loop : 0
Processing 1/10, LUKOIL_SUSTAINABILITY_REPORT_2018
loop : 1
Processing 2/10, equinor-2019-annual-report-and-form-20f
loop : 2
Processing 3/10, 413750375_Avista Corp_2019-12-31
loop : 3
Processing 4/10, 2018 Annual Report
loop : 4
Processing 5/10, Ervia-Annual-Report-2018
loop : 5
Processing 6/10, SaipemSustainability2018
loop : 6
Processing 7/10, RN_SR_2016_EN_2_ sustainabilitz 2016
loop : 7
Processing 8/10, Eskom Holdings SOC Ltd Integrated Report 2019
loop : 8
Processing 9/10, Sustainability Report 2016_EN
loop : 9
Processing 10/10, annual_report_2019_eng


In [22]:
metric_dfs = pd.DataFrame(metric_list)

In [23]:
metric_dfs.head()

Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
0,teacher,438.011337,LUKOIL_SUSTAINABILITY_REPORT_2018,81,52392,201.269272,0.003842,260.307992
1,teacher,438.011337,equinor-2019-annual-report-and-form-20f,317,82368,489.522515,0.005943,168.261924
2,teacher,438.011337,413750375_Avista Corp_2019-12-31,79,81072,108.448538,0.001338,747.561948
3,teacher,438.011337,2018 Annual Report,147,34152,205.439296,0.006015,166.238887
4,teacher,438.011337,Ervia-Annual-Report-2018,155,41232,151.55162,0.003676,272.065717


In [None]:
df4 = pd.read_csv("file_to_save_cpu412layer_pruned80-none.csv")
df8 = pd.read_csv("file_to_save_cpu8obert_mnli_pruned90.csv")

**Model Name: 12layer_pruned80-none, Size: 438.02 MB**

In [4]:
df4 = pd.read_csv("file_to_save_cpu412layer_pruned80-none.csv")
df8 = pd.read_csv("file_to_save_cpu812layer_pruned80-none.csv")
df14 = df4[df4['Model Name']=='12layer_pruned80-none']
df18 = df8[df8['Model Name']=='12layer_pruned80-none']

In [6]:
# Average time per data point

df14['Time per data point'].describe()

count    5.000000
mean     0.033795
std      0.000038
min      0.033733
25%      0.033794
50%      0.033804
75%      0.033807
max      0.033838
Name: Time per data point, dtype: float64

The average time per data point is 0.033795 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~34 mins, for 4 cores of CPU.

In [9]:
# Average time per data point

df18['Time per data point'].describe()

count    5.000000
mean     0.016303
std      0.000048
min      0.016223
25%      0.016296
50%      0.016323
75%      0.016333
max      0.016338
Name: Time per data point, dtype: float64

The average time per data point is 0.016303 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~16 mins, for 8 cores of CPU.

**Model Name: 12layer_pruned90-none, Size: 438.02 MB**

In [13]:
df44 = pd.read_csv("file_to_save_cpu412layer_pruned90-none.csv")
df88 = pd.read_csv("file_to_save_cpu812layer_pruned90-none.csv")
df24 = df44[df44['Model Name']=='12layer_pruned90-none']
df28 = df88[df88['Model Name']=='12layer_pruned90-none']

In [15]:
# Average time per data point

df24['Time per data point'].describe()

count    5.000000
mean     0.024715
std      0.000162
min      0.024557
25%      0.024619
50%      0.024622
75%      0.024866
max      0.024914
Name: Time per data point, dtype: float64

The average time per data point is 0.024715 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~25 mins, for 4 cores of CPU.

In [31]:
# Average time per data point

df28['Time per data point'].describe()

count    5.000000
mean     0.011791
std      0.000056
min      0.011749
25%      0.011749
50%      0.011756
75%      0.011841
max      0.011862
Name: Time per data point, dtype: float64

The average time per data point is 0.011791 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~12 mins, for 8 cores of CPU.

**Model_Name : teacher**

In [23]:
df444 = pd.read_csv("file_to_save_cpu4teacher.csv")
df888 = pd.read_csv("file_to_save_cpu8teacher.csv")
df34 = df444[df444['Model Name']=='teacher']
df38 = df888[df888['Model Name']=='teacher']

In [30]:
# Average time per data point

df34['Time per data point'].describe()

count    5.000000
mean     0.096243
std      0.001691
min      0.094675
25%      0.094822
50%      0.095612
75%      0.097979
max      0.098129
Name: Time per data point, dtype: float64

The average time per data point is 0.096243 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~97 mins to execute, for 4 cores of CPU.

In [28]:
# Average time per data point

df38['Time per data point'].describe()

count    5.000000
mean     0.045240
std      0.002098
min      0.041899
25%      0.044501
50%      0.046121
75%      0.046759
max      0.046919
Name: Time per data point, dtype: float64

The average time per data point is 0.045240 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take ~45 mins to execute, for 8 cores of CPU.

# Conclusion

Based on our experiments, we can conclude that DeepSparse and SparseZoo are effective in optimizing and executing custom knowledge distillation models on the CPU. In this notebook, we focused on calculating the inference time for three pre-trained custom knowledge distillation models: '12layer_pruned80-none', '12layer_pruned90-none', and 'teacher'. By running our experiments using 4 and 8 cores of the CPU, we were able to evaluate the effect of core count on the inference time of these models.

Our results showed that the inference time varied across the different models. Additionally, we observed that increasing the number of cores led to a decrease in inference time for all models, which is in line with our expectations.

Our findings demonstrate the potential of DeepSparse and SparseZoo for optimizing custom knowledge distillation models and highlight the importance of considering the number of CPU cores when evaluating their performance. This information can be useful for researchers and practitioners working with custom knowledge distillation models and contributes to the ongoing efforts to develop efficient and effective deep learning models.