## Inference test for SparseZoo Models:

In this notebook, we are exploring the capabilities of SparseZoo and DeepSparse by using pre-trained models from SparseZoo and performing inference tests using DeepSparse, but only on the CPU. By limiting our experiments to the CPU, we can gain insights into how efficiently these frameworks can optimize sparse models for execution on general-purpose hardware.

In [None]:
import os
import pathlib
from dotenv import load_dotenv
from datasets import Dataset, DatasetDict
import pandas as pd
from src.data.s3_communication import S3Communication, S3FileType
from src.components.utils.kpi_mapping import get_kpi_mapping_category
import json
import time
import config
from torch import cuda
from deepsparse import Pipeline
import warnings
warnings.filterwarnings("ignore")
device = 'cuda' if cuda.is_available() else 'cpu'

In [3]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [4]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Helper functions

In [6]:
def gather_data(pdf_name, pdf_path):
    pdf_content = read_text_from_json(file_path)
    text_data = []
    # Build all possible combinations of paragraphs and  questions
    # Keep track of page number which the text is extracted from and
    # the pdf it belongs to.
    for kpi_question in questions:
        text_data.extend([{
            "page": page_num,
            "pdf_name": pdf_name,
            "question": kpi_question,
            "sentence": paragraph}
            for page_num, page_content in pdf_content.items()
            for paragraph in page_content])
    return text_data

In [8]:
def read_text_from_json(file):
    """Read text from json."""

    with open(file) as f:
        text = json.load(f)
        return text

## Retrieve the test dataset and the trained models

In [9]:
s3c.download_files_in_prefix_to_dir(
    config.BASE_TRAIN_TEST_DATASET_S3_PREFIX,
    config.BASE_PROCESSED_DATA)

In [10]:
test_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_test_split.csv'
test_data = pd.read_csv(test_data_path, index_col=0)
test_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

train_data_path = str(config.BASE_PROCESSED_DATA)+'/rel_train_split.csv'
train_data = pd.read_csv(train_data_path, index_col=0)
train_data.rename(columns={'text': 'question', 'text_b':'sentence'}, inplace=True)

In [11]:
trds = Dataset.from_pandas(train_data)
teds = Dataset.from_pandas(test_data.drop('label', axis=1))

climate_dataset = DatasetDict()

climate_dataset['train'] = trds
climate_dataset['test'] = teds

In [12]:
climate_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'question', 'sentence', '__index_level_0__'],
        num_rows: 2033
    })
    test: Dataset({
        features: ['question', 'sentence', '__index_level_0__'],
        num_rows: 509
    })
})

PDFs

In [13]:
config.DATA_FOLDER

PosixPath('/opt/app-root/src/data')

In [14]:
BENCHMARK_FOLDER = config.DATA_FOLDER
if not os.path.exists(BENCHMARK_FOLDER):
    BENCHMARK_FOLDER.mkdir(parents=True, exist_ok=True)

BENCHMARK_EXTRACTION_FOLDER = BENCHMARK_FOLDER / "extraction"
if not os.path.exists(BENCHMARK_EXTRACTION_FOLDER):
    pathlib.Path(BENCHMARK_EXTRACTION_FOLDER).mkdir(parents=True, exist_ok=True)

In [15]:
kpi_df = s3c.download_df_from_s3(
    "aicoe-osc-demo/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0)

kmc = get_kpi_mapping_category(kpi_df)
questions = [q_text for q_id, (q_text, sect) in kmc["KPI_MAPPING_MODEL"].items()
             if len(set(sect).intersection({"OG", "CM", "CU"})) > 0
             and "TEXT" in kmc["KPI_CATEGORY"][q_id]]

text_paths = sorted(BENCHMARK_EXTRACTION_FOLDER.rglob("*.json"))
all_text_path_dict = {os.path.splitext(os.path.basename(file_path))[0]:
                      file_path for file_path in text_paths
                      if "table_meta" not in str(file_path)}

In [16]:
# Choosing 5 pdfs
all_text_path_dict = dict(list(all_text_path_dict.items())[:5])

In [17]:
all_text_path_dict

{'04_NOVATEK_AR_2016_ENG_11': PosixPath('/opt/app-root/src/data/extraction/04_NOVATEK_AR_2016_ENG_11.json'),
 '04_NOVATEK_AR_2018_ENG_15': PosixPath('/opt/app-root/src/data/extraction/04_NOVATEK_AR_2018_ENG_15.json'),
 '2013_book_mol_ar_eng_fin': PosixPath('/opt/app-root/src/data/extraction/2013_book_mol_ar_eng_fin.json'),
 '2015_BASF_Report': PosixPath('/opt/app-root/src/data/extraction/2015_BASF_Report.json'),
 '2017 Sustainability Report': PosixPath('/opt/app-root/src/data/extraction/2017 Sustainability Report.json')}

In [18]:
df_list = []
metrics_df_list = []
metrics_list = []
metric_dfs = pd.DataFrame()
num_pdfs = len(all_text_path_dict)

In [19]:
all_text_path_dict

{'04_NOVATEK_AR_2016_ENG_11': PosixPath('/opt/app-root/src/data/extraction/04_NOVATEK_AR_2016_ENG_11.json'),
 '04_NOVATEK_AR_2018_ENG_15': PosixPath('/opt/app-root/src/data/extraction/04_NOVATEK_AR_2018_ENG_15.json'),
 '2013_book_mol_ar_eng_fin': PosixPath('/opt/app-root/src/data/extraction/2013_book_mol_ar_eng_fin.json'),
 '2015_BASF_Report': PosixPath('/opt/app-root/src/data/extraction/2015_BASF_Report.json'),
 '2017 Sustainability Report': PosixPath('/opt/app-root/src/data/extraction/2017 Sustainability Report.json')}

In [20]:
print(num_pdfs)

5


In [21]:
local_model_paths=['/opt/app-root/src/aicoe-osc-demo/models/distilbert_mnli_pruned80/deployment/',
                   '/opt/app-root/src/aicoe-osc-demo/models/distilbert_qqp_pruned80/deployment/',
                   '/opt/app-root/src/aicoe-osc-demo/models/obert_mnli_pruned90/deployment/']

model_names=['distilbert_mnli_pruned80','distilbert_qqp_pruned80','obert_mnli_pruned90']

In [22]:
metric_list = []
for local_model_path, model_name in zip(local_model_paths,model_names):
    tc_pipeline = Pipeline.create(
        task="text-classification",
        model_path=local_model_path
    )
    for i,(pdf_name,file_path) in enumerate(all_text_path_dict.items()):
        print(f'Processing {i+1}/{len(all_text_path_dict)}, {pdf_name}')
        print(model_name)
        data = gather_data(pdf_name, file_path)
        num_data_points = len(data)
        num_pages = data[len(data)-1]['page']
        chunk_size = 5000
        chunk_idx = 0
        total_file_time = 0
        temp_df_list = list()
        temp_df = pd.DataFrame(data).drop(['pdf_name', 'page'], axis=1)
        temp_df_list = temp_df.values.tolist()
        start = time.time()
        inference = tc_pipeline(temp_df_list)
        end = time.time()
        total_file_time = (end - start)
        print(total_file_time)
        time_per_data_point = total_file_time / num_data_points
        data_points_per_sec = 1/time_per_data_point
        metric_list.append(
            {'Model Name':model_name,
             'PDF Name':pdf_name,
             'Number of Pages':int(num_pages),
             'Number of Data Points':num_data_points,
             'Total Inference Time':total_file_time,
             'Time per data point':time_per_data_point,
             'Data points per sec':data_points_per_sec})
    file_to_save = pd.DataFrame(metric_list)
    file_to_save.to_csv(f"file_to_save_cpu4{model_name}.csv")



Collecting transformers==4.23.1
  Downloading https://github.com/neuralmagic/transformers/releases/download/v1.3/transformers-4.23.1-py3-none-any.whl (5.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.3/5.3 MB 18.2 MB/s eta 0:00:00
Collecting datasets<=1.18.4
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 312.1/312.1 kB 55.9 MB/s eta 0:00:00
Installing collected packages: transformers, datasets
  Attempting uninstall: transformers
    Found existing installation: transformers 4.25.1
    Uninstalling transformers-4.25.1:
      Successfully uninstalled transformers-4.25.1
  Attempting uninstall: datasets
    Found existing installation: datasets 2.8.0
    Uninstalling datasets-2.8.0:
      Successfully uninstalled datasets-2.8.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
farm 0.5.0 requires transformers==3.3.1, but you have transformers 4.23.1 which is incompatible.
2023-01-13 07:41:21 deepsparse.transformers INFO     deepsparse-transformers and dependencies successfully installed
2023-01-13 07:41:21,805 [558] INFO     deepsparse.transformers: deepsparse-transformers and dependencies successfully installed


Successfully installed datasets-1.18.4 transformers-4.23.1


DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.3.1 COMMUNITY | (d1a12439) (release) (optimized) (system=avx512, binary=avx512)


Processing 1/5, 04_NOVATEK_AR_2016_ENG_11
loop : 0
distilbert_mnli_pruned80
363.48696660995483
Processing 2/5, 04_NOVATEK_AR_2018_ENG_15
loop : 1
distilbert_mnli_pruned80
344.98051285743713
Processing 3/5, 2013_book_mol_ar_eng_fin
loop : 2
distilbert_mnli_pruned80
939.6917035579681
Processing 4/5, 2015_BASF_Report
loop : 3
distilbert_mnli_pruned80
1141.9932181835175
Processing 5/5, 2017 Sustainability Report
loop : 4
distilbert_mnli_pruned80
325.9426770210266
Processing 1/5, 04_NOVATEK_AR_2016_ENG_11
loop : 0
distilbert_qqp_pruned80
382.77617383003235
Processing 2/5, 04_NOVATEK_AR_2018_ENG_15
loop : 1
distilbert_qqp_pruned80
364.8406844139099
Processing 3/5, 2013_book_mol_ar_eng_fin
loop : 2
distilbert_qqp_pruned80
992.151086807251
Processing 4/5, 2015_BASF_Report
loop : 3
distilbert_qqp_pruned80
1201.654770374298
Processing 5/5, 2017 Sustainability Report
loop : 4
distilbert_qqp_pruned80
345.39869475364685
Processing 1/5, 04_NOVATEK_AR_2016_ENG_11
loop : 0
obert_mnli_pruned90
438.9388

In [25]:
metric_dfs = pd.DataFrame(metric_list)

In [28]:
metric_dfs.head()

Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
0,distilbert_mnli_pruned80,0.004096,04_NOVATEK_AR_2016_ENG_11,119,24264,363.486967,0.014981,66.753425
1,distilbert_mnli_pruned80,0.004096,04_NOVATEK_AR_2018_ENG_15,105,23112,344.980513,0.014926,66.995089
2,distilbert_mnli_pruned80,0.004096,2013_book_mol_ar_eng_fin,135,63024,939.691704,0.01491,67.068805
3,distilbert_mnli_pruned80,0.004096,2015_BASF_Report,261,76392,1141.993218,0.014949,66.893567
4,distilbert_mnli_pruned80,0.004096,2017 Sustainability Report,57,21936,325.942677,0.014859,67.300177


In [29]:
df4 = pd.read_csv("file_to_save_cpu4obert_mnli_pruned90.csv")
df8 = pd.read_csv("file_to_save_cpu8obert_mnli_pruned90.csv")

**Model Name: distilbert_mnli_pruned80, Size: 438.02 MB**

In [32]:
df14 = df4[df4['Model Name']=='distilbert_mnli_pruned80']
df18 = df8[df8['Model Name']=='distilbert_mnli_pruned80']

In [35]:
# Average time per data point

df14['Time per data point'].describe()

count    5.000000
mean     0.014925
std      0.000045
min      0.014859
25%      0.014910
50%      0.014926
75%      0.014949
max      0.014981
Name: Time per data point, dtype: float64

The average time per data point is 0.014925 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 15 mins to execute for 4 CPU cores. 

In [40]:
# Average time per data point

df18['Time per data point'].describe()

count    5.000000
mean     0.007645
std      0.000005
min      0.007640
25%      0.007643
50%      0.007644
75%      0.007647
max      0.007653
Name: Time per data point, dtype: float64

The average time per data point is 0.007645 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 8 mins to execute, for 8 cores CPU.

**Model Name: distilbert_qqp_pruned80, Size: 438.02 MB**

In [50]:
df24 = df4[df4['Model Name']=='distilbert_qqp_pruned80']
df28 = df8[df8['Model Name']=='distilbert_qqp_pruned80']

In [53]:
abs# Average time per data point

df24['Time per data point'].describe()

count    5.000000
mean     0.015756
std      0.000024
min      0.015730
25%      0.015742
50%      0.015746
75%      0.015775
max      0.015786
Name: Time per data point, dtype: float64

The average time per data point is 0.0015756 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 16 mins to execute, for 4 cores CPU.

In [79]:
df28

Unnamed: 0.1,Unnamed: 0,Model Name,Model Size(MB),PDF Name,Number of Pages,Number of Data Points,Total Inference Time,Time per data point,Data points per sec
5,5,distilbert_qqp_pruned80,0.004096,04_NOVATEK_AR_2016_ENG_11,119,24264,189.696231,0.007818,127.909764
6,6,distilbert_qqp_pruned80,0.004096,04_NOVATEK_AR_2018_ENG_15,105,23112,180.43752,0.007807,128.088659
7,7,distilbert_qqp_pruned80,0.004096,2013_book_mol_ar_eng_fin,135,63024,491.198755,0.007794,128.306514
8,8,distilbert_qqp_pruned80,0.004096,2015_BASF_Report,261,76392,595.565105,0.007796,128.268093
9,9,distilbert_qqp_pruned80,0.004096,2017 Sustainability Report,57,21936,169.021273,0.007705,129.78248


In [59]:
# Average time per data point

df28['Time per data point'].describe()

count    5.000000
mean     0.007784
std      0.000045
min      0.007705
25%      0.007794
50%      0.007796
75%      0.007807
max      0.007818
Name: Time per data point, dtype: float64

The average time per data point is 0.00784 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 8 mins to execute, for 8 cores CPU.

**Model Name: distilbert_qqp_pruned80, Size: 438.02 MB**

In [63]:
df34 = df4[df4['Model Name']=='obert_mnli_pruned90']
df38 = df8[df8['Model Name']=='obert_mnli_pruned90']

In [65]:
# Average time per data point

df34['Time per data point'].describe()

count    5.000000
mean     0.017995
std      0.000133
min      0.017798
25%      0.017938
50%      0.018020
75%      0.018090
max      0.018131
Name: Time per data point, dtype: float64

The average time per data point is 0.017995 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 18 mins to execute, for 4 cores CPU.

In [71]:
# Average time per data point

df38['Time per data point'].describe()

count    5.000000
mean     0.008859
std      0.000017
min      0.008844
25%      0.008847
50%      0.008851
75%      0.008871
max      0.008883
Name: Time per data point, dtype: float64

The average time per data point is 0.008859 seconds. A pdf with on average ~157 pages, and ~387 data points per page, will take 9 mins to execute, for 8 cores CPU.

# Conclusion

Based on our experiments, we can conclude that DeepSparse, in combination with SparseZoo, is an efficient and effective framework for optimizing and executing sparse deep learning models on the CPU. In this notebook, we focused on calculating the inference time for three pre-trained models: "distilbert_mnli_pruned80", "distilbert_qqp_pruned80", and "obert_mnli_pruned90". By running our experiments on both 4 and 8 cores of the CPU, we were able to assess the impact of the number of cores on the inference time.

Our results showed that the inference time decreased as we increased the number of CPU cores, which is expected. Although we observed some variation in the inference times of the different models, overall, the results were comparable. These findings provide valuable insights into the performance of sparse models on general-purpose hardware and can help guide the development of future models and frameworks.

Overall, our experiments demonstrated the usefulness of SparseZoo and DeepSparse for working with sparse models and provided a glimpse into the potential of these technologies for optimizing deep learning workflows.