# Time analysis 
How do we make sense of the benchmarks generated by relevance and inference steps in the benchmarks.ipynb? This notebooks combines the benchmarks and gives conclusive times for pdfs. We used an a large memory intensive instance with 4 CPUs, 32Gb memory, and 1 GPU for our measurements. 

In [1]:
import pandas as pd
metrics_df = pd.read_pickle('../../reports/benchmarks/metrics_df.pkl')
kpi_metrics_df = pd.read_pickle('../../reports/benchmarks/kpi_metrics_df.pkl')

In [2]:
metrics_df.rename(columns={
    'PDF Name': 'pdf_name',
    'Number of Pages': 'pages',
    'Number of Data Points': 'rel_data_points',
    'Total Inference Time': 'rel_inference_time',
    'Time per data point': 'rel_time_per_point',
    'Data points per sec': 'rel_points_per_sec'}, inplace=True)
metrics_df.set_index('pdf_name', inplace=True)
metrics_df

Unnamed: 0_level_0,pages,rel_data_points,rel_inference_time,rel_time_per_point,rel_points_per_sec
pdf_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
04_NOVATEK_AR_2016_ENG_11,119,24264,169.834546,0.006999,142.868460
04_NOVATEK_AR_2018_ENG_15,105,23112,162.614384,0.007036,142.127648
2013_book_mol_ar_eng_fin,135,63024,447.409486,0.007099,140.864246
2015_BASF_Report,261,76392,530.416069,0.006943,144.022786
2017 Sustainability Report,57,21936,147.753352,0.006736,148.463636
...,...,...,...,...,...
sustainability 2015,141,43920,292.777661,0.006666,150.011445
sustainability-report-2019,32,17424,118.627661,0.006808,146.879740
sustainability-report-repsol-2016-eng-april-baja_tcm14-63403,13,8088,55.069756,0.006809,146.868273
sustainable development 2017,110,54240,358.022435,0.006601,151.498886


In [3]:
kpi_metrics_df.rename(columns={
    'PDF Name': 'pdf_name',
    'Number of Data Points': 'kpi_data_points',
    'Total Inference Time': 'kpi_inference_time',
    'Time per data point': 'kpi_time_per_point',
    'Data points per sec': 'kpi_points_per_sec'}, inplace=True)
kpi_metrics_df.set_index('pdf_name', inplace=True)
kpi_metrics_df

Unnamed: 0_level_0,kpi_data_points,kpi_inference_time,kpi_time_per_point,kpi_points_per_sec
pdf_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China Resources Power Holdings Co Ltd Sustainable Development Report 2018,3619,79.930830,0.022086,45.276647
Uniper_Sustainability_Report_2019_EN,3797,83.328373,0.021946,45.566712
Galp_Integrated_Report_2018,22454,475.892238,0.021194,47.182951
2018_sustainability_report,4166,94.522472,0.022689,44.074175
Annual-report-2019,14704,405.279577,0.027563,36.281127
...,...,...,...,...
Enel SA sustainability-report-2018,14917,302.939695,0.020308,49.240823
Sustainability Report 2017_EN,3557,77.206329,0.021705,46.071353
Eskom Holdings SOC Ltd Integrated Report 2019,13347,290.541965,0.021768,45.938286
2018 Annual Report,10091,223.097630,0.022109,45.231319


In [4]:
benchmark_df = metrics_df.join(kpi_metrics_df)

In [5]:
benchmark_df

Unnamed: 0_level_0,pages,rel_data_points,rel_inference_time,rel_time_per_point,rel_points_per_sec,kpi_data_points,kpi_inference_time,kpi_time_per_point,kpi_points_per_sec
pdf_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
04_NOVATEK_AR_2016_ENG_11,119,24264,169.834546,0.006999,142.868460,5094,113.395341,0.022261,44.922481
04_NOVATEK_AR_2018_ENG_15,105,23112,162.614384,0.007036,142.127648,5174,117.463928,0.022703,44.047565
2013_book_mol_ar_eng_fin,135,63024,447.409486,0.007099,140.864246,20229,467.246483,0.023098,43.294066
2015_BASF_Report,261,76392,530.416069,0.006943,144.022786,22773,509.131503,0.022357,44.729112
2017 Sustainability Report,57,21936,147.753352,0.006736,148.463636,2603,56.340496,0.021644,46.201227
...,...,...,...,...,...,...,...,...,...
sustainability 2015,141,43920,292.777661,0.006666,150.011445,9258,195.610928,0.021129,47.328644
sustainability-report-2019,32,17424,118.627661,0.006808,146.879740,2788,59.883331,0.021479,46.557196
sustainability-report-repsol-2016-eng-april-baja_tcm14-63403,13,8088,55.069756,0.006809,146.868273,995,21.043432,0.021149,47.283161
sustainable development 2017,110,54240,358.022435,0.006601,151.498886,6595,137.592253,0.020863,47.931478


In [6]:
# Clean dataframe with an entry with 0 pages
benchmark_df.loc['Maharashtra State Power Generation Co. Ltd._website', 'pages'] = 1

## Finding average number of pages in our pdfs

In [7]:
benchmark_df['pages'].describe()

count    144.000000
mean     156.965278
std      117.540733
min        1.000000
25%       73.250000
50%      127.500000
75%      224.250000
max      653.000000
Name: pages, dtype: float64

Each pdf has on average 157 pages, 75% of pdfs have pages less than 224. The median number of pages is 127 and max is 653.

## Finding average time for relevance

In [8]:
temp = benchmark_df['rel_data_points'] / benchmark_df['pages']
temp.describe()

count     144.000000
mean      359.037705
std       181.697295
min        73.066667
25%       239.955882
50%       305.102564
75%       434.983559
max      1042.036364
dtype: float64

In [9]:
benchmark_df['rel_time_per_point'].describe()

count    144.000000
mean       0.006918
std        0.000771
min        0.006433
25%        0.006697
50%        0.006809
75%        0.006922
max        0.013642
Name: rel_time_per_point, dtype: float64

The average time per data point is 0.006918s. Note the std is quite low compared to the order of mean, indicating that it is a reliable value across data points.

A pdf with 157 pages, and 360 data points per page, will take 391 seconds or 6.5min to execute

In [10]:
benchmark_df['rel_inference_time'].describe()

count     144.000000
mean      409.545597
std       448.180519
min         3.240382
25%       140.197153
50%       284.229863
75%       552.409658
max      3078.194455
Name: rel_inference_time, dtype: float64

The average time for a pdf overall is 409 seconds or 6.8 mins with a std of 7.5mins. Also, 75% of the pdfs will take between a few seconds to 9mins. The median pdfs will take less than 5mins.

## Finding average time for kpi

In [11]:
temp = benchmark_df['kpi_data_points'] / benchmark_df['pages']
temp.describe()

count    144.000000
mean      84.259906
std       43.973948
min       16.444444
25%       49.613165
50%       75.081418
75%      115.574053
max      238.000000
dtype: float64

In [12]:
benchmark_df['kpi_time_per_point'].describe()

count    144.000000
mean       0.022305
std        0.002467
min        0.020149
25%        0.021308
50%        0.021803
75%        0.022309
max        0.039455
Name: kpi_time_per_point, dtype: float64

The average time per data point is 0.022305s. Note the std is quite low compared to the order of mean, indicating that it is a reliable value across data points.

A pdf with 157 pages, and 85 data points per page, will take 298 seconds or ~5min to execute

In [13]:
benchmark_df['kpi_inference_time'].describe()

count     144.000000
mean      323.633887
std       339.327505
min         0.677586
25%        79.557247
50%       205.765385
75%       461.444348
max      1817.714101
Name: kpi_inference_time, dtype: float64

The average time for a pdf overall is 324 seconds or 5.4 mins with a std of 5.6mins. Also, 75% of the pdfs will take between a few seconds to 8mins. The median pdfs will take less than 3.4mins.

Total time for both the steps

We can add the averages for each step. For a pdf with 157 pages, it will take around 11.5mins to execute. We can also look at the total time and then average it.

In [14]:
benchmark_df['total_inference_time'] = benchmark_df['rel_inference_time'] + benchmark_df['kpi_inference_time']

In [15]:
benchmark_df['total_inference_time'].describe()

count     144.000000
mean      733.179484
std       758.515997
min         3.917968
25%       209.536250
50%       496.674452
75%      1046.226577
max      4727.612188
Name: total_inference_time, dtype: float64

This means average total time is 733sec or ~12min with a std of another 12mins. 75% of the pdfs will take a few secs to 17.4 mins. The maximum time it took for a pdf in our list was 78mins and the minimum was 4secs. 

## Total time it took for 144 pdfs

In [16]:
benchmark_df['total_inference_time'].sum()/(60*60)

29.327179345223637

### 29.3hrs was the total time taken for 144 pdfs.

# Conclusion

An average pdf with around 150 pages will take 12 mins for the relevance and kpi models to infer. The pdfs in our datasets range from a single page to a 634 pages long and they take 4sec to 78mins for inference. 