# Notebook [2]: Using the PDF converter



This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.


***Note:*** *To run this notebook you will need to have access to GPU. If you are using colab, you will need to install `cdQA` by executing `!pip install cdqa` in a cell.* 

In [1]:
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model



### Download pre-trained reader model and PDF files

In [2]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...
bert_qa.joblib already downloaded


In [3]:
# Download pdf files from BNP Paribas public news
def download_pdf():
    import os
    import wget
    directory = './data/pdf/'
    models_url = [
      'https://invest.bnpparibas.com/documents/1q19-pr-12648',
      'https://invest.bnpparibas.com/documents/4q18-pr-18000',
      'https://invest.bnpparibas.com/documents/4q17-pr'
    ]

    print('\nDownloading PDF files...')

    if not os.path.exists(directory):
        os.makedirs(directory)
    for url in models_url:
        wget.download(url=url, out=directory)

download_pdf()


Downloading PDF files...
100% [............................................................................] 776686 / 776686

### Convert the PDF files into a DataFrame for cdQA pipeline

In [4]:
df = pdf_converter(directory_path='./data/pdf/')
df.head()

Unnamed: 0,title,paragraphs
0,singtel-annual-report-2019,"[01 Singtel Annual Report 2019 , Annual Report..."
1,starhub-annual-report-2018,"[STARHUB LTDAnnual Report 2018, DARET O E V O..."


In [5]:
from cdqa.utils.filters import filter_paragraphs

df = filter_paragraphs(df)
df.head()

Unnamed: 0,title,paragraphs
0,singtel-annual-report-2019,[Much has changed over Singtel’s 140-year hist...
1,starhub-annual-report-2018,[We are pursuing a more aggressive transformat...


In [5]:
df.to_csv('pdf_filtered_para.csv')

### Instantiate the cdQA pipeline from a pre-trained reader model

In [6]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)

QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

 ### Execute a query

In [8]:
query_1 = 'Why did Singtel EBITDA declined?'
prediction_1 = cdqa_pipeline.predict(query_1)

### Explore predictions

In [None]:
print('query: {}'.format(query_1))
print('answer: {}'.format(prediction_1[0]))
print('title: {}'.format(prediction_1[1]))
print('paragraph: {}'.format(prediction_1[2]))

In [11]:
query_2 = "What caused the associate's pre-tax contribution to decline in 2018?"
prediction_2 = cdqa_pipeline.predict(query_2)

In [12]:
print('query: {}'.format(query_2))
print('answer: {}'.format(prediction_2[0]))
print('title: {}'.format(prediction_2[1]))
print('paragraph: {}'.format(prediction_2[2]))

query: What caused the associate's pre-tax contribution to decline in 2018?
answer: higher depreciation charges and share of equity losses
title: singtel-annual-report-2019
paragraph: Globe delivered a solid performance with double-digit growth in EBITDA and earnings. Service revenue grew 6% driven by robust data growth in mobile and broadband. EBITDA rose 22% on strong revenue growth and lower selling expenses. Despite higher depreciation charges and share of equity losses from its associates, Globe’s post-tax ordinary contribution rose strongly by 39%. The share of Globe’s one-off gain in FY 2018 arose from the increase in fair value of its retained interest in its associate. With the absence of exceptional gain this year, overall post-tax contribution grew 24%.


In [15]:
query_3 = 'Why did Singtel operating revenue grow?'
prediction_3 = cdqa_pipeline.predict(query_3)

In [16]:
print('query: {}'.format(query_3))
print('answer: {}'.format(prediction_3[0]))
print('title: {}'.format(prediction_3[1]))
print('paragraph: {}'.format(prediction_3[2]))

query: Why did Singtel operating revenue grow?
answer: increases in ICT, digital services and equipment sales
title: singtel-annual-report-2019
paragraph: In constant currency terms, operating revenue grew 3.7% driven by increases in ICT, digital services and equipment sales. However, EBITDA was down  3.9% mainly due to lower legacy carriage services especially voice, and price erosion. With 6% depreciation in the Australian Dollar, operating revenue was stable while EBITDA declined 7.1%.


In [19]:
query_4 = 'Why did Singtel operating revenue increase?'
prediction_4 = cdqa_pipeline.predict(query_4)

print('query: {}'.format(query_3))
print('answer: {}'.format(prediction_3[0]))
print('title: {}'.format(prediction_3[1]))
print('paragraph: {}'.format(prediction_3[2]))

query: Why did Singtel operating revenue grow?
answer: increases in ICT, digital services and equipment sales
title: singtel-annual-report-2019
paragraph: In constant currency terms, operating revenue grew 3.7% driven by increases in ICT, digital services and equipment sales. However, EBITDA was down  3.9% mainly due to lower legacy carriage services especially voice, and price erosion. With 6% depreciation in the Australian Dollar, operating revenue was stable while EBITDA declined 7.1%.


In [21]:
query_5 = 'Why did Singtel free cash flow grow?'
prediction_5 = cdqa_pipeline.predict(query_5)

print('query: {}'.format(query_5))
print('answer: {}'.format(prediction_5[0]))
print('title: {}'.format(prediction_5[1]))
print('paragraph: {}'.format(prediction_5[2]))

query: Why did Singtel free cash flow grow?
answer: from operating activities, including dividends from associates, less cash capital expenditure
title: singtel-annual-report-2019
paragraph: Notes:(1)  Based on Singapore Financial Reporting Standards (International). (2)  FY 2018 included the gain on disposal of economic interest in NetLink Trust of S$2.03 billion.(3)  Underlying net profit is defined as net profit before exceptional items. (4)  Average A$ rate for translation of Optus’ operating revenue.(5)  Free cash flow refers to cash flow from operating activities, including dividends from associates, less cash capital expenditure. (6)  Return on invested capital is defined as EBIT (post-tax) divided by average capital.


In [22]:
query_6 = 'How is Starhub EBITDA?'
prediction_6 = cdqa_pipeline.predict(query_6)

print('query: {}'.format(query_6))
print('answer: {}'.format(prediction_6[0]))
print('title: {}'.format(prediction_6[1]))
print('paragraph: {}'.format(prediction_6[2]))

query: How is Starhub EBITDA?
answer: via (a) periodic analyst and media briefings throughout the year
title: starhub-annual-report-2018
paragraph: StarHub provides regular and timely information to the investment community regarding the Group’s performance, progress and prospects as well as major industry and corporate developments and other relevant information. StarHub solicits and considers the views of shareholders via (a) periodic analyst and media briefings throughout the year, (b) regular meetings between the CEO, the StarHub IR team and institutional investors through international road shows and conferences organised by major brokerage firms and (c) third-party perception studies on StarHub.


In [24]:
query_7 = 'What is Starhub EBITDA Yoy?'
prediction_7 = cdqa_pipeline.predict(query_7)

print('query: {}'.format(query_7))
print('answer: {}'.format(prediction_7[0]))
print('title: {}'.format(prediction_7[1]))
print('paragraph: {}'.format(prediction_7[2]))

query: What is Starhub EBITDA Yoy?
answer: Executive Resource and Compensation Committee
title: starhub-annual-report-2018
paragraph: The StarHub Share Plans 2014, the StarHub Share Plans 2004, the StarHub Share Option Plan 2004 and the StarHub Share Option Plan 2000 (collectively, “Plans”) are administered by the Company’s Executive Resource and Compensation Committee (“ERCC”) comprising five directors, namely Teo Ek Tor, Stephen Geoffrey Miller, Michelle Lee Guthrie, Lionel Yeo Hung Tong and Lim Ming Seong.


In [None]:
query_7 = 'What is Starhub EBITDA Yoy?'
prediction_7 = cdqa_pipeline.predict(query_7)

print('query: {}'.format(query_7))
print('answer: {}'.format(prediction_7[0]))
print('title: {}'.format(prediction_7[1]))
print('paragraph: {}'.format(prediction_7[2]))