# Question Answering System




This notebook shows how to use the PDF converter to create an input dataframe for the cdQA pipeline from a directory of PDF files.


In [0]:
!pip install cdqa 
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.utils.filters import filter_paragraphs
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model



### Download pre-trained reader model and PDF files

In [0]:
# Download model
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...
bert_qa.joblib already downloaded


In [0]:
# Download pdf files from your URL
def download_pdf():
    import os
    import wget
    directory = './data/pdf/'
    models_url = [
      'https://www.actfl.org/sites/default/files/pdfs/ACTFL06handouts/Session146-KenStewart-Handout1.pdf'
    ]

    print('\nDownloading PDF files...')

    if not os.path.exists(directory):
        os.makedirs(directory)
    for url in models_url:
        wget.download(url=url, out=directory)

download_pdf()


Downloading PDF files...


### Convert the PDF files into a DataFrame for cdQA pipeline

In [0]:
df = pdf_converter(directory_path='./data/pdf/')
df.head()

2020-03-31 16:15:25,041 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to /tmp/tika-server.jar.
2020-03-31 16:15:33,414 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to /tmp/tika-server.jar.md5.
2020-03-31 16:15:35,403 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Unnamed: 0,title,paragraphs
0,Session146-KenStewart-Handout1 (1),[100 ESSAY AND JOURNAL TOPICSt100 ESSAY AND JO...
1,Session146-KenStewart-Handout1,[100 ESSAY AND JOURNAL TOPICSt100 ESSAY AND JO...


In [0]:
cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)

# Fit Retriever to documents
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 354975.64B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

 ### Execute a query

In [0]:
query = input("")
prediction = cdqa_pipeline.predict(query)

What is journal writing?


### Explore predictions

In [0]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
print('paragraph: {}'.format(prediction[2]))

query: What is journal writing?
answer: an informal approach to developing students’ writing skills
title: Session146-KenStewart-Handout1 (1)
paragraph: Chapel Hill High School Chapel Hill, North Carolina  Journal writing is an informal approach to developing students’ writing skills. The assessment is primarily based on improvement and completing a minimum number of pages (5 pages skipping lines) by the established deadline. My feedback to students is focused on interaction with what they have written as opposed to correcting syntax or orthography. Since this is a dialogue journal, I respect the confidential nature of what students choose to write. Be prepared for students to share anecdotes that may be sensitive in content. This is a great way to get to know your students on a more personal level. I do not place a value judgment on their ideas or how compelling their argument may be. I am concerned with improvement from one journal collection to the next. Ease of expression and sophi