<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# Hugging Face - Question Answering from PDF
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/Hugging%20Face/Hugging_Face_Ask_boolean_question_to_T5.ipynb" target="_parent"><img src="https://naasai-public.s3.eu-west-3.amazonaws.com/open_in_naas.svg"/></a>

**Tags:** #huggingface #ml #question_answer #ai #text

**Author:** [Muhammad Talha Khan](https://www.linkedin.com/in/muhtalhakhan/)

**Description**: This Transformers QA Pipeline shows a Question Answering models that can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. This specific example is set with the Bitcoin white paper but you can put any PDF.

The PDF will server as context which will used for questions answering.

## Input

### Install Packages

In [16]:
!pip install tensorflow

Collecting typing-extensions~=3.7.4
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting six~=1.15.0
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting protobuf>=3.9.2
  Using cached protobuf-3.19.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[0mInstalling collected packages: typing-extensions, six, protobuf
  Attempting uninstall: typing-extensions
[0m    Found existing installation: typing_extensions 4.1.1
    Uninstalling typing_extensions-4.1.1:
[31mERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: 'INSTALLER'
Consider using the `--user` option or check the permissions.
[0m[31m
[0m

In [2]:
!pip install -q transformers

[0m

Use "--user" if it asks for permission prompt.

In [3]:
!pip install PyPDF2

[0mCollecting PyPDF2
  Using cached PyPDF2-2.11.1-py3-none-any.whl (220 kB)
[0mInstalling collected packages: PyPDF2
[0mSuccessfully installed PyPDF2-2.11.1
[0m

In [4]:
!pip install urllib3

[0m

### Import Libraries


In [5]:
from transformers import pipeline
import urllib.request
import PyPDF2
import io

### Add the Document Path

In [6]:
URL = 'https://bitcoin.org/bitcoin.pdf'
req = urllib.request.Request(URL, headers={'User-Agent' : "Chrome"})
remote_file = urllib.request.urlopen(req).read()
remote_file_bytes = io.BytesIO(remote_file)
pdfdoc_remote = PyPDF2.PdfFileReader(remote_file_bytes)

You can change the URL path to the desired one relating to any of the PDF.

## Model

### Read Text from File

In [7]:
pdf_text = ""    
    
for i in range(pdfdoc_remote.getNumPages()):
    print(i)
    page = pdfdoc_remote.getPage(i)
    page_content = page.extractText()
    pdf_text += page_content    

0
1
2
3
4
5
6
7
8


### Generate the text data from the pdf file 

In [8]:
print(pdf_text)

Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
satoshin@gmx.com
www.bitcoin.org
Abstract.  A purely peer-to-peer version of electronic cash would allow online  
payments to be sent directly from one party to another without going through a  
financial institution.  Digital signatures provide part of the solution, but the main  
benefits are lost if a trusted third party is still required to prevent double-spending.  
We propose a solution to the double-spending problem using a peer-to-peer network.  
The network timestamps transactions by hashing them into an ongoing chain of  
hash-based proof-of-work, forming a record that cannot be changed without redoing  
the proof-of-work.  The longest chain not only serves as proof of the sequence of  
events witnessed, but proof that it came from the largest pool of CPU power.  As  
long as a majority of CPU power is controlled by nodes that are not cooperating to  
attack the network, they'll generate the longest chain and out

### Loading the pipeline
Import Pipeline from Transformer after installing the transformers and tensorflow.

In [9]:
nlp = pipeline('question-answering', model='deepset/roberta-base-squad2', tokenizer='deepset/roberta-base-squad2')

## Output

### Ask question

In [12]:
context = pdf_text 
question = input('Enter your question:\n')

question_set = {
        'context': context,
        'question': question
    }

results = nlp(question_set)

Enter your question:
 Who created Bitcoin




### Get answer

This will print the answer to the question you have asked before.

In [13]:
print("\nAnswer: " + results['answer'])


Answer: Satoshi Nakamoto
