<a href="https://colab.research.google.com/github/plaban1981/POCs/blob/main/Updated_Extract_tabular_data_from_PDF_document_using_Camelot_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Camelot 

* Camelot is an open-source Python library, that enables developers to extract all tables from the PDF document and convert it to Pandas Dataframe format. The extracted table can also be exported in a structured form as CSV, JSON, Excel, or other formats, and can be used for modeling.

* Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)


* For large files, the camelot library tends to outperform tabula-py. However, sometimes you'll encounter a NotImplementedError for some PDFs using the camelot library, you can use tabula-py as an alternative.

## Working - Camelot processing  under the hood. 

#### Typically, two parsing methods are used by Camelot to extract tables:


* **Stream**: looks for whitespaces between words to identify a table.

* **Lattice**: Looks for lines on a page to identify a table. 

* Lattice is used by default.

In [2]:
pip install "camelot-py[cv]"

Collecting camelot-py[cv]
  Downloading camelot_py-0.10.1-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 27 kB/s 
Collecting PyPDF2>=1.26.0
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[K     |████████████████████████████████| 77 kB 3.1 MB/s 
[?25hCollecting pdfminer.six>=20200726
  Downloading pdfminer.six-20211012-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 19.2 MB/s 
Collecting pdftopng>=0.2.3
  Downloading pdftopng-0.2.3-cp37-cp37m-manylinux2010_x86_64.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 16.2 MB/s 
[?25hCollecting ghostscript>=0.7
  Downloading ghostscript-0.7-py2.py3-none-any.whl (25 kB)
Collecting cryptography
  Downloading cryptography-35.0.0-cp36-abi3-manylinux_2_24_x86_64.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 59.9 MB/s 
Building wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyP

##PyPDF

In [3]:
!pip install install PyPDF2

Collecting install
  Downloading install-1.3.4-py3-none-any.whl (3.1 kB)
Installing collected packages: install
Successfully installed install-1.3.4


## Import Drive

In [4]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [5]:
import camelot

## Extract the number of pages in PDF file 

In [6]:
from PyPDF2 import PdfFileReader
path = '/content/drive/MyDrive/ZeoanAI_Poc/4Q19-Press-Release.pdf'
def get_pdf_page_count(path):
  with open(path, 'rb') as fl:
    reader = PdfFileReader(fl)
    return reader.getNumPages()

In [7]:
num_pages = get_pdf_page_count(path)


In [8]:
num_pages

14

In [9]:
pages = '1-'+str(num_pages)
pages

'1-14'

In [10]:
def read_pdf(path,page_num):
  with open(path, "rb") as filehandle:
    pdf = PdfFileReader(filehandle)
    page1 = pdf.getPage(page_num)
    content = page1.extractText()
    return content[1:]

In [11]:
read_pdf(path,6)

'Caution Concerning Forward-Looking StatementsThis press release contains ﬁforward-looking statementsﬂ within the meaning of the Private Securities Litigation Reform Act of 1995. Words such as ﬁmay,ﬂ ﬁshould,ﬂ ﬁexpects,ﬂ ﬁintends,ﬂ ﬁprojects,ﬂ ﬁplans,ﬂ ﬁbelieves,ﬂ ﬁestimates,ﬂ ﬁtargets,ﬂ ﬁanticipates,ﬂ and similar expressions generally identify these forward-looking statements. Examples of forward-looking statements include statements relating to our future financial condition and operating results, as well as any other statement that does not directly relate to any historical or current fact. Forward-looking statements are based on expectations and assumptions that we believe to be reasonable when made, but that may not prove to be accurate. These statements are not guarantees and are subject to risks, uncertainties, and changes in circumstances that are difficult to predict. Many factors could cause actual results to differ materially and adversely from these forward-looking statemen

## Define helper functions

1. get number of pages in the pdf file

In [12]:
def get_pdf_page_count(path):
  with open(path, 'rb') as fl:
    reader = PdfFileReader(fl)
    return reader.getNumPages()

2. Read the content of the pdf that has no tables

In [13]:
def read_pdf(path,page_num):
  with open(path, "rb") as filehandle:
    pdf = PdfFileReader(filehandle)
    page1 = pdf.getPage(page_num)
    content = page1.extractText()
    return content[1:]

3. Extract the header, position of the header and the text prior to the header as normal text

In [14]:
def extract_normal_header_text(df):
  normal_text = []
  tabular_text = []
  header = []
  row_number = 0
  for i,items in enumerate(df.values.tolist()):
    count = 0
    #print(items)
    for j in range(len(df.columns.values.tolist())):
      #print(items[j])
      if items[j] == "":
        count +=1
    if count > 2:
      text = ""
      normal_text.append("".join(items))
      #second_last = items
    else:
      row_number = i
      #print(second_last)
      header = items
      break
  return normal_text,header,row_number

4. Extract para from the tables detected in the pdf and save into respective json files

In [15]:
def extract_para(normal_text,df,header,row_number,table_num):
  #print(normal_text)
  #print(df)
  paragraphs  = []
  paragraphs.append(f"PDF Page Number : {table_num} .")
  const = " ".join(normal_text)
  if len(normal_text) > 0:# if we have headers from the start
    file_name = f'Table_{table_num}_' +normal_text[0][:20].replace("/","").strip()+".json"
  else:
    file_name = f'Table_{table_num}.json'
  df.to_json(file_name)
  df = df.iloc[row_number:]
  if len(normal_text) > 0:
    for txts in normal_text:
      paragraphs.append("".join(txts))
  if len(header) > 0:
    cols = df.columns
    for items in df.values.tolist()[1:]:
      #print(items)
      for col_item in range(1,len(cols)):
        if items[col_item] > " ":
          temp = [ "The "+ header[0]+ " "+ items[0]]
          text = " associated with the context "+const +" for "+str(header[col_item]) + " is " + items[col_item]+'.'
          temp.append(text)
          #print(temp)
          paragraphs.append(" ".join(temp))
  documents = " ".join(paragraphs)
  return paragraphs,documents

In [16]:
def extract_para(normal_text,df,header,row_number,table_num):
  #print(normal_text)
  #print(df)
  paragraphs  = []
  if len (normal_text) > 0:
    const = "The table is present in PDF page : "+ str(table_num)+ " with topic " +normal_text[0].strip()
  else:
    const = ""
  if len(normal_text) > 0:# if we have headers from the start
    file_name = f'Table_{table_num}_' +normal_text[0][:20].replace("/","").strip()+".json"
  else:
    file_name = f'Table_{table_num}.json'
  df.to_json(file_name)
  df = df.iloc[row_number:]
  if len(normal_text) > 0:
    for txts in normal_text:
      paragraphs.append("".join(txts))
  if len(header) > 0:
    cols = df.columns
    for items in df.values.tolist()[1:]:
      #print(items)
      for col_item in range(1,len(cols)):
        if items[col_item] > " " and const != "":
          temp = [const + " associated with the " + header[0]+ " "+ items[0]]
          text = "for "+str(header[col_item]) + " is " + items[col_item]+'.'
          temp.append(text)
          #print(temp)
          paragraphs.append(" ".join(temp))
        else:
          if len(items[col_item]) > 1:
            temp = [ "The "+ header[0]+ " "+ items[0]]
            text =  " is " + items[col_item]+'.'
            temp.append(text)
            paragraphs.append(" ".join(temp))
  documents = " ".join(paragraphs)
  return paragraphs,documents

## Main Function 

Function to check if table exists in a pdf, if exists check for tables.
If table found store the details into dcuments list else store it in a no_pd_list

In [17]:
def is_table(path):
  num_pages = get_pdf_page_count(path)
  page = '1-'+str(num_pages)
  document_list = []
  no_tables_list = []
  tables = camelot.read_pdf(path, flavor='stream', pages=pages)
  for i in range(tables.n):
    pdf_page = tables[i].parsing_report['page']
    if tables[i].parsing_report['whitespace'] > 0.0:
      df = tables[i].df
      print('*' * 80)
      print(f'processing Table in pdf page {pdf_page}')
      print('*' * 80)
      #
      normal_text,header,row_number = extract_normal_header_text(df)
      paragraphs,docs = extract_para(normal_text,df,header,row_number,pdf_page)
      document_list.append(docs)
    else:
      print(f'processing PDF page number {pdf_page}')
      content = read_pdf(path,i)
      content = f"PDF Page Number : {pdf_page} ." + content
      no_tables_list.append(content)


  return document_list,no_tables_list

## Invoke the main function

In [18]:
path = '/content/drive/MyDrive/ZeoanAI_Poc/4Q19-Press-Release.pdf'
import time
start = time.time()
%timeit
doc_list ,no_tables_list = is_table(path)
end  = time.time()
print(f"Time taken to complete the entire task in seconds : {end - start}")



********************************************************************************
processing Table in pdf page 1
********************************************************************************
********************************************************************************
processing Table in pdf page 2
********************************************************************************
********************************************************************************
processing Table in pdf page 3
********************************************************************************
********************************************************************************
processing Table in pdf page 4
********************************************************************************
********************************************************************************
processing Table in pdf page 5
********************************************************************************
processing PDF page number 6
proces

## Verifying the results

In [19]:
len(doc_list)

13

In [20]:
len(no_tables_list)

2

In [21]:
doc_list

['Table 1.  Summary Financial Fourth QuarterFull Year Results The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $17,911. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $28,341. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for Change is (37)%. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $76,559. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $101,127. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associ

In [None]:
doc_list[0]

'Table 1.  Summary Financial        Fourth Quarter   Full Year   Results       The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $17,911. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $28,341. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for Change is (37)%. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $76,559. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $101,127. The table is present in PDF page : 1 with topic Table 1.  Summary

In [None]:
no_tables_list

['PDF Page Number : 6 .Non-GAAP Measures Disclosures We supplement the reporting of our financial information determined under Generally Accepted Accounting Principles in the United States of America (GAAP) with certain non-GAAP financial information.  The non-GAAP financial information presented excludes certain significant items that may not be indicative of, or are unrelated to, results from our ongoing business operations.  We believe that these non-GAAP measures provide investors with additional insight into the company™s ongoing business performance.  These non-GAAP measures should not be considered in isolation or as a substitute for the related GAAP measures, and other companies may define such measures differently.  We encourage investors to review our financial statements and publicly-filed reports in their entirety and not to rely on any single financial measure.  The following definitions are provided: Core Operating (Loss)/Earnings, Core Operating Margin and Core (Loss)/Ea

In [None]:
doc_list[2]

'Segment Results       Commercial Airplanes       Table 4. Commercial Airplanes Fourth Quarter   Full Year   The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2019 is 79. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2018 is 238. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for Change is (67)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2019 is 380. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2018 is 806. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commerc

In [None]:
doc_list[3]

'Defense, Space & Security       Table 5. Defense, Space &       Security Fourth Quarter   Full Year   The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $5,962. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $6,874. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (13)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $26,227. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $26,392. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (1%). The table is present in 

In [None]:
doc_list[4]

"Additional Financial Information      Table 7. Additional Financial Information  Fourth Quarter  Full Year  The table is present in PDF page : 5 with topic Additional Financial Information associated with the (Dollars in Millions) Boeing Capital for 2019 is $37. The table is present in PDF page : 5 with topic Additional Financial Information associated with the (Dollars in Millions) Boeing Capital for 2018 is $60. The table is present in PDF page : 5 with topic Additional Financial Information associated with the (Dollars in Millions) Boeing Capital for 2019 is $244. The table is present in PDF page : 5 with topic Additional Financial Information associated with the (Dollars in Millions) Boeing Capital for 2018 is $274. The table is present in PDF page : 5 with topic Additional Financial Information associated with the (Dollars in Millions) Unallocated items, eliminations and other for 2019 is ($198). The table is present in PDF page : 5 with topic Additional Financial Information ass

In [None]:
doc_list[5]

'The Sales of products Sales of services  is 10,465. The Sales of products Sales of services  is 10,898. The Sales of products Sales of services  is 2,331. The Sales of products Sales of services  is 2,960. The Sales of products Total revenues  is 76,559. The Sales of products Total revenues  is 101,127. The Sales of products Total revenues  is 17,911. The Sales of products Total revenues  is 28,341. The Sales of products Cost of products  is (62,877). The Sales of products Cost of products  is (72,922). The Sales of products Cost of products  is (16,293). The Sales of products Cost of products  is (19,788). The Sales of products Cost of services  is (9,154). The Sales of products Cost of services  is (8,499). The Sales of products Cost of services  is (2,402). The Sales of products Cost of services  is (2,284). The Sales of products Boeing Capital interest expense  is (62). The Sales of products Boeing Capital interest expense  is (69). The Sales of products Boeing Capital interest ex

In [None]:
doc_list[6]

'The The Boeing Company and Subsidiaries   is December 31. The The Boeing Company and Subsidiaries   is December 31. The The Boeing Company and Subsidiaries (Dollars in millions, except per share data)  is 2019. The The Boeing Company and Subsidiaries (Dollars in millions, except per share data)  is 2018. The The Boeing Company and Subsidiaries Cash and cash equivalents  is $9,485. The The Boeing Company and Subsidiaries Cash and cash equivalents  is $7,637. The The Boeing Company and Subsidiaries Short-term and other investments  is 545. The The Boeing Company and Subsidiaries Short-term and other investments  is 927. The The Boeing Company and Subsidiaries Accounts receivable, net  is 3,266. The The Boeing Company and Subsidiaries Accounts receivable, net  is 3,879. The The Boeing Company and Subsidiaries Unbilled receivables, net  is 9,043. The The Boeing Company and Subsidiaries Unbilled receivables, net  is 10,025. The The Boeing Company and Subsidiaries Current portion of custome

In [None]:
doc_list[7]

'The Cash flows – operating activities: Net (loss)/earnings  is ($636). The Cash flows – operating activities: Net (loss)/earnings  is $10,460. The Cash flows – operating activities: Share-based plans expense  is 212. The Cash flows – operating activities: Share-based plans expense  is 202. The Cash flows – operating activities: Depreciation and amortization  is 2,271. The Cash flows – operating activities: Depreciation and amortization  is 2,114. The Cash flows – operating activities: Investment/asset impairment charges, net  is 443. The Cash flows – operating activities: Investment/asset impairment charges, net  is 93. The Cash flows – operating activities: Customer financing valuation adjustments  is 250. The Cash flows – operating activities: Customer financing valuation adjustments  is (3). The Cash flows – operating activities: Gain on dispositions, net  is (691). The Cash flows – operating activities: Gain on dispositions, net  is (75). The Cash flows – operating activities: Oth

In [None]:
doc_list[12]

'The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release (Dollars in millions, except per share data)  is Full Year 2019. The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release (Dollars in millions, except per share data)  is Full Year 2018. The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release   is $ millions\nPer Share. The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release   is $ millions\nPer Share. The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release Revenues  is 76,559. The (loss)/earnings from operations, operating margin, and diluted (loss)/earnings per share. See page 6 of this release Revenues  is 101,127. The (loss)/earnings from operations, operati

In [24]:
len(doc_list)

13

In [26]:
doc_list[0]

'Table 1.  Summary Financial Fourth QuarterFull Year Results The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $17,911. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $28,341. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for Change is (37)%. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2019 is $76,559. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associated with the (Dollars in Millions, except per share data) Revenues for 2018 is $101,127. The table is present in PDF page : 1 with topic Table 1.  Summary Financial associa

## Questgen api

In [27]:
!pip install git+https://github.com/ramsrigouthamg/Questgen.ai

Collecting git+https://github.com/ramsrigouthamg/Questgen.ai
  Cloning https://github.com/ramsrigouthamg/Questgen.ai to /tmp/pip-req-build-mbw7e7tc
  Running command git clone -q https://github.com/ramsrigouthamg/Questgen.ai /tmp/pip-req-build-mbw7e7tc
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 4.0 MB/s 
[?25hCollecting pytorch_lightning==0.8.1
  Downloading pytorch_lightning-0.8.1-py3-none-any.whl (293 kB)
[K     |████████████████████████████████| 293 kB 32.6 MB/s 
[?25hCollecting sense2vec==1.0.3
  Downloading sense2vec-1.0.3-py2.py3-none-any.whl (35 kB)
Collecting strsim==0.0.3
  Downloading strsim-0.0.3-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 936 kB/s 
Collecting networkx==2.4.0
  Downloading networkx-2.4-py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 41.4 MB/s 
Collecting unidecode==1.1.1
  Downloading Unidecode-1.1.1-py

In [28]:
!pip install --quiet git+https://github.com/boudinfl/pke.git

  Building wheel for pke (setup.py) ... [?25l[?25hdone


In [1]:
!python -m nltk.downloader universal_tagset
!python -m spacy download en

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 4.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [2]:
!wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz

--2021-10-27 17:05:42--  https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/50261113/52126080-0993-11ea-8190-8f0e295df22a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20211027%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211027T170542Z&X-Amz-Expires=300&X-Amz-Signature=a0376367ebd9b8aae43fec92605aa9afc9559ad53e19eea74d204540bf5d25f8&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=50261113&response-content-disposition=attachment%3B%20filename%3Ds2v_reddit_2015_md.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-10-27 17:05:43--  https://github-releases.githubusercontent.com/50261113/52126080-0993-11ea-8190-8f0e295df22a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=

In [3]:
!tar -xvf  s2v_reddit_2015_md.tar.gz

./._s2v_old
./s2v_old/
./s2v_old/._freqs.json
./s2v_old/freqs.json
./s2v_old/._vectors
./s2v_old/vectors
./s2v_old/._cfg
./s2v_old/cfg
./s2v_old/._strings.json
./s2v_old/strings.json
./s2v_old/._key2row
./s2v_old/key2row


In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
from pprint import pprint
from Questgen import main

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unz

## Question Answering

In [6]:
answer = main.AnswerPredictor()

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [7]:
pghs = "Defense, Space & Security       Table 5. Defense, Space &       Security Fourth Quarter   Full Year   The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $5,962. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $6,874. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (13)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $26,227. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $26,392. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (1%). The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2019 is $31. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2018 is $771. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for Change is (96)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2019 is $2,608. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2018 is $1,657. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for Change is 57%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2019 is 0.5%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2018 is 11.2%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for Change is (10.7) Pts. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2019 is 9.9%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2018 is 6.3%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for Change is 3.6 Pts."

In [8]:

payload3 = {
    "input_text" : pghs ,
    "input_question" : "What is the earnings from Operations for Defense, Space & Security in 2019? "

}

In [9]:
answer.predict_answer(payload3)

Token indices sequence length is longer than the specified maximum sequence length for this model (768 > 512). Running this sequence through the model will result in indexing errors


'$26,227'

## haystack

In [10]:
! pip install farm-haystack

Collecting farm-haystack
  Downloading farm_haystack-0.10.0-py3-none-any.whl (200 kB)
[?25l[K     |█▋                              | 10 kB 21.9 MB/s eta 0:00:01[K     |███▎                            | 20 kB 26.6 MB/s eta 0:00:01[K     |█████                           | 30 kB 14.9 MB/s eta 0:00:01[K     |██████▌                         | 40 kB 10.7 MB/s eta 0:00:01[K     |████████▏                       | 51 kB 4.5 MB/s eta 0:00:01[K     |█████████▉                      | 61 kB 4.8 MB/s eta 0:00:01[K     |███████████▌                    | 71 kB 4.3 MB/s eta 0:00:01[K     |█████████████                   | 81 kB 4.9 MB/s eta 0:00:01[K     |██████████████▊                 | 92 kB 4.9 MB/s eta 0:00:01[K     |████████████████▍               | 102 kB 4.2 MB/s eta 0:00:01[K     |██████████████████              | 112 kB 4.2 MB/s eta 0:00:01[K     |███████████████████▋            | 122 kB 4.2 MB/s eta 0:00:01[K     |█████████████████████▎          | 133 kB 4.2 MB/s e

In [1]:
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers



In [2]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()

10/27/2021 17:16:12 - INFO - haystack.utils -   Starting Elasticsearch ...


In [3]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [12]:
# Connect to Elasticsearch

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

10/27/2021 17:18:43 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.105s]
10/27/2021 17:18:44 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.428s]
10/27/2021 17:18:44 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.266s]


## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store


## Modify data in correct format for storing in Elastic Document Store

In [20]:
pghs = "Defense, Space & Security       Table 5. Defense, Space &       Security Fourth Quarter   Full Year   The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $5,962. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $6,874. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (13)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $26,227. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $26,392. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (1%). The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2019 is $31. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2018 is $771. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for Change is (96)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2019 is $2,608. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for 2018 is $1,657. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Earnings from Operations for Change is 57%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2019 is 0.5%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2018 is 11.2%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for Change is (10.7) Pts. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2019 is 9.9%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for 2018 is 6.3%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Operating Margin for Change is 3.6 Pts."

In [8]:
content  ="Table 4. Commercial Airplanes Fourth Quarter   Full Year   The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2019 is 79. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2018 is 238. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for Change is (67)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2019 is 380. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for 2018 is 806. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Commercial Airplanes Deliveries for Change is (53)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for 2019 is $7,462. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for 2018 is $16,531. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for Change is (55)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for 2019 is $32,255. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for 2018 is $57,499. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Revenues for Change is (44)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for 2019 is ($2,844). The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for 2018 is $2,600. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for Change is NM. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for 2019 is ($6,657). The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for 2018 is $7,830. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) (Loss)/Earnings from Operations for Change is NM. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for 2019 is (38.1)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for 2018 is 15.7%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for Change is NM. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for 2019 is (20.6)%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for 2018 is 13.6%. The table is present in PDF page : 3 with topic Segment Results associated with the (Dollars in Millions) Operating Margin for Change is NM."

In [25]:
data_json = [{'text':pghs,'meta':{'source':'segment results'}}]

In [26]:
data_json[:10]

[{'meta': {'source': 'segment results'},
  'text': 'Defense, Space & Security       Table 5. Defense, Space &       Security Fourth Quarter   Full Year   The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $5,962. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $6,874. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for Change is (13)%. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2019 is $26,227. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Revenues for 2018 is $26,392. The table is present in PDF page : 4 with topic Defense, Space & Security associated with the (Dollars in Millions) Re

## Store data into document store

In [27]:
import requests
document_store.write_documents(data_json)
requests.get("http://localhost:9200/new/_count").json()

10/27/2021 17:23:14 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.942s]


{'error': {'index': 'new',
  'index_uuid': '_na_',
  'reason': 'no such index [new]',
  'resource.id': 'new',
  'resource.type': 'index_or_alias',
  'root_cause': [{'index': 'new',
    'index_uuid': '_na_',
    'reason': 'no such index [new]',
    'resource.id': 'new',
    'resource.type': 'index_or_alias',
    'type': 'index_not_found_exception'}],
  'type': 'index_not_found_exception'},
 'status': 404}

In [28]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [29]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

10/27/2021 17:23:21 - INFO - farm.utils -   Using device: CPU 
10/27/2021 17:23:21 - INFO - farm.utils -   Number of GPUs: 0
10/27/2021 17:23:21 - INFO - farm.utils -   Distributed Training: False
10/27/2021 17:23:21 - INFO - farm.utils -   Automatic Mixed Precision: None
Some weights of the model checkpoint at deepset/roberta-base-squad2 were not used when initializing RobertaModel: ['qa_outputs.bias', 'qa_outputs.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and ar

In [31]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

In [30]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(
    query="What is Commercial Airplanes Deliveries for 2019?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

10/27/2021 17:24:01 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.025s]
Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.89s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.00s/ Batches]


In [19]:
print_answers(prediction, details="minimal")

[   {   'answer': '380',
        'context': 'ith the (Dollars in Millions) Commercial Airplanes '
                   'Deliveries for 2019 is 380. The table is present in PDF '
                   'page : 3 with topic Segment Results associa'},
    {   'answer': '79',
        'context': 'ith the (Dollars in Millions) Commercial Airplanes '
                   'Deliveries for 2019 is 79. The table is present in PDF '
                   'page : 3 with topic Segment Results associat'},
    {   'answer': '$57,499',
        'context': ' Results associated with the (Dollars in Millions) '
                   'Revenues for 2018 is $57,499. The table is present in PDF '
                   'page : 3 with topic Segment Results assoc'}]


In [32]:
prediction = pipe.run(
    query="What is the earnings from Operations for Defense, Space & Security in 2019?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

10/27/2021 17:24:32 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.018s]
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.99s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.62s/ Batches]


In [33]:
print_answers(prediction, details="minimal")

[   {   'answer': '($2,844',
        'context': 'h the (Dollars in Millions) (Loss)/Earnings from '
                   'Operations for 2019 is ($2,844). The table is present in '
                   'PDF page : 3 with topic Segment Results asso'},
    {   'answer': '$31',
        'context': 'iated with the (Dollars in Millions) Earnings from '
                   'Operations for 2019 is $31. The table is present in PDF '
                   'page : 4 with topic Defense, Space & Securi'},
    {   'answer': '$2,608',
        'context': 'ted with the (Dollars in Millions) Earnings from '
                   'Operations for 2019 is $2,608. The table is present in PDF '
                   'page : 4 with topic Defense, Space & Secur'},
    {   'answer': '($6,657',
        'context': 'h the (Dollars in Millions) (Loss)/Earnings from '
                   'Operations for 2019 is ($6,657). The table is present in '
                   'PDF page : 3 with topic Segment Results asso'},
    {   'answer':

## Referrences:
* https://www.thepythoncode.com/article/extract-pdf-tables-in-python-camelot

* https://www.thepythoncode.com/article/optical-character-recognition-pytesseract-python

* https://www.thepythoncode.com/article/extract-text-from-images-or-scanned-pdf-python

* https://pythonrepo.com/repo/cseas-ocr-table-python-computer-vision#install-requirements