<h1>PDF files access and text extraction/OCR with Apache Tika<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Prerequisites" data-toc-modified-id="Prerequisites-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Prerequisites</a></span><ul class="toc-item"><li><span><a href="#Get-internal-IDs-of-e-periodica-from-DOI-text-file" data-toc-modified-id="Get-internal-IDs-of-e-periodica-from-DOI-text-file-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Get internal IDs of e-periodica from DOI text file</a></span></li></ul></li><li><span><a href="#Download-e-periodica-PDF-files-via-internal-IDs" data-toc-modified-id="Download-e-periodica-PDF-files-via-internal-IDs-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Download e-periodica PDF files via internal IDs</a></span><ul class="toc-item"><li><span><a href="#Download-PDF-files-directly-to-AWS-S3" data-toc-modified-id="Download-PDF-files-directly-to-AWS-S3-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Download PDF files directly to AWS S3</a></span></li><li><span><a href="#Shell-script-for-downloading-PDFs-from-e-periodica-to-AWS-S3" data-toc-modified-id="Shell-script-for-downloading-PDFs-from-e-periodica-to-AWS-S3-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Shell script for downloading PDFs from e-periodica to AWS S3</a></span></li></ul></li><li><span><a href="#OCRizing-e-rara-PDFs" data-toc-modified-id="OCRizing-e-rara-PDFs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>OCRizing e-rara PDFs</a></span><ul class="toc-item"><li><span><a href="#Transform-PDFs-into-images-with-pdf2image" data-toc-modified-id="Transform-PDFs-into-images-with-pdf2image-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Transform PDFs into images with pdf2image</a></span></li><li><span><a href="#OCRizing-images-with-Tika/Tessaract" data-toc-modified-id="OCRizing-images-with-Tika/Tessaract-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>OCRizing images with Tika/Tessaract</a></span></li><li><span><a href="#OCRizing-e-rara-PDFs-directly-with-Tika/Tessaract" data-toc-modified-id="OCRizing-e-rara-PDFs-directly-with-Tika/Tessaract-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>OCRizing e-rara PDFs directly with Tika/Tessaract</a></span></li></ul></li><li><span><a href="#Extract-text-from-e-periodica-PDFs-with-Apache-Tika" data-toc-modified-id="Extract-text-from-e-periodica-PDFs-with-Apache-Tika-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Extract text from e-periodica PDFs with Apache Tika</a></span><ul class="toc-item"><li><span><a href="#Testing" data-toc-modified-id="Testing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Testing</a></span></li><li><span><a href="#Shell-script-for-extracting-text-from-e-periodica-PDFs-and-save-to-AWS-S3" data-toc-modified-id="Shell-script-for-extracting-text-from-e-periodica-PDFs-and-save-to-AWS-S3-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Shell script for extracting text from e-periodica PDFs and save to AWS S3</a></span></li></ul></li></ul></div>

## Prerequisites

In [2]:
# Load the necessary libraries
import requests                                 # request URLs
from pathlib import Path
import urllib.request                           # open URLs, e.g. PDF files on URLs

!pip install pdfplumber   
import pdfplumber                               # read - available - text out from PDFs

from bs4 import BeautifulSoup as soup           # webscrape and parse HTML and XML
!pip install lxml
import lxml                                     # XML parser supported by bs4
                                                # call with soup(markup, 'lxml-xml' OR 'xml')
import os                                       # navigate and manipulate file directories
import time                                     # work with time stamps
import pandas as pd                             # pandas is the Python standard library to work with dataframes
from IPython.display import IFrame              # embed website views in jupyter notebook
import math                                     # work with mathematical functions
import re                                       # work with regular expressions
print("Successfully imported necessary libraries")

Successfully imported necessary libraries


In [163]:
!pip install awscli
# 'configure aws' via terminal

Collecting awscli
  Downloading awscli-1.20.48-py3-none-any.whl (3.7 MB)
Collecting botocore==1.21.48
  Downloading botocore-1.21.48-py3-none-any.whl (7.9 MB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
Collecting colorama<0.4.4,>=0.2.5
  Downloading colorama-0.4.3-py2.py3-none-any.whl (15 kB)
Collecting rsa<4.8,>=3.1.2
  Using cached rsa-4.7.2-py3-none-any.whl (34 kB)
Collecting docutils<0.16,>=0.10
  Downloading docutils-0.15.2-py3-none-any.whl (547 kB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting pyasn1>=0.1.3
  Using cached pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
Installing collected packages: jmespath, botocore, s3transfer, colorama, pyasn1, rsa, docutils, awscli
  Attempting uninstall: colorama
    Found existing installation: colorama 0.4.4
    Uninstalling colorama-0.4.4:
      Successfully uninstalled colorama-0.4.4
  Attempting uninstall: docutils
    Found existing in

### Get internal IDs of e-periodica from DOI text file

In [22]:
# Read DOI link source file
source = 'content/berner-zeitschrift-doi.csv'
with open(source, 'r', encoding='utf-8') as f:
               doi = pd.read_csv(f, header=None, names=['raw'])
doi['doi'] = 0
for i in doi.index:
    string = doi.raw[i]
    match = re.search('<attr type="DOI">(\S+)</attr>', string)   # extract DOI
    if match:
         doi.doi[i] = match.group(1)  
doi

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doi.doi[i] = match.group(1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0,raw,doi
0,"Line 65: <attr type=""DOI"">10.5169/...",10.5169/seals-237634
1,"Line 74: <attr type=""DOI"">10.5169/...",10.5169/seals-237635
2,"Line 83: <attr type=""DOI"">10.5169/...",10.5169/seals-237636
3,"Line 92: <attr type=""DOI"">10.5169/...",10.5169/seals-237637
4,"Line 101: <attr type=""DOI"">10.5169...",10.5169/seals-237638
...,...,...
913,"Line 201: <attr type=""DOI"">10.5169...",10.5169/seals-869590
914,"Line 32: <attr type=""DOI"">10.5169/...",10.5169/seals-869568
915,"Line 38: <attr type=""DOI"">10.5169/...",10.5169/seals-869569
916,"Line 44: <attr type=""DOI"">10.5169/...",10.5169/seals-869570


In [40]:
# Get e-periodica ID from DOI link
# Example: doi.org/10.5169/seals-237634 (redirects to https://www.e-periodica.ch/digbib/view?pid=zgh-001:1939:1::293#10)

doi['id_intern'] = 0
for i in doi.index:
    single_doi = doi.doi[i]
    url = 'http://doi.org/' + str(single_doi)
    r = requests.get(url, stream=True)
    out = soup(r.content, 'html.parser')   
    string = out.find_all('ul')[3].li.a['href']
    match = re.search('\/digbib\/view\?lang=de&pid=(\S+)', string)   # extract HTML coded identifier'  
    if match:
        doi.id_intern[i] = match.group(1).replace('%3A', ':')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  doi.id_intern[i] = match.group(1).replace('%3A', ':')


Unnamed: 0,raw,doi,id_intern
0,"Line 65: <attr type=""DOI"">10.5169/...",10.5169/seals-237634,0
1,"Line 74: <attr type=""DOI"">10.5169/...",10.5169/seals-237635,0
2,"Line 83: <attr type=""DOI"">10.5169/...",10.5169/seals-237636,0
3,"Line 92: <attr type=""DOI"">10.5169/...",10.5169/seals-237637,0
4,"Line 101: <attr type=""DOI"">10.5169...",10.5169/seals-237638,0
...,...,...,...
913,"Line 201: <attr type=""DOI"">10.5169...",10.5169/seals-869590,zgh-002:2019:81::470
914,"Line 32: <attr type=""DOI"">10.5169/...",10.5169/seals-869568,zgh-002:2020:82::294
915,"Line 38: <attr type=""DOI"">10.5169/...",10.5169/seals-869569,zgh-002:2020:82::295
916,"Line 44: <attr type=""DOI"">10.5169/...",10.5169/seals-869570,zgh-002:2020:82::296


In [4]:
print(out.title)
out.find_all('meta')[7].attrs # image of first page

<title>E-Periodica - Katharina von Wattenwyl : ein bernischer Spionageprozess aus der Zeit der Hugenottenverfolgungen</title>


{'property': 'og:image',
 'content': 'https://www.e-periodica.ch/iiif/2/e-periodica!zgh!1954_016!zgh-001_1954_016_0004.jpg/full/800,/0/default.jpg'}

In [142]:
fname = 'berner_zs_id_intern.csv'
with open(fname, 'r') as f:
    doi1 = pd.read_csv(f, nrows=354, usecols=['raw', 'doi', 'id_intern'])

fname2 = 'berner_zs_id_intern_2.csv'
with open(fname2, 'r') as f:
    doi2 = pd.read_csv(f, skiprows=[i for i in range(1,709)], nrows=246, usecols=['raw', 'doi', 'id_intern'])   

fname3 = 'berner_zs_id_intern_3.csv'
with open(fname3, 'r') as f:
    doi3 = pd.read_csv(f, nrows=263, skiprows=[i for i in range(1,1201)], usecols=['raw', 'doi', 'id_intern'])

fname4 = 'berner_zs_id_intern_4.csv'
with open(fname4, 'r') as f:
    doi4 = pd.read_csv(f, skiprows=[i for i in range(1,1727)], usecols=['raw', 'doi', 'id_intern'])

doi = pd.concat([doi1, doi2, doi3, doi4], ignore_index=True)

fname = 'berner_zs_id_intern_all.csv'
with open(fname, 'w') as f:
    doi.to_csv(f)

In [3]:
fname = 'berner_zs_id_intern_all.csv'
with open(fname, 'r') as f:
    ep = pd.read_csv(f)
ep

Unnamed: 0.1,Unnamed: 0,raw,doi,id_intern
0,0,"Line 65: <attr type=""DOI"">10.5169/...",10.5169/seals-237634,zgh-001:1939:1::293
1,1,"Line 74: <attr type=""DOI"">10.5169/...",10.5169/seals-237635,zgh-001:1939:1::294
2,2,"Line 83: <attr type=""DOI"">10.5169/...",10.5169/seals-237636,zgh-001:1939:1::295
3,3,"Line 92: <attr type=""DOI"">10.5169/...",10.5169/seals-237637,zgh-001:1939:1::296
4,4,"Line 101: <attr type=""DOI"">10.5169...",10.5169/seals-237638,zgh-001:1939:1::297
...,...,...,...,...
913,913,"Line 201: <attr type=""DOI"">10.5169...",10.5169/seals-869590,zgh-002:2019:81::470
914,914,"Line 32: <attr type=""DOI"">10.5169/...",10.5169/seals-869568,zgh-002:2020:82::294
915,915,"Line 38: <attr type=""DOI"">10.5169/...",10.5169/seals-869569,zgh-002:2020:82::295
916,916,"Line 44: <attr type=""DOI"">10.5169/...",10.5169/seals-869570,zgh-002:2020:82::296


## Download e-periodica PDF files via internal IDs

Example link with internal ID: 
https://www.e-periodica.ch/cntmng?type=pdf&pid=zgh-001:1954:16::255

Eample link, HTML-decoded:
https://www.e-periodica.ch/cntmng?pid=zgh-001%3A1954%3A16%3A%3A255

In [157]:
# Testing some download methods

urllib.request.urlretrieve('https://www.e-periodica.ch/cntmng?type=pdf&pid=zgh-001:1954:16::255', 'test.pdf')
# OR
outfile = '{}.pdf'.format('test')
r = urllib.request.urlopen('https://www.e-periodica.ch/cntmng?type=pdf&pid=zgh-001:1954:16::255')
with open(outfile, 'wb') as f:
    #f.write(r.read())
    f.write(response.content)
    
# OR
filename = Path('test.pdf')
url = 'https://www.e-periodica.ch/cntmng?pid=zgh-001:1954:16::255'
response = requests.get(url)
filename.write_bytes(response.content)    # open the file of 'Path' in bytes mode, write data to it, and close the file


In [159]:
# Optimized method for downloading and writing big files

chunk_size = 20000
url = 'https://www.e-periodica.ch/cntmng?pid=zgh-001:1954:16::255'
r = requests.get(url, stream=True)

with open('test.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)
        

### Download PDF files directly to AWS S3

In [36]:
# Test
! curl https://www.e-periodica.ch/cntmng?pid=zgh-001:1954:16::255 | aws s3 cp - s3://bgd-test-content/pdf/test.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9495k  100 9495k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M


### Shell script for downloading PDFs from e-periodica to AWS S3

~~~
cat downloadpdf.sh

#!/bin/bash
# Download PDF from e-Periodica and transfer to AWS S3

download_pdf () {
    curl https://www.e-periodica.ch/cntmng?pid=${1} | aws s3 cp - s3://bgd-test-content/pdf/${1}.pdf
}

# Invoke function
download_pdf $1
~~~~

In [44]:
! chmod +x downloadpdf.sh    # Make shell script executable

In [35]:
# Test
! ./downloadpdf.sh 'zgh-001:1954:16::255'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9495k  100 9495k    0     0  13.2M      0 --:--:-- --:--:-- --:--:-- 13.2M


In [37]:
# Test
! sh downloadpdf.sh 'zgh-001:1954:16::255'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9495k  100 9495k    0     0  13.2M      0 --:--:-- --:--:-- --:--:-- 13.2M


In [43]:
# Look into AWS S3
! aws s3 ls bgd-content/

                           PRE bernensia-json-fulltexts/
                           PRE bernensia-json/
                           PRE bernensia-xml/
                           PRE eperiodica-pdf/
2021-09-04 18:18:32     485384 bernensia-docs-json


In [65]:
ep.tail().style

Unnamed: 0.1,Unnamed: 0,raw,doi,id_intern
913,913,Line 201: 10.5169/seals-869590,10.5169/seals-869590,zgh-002:2019:81::470
914,914,Line 32: 10.5169/seals-869568,10.5169/seals-869568,zgh-002:2020:82::294
915,915,Line 38: 10.5169/seals-869569,10.5169/seals-869569,zgh-002:2020:82::295
916,916,Line 44: 10.5169/seals-869570,10.5169/seals-869570,zgh-002:2020:82::296
917,917,Line 73: 10.5169/seals-869571,10.5169/seals-869571,zgh-002:2020:82::303


In [63]:
# Run shell script to download e-periodica PDFs to AWS S3

for i in ep.index[914:919]:
    id_intern = ep.id_intern[i]
    ! sh downloadpdf.sh $id_intern

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [71]:
# Get a list of all down-/uploaded full text files in AWS S3
! aws s3 ls bgd-content/eperiodica-pdf/ --summarize --human-readable > eperiodica_files.txt

In [173]:
# Get a list of all down-/uploaded full text files in AWS S3
! aws s3 ls bgd-content/bernensia-mix-fulltext/mix_erara_bernensia/ > bernensia_files.csv
# convert fixed wide columns via Open Refine

In [186]:
source = 'content/bernensia_files.csv'
with open(source, 'r', encoding='utf-8') as f:
               files = pd.read_csv(f, usecols=[2, 3], names=['size', 'file'], header=0)
files

Unnamed: 0,size,file
0,7202995,10179500.txt
1,1683241,10347968.pdf
2,417835,10381638.txt
3,72208257,10722710.pdf
4,12103,10733431.txt
...,...,...
566,2196124,9888151.pdf
567,1493371,9888167.pdf
568,1352979,9888179.pdf
569,859487,9888191.pdf


## OCRizing e-rara PDFs

PDFs with out text -> images -> OCR with Tessaract (in Apache Tika)

### Transform PDFs into images with pdf2image

https://pypi.org/project/pdf2image/

https://pdf2image.readthedocs.io

In [207]:
#!pip install pdf2image
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

In [None]:
# better than simply 'convert_from_path'
import tempfile
with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)  # to save system memory
    # Do something here

In [228]:
from pdf2image import convert_from_path, convert_from_byte

# pages = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
# OR

convert_from_path(pdf_path='content/raw/fulltext/pdf_erara/12903483.pdf',   # string or a pathlib.Path object
                          dpi=300,
                          fmt="jpeg",   # jpeg, png, tiff and ppm (png = big!)
                          jpegopt={
                            "quality": 100,   # highest
                            "progressive": False,   # https://www.thewebmaster.com%2Fdev%2F2016%2Ffeb%2F10%2Fhow-progressive-jpegs-can-speed-up-your-website%2F&usg=AOvVaw06o25vsfVuRY_nhVO-5GCY
                            "optimize": True
                            },   # or None
                          poppler_path=r"C:\poppler-0.68.0\bin",
                          first_page=2,    # skip first page (cover page)
                          grayscale=True,
                          output_file="third",   # content/raw/fulltext/pdf_erara/
                          output_folder="content/raw/fulltext/pdf_erara/",
                          paths_only=True,     # outputs path list of generated images
                          hide_annotations=False)


['content/raw/fulltext/pdf_erara/third0001-02.jpg',
 'content/raw/fulltext/pdf_erara/third0001-03.jpg',
 'content/raw/fulltext/pdf_erara/third0001-04.jpg',
 'content/raw/fulltext/pdf_erara/third0001-05.jpg',
 'content/raw/fulltext/pdf_erara/third0001-06.jpg',
 'content/raw/fulltext/pdf_erara/third0001-07.jpg',
 'content/raw/fulltext/pdf_erara/third0001-08.jpg',
 'content/raw/fulltext/pdf_erara/third0001-09.jpg',
 'content/raw/fulltext/pdf_erara/third0001-10.jpg',
 'content/raw/fulltext/pdf_erara/third0001-11.jpg',
 'content/raw/fulltext/pdf_erara/third0001-12.jpg',
 'content/raw/fulltext/pdf_erara/third0001-13.jpg',
 'content/raw/fulltext/pdf_erara/third0001-14.jpg',
 'content/raw/fulltext/pdf_erara/third0001-15.jpg',
 'content/raw/fulltext/pdf_erara/third0001-16.jpg',
 'content/raw/fulltext/pdf_erara/third0001-17.jpg']

### OCRizing images with Tika/Tessaract

Documentation: https://cwiki.apache.org/confluence/display/tika/tikaocr

Default setting of OCR Parser Tika:
~~~
    Tesseract installation path = ""
    Language dictionary = "eng"
    Page Segmentation Mode = "1"
    Minmum file size = 0
    Maximum file size = 2147483647
    Timeout = 120
~~~~

In [230]:
! curl -T content/raw/fulltext/pdf_erara/third0001-08.jpg http://beta.offenedaten.de:9998/tika \
        --header "content-type: image/jpeg" --header "X-Tika-OCRLanguage: deu"

1%)an


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  481k    0     0  100  481k      0   391k  0:00:01  0:00:01 --:--:--  391k
100  481k    0     0  100  481k      0   214k  0:00:02  0:00:02 --:--:--  214k
100  481k    0     0  100  481k      0   147k  0:00:03  0:00:03 --:--:--  147k
100  481k    0     0  100  481k      0   113k  0:00:04  0:00:04 --:--:--  113k
100  481k    0     0  100  481k      0  93747  0:00:05  0:00:05 --:--:-- 93765
100  481k    0     0  100  481k      0  78593  0:00:06  0:00:06 --:--:--     0
100  483k    0  1221  100  481k    171  69164  0:00:07  0:00:07 --:--:--   249
100  483k    0  1221  100  481k    171  69154  0:00:07  0:00:07 --:--:--   314



QÃŸorben es ben ÃŸerï¬�anb nett, bog aIIe auï¬‚'eee
erben) a MÃ¤usen (worunter geseblet mieb â€šÂ» alles
maÃŸ unter bem 213m6 eines ÃŸeenvfunbeÃŸ, aber
â€™7; "Basen ÃŸernmebwng gemÃ¼rbiget t'ft) meldn btet
niet): unter Die erlaubten gefest warben, obne aus;
naÃŸm verbauen (enn rollen. ÃŸDarinn boel) nith Igeaâ€™
griffen fenn fallen, bie Ã¤Ã¤rudniÃ¼de von Denen Im:
im ÃŸanb gangbaren 6ilber:60rtmâ€š beten ÃŸauf
bienad) Deftimmt in.

{Damit aber Die ferneee ginmerfung bee Wem:
ten widmen unb beten Snnf invunrern ÃŸanben bez
binbmt, unb biefeÃ¶ f0 lanbÃ¤fdmbliebe ueBeI in feis
nem urfnrung auÃŸgetilget werben mÃ¶ge, mitbinbie
{ogennnnten Rippen unb Ã¤Binpem Das i1}, alle (Seite
bÃ¤nblem meIeIJe nerrufene febleebte mangelt in an:
fere Qanbe bringen, unb felbige gegen gute mÃ¼nsen,
ober und) gegen QÃ¶olbmnb 6ilbee=60rten aufmedy
(ein, nnb anÃ¤ Dem ÃŸnnb sieben ; benne und) folclje
megotiantem Sinbeimnten, (SvenmiÃŸionairen, â€šSpe:
bitoren, Subtkute, ÃŸÃ¶tt unb (Ã¶elbtrÃ¤gere, auch
anbeee 

### OCRizing e-rara PDFs directly with Tika/Tessaract

In [232]:
# To go with option 1 for OCR'ing PDFs (run OCR against inline images), you need to specify configurations for the PDFParser like so:
! curl -T content/raw/fulltext/pdf_erara/12903483.pdf \
        http://beta.offenedaten.de:9998/rmeta/text \
        --header "X-Tika-PDFextractInlineImages: true" > content/raw/fulltext/pdf_digibern/berner_enzy_bd_01.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 22 3704k    0     0   22  832k      0  1329k  0:00:02 --:--:--  0:00:02 1326k
100 3704k    0     0  100 3704k      0  2505k  0:00:01  0:00:01 --:--:-- 2505k


In [233]:
# To go with option 2 (render each page and then run OCR on that rendered image), you need to specify the ocr strategy:
! curl -T content/raw/fulltext/pdf_erara/12903483.pdf http://beta.offenedaten.de:9998/tika \
        --header "X-Tika-PDFOcrStrategy: ocr_only" > content/raw/fulltext/pdf_erara/12903483.txt

# Note: These two options are independent.  If you set extractInlineImages to true and select an OcrStrategy that includes
# OCR on the rendered page, Tika will run OCR on the extracted inline images and  the rendered page.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  3 3704k    0     0    3  128k      0  96875  0:00:39  0:00:01  0:00:38 96803
  3 3704k    0     0    3  128k      0  55491  0:01:08  0:00:02  0:01:06 55468
  3 3704k    0     0    3  128k      0  38847  0:01:37  0:00:03  0:01:34 38847
  3 3704k    0     0    3  128k      0  29870  0:02:07  0:00:04  0:02:03 29863
  3 3704k    0     0    3  128k      0  24313  0:02:36  0:00:05  0:02:31 25965
  3 3704k    0     0    3  128k      0  20476  0:03:05  0:00:06  0:02:59     0
  3 3704k    0     0    3  128k      0  17678  0:03:34  0:00:07  0:03:27     0
  3 3704k    0     0    3  128k      0  15548  0:04:04  0:00:08  0:03:56     0
  3 3704k    0     0    3  128k      0  13878  0:04

In [198]:
! curl -T content/raw/fulltext/pdf_erara/12903483.pdf http://beta.offenedaten.de:9998/tika | aws s3 cp - s3://bgd-test-content/txt/test2.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 91 3704k    0     0   91 3392k      0  3957k --:--:-- --:--:-- --:--:-- 3953k
100 3707k    0  2335  100 3704k   1878  2980k  0:00:01  0:00:01 --:--:-- 2984k


In [195]:
# Test
! aws s3 cp s3://bgd-content/bernensia-mix-fulltext/mix_erara_bernensia/10722710.pdf test.pdf
! curl -T test.pdf http://beta.offenedaten.de:9998/tika | aws s3 cp - s3://bgd-test-content/txt/test2.txt

Completed 256.0 KiB/68.9 MiB (289.7 KiB/s) with 1 file(s) remaining
Completed 512.0 KiB/68.9 MiB (561.4 KiB/s) with 1 file(s) remaining
Completed 768.0 KiB/68.9 MiB (825.8 KiB/s) with 1 file(s) remaining
Completed 1.0 MiB/68.9 MiB (1.1 MiB/s) with 1 file(s) remaining    
Completed 1.2 MiB/68.9 MiB (1.3 MiB/s) with 1 file(s) remaining    
Completed 1.5 MiB/68.9 MiB (1.6 MiB/s) with 1 file(s) remaining    
Completed 1.8 MiB/68.9 MiB (1.5 MiB/s) with 1 file(s) remaining    
Completed 2.0 MiB/68.9 MiB (1.7 MiB/s) with 1 file(s) remaining    
Completed 2.2 MiB/68.9 MiB (1.9 MiB/s) with 1 file(s) remaining    
Completed 2.5 MiB/68.9 MiB (2.0 MiB/s) with 1 file(s) remaining    
Completed 2.8 MiB/68.9 MiB (2.1 MiB/s) with 1 file(s) remaining    
Completed 3.0 MiB/68.9 MiB (2.3 MiB/s) with 1 file(s) remaining    
Completed 3.2 MiB/68.9 MiB (2.5 MiB/s) with 1 file(s) remaining    
Completed 3.5 MiB/68.9 MiB (2.6 MiB/s) with 1 file(s) remaining    
Completed 3.8 MiB/68.9 MiB (2.7 MiB/s) with 1 fi

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1 68.8M    0     0    1  896k      0  2262k  0:00:31 --:--:--  0:00:31 2256k
  3 68.8M    0     0    3 2368k      0  1723k  0:00:40  0:00:01  0:00:39 1723k
  7 68.8M    0     0    7 5312k      0  2236k  0:00:31  0:00:02  0:00:29 2235k
 13 68.8M    0     0   13 9472k      0  2810k  0:00:25  0:00:03  0:00:22 2810k
 18 68.8M    0     0   18 12.8M      0  3005k  0:00:23  0:00:04  0:00:19 3005k
 23 68.8M    0     0   23 16.3M      0  3123k  0:00:22  0:00:05  0:00:17 3191k
 28 68.8M    0     0   28 19.8M      0  3176k  0:00:22  0:00:06  0:00:16 3576k
 33 68.8M    0     0   33 23.3M      0  3250k  0:00:21  0:00:07  0:00:14 3733k
 39 68.8M    0     0   39 27.0M      0  3301k  0:00:21  0:00:08  0:00:13 3632k
 44 68.8M    0     0   44 30.5M      0  3341k  0:00

## Extract text from e-periodica PDFs with Apache Tika

Documentation: https://cwiki.apache.org/confluence/display/TIKA/Home

### Testing

In [None]:
source = 'content/berner-zeitschrift-doi.csv'
with open(source, 'r', encoding='utf-8') as f:
               doi = pd.read_csv(f, header=None, names=['raw'])

In [59]:
name = []
size = []
for f in os.scandir('s3a://bgd-content/eperiodica-pdf'):
    size.append(f.name)
    size.append(f.stat().st_size)
df = pd.DataFrame(list(zip(name, size)), columns =['name', 'size'])

FileNotFoundError: [Errno 2] No such file or directory: 's3a://bgd-content/eperiodica-pdf'

In [95]:
def read_pdf(pdf_path):
    '''
    Extracts the raw text of a PDF formatted file and prints it.
    Omits the first page of the PDF file, which is a cover sheet and not part of the article's genuine text.
    Parameters:
    pdf_path = The path of the PDF file to be read.   
    '''
    with open(pdf_path, 'rb') as f:                
        pdf = pdfplumber.open(f)
        for i in range(1, len(pdf.pages)):       # start with the second page to skip the first one = cover sheet
            page = pdf.pages[i]                  # creating a page object
            text = page.extract_text()           # extracting text form the page object
            return text   

In [86]:
! aws s3 cp - s3://bgd-test-content/pdf/test.pdf  http://35.181.7.199:28228/tika/text \
        --header "Accept: application/json"

/bin/bash: http://35.181.7.199:28228/tika/text: No such file or directory


In [93]:
! aws s3 cp - s3://bgd-test-content/pdf/test.pdf http://35.181.7.199:28228/tika/text \
        --header "Content-type: application/pdf" "Accept: text/plain" > test.txt


Unknown options: http://35.181.7.199:28228/tika/text,--header,Content-type: application/pdf,Accept: text/plain


In [84]:
! curl -T s3://bgd-test-content/pdf/test.pdf http://35.181.7.199:28228/tika \
        "Content-type: application/pdf" "Accept: text/plain"

curl: Can't open 's3://bgd-test-content/pdf/test.pdf'!
curl: try 'curl --help' or 'curl --manual' for more information
curl: (26) Failed to open/read local data from file/application


In [115]:
#! curl https://www.e-periodica.ch/cntmng?pid=zgh-001:1939:1::293 http://35.181.7.199:28228/tika/text \
        #--header "Content-type: application/pdf" "Accept: text/plain"

In [123]:
! curl -T zgh-002_2017_79__415_d.pdf http://35.181.7.199:28228/tika \
        "Content-type: application/pdf" "Accept: text/plain" > test.txt  #"Accept: application/json"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0^C


In [121]:
! curl -T test.pdf http://beta.offenedaten.de:9998/tika \
        #"Accept: application/json" > test.json   #"Content-type: application/pdf" "Accept: text/plain"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.9M    0     0  100  9.9M      0  23.5M --:--:-- --:--:-- --:--:-- 23.5M


### Shell script for extracting text from e-periodica PDFs and save to AWS S3
**Shell script for free accessible Tika Server**
~~~
cat pdftotext.sh

#!/bin/bash
# Get text from PDF and write in TXT files on AWS S3

get_text () {
    aws s3 cp s3://bgd-content/eperiodica-pdf/${1}.pdf test.pdf
    curl -T test.pdf http://beta.offenedaten.de:9998/tika | aws s3 cp - s3://bgd-content/eperiodica-txt/${1}.txt
}

# Invoke function
get_text $1
~~~

**Shell script for remote / own Tika Server**
~~~
cat pdftotext-remote.sh

#!/bin/bash
# Get text from PDF and write in TXT files on AWS S3

get_text () {
    IP=$(curl ipinfo.io/ip)
    aws s3 cp s3://bgd-content/eperiodica-pdf/${1}.pdf test.pdf
    curl -T test.pdf http://${IP}:28228/tika --header "Accept: text/plain" | \
                aws s3 cp - s3://bgd-content/eperiodica-txt/${1}.txt
}

# Invoke function
get_text $1
~~~

In [116]:
# Test for pdftotext.sh
! aws s3 cp s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::293.pdf test.pdf
! curl -T test.pdf http://beta.offenedaten.de:9998/tika | aws s3 cp - s3://bgd-test-content/txt/test.txt

download: s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::293.pdf to ./test.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.9M    0 46226  100  9.9M  87218  18.7M --:--:-- --:--:-- --:--:-- 18.8M


In [None]:
# Test for pdftotext-remote.sh
! aws s3 cp s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::353.pdf test.pdf
! curl -T test.pdf http://13.36.174.150:28228/tika --header "Accept: text/plain" > test.txt

In [124]:
# Make shell script executable
! chmod +x pdftotext.sh    

In [125]:
# Test
! sh pdftotext.sh zgh-002:2019:81::470

download: s3://bgd-content/eperiodica-pdf/zgh-002:2019:81::470.pdf to ./test.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5624k    0 27445  100 5598k  65815  13.1M --:--:-- --:--:-- --:--:-- 13.1M


In [127]:
# Run shell script to extract text from PDFs

for i in ep.index:
    id_intern = ep.id_intern[i]
    ! sh pdftotext.sh $id_intern
    #! !sh pdftotext-remote.sh $id_intern     # remote script

download: s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::293.pdf to ./test.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9.9M    0 46226  100  9.9M  93574  20.1M --:--:-- --:--:-- --:--:-- 20.1M
download: s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::294.pdf to ./test.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3558k    0 13143  100 3545k  52783  13.9M --:--:-- --:--:-- --:--:-- 13.9M
download: s3://bgd-content/eperiodica-pdf/zgh-001:1939:1::295.pdf to ./test.pdf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.0M    0 51866  100 11.0M  72036  15.3M --:--:-- --:--:-- --:--:-- 15.3M
download: s3://bgd-content/eperiodica-pdf/zgh-001

In [34]:
! curl -T content/pdf/zgh-002_2017_79__415_d.pdf http://localhost:9998/tika/text \
        "Content-type: application/pdf" "Accept: text/plain"

{"Author":"Adam, Tina","Content-Type":"application/pdf","Creation-Date":"2021-07-27T13:39:11Z","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nArbeitskonflikte in Berner Haushalten : die Justizpraxis der Reformationskammer 1781-1797\n\n\nArbeitskonflikte in Berner Haushalten : die\nJustizpraxis der Reformationskammer 1781-\n1797\n\nAutor(en): Adam, Tina\n\nObjekttyp: Article\n\nZeitschrift: Berner Zeitschrift fÃ¼r Geschichte\n\nBand (Jahr): 79 (2017)\n\nHeft 4\n\nPersistenter Link: http://doi.org/10.5169/seals-738148\n\nPDF erstellt am: 27.07.2021\n\nNutzungsbedingungen\nDie ETH-Bibliothek ist Anbieterin der digitalisierten Zeitschriften. Sie besitzt keine Urheberrechte an\nden Inhalten der Zeitschriften. Die Rechte liegen in der Regel bei den Herausgebern.\nDie auf der Plattform e-periodica verÃ¶ffentlichten Dokumente stehen fÃ¼r nicht-kommerzielle Zw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 15.8M    0     0  100 15.8M      0  75.8M --:--:-- --:--:-- --:--:-- 75.4M
100 15.9M    0 75968  100 15.8M   289k  61.8M --:--:-- --:--:-- --:--:-- 62.1M
curl: (3) URL using bad/illegal format or missing URL
curl: (3) URL using bad/illegal format or missing URL


In [35]:
! curl -T content/pdf/zgh-002_2017_79__415_d.pdf http://localhost:9998/tika/text \
        --header "Accept: application/json"

{"Author":"Adam, Tina","Content-Type":"application/pdf","Creation-Date":"2021-07-27T13:39:11Z","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nArbeitskonflikte in Berner Haushalten : die Justizpraxis der Reformationskammer 1781-1797\n\n\nArbeitskonflikte in Berner Haushalten : die\nJustizpraxis der Reformationskammer 1781-\n1797\n\nAutor(en): Adam, Tina\n\nObjekttyp: Article\n\nZeitschrift: Berner Zeitschrift fÃ¼r Geschichte\n\nBand (Jahr): 79 (2017)\n\nHeft 4\n\nPersistenter Link: http://doi.org/10.5169/seals-738148\n\nPDF erstellt am: 27.07.2021\n\nNutzungsbedingungen\nDie ETH-Bibliothek ist Anbieterin der digitalisierten Zeitschriften. Sie besitzt keine Urheberrechte an\nden Inhalten der Zeitschriften. Die Rechte liegen in der Regel bei den Herausgebern.\nDie auf der Plattform e-periodica verÃ¶ffentlichten Dokumente stehen fÃ¼r nicht-kommerzielle Zw

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 15.8M    0     0  100 15.8M      0  74.7M --:--:-- --:--:-- --:--:-- 74.7M
100 15.9M    0 75968  100 15.8M   309k  66.0M --:--:-- --:--:-- --:--:-- 66.3M
