# Different PDF libraries
Example document = [PDF of a Springer-Paper](https://link.springer.com/content/pdf/10.1007/s12525-020-00445-0.pdf)

## PDFPlumber
Readable but not proper formatted in some cases, extracts meta data. For sciencedirect not very good. [[Documentation](https://github.com/jsvine/pdfplumber)]

In [19]:
import pdfplumber

pdf_path = 'data/test_publication_springer.pdf'
with pdfplumber.open(pdf_path) as pdf_file:
    # first_page = pdf_file.pages[1]
    # print(first_page.extract_text())
    meta_data = pdf_file.metadata
    for page in pdf_file.pages:
        print(page.extract_text())

Intern.J.ofResearchinMarketing27(2010)69–82
ContentslistsavailableatScienceDirect
Intern. J. of Research in Marketing
journal homepage: www.elsevier.com/locate/ijresmar
Steering sales reps through cost information: An investigation into the black box of
cognitive references and negotiation behavior
Robert Wilkena,⁎, Markus Cornelißenb,1, Klaus Backhausb,2, Christian Schmitzc,3
aESCPEurope(CampusBerlin),InternationalMarketing,Heubnerweg6,D-14059Berlin,Germany
bUniversityofMünster,InstituteofBusiness-to-BusinessMarketing,AmStadtgraben13-15,D-48143Münster,Germany
cUniversityofSt.Gallen,InstituteofMarketing,Dufourstr.40a,CH-9000St.Gallen,Switzerland
a r t i c l e i n f o a b s t r a c t
Articlehistory: Aspreviousresearchdemonstrates,fewﬁrmsprovidefullpricingauthoritytotheirsalesrepresentatives(in
Firstreceivedin29,September2008andwas thefollowing:salesreps),andthosesalesrepresentativeswhodohavefullpricingauthoritymayoffertoo
underreviewfor5months manypriceconcessionsintheirefforttoclosethe

## PDFMiner3
Performs better than pdfplumber, well formatted and readable [[Documentation](https://pdfminersix.readthedocs.io/en/latest/)]

In [18]:
# Source: 'https://stackoverflow.com/questions/56494070/how-to-use-pdfminer-six-with-python-3'
from pdfminer3.layout import LAParams
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import TextConverter
from pdfminer3.pdfparser import PDFParser
from pdfminer3.pdfdocument import PDFDocument
import io

# Open a PDF file.
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
pdf_path = 'data/test_publication_springer.pdf'

with open(pdf_path, 'rb') as fh:
    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        page_interpreter.process_page(page)

    pdf_parser = PDFParser(fh)
    doc = PDFDocument(pdf_parser)
    meta_data = doc.info[0]
    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

Electronic Markets (2022) 32:523–545
https://doi.org/10.1007/s12525-020-00445-0

RESEARCH PAPER

Exploring customers’ likeliness to use e-service touchpoints in brick
and mortar retail

Benjamin Barann 1

& Jan H. Betzing 1 & Marco Niemann 1 & Benedikt Hoffmeister 1 & Jörg Becker 1

Received: 24 May 2019 / Accepted: 16 October 2020 /Published online: 20 November 2020
#

The Author(s) 2020

Abstract
E-commerce has embraced the digital transformation and innovated with e-service touchpoints to improve customers’ experi-
ences. Now some traditional, less-digitalized brick and mortar (BaM) retailers are starting to counteract the increasing compe-
tition by adopting digital touchpoints. However, the academic literature offers little in terms of what determines customers’
behavioral intentions toward e-service touchpoints. Therefore, drawing from the dominant design theory, this article first
conceptually adapts selected dominant touchpoints of leading e-commerce solutions to BaM retail. Th

## PyPDF2
Solid performance, though appears to be a bit less well formatted than pdfminer3. [[Documentation](https://pypdf2.readthedocs.io/en/latest/)]

In [19]:
from PyPDF2 import PdfReader

pdf_path = 'data/test_publication_springer.pdf'
reader = PdfReader(pdf_path)
for page in reader.pages:
    print(page.extract_text())
meta_data = reader.metadata

RESEARCH PAPER
Benjamin Barann1&Jan H. Betzing1&Marco Niemann1&Benedikt Hoffmeister1&Jörg Becker1
Received: 24 May 2019 / Accepted: 16 October 2020 /Published online: 20 November 2020
#
Abstract
E-commerce has embraced the digital transformation and innovated with e-service touchpoints to improve customers ’experi-
ences. Now some traditional, less-digitalized brick and mortar (BaM) retailers are starting to counteract the increasing compe-
tition by adopting digital touchpoints. However, the academic literature offers little in terms of what determines customers ’
behavioral intentions toward e-service touchpoints. Therefore, drawing from the dominant design theory, this article first
conceptually adapts selected dominant touchpoints of leading e-commerce solutions to BaM retail. Then 250 shoppers are
surveyed regarding the likeliness that they will use the selected touchpoints, followed by an exploratory factor analysis to
determine the touchpoints ’characteristics that lead to the s

## Tika
Good formatted after small cleaning, extracts meta data in a logical way (e.g.all authors, title, etc.) [[Documentation](https://github.com/chrismattmann/tika-python)], [[Source](ttps://www.geeksforgeeks.org/parsing-pdfs-in-python-with-tika/)]

In [16]:
# Source: h
from tika import parser
from Research_Scraper_Code import utils
import re

pdf_path = 'data/test_publication_springer.pdf'
parse_entire_pdf = parser.from_file(pdf_path, xmlContent=True)
meta_data = parse_entire_pdf['metadata']

text = parse_entire_pdf['content']
# format text without html commands
text = re.sub('<[^<]+?>', '', text).strip()

doi_number_regex = re.compile(r'10\.'  # DOI suffix starts with '10.'
                              r'\d{4,9}'  # followed by 4-9 digits
                              r'\/[-._;()/:a-zA-Z0-9].*'  # followed by suffix: a slash and any number/ character
                              )
# check if number matched regex

# search and print doi number in text
doi_number = re.search(doi_number_regex, text).group(0)
#print(f'DOI: {doi_number}')

print(text)

Exploring customers’ likeliness to use e-service touchpoints in brick and mortar retail


RESEARCH PAPER

Benjamin Barann1 &amp; Jan H. Betzing1 &amp; Marco Niemann1 &amp; Benedikt Hoffmeister1 &amp; Jörg Becker1

Received: 24 May 2019 /Accepted: 16 October 2020 /Published online: 20 November 2020
#

Abstract
E-commerce has embraced the digital transformation and innovated with e-service touchpoints to improve customers’ experi-
ences. Now some traditional, less-digitalized brick and mortar (BaM) retailers are starting to counteract the increasing compe-
tition by adopting digital touchpoints. However, the academic literature offers little in terms of what determines customers’
behavioral intentions toward e-service touchpoints. Therefore, drawing from the dominant design theory, this article first
conceptually adapts selected dominant touchpoints of leading e-commerce solutions to BaM retail. Then 250 shoppers are
surveyed regarding the likeliness that they will use the selected touch

# Downlaod pdfs (from url/DOI)

In [2]:
import cloudscraper
import requests
# import urllib3 new connection error
import urllib3


def download_PDF(url, filename):
    # headers = {  # todo make logic for this
    #     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.5 Safari/605.1.15'}
    # r = requests.get(url, headers=headers)

    scraper = cloudscraper.create_scraper(
        browser={
            'custom': 'ScraperBot/1.0',
        }
    )
    try:
        print('we try')
        r = scraper.get(url, allow_redirects=True)
    except requests.exceptions.ConnectionError as e:
        #print(type(e))
        #cancel download
        # cancel download
        print(f'[utils.py: download_pdf] Connection Error: {e}')
        return

    pdf_save_path = 'exports/pdf_downloads/' + filename + '.pdf'
    print(r.status_code)
    if r.status_code == 200:
        with open(pdf_save_path, 'wb') as f:
            f.write(r.content)
            # print green background black font
            print('\033[1;30;42m' + f'PDF downloaded : {filename}.pdf' + '\033[0m')
    else:
        print('Error: ', r.status_code)


url_springer = 'https://link.springer.com/content/pdf/10.1007/s12525-020-00445-0.pdf'
url_sciencedirect = 'https://www.sciencedirect.com/science/article/pii/S0167811609000901/pdfft'  # not working
url_ieee = 'https://ieeexplore.ieee.org/&arnumber=7887648'  # not working
url_bad = 'ftp://ftp.cencenelec.eu/EN/ResearchInnovation/CWA/CWA17514_2020.pdf'

download_PDF(url_springer, 'test_publication_springer')
download_PDF(url_sciencedirect, 'test_publication_sciencedirect')  # PDF corrupt
download_PDF(url_ieee, 'test_publication_ieee')  # PDF corrupt
try:
    download_PDF(url_bad, 'test_publication_bad')  # broken link
except Exception as e:
    print(type(e))

we try
200
[1;30;42mPDF downloaded : test_publication_springer.pdf[0m
we try
200
[1;30;42mPDF downloaded : test_publication_sciencedirect.pdf[0m
we try
200
[1;30;42mPDF downloaded : test_publication_ieee.pdf[0m
we try
<class 'requests.exceptions.InvalidSchema'>


In [1]:
import requests

# example
url = 'https://link.springer.com/content/pdf/10.1007/s12525-020-00445-0.pdf'
# set a header
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.5 Safari/605.1.15'}

r = requests.get(url, headers=headers)
pdf_save_path = 'exports/pdf_downloads/downloaded_publication.pdf'
if r.status_code == 200:
    with open(pdf_save_path, 'wb') as f:
        f.write(r.content)