#### How to Scrape all PDF files in a Website?
https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/?ref=rp

#### How to extract text from pdfs in folders with python and save them in dataframe?
https://stackoverflow.com/questions/66224627/how-to-extract-text-from-pdfs-in-folders-with-python-and-save-them-in-dataframe

#### Parsing PDFs in Python with Tika
https://www.geeksforgeeks.org/parsing-pdfs-in-python-with-tika/

##### Prerequisites: Implementing Web Scraping in Python with BeautifulSoup

Web Scraping is a method of extracting data from the website and use that data for other uses. There are several libraries and modules for doing web scraping in Python.  In this article, we’ll learn how to scrape the PDF files from the website with the help of beautifulsoup, which is one of the best web scraping modules in python, and the requests module for the GET requests. Also, for getting more information about the PDF file, we use PyPDF2 module.

#### Step 1: Import all the important modules and packages.

In [1]:
# for get the pdf files or url
import requests
 
# for tree traversal scraping in webpage
from bs4 import BeautifulSoup
 
# for input and output operations
import io
 
# For getting information about the pdfs
from PyPDF2 import PdfFileReader

#### Step 2: Passing the URL and make an HTML parser with the help of BeautifulSoup.

In [2]:
# website to scrap
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
 
# get the url from requests get method
read = requests.get(url)
 
# full html content
html_content = read.content
 
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")

In the above code:

Scraping is done by the https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/ link
requests module is used for making get request
read.content is used to go through all the HTML code. Printing will output the source code of the web page.
soup is having HTML content and used to parse the HTML

#### Step 3: We need to traverse through the PDFs from the website.

In [3]:
# created an empty list for putting the pdfs
list_of_pdf = set()
 
# accessed the first p tag in the html
l = soup.find('p')
 
# accessed all the anchors tag from given p tag
p = l.find_all('a')
 
# iterate through p for getting all the href links
for link in p:
     
    # original html links
    print("links: ", link.get('href'))
    print("\n")
     
    # converting the extension from .html to .pdf
    pdf_link = (link.get('href')[:-5]) + ".pdf"
     
    # converted to .pdf
    print("converted pdf links: ", pdf_link)
    print("\n")
     
    # added all the pdf links to set
    list_of_pdf.add(pdf_link)

links:  http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf


converted pdf links:  http://www.uncledavesenterprise.com/file/health/Food%20Calories%20Lis.pdf


links:  https://drive.google.com/file/d/1q4JEtOD0vCNtH0U4kLUHZqDMfThUCu9i/view?usp=sharing


converted pdf links:  https://drive.google.com/file/d/1q4JEtOD0vCNtH0U4kLUHZqDMfThUCu9i/view?usp=sh.pdf


links:  https://practice.geeksforgeeks.org/courses/Python-Foundation?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Python_Foundation


converted pdf links:  https://practice.geeksforgeeks.org/courses/Python-Foundation?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Python_Found.pdf


links:  https://practice.geeksforgeeks.org/courses/Data-Structures-With-Python?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Python_DS


converted pdf links:  https://practice.geeksforgeeks.org/courses/Data-Structures-With-Python?utm_source=geek

In the above code:

list_of_pdf is an empty set created for adding all the PDF files from the web page. Set is used because it never repeats the same-named elements. And automatically get rid of duplicates.
Iteration is done within all the links converting the .HTML to .pdf. It is done as the PDF name and HTML name has an only difference in the format, the rest all are same.
We use the set because we need to get rid of duplicate names. The list can also be used and instead of add, we append all the PDFs.

 #### Step 4: Create info function with pypdf2 module for getting all the required information of the pdf

In [4]:
def info(pdf_path):
 
    # used get method to get the pdf file
    response = requests.get(pdf_path)
 
    # response.content generate binary code for
    # string function
    with io.BytesIO(response.content) as f:
 
        # initialized the pdf
        pdf = PdfFileReader(f)
 
        # all info about pdf
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
     
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
     
    return information

 In the above code: 

Info function is responsible for giving all the required scraped output inside of the PDF.
io.BytesIO(response.content) – It is used because response.content is a binary code and the requests library is quite low leveled and generally compiled (not interpreted). So to handle byte, io.BytesIO is used.
There are several pypdfs2 functions to access different data in pdf.

#### Complete Code

In [9]:
import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
 
 
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
 
for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)
 
def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
 
for i in list_of_pdf:
    info(i)

http://www.uncledavesenterprise.com/file/health/Food%20Calories%20Lis.pdf
https://drive.google.com/file/d/1q4JEtOD0vCNtH0U4kLUHZqDMfThUCu9i/view?usp=sh.pdf
https://practice.geeksforgeeks.org/courses/Python-Foundation?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Python_Found.pdf
https://practice.geeksforgeeks.org/courses/Data-Structures-With-Python?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Pyth.pdf
https://practice.geeksforgeeks.org/courses/machine-learning?utm_source=geeksforgeeks&utm_medium=article&utm_campaign=GFG_Article_Bottom_Pyth.pdf


PdfReadError: EOF marker not found

### How to extract text from pdfs in folders with python and save them in dataframe?
https://stackoverflow.com/questions/66224627/how-to-extract-text-from-pdfs-in-folders-with-python-and-save-them-in-dataframe

If you want to find all the PDFs in a directory and its subdirectories, you can use os.listdir and glob, see Recursive sub folder search and return files in a list python . I've gone for a slightly longer form so it is easier to follow what is happening for beginners

Then, for each file, call Apache Tika, and save to the next row in the Pandas DataFrame

In [15]:
#!/usr/bin/python3
#pth = "PATH"
pth = "\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients"

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(pth):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

2022-01-28 14:57:06,087 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to C:\Users\HILTON~1.NET\AppData\Local\Temp\tika-server.jar.
2022-01-28 14:58:17,104 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to C:\Users\HILTON~1.NET\AppData\Local\Temp\tika-server.jar.md5.
2022-01-28 14:58:20,524 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


FileNotFoundError: [Errno 2] No such file or directory: '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\Alexander Forbes Investments Ltd - Income Provider Fund (AFIINC)\\Legal and Compliance\\FICA Documents\\FICA Pack March 2021\\FICA Pack March 2021\\1. CM1  Forbes Life Limited.pdf'

In [16]:
#!/usr/bin/python3
#pth = "PATH"
pth = "\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients"

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(pth):
    files += glob.glob(os.path.join(dirpath, ext))

In [17]:
files

['\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\20190903 Fairtree (RISE) Amended Fee Agreement_fully executed..pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\FRB ITF IFM Inc Fund Pres  AMMIPF.pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\prescient-redemptionswitch-application-form.pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\Report Pack.pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\Signatory List - Global Admin.pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\TFG Medical Aid Scheme Reg 30 Lookthrough 30 June 2017.pdf',
 '\\\\FILESERVER\\Clients\\Prescient Administration Clients\\PIM\\Segregated Clients\\Active Clients\\1. DocFox\\D

In [18]:
print(files)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

