### From PDF

#### Introduction to PDF Data Extraction:
- Understanding the structure of PDFs
- Overview of Python libraries for PDF extraction (PyPDF2, pdfplumber,tabula-py)

- Hands-on Activities: 
    - Identifying the structure of different PDFs
    - Setting up the Python environment for PDF data extraction

**1. Understanding PDF Structures**

PDF files can have different internal structures:

| PDF Type        | Characteristics                                        | Example Use                      |
| --------------- | ------------------------------------------------------ | -------------------------------- |
| **Text-based**  | Contains actual digital text (selectable & searchable) | Bank statements, invoices        |
| **Image-based** | Scanned images or photos                               | Handwritten notes, scanned forms |
| **Mixed PDFs**  | Contains both text and scanned images                  | Annotated or signed documents    |

Understanding the structure is crucial before deciding which tool to use (e.g., `PyMuPDF`, `pdfplumber`, `Tesseract`).

**2. Setting Up Python Environment**

To extract data, you'll need the right libraries installed. Use this command to set them up:

```bash
pip install pdfplumber pytesseract PyMuPDF opencv-python pillow
```

Also install **Tesseract-OCR** (needed for image-based PDFs):

* Windows: [Download installer](https://github.com/tesseract-ocr/tesseract)

**3. Python Tools & When to Use**

| Tool                   | Best for                                         | Usage                              |
| ---------------------- | ------------------------------------------------ | ---------------------------------- |
| `pdfplumber`           | Extracting text, tables from **text-based PDFs** | High accuracy for structured text  |
| `PyMuPDF` (fitz)       | Text & layout data; images                       | Versatile for both text and layout |
| `pytesseract + OpenCV` | **OCR on image-based PDFs**                      | Converts images to text            |

#### Using `pdfplumber`
**`pdfplumber`** is a Python library for **extracting text, tables, and metadata** from PDF files. Unlike basic text extractors, `pdfplumber` gives **fine-grained access** to PDF layout elements such as individual characters, lines, words, and table structures—making it ideal for **structured data extraction**, especially from PDFs exported from spreadsheets or forms.

**Key Features:**

* Extract full text or individual words with coordinates
* Extract structured tables from PDFs
* Access layout metadata (bounding boxes, fonts)
* Crop and visually render pages for inspection
* Great for working with PDFs generated from Excel, scans, and forms

Install it with:

```bash
pip install pdfplumber
```

<table style="width: 80%; border-collapse: collapse; border: 1px solid #ccc; text-align: left;margin-left: 0;">
  <thead>
    <tr style="background-color: #050A30; color: white;">
      <th>Function / Class</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>pdfplumber.open(path)</code></td>
      <td>Opens the PDF file at the given path and returns a <code>PDF</code> object.</td>
    </tr>
    <tr>
      <td><code>pdf.pages</code></td>
      <td>A list of <code>Page</code> objects, one for each page in the PDF.</td>
    </tr>
    <tr>
      <td><code>page.extract_text()</code></td>
      <td>Extracts all text from the page as a string.</td>
    </tr>
    <tr>
      <td><code>page.extract_words()</code></td>
      <td>Extracts a list of words with coordinates, useful for detailed parsing.</td>
    </tr>
    <tr>
      <td><code>page.extract_table()</code></td>
      <td>Extracts a single table (if found) from the page as a list of lists.</td>
    </tr>
    <tr>
      <td><code>page.extract_tables()</code></td>
      <td>Extracts all tables from the page, each as a list of lists.</td>
    </tr>
    <tr>
      <td><code>pdf.pages[i]</code></td>
      <td>Accesses the <i>i-th</i> page of the PDF as a <code>Page</code> object.</td>
    </tr>
  </tbody>
</table>


In [None]:
import pdfplumber
import pandas as pd
import re
import os
os.chdir(r"C:\Users\vaide\OneDrive - knowledgecorner.in\Course Material\Clients\Virtua Search\Vituare-Research\Datasets")

----

###### Ex. For more structured table in PDF

In [None]:
pdf = pdfplumber.open(r"data_pdf_1.pdf")
for page in pdf.pages :
    print(page.extract_table(), "\n -------------- \n")

In [None]:
df = pd.DataFrame()
for page in pdf.pages :
    df = pd.concat((df, pd.DataFrame(page.extract_table())), ignore_index=True)
df.columns = df.iloc[0]
df = df.drop(index = 0).reset_index(drop= True)
df.head()

----

###### Ex. Reading data from pdf with table

In [None]:
pdf = pdfplumber.open(r"data_pdf_2.pdf")
for page in pdf.pages :
    print(page.extract_text(), "\n -------------- \n")

In [None]:
# Using basic list and str handling
pdf = pdfplumber.open(r"data_pdf_2.pdf")
lines = []
for page in pdf.pages :
    lines.extend(page.extract_text().split("\n"))
header = lines[0].split()
data = [line for line in lines[1:] if re.match(r"\d{5}", line.strip())]

def clean_data(string) :
    parts = string.split()
    return [parts[0], " ".join(parts[1:4]), parts[4], " ".join(parts[5:7]), *parts[7:]]

df1 = pd.DataFrame(map(clean_data, data), columns = header)

In [None]:
# Using regular expression
pdf = pdfplumber.open(r"data_pdf_2.pdf")
lines = []
for page in pdf.pages :
    lines.extend(page.extract_text().split("\n"))
header = lines[0].split()
data = [line for line in lines[1:] if re.match(r"\d{5}", line.strip())]

def clean_data(string) :
    pattern = r'^(\d+)\s+(.+?)\s+([A-Za-z]+)\s+(Q\d\s+\d{2}|FY\s+\d{2})\s+(\d{2}-\d{2}-\d{4})\s+(\d{2}-\d{2}-\d{4})\s+(\d{2}-\d{2}-\d{4})$'
    result = re.match(pattern, string)
    return result.groups() if result else [np.nan] * 7

df1 = pd.DataFrame(map(clean_data, data), columns = header)

In [None]:
df1

In [None]:
df.equals(df1)

----

### From Image

###### Ex. Image to text

In [None]:
# Using pytesseract
from pytesseract import image_to_string

# Path to Tesseract executable (adjust this)
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

text = image_to_string("data.png", lang="eng", config = r'--oem 3 --psm 6')
lines = text.split("\n")

header = lines[0].split()
data = [line for line in lines[1:] if re.match(r"\d{5}", line.strip())]

def clean_data(string) :
    pattern = r'^(\d+)\s+(.+?)\s+([A-Za-z]+)\s+(Q\d\d{2}|FY\s+\d{2})\s+(\d{2}-\d{2}-\d{4})\s+(\d{2}-\d{2}-\d{4})\s+(\d{2}-\d{2}-\d{4})$'
    result = re.match(pattern, string)
    return result.groups() if result else [np.nan] * 7

df2 = pd.DataFrame(map(clean_data, data), columns = header).dropna()
df2

In [None]:
df.drop(columns=["PeriodName"]).iloc[:22].equals(df2.drop(columns=["PeriodName"]))

In [None]:
from easyocr import Reader

reader = Reader(['en'])
text = reader.readtext("data.png", detail=0)
print(text)

In [None]:
data = np.append(text[3:], np.ones(5))
df = pd.DataFrame(np.reshape(data, (24, 6)), columns=['A', 'Ticker',  "PeriodName", "PeriodEndDate", "FirstFillingDate", "LatestFillingDate"])
df = df.iloc[:-2]
df.loc[len(df)] = ['10688 Meta Platforms, Inc.', 'META', '03 11', '30-09-2011', '15-10-2011', '15-10-201']
df[['COID', 'CoName']] = df["A"].str.split(r"\d ", expand= True, regex=True)
df

In [None]:
# Using Keras-OCR
# pip install tensorflow
# pip install keras-ocr 
'''
Check Compatible NumPy Version for TensorFlow

TensorFlow Version	Compatible NumPy Versions
TF 2.15+    	NumPy ≥ 1.20.0, ≤ 1.26
TF 2.11–2.14  	NumPy ≥ 1.20.0, ≤ 1.24
TF 2.10	        NumPy ≥ 1.20.0, ≤ 1.23
TF 2.6–2.9	    NumPy ≥ 1.19.0, ≤ 1.22

keras-ocr uses TensorFlow under the hood. So match NumPy accordingly.

pip install numpy==1.23.5
pip install tensorflow==2.10.0
pip install keras==2.10.0
pip install keras-ocr

'''

import keras_ocr
import numpy as np
import pandas as pd

# Pipeline
pipeline = keras_ocr.pipeline.Pipeline()


# Read image
image = keras_ocr.tools.read("data2.png")
prediction_groups = pipeline.recognize([image])


# Extract text only
text = [word for word, box in prediction_groups[0]]
print(text)

### Scanned PDF to text 

In [33]:
import numpy as np
import pandas as pd

from pdf2image import convert_from_path

pdf_path = "data_pdf_4.pdf"
images = convert_from_path(pdf_path, dpi = 300)
images

[<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=3300x2550>]

In [None]:
import pytesseract

print(pytesseract.image_to_string(images[0]))

In [39]:
from easyocr import Reader
reader = Reader(["en"])

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


In [42]:
images[0].save("temp.png", "PNG")
print(reader.readtext("temp.png", detail=0))



['COID', 'CoName', 'Ticker PeriodName', 'PeriodEndDate FirstFillingDate LatestFillingDate', '10688 Meta Platforms, Inc:', 'META', 'Q1 07', '31-03-2007', '15-04-2007', '15-04-2007', '10688 Meta Platforms, Inc:', 'META', 'Q2 07', '30-06-2007', '15-07-2007', '15-07-2007', '10688 Meta Platforms, Inc:', 'META', 'Q3 07', '30-09-2007', '15-10-2007', '15-10-2007', '10688 Meta Platforms, Inc:', 'META', 'Q4 07', '31-12-2007', '15-01-2008', '15-01-2008', '10688 Meta Platforms, Inc:', 'META', 'FY 07', '31-12-2007', '15-01-2008', '15-01-2008', '10688 Meta Platforms, Inc:', 'META', 'Q1 08', '31-03-2008', '15-04-2008', '15-04-2008', '10688 Meta Platforms, Inc.', 'META', 'Q2 08', '30-06-2008', '15-07-2008', '15-07-2008', '10688 Meta Platforms, Inc', 'META', 'Q3 08', '30-09-2008', '15-10-2008', '15-10-2008', '10688 Meta Platforms, Inc:', 'META', 'Q4 08', '31-12-2008', '15-01-2009', '15-01-2009', '10688 Meta Platforms, Inc:', 'META', 'FY 08', '31-12-2008', '15-01-2009', '15-01-2009', '10688 Meta Platfor