# PDF Text Extraction Script

This notebook extracts text content and metadata from PDF files in the Nordic travel literature collection.

**Purpose:**
- Batch process multiple PDF files from the `literature pdf` folder
- Extract page content and metadata (page numbers, source file) from each PDF
- Save extracted data as CSV files for further analysis

**Workflow:**
1. Load PDF files using LangChain's PyPDFLoader
2. Extract text content and metadata from each page
3. Organize data into pandas DataFrames
4. Export to CSV files in `literature csv_raw` folder

**Input:** PDF files in `.\literature pdf\`

**Output:** CSV files (one per PDF) in `.\literature csv_raw\` containing:
- `page`: Page number
- `source`: Source PDF filename
- `page_content`: Extracted text content from each page

**Dependencies:** langchain_community, pandas, tqdm

In [1]:
# Import required libraries
print ('Import libraries')

# LangChain document loaders for PDF and EPUB processing
from langchain_community.document_loaders import UnstructuredEPubLoader  # For EPUB files (if needed)
from langchain.document_loaders import PyPDFLoader  # Main PDF loader

# Data manipulation and file operations
import pandas as pd  # For creating and managing DataFrames
import os  # For file and directory operations

# Progress tracking
from tqdm.notebook import tqdm  # Display progress bars in Jupyter notebooks

Import libraries


In [2]:
# Function to extract text from PDF and return data in a DataFrame
def pdf_text_extract_from_file(pdf_file):
    """
    Extract text content and metadata from a PDF file.
    
    Args:
        pdf_file (str): Path to the PDF file to process
        
    Returns:
        pd.DataFrame: DataFrame containing page metadata and content with columns:
                      - page: Page number
                      - source: Source PDF filename
                      - page_content: Extracted text from the page
    """
    
    # Initialize the PDF loader with page-by-page mode
    # This loads each page as a separate document object
    loader = PyPDFLoader(pdf_file, mode='page')
    
    # Load all pages from the PDF
    # Returns a list of Document objects, one per page
    docs = loader.load()
    
    # Create an empty list to store individual page DataFrames
    data_bucket = []
    
    # Iterate through each page document
    for i in docs:
        # Extract metadata (page number, source file, etc.)
        meta_data = i.metadata
        
        # Extract the actual text content from the page
        page_content = i.page_content
        
        # Create a DataFrame from the metadata dictionary
        # The [meta_data] wraps it in a list to create a single-row DataFrame
        df = pd.DataFrame.from_dict([meta_data])
        
        # Add the page content as a new column
        df['page_content'] = page_content
        
        # Append this page's DataFrame to the collection
        data_bucket.append(df)
    
    # Concatenate all page DataFrames into a single DataFrame
    # ignore_index=True creates a new sequential index (0, 1, 2, ...)
    book_data = pd.concat(data_bucket, ignore_index=True)
    
    # Return the complete book data
    return book_data

In [None]:
# ============================================
# BATCH PROCESSING: Extract all PDF files
# ============================================

# Define input and output directories
departure_folder = r'.\literature pdf'      # Source folder containing PDF files
arrival_folder = r'.\literature csv_raw'    # Destination folder for CSV output

# Create the output directory if it doesn't exist
# exist_ok=True prevents errors if the directory already exists
os.makedirs(arrival_folder, exist_ok=True)

# Get a list of all files in the PDF folder
files_in_folder = os.listdir(departure_folder)

# Process each file in the folder with a progress bar
for file in tqdm(files_in_folder, desc='Extracting data from multiple pdf files', colour='blue'):
    
    # Construct the full path to the PDF file
    file_path = os.path.join(departure_folder, file)
    
    # Extract text and metadata from the PDF using our function
    # Returns a DataFrame with all pages from this book
    book = pdf_text_extract_from_file(file_path)
    
    # Create the CSV filename by replacing .pdf extension with .csv
    # str(file)[:-4] removes the last 4 characters (.pdf)
    csv_file_name = str(file)[:-4] + '.csv'
    
    # Construct the full path for the output CSV file
    csv_file_path = os.path.join(arrival_folder, csv_file_name)
    
    # Save the DataFrame to CSV
    # index=False prevents pandas from writing row numbers as a column
    book.to_csv(csv_file_path, index=False)
    

Extracting data from multiple pdf files:   0%|          | 0/29 [00:00<?, ?it/s]