<a href="https://colab.research.google.com/github/ivanmladek/Sentinel-Intelligence-Codex/blob/main/process_refactor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library Processing Pipeline

The process of extracting, cleaning, and preparing the text from PDF files for the LLM is a multi-stage pipeline designed to ensure high-quality, structured data. This process is orchestrated by the process_refactor.ipynb notebook.

1. Environment Setup and PDF Discovery

Dependencies: The process begins by installing necessary Python libraries, including nougat-ocr for text extraction, nltk for natural language processing, and langdetect for language identification.
PDF Discovery: The script recursively scans a specified directory (e.g., a Google Drive folder) to locate all PDF files.
2. Text Extraction with Nougat

Nougat OCR: For each PDF, the nougat command-line tool is used. Nougat is a state-of-the-art OCR tool specifically designed for academic and scientific documents, capable of recognizing and transcribing complex layouts, mathematical equations, and tables into a structured Markdown format (.mmd).
Output: The raw extracted text is saved as a .mmd file, preserving the document's structure with Markdown headings.
3. Text Cleaning and Garbage Detection

This is a critical step to filter out irrelevant or low-quality text.

Cleaning: A series of regular expressions and cleaning functions are applied to the raw text to:
Remove extra newlines, spaces, and non-ASCII characters.
Eliminate academic citations, references to tables/figures, and bracketed content.
Sanitize garbled punctuation and symbols.
Garbage Detection: Each segment of text is evaluated against a set of criteria to identify and discard "garbage" content. This includes:
Language Detection: Text that is not identified as English is discarded.
Heuristics: Checks for "jammed" words (long strings of characters without spaces), an unusually high proportion of single-letter words, and repetitive patterns.
Quality Scoring: A text_quality_score is calculated based on the presence of common English words, proper part-of-speech patterns, and other linguistic features. Text falling below a certain threshold is flagged as garbage.
4. Tokenization and Chunking

Chunking Strategy: The cleaned .mmd content is chunked into smaller, manageable segments suitable for the LLM. The chunking logic is designed to respect the document's structure:
The text is split by Markdown headings (#, ##, ###).
These larger sections are then further divided into sentences using nltk.sent_tokenize.
Size Constraints: The sentences are grouped into chunks with a maximum size (e.g., 8192 characters) to ensure they fit within the model's context window, while avoiding splitting sentences in the middle.
Final Output: The cleaned, chunked text is saved to a .jsonl file, with each line containing a JSON object with a single "text" key, ready for training the LLM. Garbage text is saved to a separate file for review.

## 1. Setup and Dependencies

In [None]:
#@title Install System Dependencies
!apt-get install -y poppler-utils tesseract-ocr libmagic-dev unrar

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
libmagic-dev is already the newest version (1:5.41-3ubuntu0.1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
unrar is already the newest version (1:6.1.5-1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
#@title Install Python Libraries (Part 1)
!pip install numpy==1.26.4



In [None]:
#@title Install Python Libraries (Part 2)
!pip install transformers==4.38.2 pyarrow==14.0.1 timm==0.5.4 requests==2.31.0 albumentations==1.0.0 git+https://github.com/facebookresearch/nougat
!pip install textblob langdetect beautifulsoup4 huggingface_hub tqdm pandas

Collecting git+https://github.com/facebookresearch/nougat
  Cloning https://github.com/facebookresearch/nougat to /tmp/pip-req-build-z4g1vguv
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/nougat /tmp/pip-req-build-z4g1vguv
  Resolved https://github.com/facebookresearch/nougat to commit 5a92920d342fb6acf05fc9b594ccb4053dbe8e7a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers==4.38.2
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow==14.0.1
  Downloading pyarrow-14.0.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting timm==0.5.4
  Downloading timm-0.5.4-py3-none-any.whl.metadata (36 kB)
Collecting requests==2.31.0
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting albumentations==1.0.0
  Downloading albumentatio

## 2. Imports and Configuration

In [None]:
import os
import re
import json
import logging
import shutil
import subprocess
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed

import nltk
import pandas as pd
import requests
from bs4 import BeautifulSoup
from huggingface_hub import HfApi
from langdetect import detect, LangDetectException
from nltk.corpus import words, brown
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
from tqdm import tqdm
from google.colab import drive

# --- Configuration ---
BASE_URL = "https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/"
HUGGING_FACE_REPO = "ivanmladek/Sentinel-Intelligence-Codex"  # Replace with your Hugging Face repo
GARBAGE_THRESHOLD = 0.8
LENWORD = 50

# --- Logging Setup ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
logger.propagate = True # Ensure messages are propagated to the root logger

# Explicitly set the logging level and add a handler to print to stdout
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
# Avoid adding duplicate handlers if the cell is run multiple times
if not logger.handlers:
    logger.addHandler(handler)


# --- Mount Google Drive ---
#drive.mount('/content/drive')

# --- Download NLTK Data ---
nltk.download('punkt')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

## 3. Helper Functions

### 3.1. File and Web Operations

In [None]:
import requests
from bs4 import BeautifulSoup
import os
import re
import subprocess
import logging
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

logger = logging.getLogger(__name__)

def get_file_list(url, depth=0, max_depth=3):
    """Recursively get a list of files from a URL and its subdirectories up to a max depth."""
    if depth > max_depth:
        logger.debug(f"Max depth ({max_depth}) reached at URL: {url}. Stopping recursion.")
        return []

    rar_files = []
    # Configure retries
    retry_strategy = Retry(
        total=3,  # Number of retries
        backoff_factor=1, # Factor by which the delay increases
        status_forcelist=[429, 500, 502, 503, 504] # HTTP status codes to retry on
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("http://", adapter)
    http.mount("https://", adapter)

    logger.info(f"Accessing URL: {url} (Depth: {depth})")
    try:
        response = http.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        soup = BeautifulSoup(response.text, 'html.parser')
        for link in soup.find_all('a'):
            href = link.get('href')
            if href:
                # Handle relative and absolute links
                full_url = requests.compat.urljoin(url, href)
                if full_url.endswith('.rar'):
                    logger.debug(f"Found RAR file: {full_url}")
                    rar_files.append(full_url)
                elif full_url.endswith('/'):
                    # Recursively call for subdirectories, avoiding infinite loops
                    if url != full_url: # Avoid processing the same directory again
                         logger.debug(f"Found subdirectory: {full_url}. Recursing.")
                         rar_files.extend(get_file_list(full_url, depth + 1, max_depth))
    except requests.exceptions.RequestException as e:
        logger.error(f"Error accessing URL {url}: {e}")
    logger.debug(f"Finished processing URL: {url}. Found {len(rar_files)} RAR files in this branch.")
    return rar_files

def download_file(url, output_path):
    """Download a file from a URL."""
    if os.path.exists(output_path):
        logger.info(f"{output_path} already exists. Skipping download.")
        return True # Indicate success as file exists
    logger.info(f"Attempting to download {url} to {output_path}")
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
        with open(output_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        logger.info(f"Successfully downloaded {url} to {output_path}")
        return True # Indicate success
    except requests.exceptions.RequestException as e:
        logger.error(f"Error downloading file from {url}: {e}")
        return False # Indicate failure


def extract_rar(file_path, output_path):
    """Extract a RAR file."""
    if not os.path.exists(file_path):
        logger.error(f"RAR file not found for extraction: {file_path}")
        return False # Indicate failure
    if not os.path.exists(output_path):
        os.makedirs(output_path)
        logger.debug(f"Created output directory for extraction: {output_path}")
    logger.info(f"Attempting to extract {file_path} to {output_path}")
    try:
        # Added -o+ to overwrite without prompting
        result = subprocess.run(['unrar', 'x', '-o+', file_path, output_path], check=True, capture_output=True, text=True)
        logger.info(f"Successfully extracted {file_path} to {output_path}")
        # Log stdout and stderr for debugging
        if result.stdout:
            logger.debug(f"Unrar stdout for {file_path}:\n{result.stdout}")
        if result.stderr:
             logger.debug(f"Unrar stderr for {file_path}:\n{result.stderr}")
        return True # Indicate success
    except subprocess.CalledProcessError as e:
        logger.error(f"Error extracting {file_path}: {e.stderr}")
        return False # Indicate failure
    except FileNotFoundError:
        logger.error("Unrar command not found. Please ensure 'unrar' is installed.")
        return False
    except Exception as e:
        logger.error(f"An unexpected error occurred during extraction of {file_path}: {e}")
        return False


def sanitize_filename(filename):
    """Sanitize a filename."""
    sanitized = re.sub(r'[^a-zA-Z0-9_.-]', '_', filename)
    logger.debug(f"Sanitized filename '{filename}' to '{sanitized}'")
    return sanitized

### 3.2. PDF Processing (Nougat)

In [None]:
import subprocess
import os
import logging
# Assuming sanitize_filename is defined in file-ops-cell and available

logger = logging.getLogger(__name__)

def process_pdf(pdf_path, output_dir):
    """Process a single PDF file with Nougat."""
    if not os.path.exists(pdf_path):
        logger.error(f"PDF file not found for processing: {pdf_path}")
        return None

    sanitized_filename = sanitize_filename(os.path.basename(pdf_path))
    mmd_path = os.path.join(output_dir, f"{os.path.splitext(sanitized_filename)[0]}.mmd")

    if os.path.exists(mmd_path):
        logger.info(f"{mmd_path} already exists. Skipping Nougat processing for {pdf_path}.")
        return mmd_path

    logger.info(f"Attempting to process PDF: {pdf_path} with Nougat. Output to {output_dir}")
    try:
        # Using --no-skipping and --recompute for thorough processing
        result = subprocess.run(
            ['nougat', pdf_path, '-o', output_dir, '--no-skipping', '--recompute'],
            check=True,
            capture_output=True,
            text=True
        )
        logger.info(f"Successfully processed {pdf_path} with Nougat.")
        logger.debug(f"Nougat stdout for {pdf_path}:\n{result.stdout}")
        if result.stderr:
            logger.debug(f"Nougat stderr for {pdf_path}:\n{result.stderr}")
    except subprocess.CalledProcessError as e:
        logger.error(f"Error processing {pdf_path} with Nougat: {e.stderr}")
        return None
    except Exception as e:
        logger.error(f"An unexpected error occurred during Nougat processing of {pdf_path}: {e}")
        return None

    # Check if the expected output file was actually created
    if not os.path.exists(mmd_path):
         logger.error(f"Nougat command finished without error, but expected output file {mmd_path} was not created.")
         return None

    return mmd_path

# Ensure sanitize_filename is available if it's not in this cell
# from .file_ops import sanitize_filename # Example if in a different file

### 3.3. Text Cleaning and Quality Control

In [None]:
import re
import logging
from langdetect import detect, LangDetectException
from nltk.corpus import words, brown # Assuming these are downloaded in config-cell
from nltk.tokenize import word_tokenize # Assuming this is downloaded in config-cell
from textblob import TextBlob # Assuming TextBlob is installed

# Assuming GARBAGE_THRESHOLD and LENWORD are defined in config-cell
# from .config import GARBAGE_THRESHOLD, LENWORD # Example if in a different file

logger = logging.getLogger(__name__)

# Load the NLTK words corpus for garbage detection
# Ensure this is done after nltk.download('words') in config-cell
try:
    ENGLISH_WORDS = set(words.words())
    logger.info("NLTK English words corpus loaded.")
except LookupError:
    logger.error("NLTK 'words' corpus not found. Please run nltk.download('words').")
    ENGLISH_WORDS = set() # Use an empty set to avoid errors

def clean_text(text):
    """Clean the extracted text."""
    logger.debug(f"Cleaning text (first 100 chars): {text[:100]}...")
    initial_len = len(text)
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r' +', ' ', text)
    text = text.strip()
    # Remove academic citations, references to tables/figures, and bracketed content.
    text = re.sub(r'\[[^\]]*\]', '', text)
    text = re.sub(r'\(\d+\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\[[A-Za-z0-9]+\]', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\([\w\s]+et\s+al\., \d{4}\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\(\w+\s+and\s+\w+\s+\d{4}\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\(see\s+equations\s+\(\d+\)\s+and\s+\(\d+\)\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\(\w+\s+et\s+al\., \d{4};\s*\w+\s+et\s+al\., \d{4}\)', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Table\s+\d+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\[FIGURE:[^]]+\]', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\[\d+(,\s*\d+)*\]', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\[.*arxiv.*\]', '', text, flags=re.IGNORECASE)
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    # Sanitize garbled punctuation and symbols.
    text = re.sub(r'[\.,;:!?]{2,}', '', text)
    logger.debug(f"Cleaned text (first 100 chars, original len {initial_len}): {text[:100]}...")
    return text

def calculate_text_quality_score(text):
    """Calculate a quality score based on English words and sentence structure."""
    if not text:
        return 0.0

    words = word_tokenize(text)
    if not words:
        return 0.0

    english_word_count = sum(1 for word in words if word.lower() in ENGLISH_WORDS)
    english_word_ratio = english_word_count / len(words) if words else 0

    # Simple heuristic for sentence structure (check for punctuation at end of sentences)
    sentences = sent_tokenize(text)
    well_formed_sentences = sum(1 for sent in sentences if sent.strip().endswith(('.', '!', '?')))
    sentence_structure_score = well_formed_sentences / len(sentences) if sentences else 0

    # Combine ratios - adjust weights as needed
    quality_score = (english_word_ratio * 0.7) + (sentence_structure_score * 0.3)

    logger.debug(f"Text quality score calculated: {quality_score} for text (first 50 chars): {text[:50]}...")
    return quality_score


def is_garbage(text, threshold=GARBAGE_THRESHOLD, lenword=LENWORD):
    """Check if the text is garbage based on various heuristics."""
    logger.debug(f"Checking if text is garbage (first 100 chars): {text[:100]}...")

    # Check for minimal length
    if not text or len(text.split()) < 5: # Reduced minimum words
        logger.debug("Identified as garbage: text too short or empty.")
        return True

    # Language detection
    try:
        if detect(text) != 'en':
            logger.debug("Identified as garbage: language not English.")
            return True
    except LangDetectException as e:
        logger.debug(f"Language detection failed for text (first 50 chars): {text[:50]}... Error: {e}. Assuming garbage.")
        return True # Assume garbage if language detection fails

    # Check for jammed words (long strings without spaces)
    words_list = text.split()
    for word in words_list:
        if len(word) > lenword and not '-' in word: # Allow hyphens in long words
             logger.debug(f"Identified as garbage: found jammed word '{word[:50]}...'")
             return True

    # Check for unusually high proportion of single-letter words
    single_letter_words = sum(1 for word in words_list if len(word) == 1)
    if len(words_list) > 0 and single_letter_words / len(words_list) > 0.2: # More than 20% single letters
        logger.debug("Identified as garbage: high proportion of single-letter words.")
        return True

    # Check for repetitive patterns (simple heuristic)
    if re.search(r'(.)\1{4,}', text): # 5 or more of the same character in a row
        logger.debug("Identified as garbage: found repetitive character pattern.")
        return True
    if re.search(r'(\w+\s+)\1{2,}', text): # A word repeated 3 or more times
         logger.debug("Identified as garbage: found repetitive word pattern.")
         return True


    # Quality scoring
    quality_score = calculate_text_quality_score(text)
    if quality_score < threshold:
        logger.debug(f"Identified as garbage: quality score {quality_score} below threshold {threshold}.")
        return True

    logger.debug("Text passed garbage checks.")
    return False

# Ensure GARBAGE_THRESHOLD and LENWORD are available if not in this cell
# GARBAGE_THRESHOLD = 0.8 # Example default
# LENWORD = 50 # Example default

### 3.4. Text Chunking

In [None]:
import json
import logging
import os
from nltk.tokenize import sent_tokenize # Assuming this is downloaded in config-cell

# Assuming clean_text and is_garbage are defined in cleaning-cell and available
# from .cleaning import clean_text, is_garbage # Example if in a different file

logger = logging.getLogger(__name__)

def chunk_text(content, max_size=8192):
    """Chunk the text into smaller segments, respecting markdown headings."""
    logger.debug(f"Starting chunking process with max_size={max_size}.")
    segments = []
    current_segment = ""
    lines = content.split('\n')
    logger.debug(f"Splitting content into {len(lines)} lines.")

    for i, line in enumerate(lines):
        # Check for markdown headings
        if line.strip().startswith(("# ", "## ", "### ")):
            logger.debug(f"Found markdown heading at line {i}: {line.strip()}")
            # If the current segment is not empty, process it before starting a new one
            if current_segment:
                logger.debug(f"Processing previous segment before heading (length: {len(current_segment)}).")
                segments.extend(split_segment(current_segment.strip(), max_size))
            # Start a new segment with the heading line
            current_segment = line + "\n" # Keep the heading line in the new segment
            logger.debug("Starting new segment after heading.")
        else:
            # Add non-heading lines to the current segment
            current_segment += line + "\n"

    # Process any remaining content in the last segment
    if current_segment:
        logger.debug(f"Processing final segment (length: {len(current_segment)}).")
        segments.extend(split_segment(current_segment.strip(), max_size))

    logger.info(f"Chunking complete. Produced {len(segments)} initial segments based on headings.")
    return segments

def split_segment(segment, max_size):
    """Split a segment (potentially from a heading section) into smaller chunks by sentences."""
    logger.debug(f"Splitting segment by sentences (length: {len(segment)}).")
    sentences = sent_tokenize(segment)
    logger.debug(f"Segment split into {len(sentences)} sentences.")
    chunks = []
    current_chunk = ""

    for i, sentence in enumerate(sentences):
        # Add a space before adding the new sentence if the current chunk is not empty
        sentence_to_add = sentence + " " if current_chunk else sentence
        # Check if adding the current sentence exceeds the max size
        if len(current_chunk) + len(sentence_to_add) <= max_size:
            current_chunk += sentence_to_add
            logger.debug(f"Added sentence {i+1}/{len(sentences)} to current chunk (current size: {len(current_chunk)}).")
        else:
            # If adding the sentence exceeds max size, add the current chunk to chunks list
            if current_chunk: # Add the chunk only if it's not empty
                chunks.append(current_chunk.strip())
                logger.debug(f"Chunk completed (size: {len(current_chunk)}). Starting new chunk with sentence {i+1}.")
            # Start a new chunk with the current sentence
            current_chunk = sentence + " " # Start new chunk with the current sentence

    # Add the last current chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
        logger.debug(f"Added final chunk (size: {len(current_chunk)}).")

    logger.debug(f"Segment split into {len(chunks)} smaller chunks.")
    return chunks


def process_and_chunk_mmd(mmd_path, output_dir):
    """Process, clean, chunk, and categorize text from an MMD file."""
    logger.info(f"Starting processing and chunking for MMD file: {mmd_path}")

    if not mmd_path or not os.path.exists(mmd_path):
        logger.warning(f"MMD file not found or path is invalid: {mmd_path}. Skipping processing and chunking.")
        return None, None

    sanitized_filename = sanitize_filename(os.path.basename(mmd_path))
    cleaned_jsonl_path = os.path.join(output_dir, f"{os.path.splitext(sanitized_filename)[0]}_cleaned.jsonl")
    garbage_jsonl_path = os.path.join(output_dir, f"{os.path.splitext(sanitized_filename)[0]}_garbage.jsonl")

    if os.path.exists(cleaned_jsonl_path) and os.path.exists(garbage_jsonl_path):
        logger.info(f"Output files {cleaned_jsonl_path} and {garbage_jsonl_path} already exist. Skipping processing and chunking for {mmd_path}.")
        return cleaned_jsonl_path, garbage_jsonl_path

    try:
        with open(mmd_path, 'r', encoding='utf-8') as f:
            content = f.read()
        logger.debug(f"Successfully read content from {mmd_path} (length: {len(content)}).")
    except Exception as e:
        logger.error(f"Error reading MMD file {mmd_path}: {e}")
        return None, None

    chunks = chunk_text(content)
    logger.info(f"MMD content chunked into {len(chunks)} segments.")

    cleaned_count = 0
    garbage_count = 0

    try:
        with open(cleaned_jsonl_path, 'w', encoding='utf-8') as cleaned_f, \
             open(garbage_jsonl_path, 'w', encoding='utf-8') as garbage_f:
            for i, chunk in enumerate(chunks):
                logger.debug(f"Processing chunk {i+1}/{len(chunks)} (length: {len(chunk)}).")
                cleaned_chunk = clean_text(chunk)
                if is_garbage(cleaned_chunk):
                    garbage_f.write(json.dumps({"text": cleaned_chunk}) + '\n')
                    garbage_count += 1
                    logger.debug(f"Chunk {i+1} identified as garbage.")
                else:
                    cleaned_f.write(json.dumps({"text": cleaned_chunk}) + '\n')
                    cleaned_count += 1
                    logger.debug(f"Chunk {i+1} identified as cleaned text.")

        logger.info(f"Finished processing and chunking {mmd_path}. Generated {cleaned_count} cleaned chunks and {garbage_count} garbage chunks.")
        return cleaned_jsonl_path, garbage_jsonl_path

    except Exception as e:
        logger.error(f"Error during cleaning or writing chunk files for {mmd_path}: {e}")
        # Clean up potentially incomplete files
        if os.path.exists(cleaned_jsonl_path):
            os.remove(cleaned_jsonl_path)
        if os.path.exists(garbage_jsonl_path):
            os.remove(garbage_jsonl_path)
        return None, None

# Ensure sanitize_filename, clean_text, is_garbage are available
# from .file_ops import sanitize_filename # Example if in a different file
# from .cleaning import clean_text, is_garbage # Example if in a different file

### 3.5. Hugging Face Integration

In [None]:
import logging
import os
from huggingface_hub import HfApi, Repository # Import Repository for better practice if needed for cloning/managing

logger = logging.getLogger(__name__)

def upload_to_huggingface(file_path, repo_id, repo_type="dataset"):
    """Upload a file to a Hugging Face repository."""
    if not os.path.exists(file_path):
        logger.error(f"File not found for upload to Hugging Face: {file_path}")
        return False # Indicate failure

    logger.info(f"Attempting to upload {file_path} to Hugging Face repo '{repo_id}' (type: {repo_type}).")
    api = HfApi()
    try:
        # Use create_commit for potentially better handling of multiple files or larger uploads
        # This example uses upload_file for simplicity as in the original code
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=os.path.basename(file_path),
            repo_id=repo_id,
            repo_type=repo_type,
            # Optional: add commit_message, token if not using environment variable
        )
        logger.info(f"Successfully uploaded {file_path} to {repo_id}")
        return True # Indicate success
    except Exception as e:
        logger.error(f"Error uploading {file_path} to Hugging Face repo '{repo_id}': {e}")
        return False # Indicate failure

## 4. Main Processing Loop

In [None]:
import os
import shutil
import logging
import json
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed # Import ThreadPoolExecutor
# Assuming helper functions are defined in other cells and available
# from .file_ops import get_file_list, download_file, extract_rar, sanitize_filename
# from .nougat_processing import process_pdf
# from .chunking import process_and_chunk_mmd
# from .huggingface_integration import upload_to_huggingface
# from .config import BASE_URL, HUGGING_FACE_REPO # Assuming these are defined in config-cell

logger = logging.getLogger(__name__)

# Define a cache file path
RAR_LIST_CACHE = "rar_list_cache.json"

def process_single_rar(rar_file_url, HUGGING_FACE_REPO):
    """Processes a single RAR file: downloads, extracts, processes PDFs, and uploads."""
    rar_filename = rar_file_url.split('/')[-1]
    sanitized_rar_filename = sanitize_filename(rar_filename)
    rar_path = sanitized_rar_filename
    extract_path = os.path.splitext(rar_path)[0]

    logger.info(f"--- Processing {rar_filename} ---")

    # 1. Download RAR file
    # The download_file function now correctly returns True on success
    download_file(rar_file_url, rar_path)

    # 2. Extract RAR file
    extract_rar(rar_path, extract_path)
    # Clean up the downloaded RAR even if extraction failed
    if os.path.exists(rar_path):
        os.remove(rar_path)
        logger.debug(f"Removed partially downloaded/failed RAR file: {rar_path}")

    # 3. Find and Process PDF files within the extracted directory
    pdf_files = [os.path.join(root, file) for root, _, files in os.walk(extract_path) for file in files if file.lower().endswith('.pdf')]
    logger.info(f"Found {len(pdf_files)} PDF files in extracted directory: {extract_path}")

    if not pdf_files:
        logger.warning(f"No PDF files found in {extract_path}. Cleaning up.")
        # Clean up the downloaded RAR and extracted directory
        if os.path.exists(rar_path):
            os.remove(rar_path)
            logger.debug(f"Removed downloaded RAR file: {rar_path}")
        if os.path.exists(extract_path):
            shutil.rmtree(extract_path)
            logger.debug(f"Removed extracted directory: {extract_path}")
        return 0 # Indicate no PDFs processed

    successful_uploads_count = 0
    with tqdm(total=len(pdf_files), desc=f"Processing PDFs in {sanitized_rar_filename}", leave=False) as pbar_pdfs:
        for pdf_path in pdf_files:
            logger.info(f"Processing PDF: {pdf_path}")

            # 4. Process PDF with Nougat
            mmd_path = process_pdf(pdf_path, extract_path)

            if mmd_path:
                logger.info(f"Nougat processing successful for {pdf_path}. MMD file: {mmd_path}")
                # 5. Clean and Chunk MMD file
                cleaned_jsonl, garbage_jsonl = process_and_chunk_mmd(mmd_path, extract_path)

                # 6. Upload to Hugging Face
                if cleaned_jsonl and os.path.exists(cleaned_jsonl):
                    logger.info(f"Uploading cleaned data for {os.path.basename(pdf_path)} to Hugging Face.")
                    if upload_to_huggingface(cleaned_jsonl, HUGGING_FACE_REPO):
                        successful_uploads_count += 1
                    else:
                        logger.error(f"Failed to upload cleaned data for {os.path.basename(pdf_path)}.")
                    # Optionally upload garbage data
                    ##if garbage_jsonl and os.path.exists(garbage_jsonl):
                    ##    logger.info(f"Uploading garbage data for {os.path.basename(pdf_path)} to Hugging Face.")
                    ##    upload_to_huggingface(garbage_jsonl, HUGGING_FACE_REPO)
                else:
                    logger.warning(f"No cleaned data generated for {os.path.basename(pdf_path)}. Skipping upload.")
            else:
                logger.error(f"Nougat processing failed for {pdf_path}. Skipping cleaning, chunking, and upload.")

            pbar_pdfs.update(1) # Update inner progress bar for each PDF

    # 7. Clean up downloaded RAR and extracted directory after processing all PDFs in the RAR
    logger.info(f"Cleaning up downloaded RAR and extracted directory for {rar_filename}.")
    if os.path.exists(rar_path):
        os.remove(rar_path)
        logger.debug(f"Removed downloaded RAR file: {rar_path}")
    if os.path.exists(extract_path):
        shutil.rmtree(extract_path)
        logger.debug(f"Removed extracted directory: {extract_path}")

    return successful_uploads_count


def main():
    """Main function to process the library."""
    logger.info("--- Starting Library Processing Pipeline ---")

    rar_files = []
    if os.path.exists(RAR_LIST_CACHE):
        logger.info(f"Loading RAR file list from cache: {RAR_LIST_CACHE}")
        try:
            with open(RAR_LIST_CACHE, 'r') as f:
                rar_files = json.load(f)
            logger.info(f"Loaded {len(rar_files)} RAR files from cache.")
        except Exception as e:
            logger.error(f"Error loading RAR file list from cache: {e}. Rescanning.")
            # If loading fails, proceed to rescan
            rar_files = []

    if not rar_files:
        logger.info(f"Scanning for RAR files at {BASE_URL}")
        try:
            rar_files = get_file_list(BASE_URL)
            logger.info(f"Found {len(rar_files)} RAR files.")
            # Save the list to cache
            try:
                with open(RAR_LIST_CACHE, 'w') as f:
                    json.dump(rar_files, f)
                logger.info(f"Saved RAR file list to cache: {RAR_LIST_CACHE}")
            except Exception as e:
                logger.error(f"Error saving RAR file list to cache: {e}")
        except Exception as e:
            logger.error(f"Failed to get RAR file list from {BASE_URL}: {e}")
            return # Exit if initial scan fails

    total_successful_uploads = 0

    # Use ThreadPoolExecutor to process RAR files in parallel
    # Adjust max_workers based on your runtime's capabilities and the task
    max_workers = 4 # Example: process 4 RARs at a time
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit tasks to the executor
        future_to_rar = {executor.submit(process_single_rar, rar_file_url, HUGGING_FACE_REPO): rar_file_url for rar_file_url in rar_files}

        # Use tqdm to track overall progress
        with tqdm(total=len(rar_files), desc="Overall RAR Processing") as pbar_overall:
            for future in as_completed(future_to_rar):
                rar_file_url = future_to_rar[future]
                try:
                    successful_uploads_count = future.result()
                    total_successful_uploads += successful_uploads_count
                except Exception as exc:
                    logger.error(f'{rar_file_url} generated an exception: {exc}')

                pbar_overall.update(1) # Update outer progress bar for each completed RAR

    logger.info("--- Library Processing Pipeline Finished ---")
    logger.info(f"Successfully uploaded cleaned data for {total_successful_uploads} PDF files to {HUGGING_FACE_REPO}.")


if __name__ == "__main__":
    main()

2025-07-01 13:38:57,116 - INFO - --- Starting Library Processing Pipeline ---


INFO:__main__:--- Starting Library Processing Pipeline ---


2025-07-01 13:38:57,133 - INFO - Loading RAR file list from cache: rar_list_cache.json


INFO:__main__:Loading RAR file list from cache: rar_list_cache.json


2025-07-01 13:38:57,156 - INFO - Loaded 4824 RAR files from cache.


INFO:__main__:Loaded 4824 RAR files from cache.


2025-07-01 13:38:57,168 - INFO - --- Processing 1.%20Prehistory.rar ---


INFO:__main__:--- Processing 1.%20Prehistory.rar ---


2025-07-01 13:38:57,186 - INFO - --- Processing Alexander%20the%20Great.rar ---


INFO:__main__:--- Processing Alexander%20the%20Great.rar ---


2025-07-01 13:38:57,190 - INFO - --- Processing Ancient%20Africa.rar ---


INFO:__main__:--- Processing Ancient%20Africa.rar ---


2025-07-01 13:38:57,201 - INFO - --- Processing Ancient%20Britain.rar ---


INFO:__main__:--- Processing Ancient%20Britain.rar ---
Overall RAR Processing:   0%|          | 0/4824 [00:00<?, ?it/s]

2025-07-01 13:38:57,937 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/The%20Etruscans.rar to The_20Etruscans.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/The%20Etruscans.rar to The_20Etruscans.rar


2025-07-01 13:39:17,626 - INFO - Extracted The_20Etruscans.rar to The_20Etruscans


INFO:__main__:Extracted The_20Etruscans.rar to The_20Etruscans


2025-07-01 13:39:17,630 - ERROR - Skipping PDF processing for The%20Etruscans.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for The%20Etruscans.rar due to extraction failure.


2025-07-01 13:39:17,760 - INFO - --- Processing World%20Literature%20%26%20Myths.rar ---


INFO:__main__:--- Processing World%20Literature%20%26%20Myths.rar ---


2025-07-01 13:39:17,788 - INFO - World_20Literature_20_26_20Myths.rar already exists. Skipping download.


INFO:__main__:World_20Literature_20_26_20Myths.rar already exists. Skipping download.


2025-07-01 13:39:39,679 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Africa.rar to Ancient_20Africa.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Africa.rar to Ancient_20Africa.rar


2025-07-01 13:40:01,647 - INFO - Extracted Ancient_20Africa.rar to Ancient_20Africa


INFO:__main__:Extracted Ancient_20Africa.rar to Ancient_20Africa


2025-07-01 13:40:01,764 - INFO - Found 14 PDF files in extracted directory: Ancient_20Africa


INFO:__main__:Found 14 PDF files in extracted directory: Ancient_20Africa

Processing PDFs in Ancient_20Africa.rar:   0%|          | 0/14 [00:00<?, ?it/s][A

2025-07-01 13:40:01,830 - INFO - Processing PDF: Ancient_20Africa/Ancient Africa/Anna Leone - The End of the Pagan City. Religion, Economy, and Urbanism in Late Antique North Africa.pdf


INFO:__main__:Processing PDF: Ancient_20Africa/Ancient Africa/Anna Leone - The End of the Pagan City. Religion, Economy, and Urbanism in Late Antique North Africa.pdf


2025-07-01 13:40:16,030 - ERROR - Error extracting World_20Literature_20_26_20Myths.rar: 
Unexpected end of archive
World Literature & Myths/Jennifer R. March - Dictionary of Classical Mythology [Retail].pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting World_20Literature_20_26_20Myths.rar: 
Unexpected end of archive
World Literature & Myths/Jennifer R. March - Dictionary of Classical Mythology [Retail].pdf - checksum error
Unexpected end of archive


2025-07-01 13:40:16,032 - ERROR - Skipping PDF processing for World%20Literature%20%26%20Myths.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for World%20Literature%20%26%20Myths.rar due to extraction failure.


2025-07-01 13:40:16,035 - INFO - --- Processing Church%20and%20Theology%20in%20Middle%20Ages.rar ---


INFO:__main__:--- Processing Church%20and%20Theology%20in%20Middle%20Ages.rar ---


2025-07-01 13:40:16,043 - INFO - Church_20and_20Theology_20in_20Middle_20Ages.rar already exists. Skipping download.


INFO:__main__:Church_20and_20Theology_20in_20Middle_20Ages.rar already exists. Skipping download.


2025-07-01 13:41:19,667 - ERROR - Error extracting Church_20and_20Theology_20in_20Middle_20Ages.rar: 
Unexpected end of archive
Church and Theology in Middle Ages/F. Donald Logan - A History of the Church in the Middle Ages (2nd Edition) [Retail] (2).pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Church_20and_20Theology_20in_20Middle_20Ages.rar: 
Unexpected end of archive
Church and Theology in Middle Ages/F. Donald Logan - A History of the Church in the Middle Ages (2nd Edition) [Retail] (2).pdf - checksum error
Unexpected end of archive


2025-07-01 13:41:19,671 - ERROR - Skipping PDF processing for Church%20and%20Theology%20in%20Middle%20Ages.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Church%20and%20Theology%20in%20Middle%20Ages.rar due to extraction failure.


2025-07-01 13:41:19,674 - INFO - --- Processing Crusades.rar ---


INFO:__main__:--- Processing Crusades.rar ---


2025-07-01 13:41:19,676 - INFO - Crusades.rar already exists. Skipping download.


INFO:__main__:Crusades.rar already exists. Skipping download.


2025-07-01 13:41:29,475 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Roman%20Empire%20%26%20History.rar: ('Connection broken: IncompleteRead(1719676591 bytes read, 18009109827 more expected)', IncompleteRead(1719676591 bytes read, 18009109827 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Roman%20Empire%20%26%20History.rar: ('Connection broken: IncompleteRead(1719676591 bytes read, 18009109827 more expected)', IncompleteRead(1719676591 bytes read, 18009109827 more expected))


2025-07-01 13:41:33,418 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/1.%20Prehistory.rar: ('Connection broken: IncompleteRead(981507759 bytes read, 19728188992 more expected)', IncompleteRead(981507759 bytes read, 19728188992 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/1.%20Prehistory.rar: ('Connection broken: IncompleteRead(981507759 bytes read, 19728188992 more expected)', IncompleteRead(981507759 bytes read, 19728188992 more expected))


2025-07-01 13:41:33,503 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Britain.rar: ('Connection broken: IncompleteRead(939564721 bytes read, 1686823080 more expected)', IncompleteRead(939564721 bytes read, 1686823080 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Britain.rar: ('Connection broken: IncompleteRead(939564721 bytes read, 1686823080 more expected)', IncompleteRead(939564721 bytes read, 1686823080 more expected))


2025-07-01 13:41:33,916 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/World%20Literature%20%26%20Myths.rar: ('Connection broken: IncompleteRead(1568767665 bytes read, 1559578686 more expected)', IncompleteRead(1568767665 bytes read, 1559578686 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/World%20Literature%20%26%20Myths.rar: ('Connection broken: IncompleteRead(1568767665 bytes read, 1559578686 more expected)', IncompleteRead(1568767665 bytes read, 1559578686 more expected))


2025-07-01 13:41:33,939 - ERROR - Skipping extraction and processing for World%20Literature%20%26%20Myths.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for World%20Literature%20%26%20Myths.rar due to download failure.


2025-07-01 13:41:33,962 - INFO - --- Processing Hanseatic%20League%20%26%20East%20India%20Company.rar ---


INFO:__main__:--- Processing Hanseatic%20League%20%26%20East%20India%20Company.rar ---


2025-07-01 13:41:33,994 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Alexander%20the%20Great.rar: ('Connection broken: IncompleteRead(981507761 bytes read, 166399684 more expected)', IncompleteRead(981507761 bytes read, 166399684 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Alexander%20the%20Great.rar: ('Connection broken: IncompleteRead(981507761 bytes read, 166399684 more expected)', IncompleteRead(981507761 bytes read, 166399684 more expected))


2025-07-01 13:41:34,730 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Church%20and%20Theology%20in%20Middle%20Ages.rar: ('Connection broken: IncompleteRead(1275109041 bytes read, 2188715028 more expected)', IncompleteRead(1275109041 bytes read, 2188715028 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Church%20and%20Theology%20in%20Middle%20Ages.rar: ('Connection broken: IncompleteRead(1275109041 bytes read, 2188715028 more expected)', IncompleteRead(1275109041 bytes read, 2188715028 more expected))


2025-07-01 13:41:34,752 - ERROR - Skipping extraction and processing for Church%20and%20Theology%20in%20Middle%20Ages.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Church%20and%20Theology%20in%20Middle%20Ages.rar due to download failure.


2025-07-01 13:41:34,764 - INFO - --- Processing History%20of%20Ships.rar ---


INFO:__main__:--- Processing History%20of%20Ships.rar ---


2025-07-01 13:41:38,430 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Greece.rar: ('Connection broken: IncompleteRead(3888021167 bytes read, 13324521886 more expected)', IncompleteRead(3888021167 bytes read, 13324521886 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Ancient%20Greece.rar: ('Connection broken: IncompleteRead(3888021167 bytes read, 13324521886 more expected)', IncompleteRead(3888021167 bytes read, 13324521886 more expected))


2025-07-01 13:41:38,437 - ERROR - Skipping extraction and processing for Ancient%20Greece.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Ancient%20Greece.rar due to download failure.


2025-07-01 13:41:38,445 - INFO - --- Processing Knights%20%26%20Chivalry.rar ---


INFO:__main__:--- Processing Knights%20%26%20Chivalry.rar ---


2025-07-01 13:41:58,229 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/History%20of%20Ships.rar: ('Connection broken: IncompleteRead(376497 bytes read, 3054386406 more expected)', IncompleteRead(376497 bytes read, 3054386406 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/History%20of%20Ships.rar: ('Connection broken: IncompleteRead(376497 bytes read, 3054386406 more expected)', IncompleteRead(376497 bytes read, 3054386406 more expected))


2025-07-01 13:41:58,253 - ERROR - Skipping extraction and processing for History%20of%20Ships.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for History%20of%20Ships.rar due to download failure.


2025-07-01 13:41:58,330 - INFO - --- Processing Medieval%20Architecture.rar ---


INFO:__main__:--- Processing Medieval%20Architecture.rar ---


2025-07-01 13:42:43,166 - ERROR - Error extracting 1._20Prehistory.rar: 
Unexpected end of archive
1. Prehistory/Archaeology/Archaeological Studies/Brian M. Fagan, Nadia Durrani - In the Beginning. An Introduction to Archaeology (14th Edition) (Retail).epub - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting 1._20Prehistory.rar: 
Unexpected end of archive
1. Prehistory/Archaeology/Archaeological Studies/Brian M. Fagan, Nadia Durrani - In the Beginning. An Introduction to Archaeology (14th Edition) (Retail).epub - checksum error
Unexpected end of archive


2025-07-01 13:42:43,400 - INFO - Found 31 PDF files in extracted directory: 1._20Prehistory


INFO:__main__:Found 31 PDF files in extracted directory: 1._20Prehistory


Processing PDFs in 1._20Prehistory.rar:   0%|          | 0/31 [00:00<?, ?it/s][A[A

2025-07-01 13:42:43,427 - INFO - Processing PDF: 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf


INFO:__main__:Processing PDF: 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf


2025-07-01 13:42:43,537 - ERROR - Error extracting Ancient_20Britain.rar: 
Unexpected end of archive
Ancient Britain/John Sheehan, Donnchadh Ó Corráin - The Viking Age. Ireland and the West (Retail).pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Ancient_20Britain.rar: 
Unexpected end of archive
Ancient Britain/John Sheehan, Donnchadh Ó Corráin - The Viking Age. Ireland and the West (Retail).pdf - checksum error
Unexpected end of archive


2025-07-01 13:42:43,755 - INFO - Found 38 PDF files in extracted directory: Ancient_20Britain


INFO:__main__:Found 38 PDF files in extracted directory: Ancient_20Britain



Processing PDFs in Ancient_20Britain.rar:   0%|          | 0/38 [00:00<?, ?it/s][A[A[A

2025-07-01 13:42:43,773 - INFO - Processing PDF: Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf


INFO:__main__:Processing PDF: Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf


2025-07-01 13:42:53,576 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Religion%2C%20History%20of%20Religion.rar: ('Connection broken: IncompleteRead(1896038065 bytes read, 2019845764 more expected)', IncompleteRead(1896038065 bytes read, 2019845764 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Religion%2C%20History%20of%20Religion.rar: ('Connection broken: IncompleteRead(1896038065 bytes read, 2019845764 more expected)', IncompleteRead(1896038065 bytes read, 2019845764 more expected))
Overall RAR Processing:   0%|          | 0/4824 [03:56<?, ?it/s]

2025-07-01 13:42:54,036 - ERROR - Error extracting Alexander_20the_20Great.rar: 
Unexpected end of archive



ERROR:__main__:Error extracting Alexander_20the_20Great.rar: 
Unexpected end of archive


2025-07-01 13:42:54,047 - ERROR - Error extracting Religion_2C_20History_20of_20Religion.rar: 
Unexpected end of archive


ERROR:__main__:Error extracting Religion_2C_20History_20of_20Religion.rar: 
Unexpected end of archive


2025-07-01 13:42:54,053 - ERROR - Skipping PDF processing for Religion%2C%20History%20of%20Religion.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Religion%2C%20History%20of%20Religion.rar due to extraction failure.


2025-07-01 13:42:54,175 - ERROR - Error extracting Roman_20Empire_20_26_20History.rar: 
Unexpected end of archive


ERROR:__main__:Error extracting Roman_20Empire_20_26_20History.rar: 
Unexpected end of archive


2025-07-01 13:42:54,187 - ERROR - Skipping PDF processing for Roman%20Empire%20%26%20History.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Roman%20Empire%20%26%20History.rar due to extraction failure.


2025-07-01 13:42:54,316 - INFO - --- Processing Hanseatic%20League%20%26%20East%20India%20Company.rar ---


INFO:__main__:--- Processing Hanseatic%20League%20%26%20East%20India%20Company.rar ---


2025-07-01 13:42:54,323 - INFO - Hanseatic_20League_20_26_20East_20India_20Company.rar already exists. Skipping download.


INFO:__main__:Hanseatic_20League_20_26_20East_20India_20Company.rar already exists. Skipping download.


2025-07-01 13:42:54,337 - ERROR - Error extracting Crusades.rar: 
Unexpected end of archive


ERROR:__main__:Error extracting Crusades.rar: 
Unexpected end of archive


2025-07-01 13:42:54,343 - ERROR - Skipping PDF processing for Crusades.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Crusades.rar due to extraction failure.


2025-07-01 13:42:54,349 - INFO - --- Processing History%20of%20Ships.rar ---


INFO:__main__:--- Processing History%20of%20Ships.rar ---


2025-07-01 13:42:54,361 - INFO - History_20of_20Ships.rar already exists. Skipping download.


INFO:__main__:History_20of_20Ships.rar already exists. Skipping download.


2025-07-01 13:42:54,362 - INFO - Found 44 PDF files in extracted directory: Alexander_20the_20Great


INFO:__main__:Found 44 PDF files in extracted directory: Alexander_20the_20Great
Processing PDFs in Alexander_20the_20Great.rar:   0%|          | 0/44 [00:00<?, ?it/s]

2025-07-01 13:42:54,400 - INFO - Processing PDF: Alexander_20the_20Great/Alexander the Great/Bob Bennett, Mike Roberts - The Wars of Alexander's Successors, 323 – 281 BC. Volume 2 Battles and Tactics (Retail).pdf


INFO:__main__:Processing PDF: Alexander_20the_20Great/Alexander the Great/Bob Bennett, Mike Roberts - The Wars of Alexander's Successors, 323 – 281 BC. Volume 2 Battles and Tactics (Retail).pdf


2025-07-01 13:42:54,433 - ERROR - Error extracting History_20of_20Ships.rar: 
Unexpected end of archive
History of Ships/A. J. Hoving - Nicolaes Witsen and Shipbuilding in the Dutch Golden Age.pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting History_20of_20Ships.rar: 
Unexpected end of archive
History of Ships/A. J. Hoving - Nicolaes Witsen and Shipbuilding in the Dutch Golden Age.pdf - checksum error
Unexpected end of archive


2025-07-01 13:42:54,443 - ERROR - Skipping PDF processing for History%20of%20Ships.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for History%20of%20Ships.rar due to extraction failure.


2025-07-01 13:42:54,453 - INFO - --- Processing Knights%20%26%20Chivalry.rar ---


INFO:__main__:--- Processing Knights%20%26%20Chivalry.rar ---


2025-07-01 13:42:54,466 - INFO - Knights_20_26_20Chivalry.rar already exists. Skipping download.


INFO:__main__:Knights_20_26_20Chivalry.rar already exists. Skipping download.


2025-07-01 13:42:54,529 - INFO - --- Processing Medieval%20Architecture.rar ---


INFO:__main__:--- Processing Medieval%20Architecture.rar ---


2025-07-01 13:42:54,537 - INFO - Medieval_20Architecture.rar already exists. Skipping download.


INFO:__main__:Medieval_20Architecture.rar already exists. Skipping download.


2025-07-01 13:42:55,319 - ERROR - Error processing 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf with Nougat: Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 15, in <module>
    import torch
  File "/usr/local/lib/python3.11/dist-packages/torch/__init__.py", line 2253, in <module>
    from torch import masked as masked
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/__init__.py", line 1, in <module>
    from torch.masked._ops import (
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/_ops.py", line 10, in <module>
    from torch.masked.maskedtensor.core import is_masked_tensor, MaskedTensor
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/maskedtensor/__init__.py", line 4, in <module>
  

ERROR:__main__:Error processing 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf with Nougat: Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 15, in <module>
    import torch
  File "/usr/local/lib/python3.11/dist-packages/torch/__init__.py", line 2253, in <module>
    from torch import masked as masked
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/__init__.py", line 1, in <module>
    from torch.masked._ops import (
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/_ops.py", line 10, in <module>
    from torch.masked.maskedtensor.core import is_masked_tensor, MaskedTensor
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/maskedtensor/__init__.py", line 4, in <module>
    from .binary impo

2025-07-01 13:42:55,321 - ERROR - Nougat processing failed for 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf. Skipping cleaning, chunking, and upload.


ERROR:__main__:Nougat processing failed for 1._20Prehistory/1. Prehistory/America/Michelle Hayward, Lesley-Gail Atkinson, Michael A. Cinquino - Rock Art of the Caribbean (Caribbean Archaeology and Ethnohistory) [Retail].pdf. Skipping cleaning, chunking, and upload.


Processing PDFs in 1._20Prehistory.rar:   3%|▎         | 1/31 [00:11<05:57, 11.91s/it][A[A

2025-07-01 13:42:55,343 - INFO - Processing PDF: 1._20Prehistory/1. Prehistory/America/Bradley J. Vierra - The Late Archaic across the Borderlands From Foraging to Farming.pdf


INFO:__main__:Processing PDF: 1._20Prehistory/1. Prehistory/America/Bradley J. Vierra - The Late Archaic across the Borderlands From Foraging to Farming.pdf


2025-07-01 13:42:55,643 - ERROR - Error processing Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf with Nougat: Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 15, in <module>
    import torch
  File "/usr/local/lib/python3.11/dist-packages/torch/__init__.py", line 2253, in <module>
    from torch import masked as masked
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/__init__.py", line 1, in <module>
    from torch.masked._ops import (
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/_ops.py", line 10, in <module>
    from torch.masked.maskedtensor.core import is_masked_tensor, MaskedTensor
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/maskedtensor/__

ERROR:__main__:Error processing Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf with Nougat: Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 5, in <module>
    from predict import main
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 15, in <module>
    import torch
  File "/usr/local/lib/python3.11/dist-packages/torch/__init__.py", line 2253, in <module>
    from torch import masked as masked
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/__init__.py", line 1, in <module>
    from torch.masked._ops import (
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/_ops.py", line 10, in <module>
    from torch.masked.maskedtensor.core import is_masked_tensor, MaskedTensor
  File "/usr/local/lib/python3.11/dist-packages/torch/masked/maskedtensor/__init__.py", line 4,

2025-07-01 13:42:55,645 - ERROR - Nougat processing failed for Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf. Skipping cleaning, chunking, and upload.


ERROR:__main__:Nougat processing failed for Ancient_20Britain/Ancient Britain/Francesca Kaminski-Jones, Rhys Kaminski-Jones - Celts, Romans, Britons. Classical and Celtic Influence in the Construction of British Identities (Classical Presences) (Retail).pdf. Skipping cleaning, chunking, and upload.



Processing PDFs in Ancient_20Britain.rar:   3%|▎         | 1/38 [00:11<07:19, 11.88s/it][A[A[A

2025-07-01 13:42:55,677 - INFO - Processing PDF: Ancient_20Britain/Ancient Britain/Heather O'Donoghue - English Poetry and Old Norse Myth. A History [Retail].pdf


INFO:__main__:Processing PDF: Ancient_20Britain/Ancient Britain/Heather O'Donoghue - English Poetry and Old Norse Myth. A History [Retail].pdf


2025-07-01 13:43:02,973 - ERROR - Error extracting Knights_20_26_20Chivalry.rar: 
Unexpected end of archive
Knights & Chivalry/Maurice Keen - Chivalry.pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Knights_20_26_20Chivalry.rar: 
Unexpected end of archive
Knights & Chivalry/Maurice Keen - Chivalry.pdf - checksum error
Unexpected end of archive


2025-07-01 13:43:02,979 - ERROR - Skipping PDF processing for Knights%20%26%20Chivalry.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Knights%20%26%20Chivalry.rar due to extraction failure.


2025-07-01 13:43:02,983 - INFO - --- Processing Medieval%20Literature.rar ---


INFO:__main__:--- Processing Medieval%20Literature.rar ---


2025-07-01 13:43:03,828 - ERROR - Error extracting Medieval_20Architecture.rar: 
Unexpected end of archive
Medieval Architecture/Arleen Pabón-Charneco - Architecture History, Theory and Preservation Prehistory to the Middle Ages (Retail).pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Medieval_20Architecture.rar: 
Unexpected end of archive
Medieval Architecture/Arleen Pabón-Charneco - Architecture History, Theory and Preservation Prehistory to the Middle Ages (Retail).pdf - checksum error
Unexpected end of archive


2025-07-01 13:43:03,829 - ERROR - Skipping PDF processing for Medieval%20Architecture.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Medieval%20Architecture.rar due to extraction failure.


2025-07-01 13:43:03,831 - INFO - --- Processing Medieval%20People.rar ---


INFO:__main__:--- Processing Medieval%20People.rar ---


2025-07-01 13:43:06,285 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Miscellaneous.rar: ('Connection broken: IncompleteRead(1887477424 bytes read, 7791888393 more expected)', IncompleteRead(1887477424 bytes read, 7791888393 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/Miscellaneous.rar: ('Connection broken: IncompleteRead(1887477424 bytes read, 7791888393 more expected)', IncompleteRead(1887477424 bytes read, 7791888393 more expected))


2025-07-01 13:43:06,452 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Crusades.rar: ('Connection broken: IncompleteRead(1333800625 bytes read, 2290090166 more expected)', IncompleteRead(1333800625 bytes read, 2290090166 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Crusades.rar: ('Connection broken: IncompleteRead(1333800625 bytes read, 2290090166 more expected)', IncompleteRead(1333800625 bytes read, 2290090166 more expected))


2025-07-01 13:43:06,466 - ERROR - Skipping extraction and processing for Crusades.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Crusades.rar due to download failure.


2025-07-01 13:43:06,496 - INFO - --- Processing Medieval%20Literature.rar ---


INFO:__main__:--- Processing Medieval%20Literature.rar ---


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

  0%|          | 0/342 [00:00<?, ?it/s]
  0%|          | 0/342 [01:47<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 167, in main
    model_output = model.inference(
                   ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nougat/model.py", line 592, in inference
    decoder_output = self.decoder.model.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

  0%|          | 0/342 [00:00<?, ?it/s]
  0%|          | 0/342 [01:47<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/nougat", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/predict.py", line 167, in main
    model_output = model.inference(
                   ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nougat/model.py", line 592, in inference
    decoder_output = self.decoder.model.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-

2025-07-01 13:43:09,396 - ERROR - Nougat processing failed for Ancient_20Africa/Ancient Africa/Anna Leone - The End of the Pagan City. Religion, Economy, and Urbanism in Late Antique North Africa.pdf. Skipping cleaning, chunking, and upload.


ERROR:__main__:Nougat processing failed for Ancient_20Africa/Ancient Africa/Anna Leone - The End of the Pagan City. Religion, Economy, and Urbanism in Late Antique North Africa.pdf. Skipping cleaning, chunking, and upload.

Processing PDFs in Ancient_20Africa.rar:   7%|▋         | 1/14 [03:07<40:38, 187.57s/it][A

2025-07-01 13:43:09,429 - INFO - Processing PDF: Ancient_20Africa/Ancient Africa/Peter L. Shinnie - Ancient Nubia [Retail].pdf


INFO:__main__:Processing PDF: Ancient_20Africa/Ancient Africa/Peter L. Shinnie - Ancient Nubia [Retail].pdf


2025-07-01 13:43:11,678 - ERROR - Error extracting Hanseatic_20League_20_26_20East_20India_20Company.rar: 
Unexpected end of archive
Hanseatic League & East India Company/John Man - Marco Polo. The Journey that Changed the World [Retail].epub - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Hanseatic_20League_20_26_20East_20India_20Company.rar: 
Unexpected end of archive
Hanseatic League & East India Company/John Man - Marco Polo. The Journey that Changed the World [Retail].epub - checksum error
Unexpected end of archive


2025-07-01 13:43:11,681 - ERROR - Skipping PDF processing for Hanseatic%20League%20%26%20East%20India%20Company.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Hanseatic%20League%20%26%20East%20India%20Company.rar due to extraction failure.


2025-07-01 13:43:11,689 - INFO - --- Processing Medieval%20Society%20and%20Everyday%20Life.rar ---


INFO:__main__:--- Processing Medieval%20Society%20and%20Everyday%20Life.rar ---


2025-07-01 13:44:04,480 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Hanseatic%20League%20%26%20East%20India%20Company.rar to Hanseatic_20League_20_26_20East_20India_20Company.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Hanseatic%20League%20%26%20East%20India%20Company.rar to Hanseatic_20League_20_26_20East_20India_20Company.rar


2025-07-01 13:44:04,502 - ERROR - Skipping extraction and processing for Hanseatic%20League%20%26%20East%20India%20Company.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Hanseatic%20League%20%26%20East%20India%20Company.rar due to download failure.


2025-07-01 13:44:04,530 - INFO - --- Processing Medieval%20People.rar ---


INFO:__main__:--- Processing Medieval%20People.rar ---


2025-07-01 13:44:04,570 - INFO - Medieval_20People.rar already exists. Skipping download.


INFO:__main__:Medieval_20People.rar already exists. Skipping download.


2025-07-01 13:44:04,587 - ERROR - Skipping extraction and processing for Medieval%20People.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Medieval%20People.rar due to download failure.


2025-07-01 13:44:04,598 - INFO - --- Processing Medieval%20Society%20and%20Everyday%20Life.rar ---


INFO:__main__:--- Processing Medieval%20Society%20and%20Everyday%20Life.rar ---


2025-07-01 13:44:04,623 - INFO - Medieval_20Society_20and_20Everyday_20Life.rar already exists. Skipping download.


INFO:__main__:Medieval_20Society_20and_20Everyday_20Life.rar already exists. Skipping download.


2025-07-01 13:44:04,638 - ERROR - Skipping extraction and processing for Medieval%20Society%20and%20Everyday%20Life.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Medieval%20Society%20and%20Everyday%20Life.rar due to download failure.


2025-07-01 13:44:04,646 - INFO - --- Processing Miscellaneous.rar ---


INFO:__main__:--- Processing Miscellaneous.rar ---


2025-07-01 13:44:04,686 - INFO - Miscellaneous.rar already exists. Skipping download.


INFO:__main__:Miscellaneous.rar already exists. Skipping download.


2025-07-01 13:44:04,692 - ERROR - Skipping extraction and processing for Miscellaneous.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Miscellaneous.rar due to download failure.


2025-07-01 13:44:04,705 - INFO - --- Processing Renaissance%20and%20Enlightenment.rar ---


INFO:__main__:--- Processing Renaissance%20and%20Enlightenment.rar ---


2025-07-01 13:44:05,411 - ERROR - Error extracting Miscellaneous.rar: 
Unexpected end of archive
Miscellaneous/DK - Civilization. A History of the World in 1000 Objects (2nd Edition) (UK Edition) [Retail].pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Miscellaneous.rar: 
Unexpected end of archive
Miscellaneous/DK - Civilization. A History of the World in 1000 Objects (2nd Edition) (UK Edition) [Retail].pdf - checksum error
Unexpected end of archive


2025-07-01 13:44:05,419 - ERROR - Skipping PDF processing for Miscellaneous.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Miscellaneous.rar due to extraction failure.


2025-07-01 13:44:05,837 - INFO - --- Processing Miscellaneous.rar ---


INFO:__main__:--- Processing Miscellaneous.rar ---


2025-07-01 13:44:19,586 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20People.rar to Medieval_20People.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20People.rar to Medieval_20People.rar


2025-07-01 13:44:19,666 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Knights%20%26%20Chivalry.rar to Knights_20_26_20Chivalry.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Knights%20%26%20Chivalry.rar to Knights_20_26_20Chivalry.rar


2025-07-01 13:44:19,698 - ERROR - Skipping extraction and processing for Knights%20%26%20Chivalry.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Knights%20%26%20Chivalry.rar due to download failure.


2025-07-01 13:44:19,723 - INFO - --- Processing The%20Age%20of%20Discovery.rar ---


INFO:__main__:--- Processing The%20Age%20of%20Discovery.rar ---


2025-07-01 13:44:38,524 - INFO - Extracted Medieval_20People.rar to Medieval_20People


INFO:__main__:Extracted Medieval_20People.rar to Medieval_20People


2025-07-01 13:44:38,551 - ERROR - Skipping PDF processing for Medieval%20People.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Medieval%20People.rar due to extraction failure.


2025-07-01 13:44:38,716 - INFO - --- Processing Renaissance%20and%20Enlightenment.rar ---


INFO:__main__:--- Processing Renaissance%20and%20Enlightenment.rar ---


2025-07-01 13:44:38,754 - INFO - Renaissance_20and_20Enlightenment.rar already exists. Skipping download.


INFO:__main__:Renaissance_20and_20Enlightenment.rar already exists. Skipping download.


2025-07-01 13:44:59,455 - ERROR - Error extracting Renaissance_20and_20Enlightenment.rar: 
Unexpected end of archive
Renaissance and Enlightenment/Giuseppe Marcocci - The Globe on Paper. Writing Histories of the World in Renaissance Europe and the Americas (Retail).pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Renaissance_20and_20Enlightenment.rar: 
Unexpected end of archive
Renaissance and Enlightenment/Giuseppe Marcocci - The Globe on Paper. Writing Histories of the World in Renaissance Europe and the Americas (Retail).pdf - checksum error
Unexpected end of archive


2025-07-01 13:44:59,462 - ERROR - Skipping PDF processing for Renaissance%20and%20Enlightenment.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Renaissance%20and%20Enlightenment.rar due to extraction failure.


2025-07-01 13:44:59,469 - INFO - --- Processing The%20Age%20of%20Discovery.rar ---


INFO:__main__:--- Processing The%20Age%20of%20Discovery.rar ---


2025-07-01 13:44:59,472 - INFO - The_20Age_20of_20Discovery.rar already exists. Skipping download.


INFO:__main__:The_20Age_20of_20Discovery.rar already exists. Skipping download.


2025-07-01 13:45:03,897 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Age%20of%20Discovery.rar: ('Connection broken: IncompleteRead(184590002 bytes read, 552106546 more expected)', IncompleteRead(184590002 bytes read, 552106546 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Age%20of%20Discovery.rar: ('Connection broken: IncompleteRead(184590002 bytes read, 552106546 more expected)', IncompleteRead(184590002 bytes read, 552106546 more expected))


2025-07-01 13:45:03,901 - ERROR - Skipping extraction and processing for The%20Age%20of%20Discovery.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for The%20Age%20of%20Discovery.rar due to download failure.


2025-07-01 13:45:03,904 - INFO - --- Processing The%20Early%20Middle%20Ages%20%28400-800%29.rar ---


INFO:__main__:--- Processing The%20Early%20Middle%20Ages%20%28400-800%29.rar ---


2025-07-01 13:45:03,999 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Miscellaneous.rar: ('Connection broken: IncompleteRead(310419121 bytes read, 2616774376 more expected)', IncompleteRead(310419121 bytes read, 2616774376 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Miscellaneous.rar: ('Connection broken: IncompleteRead(310419121 bytes read, 2616774376 more expected)', IncompleteRead(310419121 bytes read, 2616774376 more expected))


2025-07-01 13:45:04,111 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Architecture.rar: ('Connection broken: IncompleteRead(796958385 bytes read, 2405511168 more expected)', IncompleteRead(796958385 bytes read, 2405511168 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Architecture.rar: ('Connection broken: IncompleteRead(796958385 bytes read, 2405511168 more expected)', IncompleteRead(796958385 bytes read, 2405511168 more expected))


2025-07-01 13:45:04,116 - ERROR - Skipping extraction and processing for Medieval%20Architecture.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Medieval%20Architecture.rar due to download failure.


2025-07-01 13:45:04,118 - INFO - --- Processing The%20High%20Middle%20Ages%20%281000-1300%29.rar ---


INFO:__main__:--- Processing The%20High%20Middle%20Ages%20%281000-1300%29.rar ---


2025-07-01 13:45:05,874 - ERROR - Error extracting The_20Age_20of_20Discovery.rar: 
Unexpected end of archive
The Age of Discovery/Jerry H. Bentley, Renate Bridenthal, Kären Wigen - Seascapes. Maritime Histories, Littoral Cultures, and Transoceanic Exchanges (Retail).pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting The_20Age_20of_20Discovery.rar: 
Unexpected end of archive
The Age of Discovery/Jerry H. Bentley, Renate Bridenthal, Kären Wigen - Seascapes. Maritime Histories, Littoral Cultures, and Transoceanic Exchanges (Retail).pdf - checksum error
Unexpected end of archive


2025-07-01 13:45:05,881 - ERROR - Skipping PDF processing for The%20Age%20of%20Discovery.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for The%20Age%20of%20Discovery.rar due to extraction failure.


2025-07-01 13:45:05,912 - INFO - --- Processing The%20Early%20Middle%20Ages%20%28400-800%29.rar ---


INFO:__main__:--- Processing The%20Early%20Middle%20Ages%20%28400-800%29.rar ---


2025-07-01 13:45:10,199 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Literature.rar: ('Connection broken: IncompleteRead(679517872 bytes read, 5278550621 more expected)', IncompleteRead(679517872 bytes read, 5278550621 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Literature.rar: ('Connection broken: IncompleteRead(679517872 bytes read, 5278550621 more expected)', IncompleteRead(679517872 bytes read, 5278550621 more expected))


2025-07-01 13:45:10,382 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Renaissance%20and%20Enlightenment.rar: ('Connection broken: IncompleteRead(293813937 bytes read, 869303860 more expected)', IncompleteRead(293813937 bytes read, 869303860 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Renaissance%20and%20Enlightenment.rar: ('Connection broken: IncompleteRead(293813937 bytes read, 869303860 more expected)', IncompleteRead(293813937 bytes read, 869303860 more expected))


2025-07-01 13:45:10,393 - ERROR - Skipping extraction and processing for Renaissance%20and%20Enlightenment.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Renaissance%20and%20Enlightenment.rar due to download failure.


2025-07-01 13:45:10,404 - INFO - --- Processing The%20Late%20Middle%20Ages%20%281300-1600%29.rar ---


INFO:__main__:--- Processing The%20Late%20Middle%20Ages%20%281300-1600%29.rar ---


2025-07-01 13:45:27,204 - ERROR - Error extracting Miscellaneous.rar: 
Unexpected end of archive
Miscellaneous/Christopher de Hamel - Meetings with Remarkable Manuscripts.epub - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Miscellaneous.rar: 
Unexpected end of archive
Miscellaneous/Christopher de Hamel - Meetings with Remarkable Manuscripts.epub - checksum error
Unexpected end of archive


2025-07-01 13:45:27,214 - ERROR - Skipping PDF processing for Miscellaneous.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Miscellaneous.rar due to extraction failure.


2025-07-01 13:45:27,295 - INFO - --- Processing The%20High%20Middle%20Ages%20%281000-1300%29.rar ---


INFO:__main__:--- Processing The%20High%20Middle%20Ages%20%281000-1300%29.rar ---


2025-07-01 13:45:27,298 - INFO - The_20High_20Middle_20Ages_20_281000-1300_29.rar already exists. Skipping download.


INFO:__main__:The_20High_20Middle_20Ages_20_281000-1300_29.rar already exists. Skipping download.


2025-07-01 13:45:48,437 - ERROR - Error extracting The_20High_20Middle_20Ages_20_281000-1300_29.rar: 
Unexpected end of archive
The High Middle Ages (1000-1300)/James Westfall Thompson, Edgar Nathaniel Johnson - An Introduction to Medieval Europe 300-1500.pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting The_20High_20Middle_20Ages_20_281000-1300_29.rar: 
Unexpected end of archive
The High Middle Ages (1000-1300)/James Westfall Thompson, Edgar Nathaniel Johnson - An Introduction to Medieval Europe 300-1500.pdf - checksum error
Unexpected end of archive


2025-07-01 13:45:48,447 - ERROR - Skipping PDF processing for The%20High%20Middle%20Ages%20%281000-1300%29.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for The%20High%20Middle%20Ages%20%281000-1300%29.rar due to extraction failure.


2025-07-01 13:45:48,456 - INFO - --- Processing The%20Late%20Middle%20Ages%20%281300-1600%29.rar ---


INFO:__main__:--- Processing The%20Late%20Middle%20Ages%20%281300-1600%29.rar ---


2025-07-01 13:45:48,459 - INFO - The_20Late_20Middle_20Ages_20_281300-1600_29.rar already exists. Skipping download.


INFO:__main__:The_20Late_20Middle_20Ages_20_281300-1600_29.rar already exists. Skipping download.


2025-07-01 13:46:08,261 - ERROR - Error extracting The_20Late_20Middle_20Ages_20_281300-1600_29.rar: 
Unexpected end of archive
The Late Middle Ages (1300-1600)/Barbara W. Tuchman - A Distant Mirror The Calamitous 14th Century (Updated Edition) [Retail].epub - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting The_20Late_20Middle_20Ages_20_281300-1600_29.rar: 
Unexpected end of archive
The Late Middle Ages (1300-1600)/Barbara W. Tuchman - A Distant Mirror The Calamitous 14th Century (Updated Edition) [Retail].epub - checksum error
Unexpected end of archive


2025-07-01 13:46:08,264 - ERROR - Skipping PDF processing for The%20Late%20Middle%20Ages%20%281300-1600%29.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for The%20Late%20Middle%20Ages%20%281300-1600%29.rar due to extraction failure.


2025-07-01 13:46:08,265 - INFO - --- Processing Weapons%20%26%20Warfare.rar ---


INFO:__main__:--- Processing Weapons%20%26%20Warfare.rar ---


2025-07-01 13:46:12,439 - ERROR - Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Society%20and%20Everyday%20Life.rar: ('Connection broken: IncompleteRead(616464049 bytes read, 2870199682 more expected)', IncompleteRead(616464049 bytes read, 2870199682 more expected))


ERROR:__main__:Error downloading file from https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Society%20and%20Everyday%20Life.rar: ('Connection broken: IncompleteRead(616464049 bytes read, 2870199682 more expected)', IncompleteRead(616464049 bytes read, 2870199682 more expected))


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

  0%|          | 0/345 [00:00<?, ?it/s]


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

  0%|          | 0/345 [00:00<?, ?it/s]


2025-07-01 13:46:49,023 - ERROR - Nougat processing failed for 1._20Prehistory/1. Prehistory/America/Bradley J. Vierra - The Late Archaic across the Borderlands From Foraging to Farming.pdf. Skipping cleaning, chunking, and upload.


ERROR:__main__:Nougat processing failed for 1._20Prehistory/1. Prehistory/America/Bradley J. Vierra - The Late Archaic across the Borderlands From Foraging to Farming.pdf. Skipping cleaning, chunking, and upload.


Processing PDFs in 1._20Prehistory.rar:   6%|▋         | 2/31 [04:05<1:08:48, 142.37s/it][A[A

2025-07-01 13:46:49,061 - INFO - Processing PDF: 1._20Prehistory/1. Prehistory/America/R. G. Matson, Gary Coupland - The Prehistory of the Northwest Coast (Retail).pdf


INFO:__main__:Processing PDF: 1._20Prehistory/1. Prehistory/America/R. G. Matson, Gary Coupland - The Prehistory of the Northwest Coast (Retail).pdf


2025-07-01 13:47:32,012 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Early%20Middle%20Ages%20%28400-800%29.rar to The_20Early_20Middle_20Ages_20_28400-800_29.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Early%20Middle%20Ages%20%28400-800%29.rar to The_20Early_20Middle_20Ages_20_28400-800_29.rar


2025-07-01 13:47:32,077 - ERROR - Skipping extraction and processing for The%20Early%20Middle%20Ages%20%28400-800%29.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for The%20Early%20Middle%20Ages%20%28400-800%29.rar due to download failure.


2025-07-01 13:47:32,114 - INFO - --- Processing Weapons%20%26%20Warfare.rar ---


INFO:__main__:--- Processing Weapons%20%26%20Warfare.rar ---


2025-07-01 13:47:32,151 - INFO - Weapons_20_26_20Warfare.rar already exists. Skipping download.


INFO:__main__:Weapons_20_26_20Warfare.rar already exists. Skipping download.


2025-07-01 13:47:32,186 - ERROR - Skipping extraction and processing for Weapons%20%26%20Warfare.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for Weapons%20%26%20Warfare.rar due to download failure.


2025-07-01 13:47:32,215 - INFO - --- Processing 4.%20Early%20Modern.rar ---


INFO:__main__:--- Processing 4.%20Early%20Modern.rar ---


2025-07-01 13:47:32,285 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Early%20Middle%20Ages%20%28400-800%29.rar to The_20Early_20Middle_20Ages_20_28400-800_29.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Early%20Middle%20Ages%20%28400-800%29.rar to The_20Early_20Middle_20Ages_20_28400-800_29.rar


2025-07-01 13:47:37,332 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Late%20Middle%20Ages%20%281300-1600%29.rar to The_20Late_20Middle_20Ages_20_281300-1600_29.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20Late%20Middle%20Ages%20%281300-1600%29.rar to The_20Late_20Middle_20Ages_20_281300-1600_29.rar


2025-07-01 13:47:37,342 - ERROR - Skipping extraction and processing for The%20Late%20Middle%20Ages%20%281300-1600%29.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for The%20Late%20Middle%20Ages%20%281300-1600%29.rar due to download failure.


2025-07-01 13:47:37,348 - INFO - --- Processing 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar ---


INFO:__main__:--- Processing 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar ---


2025-07-01 13:47:56,900 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar to 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar to 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar


2025-07-01 13:47:56,913 - ERROR - Skipping extraction and processing for 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar due to download failure.


2025-07-01 13:47:56,940 - INFO - --- Processing A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar ---


INFO:__main__:--- Processing A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar ---


2025-07-01 13:48:33,157 - ERROR - Error extracting Medieval_20Society_20and_20Everyday_20Life.rar: 
Unexpected end of archive
Medieval Society and Everyday Life/Arts and Crafts/Mariah Proctor-Tiffany - Medieval Art in Motion. The Inventory and Gift Giving of Queen Clemence de Hongrie [Retail].pdf - checksum error
Unexpected end of archive


ERROR:__main__:Error extracting Medieval_20Society_20and_20Everyday_20Life.rar: 
Unexpected end of archive
Medieval Society and Everyday Life/Arts and Crafts/Mariah Proctor-Tiffany - Medieval Art in Motion. The Inventory and Gift Giving of Queen Clemence de Hongrie [Retail].pdf - checksum error
Unexpected end of archive


2025-07-01 13:48:33,164 - ERROR - Skipping PDF processing for Medieval%20Society%20and%20Everyday%20Life.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for Medieval%20Society%20and%20Everyday%20Life.rar due to extraction failure.


2025-07-01 13:48:33,207 - INFO - --- Processing 4.%20Early%20Modern.rar ---


INFO:__main__:--- Processing 4.%20Early%20Modern.rar ---


2025-07-01 13:48:33,214 - INFO - 4._20Early_20Modern.rar already exists. Skipping download.


INFO:__main__:4._20Early_20Modern.rar already exists. Skipping download.


2025-07-01 13:48:44,907 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar to A_20Week_20in_20the_20Life_20_287_20Books_29_20_5BComplete_5D.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar to A_20Week_20in_20the_20Life_20_287_20Books_29_20_5BComplete_5D.rar


2025-07-01 13:48:44,935 - ERROR - Skipping extraction and processing for A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar due to download failure.


2025-07-01 13:48:44,952 - INFO - --- Processing Agora%20Picture%20Books%20%2827%20Books%29%20%5BComplete%5D%20%E2%80%A0.rar ---


INFO:__main__:--- Processing Agora%20Picture%20Books%20%2827%20Books%29%20%5BComplete%5D%20%E2%80%A0.rar ---


2025-07-01 13:48:47,573 - INFO - Extracted The_20Early_20Middle_20Ages_20_28400-800_29.rar to The_20Early_20Middle_20Ages_20_28400-800_29


INFO:__main__:Extracted The_20Early_20Middle_20Ages_20_28400-800_29.rar to The_20Early_20Middle_20Ages_20_28400-800_29


2025-07-01 13:48:47,577 - ERROR - Skipping PDF processing for The%20Early%20Middle%20Ages%20%28400-800%29.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for The%20Early%20Middle%20Ages%20%28400-800%29.rar due to extraction failure.


2025-07-01 13:48:47,696 - INFO - --- Processing 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar ---


INFO:__main__:--- Processing 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar ---


2025-07-01 13:48:47,710 - INFO - 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar already exists. Skipping download.


INFO:__main__:24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar already exists. Skipping download.


2025-07-01 13:48:47,966 - INFO - Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20High%20Middle%20Ages%20%281000-1300%29.rar to The_20High_20Middle_20Ages_20_281000-1300_29.rar


INFO:__main__:Downloaded https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/The%20High%20Middle%20Ages%20%281000-1300%29.rar to The_20High_20Middle_20Ages_20_281000-1300_29.rar


2025-07-01 13:48:48,007 - ERROR - Skipping extraction and processing for The%20High%20Middle%20Ages%20%281000-1300%29.rar due to download failure.


ERROR:__main__:Skipping extraction and processing for The%20High%20Middle%20Ages%20%281000-1300%29.rar due to download failure.


2025-07-01 13:48:48,013 - INFO - --- Processing Ancient%20Civilizations%20%28ABDO%20Publishing%29%20%288%20Books%29%20%5BComplete%5D%20%E2%80%A0.rar ---


INFO:__main__:--- Processing Ancient%20Civilizations%20%28ABDO%20Publishing%29%20%288%20Books%29%20%5BComplete%5D%20%E2%80%A0.rar ---


2025-07-01 13:49:01,818 - INFO - Extracted 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar to 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D


INFO:__main__:Extracted 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D.rar to 24_20Hours_20in_20Ancient_20History_20_284_20Books_29_20_5BComplete_5D


2025-07-01 13:49:01,827 - ERROR - Skipping PDF processing for 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar due to extraction failure.


ERROR:__main__:Skipping PDF processing for 24%20Hours%20in%20Ancient%20History%20%284%20Books%29%20%5BComplete%5D.rar due to extraction failure.


2025-07-01 13:49:01,854 - INFO - --- Processing A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar ---


INFO:__main__:--- Processing A%20Week%20in%20the%20Life%20%287%20Books%29%20%5BComplete%5D.rar ---


2025-07-01 13:49:01,869 - INFO - A_20Week_20in_20the_20Life_20_287_20Books_29_20_5BComplete_5D.rar already exists. Skipping download.


INFO:__main__:A_20Week_20in_20the_20Life_20_287_20Books_29_20_5BComplete_5D.rar already exists. Skipping download.
