<a href="https://colab.research.google.com/github/ivanmladek/Sentinel-Intelligence-Codex/blob/main/process_refactor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library Processing Pipeline

The process of extracting, cleaning, and preparing the text from PDF files for the LLM is a multi-stage pipeline designed to ensure high-quality, structured data. This process is orchestrated by the process_refactor.ipynb notebook.

1. Environment Setup and PDF Discovery

Dependencies: The process begins by installing necessary Python libraries, including nougat-ocr for text extraction, nltk for natural language processing, and langdetect for language identification.
PDF Discovery: The script recursively scans a specified directory (e.g., a Google Drive folder) to locate all PDF files.
2. Text Extraction with Nougat

Nougat OCR: For each PDF, the nougat command-line tool is used. Nougat is a state-of-the-art OCR tool specifically designed for academic and scientific documents, capable of recognizing and transcribing complex layouts, mathematical equations, and tables into a structured Markdown format (.mmd).
Output: The raw extracted text is saved as a .mmd file, preserving the document's structure with Markdown headings.
3. Text Cleaning and Garbage Detection

This is a critical step to filter out irrelevant or low-quality text.

Cleaning: A series of regular expressions and cleaning functions are applied to the raw text to:
Remove extra newlines, spaces, and non-ASCII characters.
Eliminate academic citations, references to tables/figures, and bracketed content.
Sanitize garbled punctuation and symbols.
Garbage Detection: Each segment of text is evaluated against a set of criteria to identify and discard "garbage" content. This includes:
Language Detection: Text that is not identified as English is discarded.
Heuristics: Checks for "jammed" words (long strings of characters without spaces), an unusually high proportion of single-letter words, and repetitive patterns.
Quality Scoring: A text_quality_score is calculated based on the presence of common English words, proper part-of-speech patterns, and other linguistic features. Text falling below a certain threshold is flagged as garbage.
4. Tokenization and Chunking

Chunking Strategy: The cleaned .mmd content is chunked into smaller, manageable segments suitable for the LLM. The chunking logic is designed to respect the document's structure:
The text is split by Markdown headings (#, ##, ###).
These larger sections are then further divided into sentences using nltk.sent_tokenize.
Size Constraints: The sentences are grouped into chunks with a maximum size (e.g., 8192 characters) to ensure they fit within the model's context window, while avoiding splitting sentences in the middle.
Final Output: The cleaned, chunked text is saved to a .jsonl file, with each line containing a JSON object with a single "text" key, ready for training the LLM. Garbage text is saved to a separate file for review.

## 1. Setup and Dependencies

In [1]:
#@title Install System Dependencies
!apt-get install -y poppler-utils tesseract-ocr libmagic-dev unrar

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
libmagic-dev is already the newest version (1:5.41-3ubuntu0.1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
unrar is already the newest version (1:6.1.5-1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [2]:
#@title Install Python Libraries (Part 1)
!pip install numpy==1.26.4



In [None]:
#@title Install Python Libraries (Part 2)

Collecting git+https://github.com/facebookresearch/nougat
  Cloning https://github.com/facebookresearch/nougat to /tmp/pip-req-build-rajd8_yf
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/nougat /tmp/pip-req-build-rajd8_yf
  Resolved https://github.com/facebookresearch/nougat to commit 5a92920d342fb6acf05fc9b594ccb4053dbe8e7a
  Preparing metadata (setup.py) ... [?25l[?25hdone


## 2. Imports and Configuration

In [4]:
import os
import re
import json
import logging
import shutil
import subprocess
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed

import nltk
import pandas as pd
import requests
from bs4 import BeautifulSoup
from huggingface_hub import HfApi
from langdetect import detect, LangDetectException
from nltk.corpus import words, brown
from nltk.tokenize import word_tokenize, sent_tokenize
from textblob import TextBlob
from tqdm import tqdm
from google.colab import drive

# --- Configuration ---
BASE_URL = "https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/"
HUGGING_FACE_REPO = "Disperser5601/Sentinel-Intelligence-Codex"  # Replace with your Hugging Face repo
GARBAGE_THRESHOLD = 0.5
LENWORD = 50

# --- Logging Setup ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
logger.propagate = True # Ensure messages are propagated to the root logger

# Explicitly set the logging level and add a handler to print to stdout
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
# Avoid adding duplicate handlers if the cell is run multiple times
if not logger.handlers:
    logger.addHandler(handler)


# --- Mount Google Drive ---
#drive.mount('/content/drive')

# --- Download NLTK Data ---
nltk.download('punkt')
nltk.download('words')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 3. Helper Functions

### 3.1. File and Web Operations

In [5]:
import requests
from bs4 import BeautifulSoup
import os
import re
import subprocess
import logging
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

logger = logging.getLogger(__name__)

def get_file_list(url, depth=0, max_depth=3):
    """Recursively get a list of files from a URL and its subdirectories up to a max depth, avoiding backlinks."""
    if depth > max_depth:
        logger.debug(f"Max depth ({max_depth}) reached at URL: {url}. Stopping recursion.")
        return []

    rar_files = []
    # Configure retries
    retry_strategy = Retry(
        total=3,  # Number of retries
        backoff_factor=1, # Factor by which the delay increases
        status_forcelist=[429, 500, 502, 503, 504] # HTTP status codes to retry on
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("http://", adapter)
    http.mount("https://", adapter)

    logger.info(f"Accessing URL: {url} (Depth: {depth})")
    try:
        response = http.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        soup = BeautifulSoup(response.text, 'html.parser')
        for link in soup.find_all('a'):
            href = link.get('href')
            if href:
                # Handle relative and absolute links
                full_url = requests.compat.urljoin(url, href)

                # Ensure we only go deeper into subdirectories, avoiding backlinks
                if full_url.startswith(url) and len(full_url) > len(url) and full_url.endswith('/'):
                     logger.debug(f"Found subdirectory: {full_url}. Recursing.")
                     rar_files.extend(get_file_list(full_url, depth + 1, max_depth))
                elif full_url.endswith('.rar'):
                    logger.debug(f"Found RAR file: {full_url}")
                    rar_files.append(full_url)

    except requests.exceptions.RequestException as e:
        logger.error(f"Error accessing URL {url}: {e}")
    logger.debug(f"Finished processing URL: {url}. Found {len(rar_files)} RAR files in this branch.")
    return rar_files

def download_file(url, output_path):
    """Download a file from a URL."""
    if os.path.exists(output_path):
        logger.info(f"{output_path} already exists. Skipping download.")
        return True # Indicate success as file exists
    logger.info(f"Attempting to download {url} to {output_path}")
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
        with open(output_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        logger.info(f"Successfully downloaded {url} to {output_path}")
        return True # Indicate success
    except requests.exceptions.RequestException as e:
        logger.error(f"Error downloading file from {url}: {e}")
        return False # Indicate failure


def extract_rar(file_path, output_path):
    """Extract a RAR file."""
    if not os.path.exists(file_path):
        logger.error(f"RAR file not found for extraction: {file_path}")
        return False # Indicate failure
    if not os.path.exists(output_path):
        os.makedirs(output_path)
        logger.debug(f"Created output directory for extraction: {output_path}")
    logger.info(f"Attempting to extract {file_path} to {output_path}")
    try:
        # Added -o+ to overwrite without prompting
        result = subprocess.run(['unrar', 'x', '-o+', file_path, output_path], check=True, capture_output=True, text=True)
        logger.info(f"Successfully extracted {file_path} to {output_path}")
        # Log stdout and stderr for debugging
        if result.stdout:
            logger.debug(f"Unrar stdout for {file_path}:\n{result.stdout}")
        if result.stderr:
             logger.debug(f"Unrar stderr for {file_path}:\n{result.stderr}")
        return True # Indicate success
    except subprocess.CalledProcessError as e:
        logger.error(f"Error extracting {file_path}: {e.stderr}")
        return False # Indicate failure
    except FileNotFoundError:
        logger.error("Unrar command not found. Please ensure 'unrar' is installed.")
        return False
    except Exception as e:
        logger.error(f"An unexpected error occurred during extraction of {file_path}: {e}")
        return False


def sanitize_filename(filename):
    """Sanitize a filename."""
    sanitized = re.sub(r'[^a-zA-Z0-9_.-]', '_', filename)
    logger.debug(f"Sanitized filename '{filename}' to '{sanitized}'")
    return sanitized



```
# This is formatted as code
```

### 3.2. PDF Processing (Nougat)

In [6]:
import subprocess
import os
import logging
import sys # Import sys for streaming stdout
from urllib.parse import quote # Import quote here

logger = logging.getLogger(__name__)

def process_pdf(pdf_path, output_dir):
    if not os.path.exists(pdf_path):
        logger.error(f"PDF file not found for processing: {pdf_path}")
        return None

    # Ensure the output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        logger.debug(f"Created output directory for Nougat: {output_dir}")

    # Construct the expected output filename based on Nougat's default behavior
    # Nougat typically replaces the PDF extension with .mmd in the output directory
    pdf_filename = os.path.basename(pdf_path)
    # MODIFICATION: Construct expected_mmd_filename using the original pdf_filename
    expected_mmd_filename = f"{os.path.splitext(pdf_filename)[0]}.mmd"
    mmd_path = os.path.join(output_dir, expected_mmd_filename)

    # Construct the expected jsonl file path based on the pdf_path
    base_name = os.path.basename(pdf_path).replace('.pdf', '')
    jsonl_file_name = f"{base_name}_cleaned.jsonl"
    huggingface_base_url = "https://huggingface.co/datasets/Disperser5601/Sentinel-Intelligence-Codex/blob/main/"
    encoded_jsonl_file_name = quote(jsonl_file_name)
    huggingface_jsonl_url = f"{huggingface_base_url}{encoded_jsonl_file_name}"

    # Check if the jsonl file already exists in the Hugging Face dataset
    logging.info(f"Checking for existing jsonl file at: {huggingface_jsonl_url}")
    try:
        import requests
        response = requests.head(huggingface_jsonl_url)
        if response.status_code == 200:
            logging.info(f"Jsonl file already exists for {os.path.basename(pdf_path)}. Skipping processing.")
            return huggingface_jsonl_url # Return the URL if the file exists
    except Exception as e:
        logging.error(f"Error checking Hugging Face dataset: {e}")
        # Continue with processing if there's an error checking the dataset


    if os.path.exists(mmd_path):
        logger.info(f"{mmd_path} already exists. Skipping Nougat processing for {pdf_path}.")
        return mmd_path

    logger.info(f"Attempting to process PDF: {pdf_path} with Nougat. Output to {output_dir}")

    try:
        # Stream output live using Popen
        # Ensure the input path to nougat is the original path, not sanitized
        process = subprocess.Popen(
            ['nougat', pdf_path, '-o', output_dir, '--no-skipping', '--recompute'],
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT, # Capture stderr and merge with stdout
            bufsize=1,
            universal_newlines=True
        )

        # Stream output to stdout
        for line in process.stdout:
            sys.stdout.write(line)
            sys.stdout.flush()

        process.stdout.close()
        return_code = process.wait()

        if return_code != 0:
            logger.error(f"Nougat process failed with exit code {return_code}")
            return None

        logger.info(f"Nougat command finished with exit code {return_code}. Checking for output file.")

        # Explicitly check if the expected output file was created
        if os.path.exists(mmd_path):
            logger.info(f"Successfully processed {pdf_path} with Nougat. MMD file created at {mmd_path}.")
            return mmd_path
        else:
            logger.error(f"Nougat command finished but expected output {mmd_path} not found.")
            return None

    except Exception as e:
        logger.error(f"An error occurred during Nougat processing of {pdf_path}: {e}")
        return None

### 3.3. Text Cleaning and Quality Control

In [7]:
import re
import logging
from langdetect import detect, LangDetectException
from nltk.corpus import words # Assuming these are downloaded in config-cell
from nltk.tokenize import word_tokenize, sent_tokenize # Assuming this is downloaded in config-cell
from textblob import TextBlob # Assuming TextBlob is installed
import nltk # Import nltk for FreqDist

# Download necessary NLTK data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/words')
except nltk.downloader.DownloadError:
    nltk.download('words')
# Ensure punkt_tab is downloaded right before potential use
try:
    nltk.data.find('tokenizers/punkt_tab')
except nltk.downloader.DownloadError:
    nltk.download('punkt_tab')

# Verify punkt_tab is available after download attempt
try:
    nltk.data.find('tokenizers/punkt_tab')
    logger.info("NLTK 'punkt_tab' resource found.")
except LookupError:
    logger.error("NLTK 'punkt_tab' resource still not found after download attempt. Sentence tokenization may fail.")


# Assuming GARBAGE_THRESHOLD and LENWORD are defined in config-cell
# from .config import GARBAGE_THRESHOLD, LENWORD # Example if in a different file

logger = logging.getLogger(__name__)

# Load the NLTK words corpus for garbage detection
# Ensure this is done after nltk.download('words') in config-cell
try:
    ENGLISH_WORDS = set(words.words())
    logger.info("NLTK English words corpus loaded.")
except LookupError:
    logger.error("NLTK 'words' corpus not found. Please run nltk.download('words').")
    ENGLISH_WORDS = set() # Use an empty set to avoid errors

def clean_text(text):
    """Clean the extracted text, including removing LaTeX artifacts."""
    logger.debug(f"Cleaning text (first 100 chars): {text[:100]}...")
    initial_len = len(text)
    # Remove markdown headings at the beginning of lines
    text = re.sub(r'^\s*#+\s+.*', '', text, flags=re.MULTILINE)
    # Remove markdown emphasis markers (*, _, **, __)
    text = re.sub(r'(\*\*|__)(.*?)\1', r'\2', text) # Bold (**word**, __word__)
    text = re.sub(r'(\*|_)(.*?)\1', r'\2', text)   # Italic (*word*, _word_)
    # Remove extra newlines, spaces, and non-ASCII characters.
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r' +', ' ', text)
    text = text.strip()

    # Add more general rules for removing content within parentheses and square brackets
    # Remove content within square brackets (more general)
    text = re.sub(r'\[.*?\]', '', text)
    # Remove content within parentheses (more general)
    text = re.sub(r'\(.*?\)', '', text)

    # Remove specific academic citations, references to tables/figures, etc.
    # (Keeping these as they might catch patterns not covered by the general rules,
    # but the general rules should handle most cases)
    text = re.sub(r'\[[^\]]*\]', '', text) # This is now redundant with the general rule above
    text = re.sub(r'\(\d+\)', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\[[A-Za-z0-9]+\]', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\([\w\s]+et\s+al\., \d{4}\)', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\(\w+\s+and\s+\w+\s+\d{4}\)', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\(see\s+equations\s+\(\d+\)\s+and\s+\(\d+\)\)', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\(\w+\s+et\s+al\., \d{4};\s*\w+\s+et\s+al\., \d{4}\)', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'Table\s+\d+', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\[FIGURE:[^]]+\]', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\[\d+(,\s*\d+)*\]', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above
    text = re.sub(r'\[.*arxiv.*\]', '', text, flags=re.IGNORECASE) # This is now redundant with the general rule above

    # MODIFICATION: Remove references to figures in the format "Figure xx.yyy" (case-insensitive)
    text = re.sub(r'Figure\s+\d+\.\d+\s*[\.,;:]*', '', text, flags=re.IGNORECASE)

    # Add extensive cleaning of LaTeX artifacts
    # Remove LaTeX environments like \begin{...} and \end{...} and everything in between
    text = re.sub(r'\\begin\{.*?\}\s*.*?\\end\{.*?\}', '', text, flags=re.DOTALL)
    # Remove common LaTeX commands (e.g., \section, \textbf, \caption, \label, \cite, \ref)
    # This regex targets backslashes followed by letters, optionally with braces or brackets
    text = re.sub(r'\\[a-zA-Z]+\{.*?\}', '', text)
    text = re.sub(r'\\[a-zA-Z]+\s', ' ', text) # Remove commands followed by a space
    # Remove mathematical environments like $...$, $$...$$, \[...\], \(...\)
    text = re.sub(r'\$.*?\$', '', text)
    text = re.sub(r'\$\$.*?\$\$', '', text)
    text = re.sub(r'\\\[.*?\\\]', '', text, flags=re.DOTALL)
    text = re.sub(r'\\\(.*?\\\)', '', text, flags=re.DOTALL)
    # Remove other common symbols and commands
    text = text.replace('\\\\', ' ') # Replace double backslash (newline in LaTeX) with space
    text = text.replace('\\%', '%') # Unescape percent sign if needed
    text = text.replace('\\&', '&') # Unescape ampersand if needed
    text = text.replace('\\$', '$') # Unescape dollar sign if needed
    text = text.replace('\\#', '#') # Unescape hash if needed
    text = text.replace('\\_', '_') # Unescape underscore if needed
    text = text.replace('\\{', '{') # Unescape brace if needed
    text = text.replace('\\}', '}') # Unescape brace if needed
    text = text.replace('\\textbullet', '') # Remove bullet point command
    text = text.replace('\\par', ' ') # Remove paragraph command
    text = text.replace('\\noindent', ' ') # Remove no indent command
    text = text.replace('\\hfill', ' ') # Remove horizontal fill command
    text = text.replace('\\vspace{.*?}', ' ') # Remove vertical space command
    text = text.replace('\\hspace{.*?}', ' ') # Remove horizontal space command
    text = text.replace('\\centering', ' ') # Remove centering command
    text = text.replace('\\raggedright', ' ') # Remove ragged right command
    text = text.replace('\\hline', ' ') # Remove table horizontal line command
    text = text.replace('\\arraystretch{.*?}', ' ') # Remove array stretch command
    text = text.replace('\\documentclass{.*?}', ' ') # Remove document class
    text = text.replace('\\usepackage{.*?}', ' ') # Remove usepackage
    text = text.replace('\\begin{document}', ' ') # Remove begin document
    text = text.replace('\\end{document}', ' ') # Remove end document

    # Remove page numbers and stray numerical artifacts (e.g., single numbers on a line or at the end of a line)
    text = re.sub(r'^\s*\d+\s*$', '', text, flags=re.MULTILINE) # Remove lines containing only numbers
    text = re.sub(r'\s+\d+$', '', text) # Remove numbers at the end of a line
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    # Sanitize garbled punctuation and symbols.
    text = re.sub(r'[\.,;:!?]{2,}', '', text)

    # Clean up extra spaces potentially introduced by removing LaTeX
    text = re.sub(r' +', ' ', text)
    text = text.strip()

    logger.debug(f"Cleaned text (first 100 chars, original len {initial_len}): {text[:100]}...")
    return text

def calculate_text_quality_score(text):
    """Calculate a quality score based on English words and sentence structure."""
    if not text:
        return 0.0

    words = word_tokenize(text)
    if not words:
        return 0.0

    english_word_count = sum(1 for word in words if word.lower() in ENGLISH_WORDS)
    english_word_ratio = english_word_count / len(words) if words else 0

    # Simple heuristic for sentence structure (check for punctuation at end of sentences)
    # Ensure punkt_tab is available before calling sent_tokenize
    try:
        nltk.data.find('tokenizers/punkt_tab')
    except LookupError:
        logger.error("NLTK 'punkt_tab' resource not found for sentence tokenization.")
        return 0.0 # Cannot calculate sentence structure score without the resource

    sentences = sent_tokenize(text)
    well_formed_sentences = sum(1 for sent in sentences if sent.strip().endswith(('.', '!', '?')))
    sentence_structure_score = well_formed_sentences / len(sentences) if sentences else 0

    # Combine ratios - adjust weights as needed
    quality_score = (english_word_ratio * 0.7) + (sentence_structure_score * 0.3)

    logger.debug(f"Text quality score calculated: {quality_score} for text (first 50 chars): {text[:50]}...")
    return quality_score

# --- Helper functions for is_garbage ---

def _check_single_letter_ratio(words_list, threshold=0.2):
    """Check for unusually high proportion of single-letter words."""
    if not words_list:
        return False
    single_letter_words = sum(1 for word in words_list if len(word) == 1)
    if single_letter_words / len(words_list) > threshold:
        logger.debug("Identified as garbage: high proportion of single-letter words.")
        return True
    return False

def _check_repetitive_chars(text, min_repeat=5):
    """Check for simple repetitive character patterns."""
    if re.search(r'(.)\1{' + str(min_repeat - 1) + r',}', text):
        logger.debug("Identified as garbage: found repetitive character pattern.")
        return True
    return False

def _check_high_frequency_ngrams(words, n_size, threshold):
    """Check for highly frequent n-grams (repetitive phrases)."""
    if len(words) < n_size:
        return False

    ngrams = list(nltk.ngrams(words, n_size))
    if not ngrams:
        return False

    fdist = nltk.FreqDist(ngrams)
    most_common_ngram, count = fdist.most_common(1)[0]

    if len(ngrams) > 0 and count / len(ngrams) > threshold:
         logger.debug(f"Identified as garbage: found highly frequent {n_size}-gram '{' '.join(most_common_ngram)}' (count: {count}/{len(ngrams)}).")
         return True
    return False

def _check_large_scale_repetition(text, min_match_len_ratio=0.1, min_match_len_abs=50, offsets_to_check_ratios=[1/2, 1/3, 1/4]):
    """Optimized check for significant large-scale repetition (comparing segments with offsets)."""
    text_len = len(text)
    # Only perform this check on reasonably long texts to avoid overhead on short chunks
    if text_len < 200:
        return False

    min_match_len = max(min_match_len_abs, int(text_len * min_match_len_ratio)) # Require a minimum match length

    for ratio in offsets_to_check_ratios:
        offset = int(text_len * ratio)
        if offset > 0 and text_len > offset + min_match_len:
            segment1 = text[:text_len - offset]
            segment2 = text[offset:text_len]
            # Find the length of the longest common prefix between the two segments
            match_len = 0
            min_segment_len = min(len(segment1), len(segment2))
            while match_len < min_segment_len and segment1[match_len] == segment2[match_len]:
                match_len += 1

            # If a significant match is found
            if match_len >= min_match_len:
                 logger.debug(f"Identified as garbage: found significant repetition with offset ratio {ratio}, match length {match_len}.")
                 return True
    return False


# --- Main is_garbage function ---

def is_garbage(text, threshold=GARBAGE_THRESHOLD, lenword=LENWORD):
    """Check if the text is garbage based on various heuristics."""
    logger.debug(f"Checking if text is garbage (first 100 chars): {text[:100]}...")

    # Check for minimal length
    if not text or len(text.split()) < 5: # Reduced minimum words
        logger.debug("Identified as garbage: text too short or empty.")
        return True

    # Language detection
    try:
        if detect(text) != 'en':
            logger.debug("Identified as garbage: language not English.")
            return True
    except LangDetectException as e:
        logger.debug(f"Language detection failed for text (first 50 chars): {text[:50]}... Error: {e}. Assuming garbage.")
        return True # Assume garbage if language detection fails

    # Check for jammed words (long strings without spaces)
    words_list = text.split()
    for word in words_list:
        if len(word) > lenword and not '-' in word: # Allow hyphens in long words
             logger.debug(f"Identified as garbage: found jammed word '{word[:50]}...'")
             return True

    # Use helper functions for modularity and generalization
    if _check_single_letter_ratio(words_list):
        return True

    if _check_repetitive_chars(text):
        return True

    words_tokenized = word_tokenize(text)
    if len(words_tokenized) > 5: # Only check for n-grams if there are enough words
        # Check for highly frequent n-grams of different sizes
        if _check_high_frequency_ngrams(words_tokenized, 3, 0.10): # Trigrams with 10% threshold
            return True
        if _check_high_frequency_ngrams(words_tokenized, 4, 0.07): # 4-grams with 7% threshold
             return True
        if _check_high_frequency_ngrams(words_tokenized, 5, 0.05): # 5-grams with 5% threshold
             return True

    # Check for significant large-scale repetition
    if _check_large_scale_repetition(text):
        return True


    # Quality scoring
    quality_score = calculate_text_quality_score(text)
    if quality_score < threshold:
        logger.debug(f"Identified as garbage: quality score {quality_score} below threshold {threshold}.")
        return True

    logger.debug("Text passed garbage checks.")
    return False

# Ensure GARBAGE_THRESHOLD and LENWORD are available if not in this cell
# GARBAGE_THRESHOLD = 0.8 # Example default
# LENWORD = 50 # Example default

2025-07-10 11:28:19,525 - INFO - NLTK 'punkt_tab' resource found.


INFO:__main__:NLTK 'punkt_tab' resource found.


2025-07-10 11:28:19,631 - INFO - NLTK English words corpus loaded.


INFO:__main__:NLTK English words corpus loaded.


### 3.4. Text Chunking

In [8]:
import json
import logging
import os
from nltk.tokenize import sent_tokenize # Assuming this is downloaded in config-cell

# Assuming clean_text and is_garbage are defined in cleaning-cell and available
# from .cleaning import clean_text, is_garbage # Example if in a different file

logger = logging.getLogger(__name__)

def chunk_text(content, max_size=8192):
    """Chunk the text into smaller segments, respecting markdown headings."""
    logger.debug(f"Starting chunking process with max_size={max_size}.")
    segments = []
    current_segment = ""
    lines = content.split('\n')
    logger.debug(f"Splitting content into {len(lines)} lines.")

    for i, line in enumerate(lines):
        # Check for markdown headings
        if line.strip().startswith(("# ", "## ", "### ")):
            logger.debug(f"Found markdown heading at line {i}: {line.strip()}")
            # If the current segment is not empty, process it before starting a new one
            if current_segment:
                logger.debug(f"Processing previous segment before heading (length: {len(current_segment)}).")
                segments.extend(split_segment(current_segment.strip(), max_size))
            # Start a new segment with the heading line
            current_segment = line + "\n" # Keep the heading line in the new segment
            logger.debug("Starting new segment after heading.")
        else:
            # Add non-heading lines to the current segment
            current_segment += line + "\n"

    # Process any remaining content in the last segment
    if current_segment:
        logger.debug(f"Processing final segment (length: {len(current_segment)}).")
        segments.extend(split_segment(current_segment.strip(), max_size))

    logger.info(f"Chunking complete. Produced {len(segments)} initial segments based on headings.")
    return segments

def split_segment(segment, max_size):
    """Split a segment (potentially from a heading section) into smaller chunks by sentences."""
    logger.debug(f"Splitting segment by sentences (length: {len(segment)}).")
    sentences = sent_tokenize(segment)
    logger.debug(f"Segment split into {len(sentences)} sentences.")
    chunks = []
    current_chunk = ""

    for i, sentence in enumerate(sentences):
        # Add a space before adding the new sentence if the current chunk is not empty
        sentence_to_add = sentence + " " if current_chunk else sentence
        # Check if adding the current sentence exceeds the max size
        if len(current_chunk) + len(sentence_to_add) <= max_size:
            current_chunk += sentence_to_add
            logger.debug(f"Added sentence {i+1}/{len(sentences)} to current chunk (current size: {len(current_chunk)}).")
        else:
            # If adding the sentence exceeds max size, add the current chunk to chunks list
            if current_chunk: # Add the chunk only if it's not empty
                chunks.append(current_chunk.strip())
                logger.debug(f"Chunk completed (size: {len(current_chunk)}). Starting new chunk with sentence {i+1}.")
            # Start a new chunk with the current sentence
            current_chunk = sentence + " " # Start new chunk with the current sentence

    # Add the last current chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk.strip())
        logger.debug(f"Added final chunk (size: {len(current_chunk)}).")

    logger.debug(f"Segment split into {len(chunks)} smaller chunks.")
    return chunks


def process_and_chunk_mmd(mmd_path, output_dir):
    """Process, clean, chunk, and categorize text from an MMD file."""
    logger.info(f"Starting processing and chunking for MMD file: {mmd_path}")

    if not mmd_path or not os.path.exists(mmd_path):
        logger.warning(f"MMD file not found or path is invalid: {mmd_path}. Skipping processing and chunking.")
        return None, None

    sanitized_filename = sanitize_filename(os.path.basename(mmd_path))
    cleaned_jsonl_path = os.path.join(output_dir, f"{os.path.splitext(sanitized_filename)[0]}_cleaned.jsonl")
    garbage_jsonl_path = os.path.join(output_dir, f"{os.path.splitext(sanitized_filename)[0]}_garbage.jsonl")

    #always reprocess with new cleaning and chunking algos
    #if os.path.exists(cleaned_jsonl_path) and os.path.exists(garbage_jsonl_path):
    #    logger.info(f"Output files {cleaned_jsonl_path} and {garbage_jsonl_path} already exist. Skipping processing and chunking for {mmd_path}.")
    #    return cleaned_jsonl_path, garbage_jsonl_path

    try:
        with open(mmd_path, 'r', encoding='utf-8') as f:
            content = f.read()
        logger.debug(f"Successfully read content from {mmd_path} (length: {len(content)}).")
    except Exception as e:
        logger.error(f"Error reading MMD file {mmd_path}: {e}")
        return None, None

    chunks = chunk_text(content)
    logger.info(f"MMD content chunked into {len(chunks)} segments.")

    cleaned_count = 0
    garbage_count = 0

    try:
        with open(cleaned_jsonl_path, 'w', encoding='utf-8') as cleaned_f, \
             open(garbage_jsonl_path, 'w', encoding='utf-8') as garbage_f:
            for i, chunk in enumerate(chunks):
                logger.debug(f"Processing chunk {i+1}/{len(chunks)} (length: {len(chunk)}).")
                cleaned_chunk = clean_text(chunk)
                if is_garbage(cleaned_chunk):
                    garbage_f.write(json.dumps({"text": cleaned_chunk}) + '\n')
                    garbage_count += 1
                    logger.debug(f"Chunk {i+1} identified as garbage.")
                else:
                    cleaned_f.write(json.dumps({"text": cleaned_chunk}) + '\n')
                    cleaned_count += 1
                    logger.debug(f"Chunk {i+1} identified as cleaned text.")

        logger.info(f"Finished processing and chunking {mmd_path}. Generated {cleaned_count} cleaned chunks and {garbage_count} garbage chunks.")
        return cleaned_jsonl_path, garbage_jsonl_path

    except Exception as e:
        logger.error(f"Error during cleaning or writing chunk files for {mmd_path}: {e}")
        # Clean up potentially incomplete files
        if os.path.exists(cleaned_jsonl_path):
            os.remove(cleaned_jsonl_path)
        if os.path.exists(garbage_jsonl_path):
            os.remove(garbage_jsonl_path)
        return None, None

# Ensure sanitize_filename, clean_text, is_garbage are available
# from .file_ops import sanitize_filename # Example if in a different file
# from .cleaning import clean_text, is_garbage # Example if in a different file

### 3.5. Hugging Face Integration

In [9]:
import logging
import os
from huggingface_hub import HfApi, Repository,login # Import Repository for better practice if needed for cloning/managing

logger = logging.getLogger(__name__)
login()

def upload_to_huggingface(file_path, repo_id, repo_type="dataset"):
    """Upload a file to a Hugging Face repository."""
    if not os.path.exists(file_path):
        logger.error(f"File not found for upload to Hugging Face: {file_path}")
        return False # Indicate failure

    logger.info(f"Attempting to upload {file_path} to Hugging Face repo '{repo_id}' (type: {repo_type}).")
    api = HfApi()

    # Log in to the Hugging Face Hub
    # This will prompt you to enter your token or use a token from your environment/secrets
    try:
        # Use create_commit for potentially better handling of multiple files or larger uploads
        # This example uses upload_file for simplicity as in the original code
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=os.path.basename(file_path),
            repo_id=repo_id,
            repo_type=repo_type,
            # Optional: add commit_message, token if not using environment variable
        )
        logger.info(f"Successfully uploaded {file_path} to {repo_id}")
        return True # Indicate success
    except Exception as e:
        logger.error(f"Error uploading {file_path} to Hugging Face repo '{repo_id}': {e}")
        return False # Indicate failure

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 4. Main Processing Loop

## Scan and process local pdfs

### Subtask:
Modify the main loop to first search for and process any existing PDF files within the anticipated output directory structure.


In [None]:
import os
import shutil
import logging
import json
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed # Import ThreadPoolExecutor
from urllib.parse import quote # Import quote here

# Assuming helper functions are defined in other cells and available
# from .file_ops import get_file_list, download_file, extract_rar, sanitize_filename
# from .nougat_processing import process_pdf
# from .chunking import process_and_chunk_mmd
# from .huggingface_integration import upload_to_huggingface
# from .config import BASE_URL, HUGGING_FACE_REPO # Assuming these are defined in config-cell

logger = logging.getLogger(__name__)

# Define a cache file path
RAR_LIST_CACHE = "rar_list_cache.json"

def process_single_pdf_local(pdf_path, HUGGING_FACE_REPO):
    """Processes a single local PDF file: processes with Nougat, cleans, chunks, and uploads."""
    logger.info(f"--- Processing local PDF: {pdf_path} ---")

    output_dir = os.path.dirname(pdf_path) # Use the PDF's directory as the output directory

    # 1. Process PDF with Nougat
    mmd_path = process_pdf(pdf_path, output_dir)

    if mmd_path:
        logger.info(f"Nougat processing successful for {pdf_path}. MMD file: {mmd_path}")
        # 2. Clean and Chunk MMD file
        cleaned_jsonl, garbage_jsonl = process_and_chunk_mmd(mmd_path, output_dir)

        # 3. Upload to Hugging Face
        if cleaned_jsonl and os.path.exists(cleaned_jsonl):
            logger.info(f"Uploading cleaned data for {os.path.basename(pdf_path)} to Hugging Face.")
            if upload_to_huggingface(cleaned_jsonl, HUGGING_FACE_REPO):
                return 1 # Indicate success
            else:
                logger.error(f"Failed to upload cleaned data for {os.path.basename(pdf_path)}.")
                return 0 # Indicate failure
        else:
            logger.warning(f"No cleaned data generated for {os.path.basename(pdf_path)}. Skipping upload.")
            return 0 # Indicate no cleaned data

    else:
        logger.error(f"Nougat processing failed for {pdf_path}. Skipping cleaning, chunking, and upload.")
        return 0 # Indicate failure


def process_single_rar(rar_file_url, HUGGING_FACE_REPO):
    """Processes a single RAR file: downloads, extracts, processes PDFs, and uploads."""
    rar_filename = rar_file_url.split('/')[-1]
    sanitized_rar_filename = sanitize_filename(rar_filename)
    rar_path = sanitized_rar_filename
    extract_path = os.path.splitext(rar_path)[0]

    logger.info(f"--- Processing {rar_filename} ---")

    # 1. Download RAR file
    # The download_file function now correctly returns True on success
    if not download_file(rar_file_url, rar_path):
        logger.error(f"Failed to download RAR file: {rar_file_url}. Skipping.")
        return 0

    # 2. Extract RAR file
    if not extract_rar(rar_path, extract_path):
        logger.error(f"Failed to extract RAR file: {rar_path}. Cleaning up and skipping.")
        if os.path.exists(rar_path):
            os.remove(rar_path)
            logger.debug(f"Removed failed RAR file: {rar_path}")
        return 0

    # Clean up the downloaded RAR after successful extraction
    if os.path.exists(rar_path):
        os.remove(rar_path)
        logger.debug(f"Removed downloaded RAR file: {rar_path}")


    # 3. Find and Process PDF files within the extracted directory
    pdf_files = [os.path.join(root, file) for root, _, files in os.walk(extract_path) for file in files if file.lower().endswith('.pdf')]
    logger.info(f"Found {len(pdf_files)} PDF files in extracted directory: {extract_path}")

    if not pdf_files:
        logger.warning(f"No PDF files found in {extract_path}. Cleaning up.")
        # Clean up the extracted directory
        if os.path.exists(extract_path):
            shutil.rmtree(extract_path)
            logger.debug(f"Removed extracted directory: {extract_path}")
        return 0 # Indicate no PDFs processed

    successful_uploads_count = 0
    with tqdm(total=len(pdf_files), desc=f"Processing PDFs in {sanitized_rar_filename}", leave=False) as pbar_pdfs:
        for pdf_path in pdf_files:
            logger.info(f"Processing PDF: {pdf_path}")

            # 4. Process PDF with Nougat
            mmd_path = process_pdf(pdf_path, extract_path)

            if mmd_path:
                logger.info(f"Nougat processing successful for {pdf_path}. MMD file: {mmd_path}")
                # 5. Clean and Chunk MMD file
                cleaned_jsonl, garbage_jsonl = process_and_chunk_mmd(mmd_path, extract_path)

                # 6. Upload to Hugging Face
                if cleaned_jsonl and os.path.exists(cleaned_jsonl):
                    logger.info(f"Uploading cleaned data for {os.path.basename(pdf_path)} to Hugging Face.")
                    if upload_to_huggingface(cleaned_jsonl, HUGGING_FACE_REPO):
                        successful_uploads_count += 1
                    else:
                        logger.error(f"Failed to upload cleaned data for {os.path.basename(pdf_path)}.")
                    # Optionally upload garbage data
                    ##if garbage_jsonl and os.path.exists(garbage_jsonl):
                    ##    logger.info(f"Uploading garbage data for {os.path.basename(pdf_path)} to Hugging Face.")
                    ##    upload_to_huggingface(garbage_jsonl, HUGGING_FACE_REPO)
                else:
                    logger.warning(f"No cleaned data generated for {os.path.basename(pdf_path)}. Skipping upload.")
            else:
                logger.error(f"Nougat processing failed for {pdf_path}. Skipping cleaning, chunking, and upload.")

            pbar_pdfs.update(1) # Update inner progress bar for each PDF

    # 7. Clean up extracted directory after processing all PDFs in the RAR
    logger.info(f"Cleaning up extracted directory for {rar_filename}.")
    if os.path.exists(extract_path):
        shutil.rmtree(extract_path)
        logger.debug(f"Removed extracted directory: {extract_path}")

    return successful_uploads_count


def main():
    """Main function to process the library."""
    logger.info("--- Starting Library Processing Pipeline ---")

    total_local_uploads = 0
    logger.info("--- Processing Local PDF Files ---")

    local_pdf_files = []
    for root, _, files in os.walk("."): # Scan current directory and subdirectories
        for file in files:
            if file.lower().endswith('.pdf'):
                local_pdf_files.append(os.path.join(root, file))

    logger.info(f"Found {len(local_pdf_files)} local PDF files.")

    with tqdm(total=len(local_pdf_files), desc="Processing Local PDFs") as pbar_local_pdfs:
        for pdf_path in local_pdf_files:
            total_local_uploads += process_single_pdf_local(pdf_path, HUGGING_FACE_REPO)
            pbar_local_pdfs.update(1)

    logger.info(f"Finished processing local PDF files. Successfully uploaded cleaned data for {total_local_uploads} files.")
    logger.info("--- Finished Processing Local PDF Files ---")


    rar_files = []
    if os.path.exists(RAR_LIST_CACHE):
        logger.info(f"Loading RAR file list from cache: {RAR_LIST_CACHE}")
        try:
            with open(RAR_LIST_CACHE, 'r') as f:
                rar_files = json.load(f)
            logger.info(f"Loaded {len(rar_files)} RAR files from cache.")
        except Exception as e:
            logger.error(f"Error loading RAR file list from cache: {e}. Rescanning.")
            # If loading fails, proceed to rescan
            rar_files = []

    if not rar_files:
        logger.info(f"Scanning for RAR files at {BASE_URL}")
        try:
            rar_files = get_file_list(BASE_URL)
            logger.info(f"Found {len(rar_files)} RAR files.")
            # Save the list to cache
            try:
                with open(RAR_LIST_CACHE, 'w') as f:
                    json.dump(rar_files, f)
                logger.info(f"Saved RAR file list to cache: {RAR_LIST_CACHE}")
            except Exception as e:
                logger.error(f"Error saving RAR file list to cache: {e}")
        except Exception as e:
            logger.error(f"Failed to get RAR file list from {BASE_URL}: {e}")
            # Continue processing even if initial scan fails, as local processing is done
            pass # Changed from return to pass

    total_rar_uploads = 0

    if rar_files:
        logger.info("--- Processing Downloaded RAR Files ---")
        # Use ThreadPoolExecutor to process RAR files in parallel
        # Adjust max_workers based on your runtime's capabilities and the task
        max_workers = 1 # Example: process 1 RAR at a time to avoid resource issues
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            # Submit tasks to the executor
            future_to_rar = {executor.submit(process_single_rar, rar_file_url, HUGGING_FACE_REPO): rar_file_url for rar_file_url in rar_files}

            # Use tqdm to track overall progress
            with tqdm(total=len(rar_files), desc="Overall RAR Processing") as pbar_overall:
                for future in as_completed(future_to_rar):
                    rar_file_url = future_to_rar[future]
                    try:
                        successful_uploads_count = future.result()
                        total_rar_uploads += successful_uploads_count
                    except Exception as exc:
                        logger.error(f'{rar_file_url} generated an exception: {exc}')

                    pbar_overall.update(1) # Update outer progress bar for each completed RAR
        logger.info(f"Finished processing downloaded RAR files. Successfully uploaded cleaned data for {total_rar_uploads} PDF files.")
        logger.info("--- Finished Processing Downloaded RAR Files ---")


    logger.info("--- Library Processing Pipeline Finished ---")
    logger.info(f"Total successfully uploaded cleaned data for {total_local_uploads + total_rar_uploads} PDF files to {HUGGING_FACE_REPO}.")


if __name__ == "__main__":
    main()

2025-07-10 11:28:19,782 - INFO - --- Starting Library Processing Pipeline ---


INFO:__main__:--- Starting Library Processing Pipeline ---


2025-07-10 11:28:19,785 - INFO - --- Processing Local PDF Files ---


INFO:__main__:--- Processing Local PDF Files ---


2025-07-10 11:28:19,795 - INFO - Found 0 local PDF files.


INFO:__main__:Found 0 local PDF files.
Processing Local PDFs: 0it [00:00, ?it/s]

2025-07-10 11:28:19,799 - INFO - Finished processing local PDF files. Successfully uploaded cleaned data for 0 files.



INFO:__main__:Finished processing local PDF files. Successfully uploaded cleaned data for 0 files.


2025-07-10 11:28:19,800 - INFO - --- Finished Processing Local PDF Files ---


INFO:__main__:--- Finished Processing Local PDF Files ---


2025-07-10 11:28:19,801 - INFO - Scanning for RAR files at https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/


INFO:__main__:Scanning for RAR files at https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/


2025-07-10 11:28:19,804 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/ (Depth: 0)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/ (Depth: 0)


2025-07-10 11:28:20,265 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/0.%20Info/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/0.%20Info/ (Depth: 1)


2025-07-10 11:28:20,602 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/ (Depth: 1)


2025-07-10 11:28:20,962 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/2.%20Ancient%20e%20Classical/ (Depth: 1)


2025-07-10 11:28:21,390 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/ (Depth: 1)


2025-07-10 11:28:21,973 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Kingdoms/ (Depth: 2)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/3.%20Middle%20Ages/Medieval%20Kingdoms/ (Depth: 2)


2025-07-10 11:28:22,591 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/4.%20Early%20Modern/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/4.%20Early%20Modern/ (Depth: 1)


2025-07-10 11:28:23,045 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/5.%20Ancient%20%26%20Classical%20Civilizations%20Series/ (Depth: 1)


2025-07-10 11:28:23,693 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/6.%20Middle%20Ages%20Series/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/6.%20Middle%20Ages%20Series/ (Depth: 1)


2025-07-10 11:28:24,772 - INFO - Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/7.%20Early%20Modern%20Series/ (Depth: 1)


INFO:__main__:Accessing URL: https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/7.%20Early%20Modern%20Series/ (Depth: 1)


2025-07-10 11:28:25,349 - INFO - Found 497 RAR files.


INFO:__main__:Found 497 RAR files.


2025-07-10 11:28:25,358 - INFO - Saved RAR file list to cache: rar_list_cache.json


INFO:__main__:Saved RAR file list to cache: rar_list_cache.json


2025-07-10 11:28:25,360 - INFO - --- Processing Downloaded RAR Files ---


INFO:__main__:--- Processing Downloaded RAR Files ---


2025-07-10 11:28:25,365 - INFO - --- Processing 1.%20Prehistory.rar ---


INFO:__main__:--- Processing 1.%20Prehistory.rar ---


2025-07-10 11:28:25,393 - INFO - Attempting to download https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/1.%20Prehistory.rar to 1._20Prehistory.rar


INFO:__main__:Attempting to download https://the-eye.eu/public/Books/Bibliotheca%20Alexandrina/1.%20Prehistory/1.%20Prehistory.rar to 1._20Prehistory.rar
Overall RAR Processing:   0%|          | 0/497 [00:00<?, ?it/s]

**Reasoning**:
Search online for Nougat's Python API documentation or examples to determine if it can be used without subprocess.

