# SKOL IV: All the Data

In [1]:
bahir_package = 'org.apache.bahir:spark-sql-cloudant_2.12:2.4.0'
!spark-shell --packages $bahir_package < /dev/null

25/11/16 23:48:58 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 172.16.227.68 instead (on interface wlp130s0f0)
25/11/16 23:48:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8fb8350f-3fcb-45b7-9a56-19eab6e17792;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	fou

In [2]:
from io import BytesIO
import json
import os
from pathlib import Path, PurePath
import requests
import shutil
import sys
import tempfile
from typing import Any, Dict, Iterator, List, Optional
from urllib.robotparser import RobotFileParser

# Be sure to get version 2: https://simple-repository.app.cern.ch/project/bibtexparser/2.0.0b8/description
import bibtexparser
import couchdb
import feedparser
import fitz # PyMuPDF
from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql.functions import col, lit, udf
from pyspark.sql.types import (
    BooleanType, IntegerType, NullType, StringType, StructType, StructField 
)
import redis
from uuid import uuid4

# Import the SKOL classifier jupyter/ist769_skol.ipynb
from skol_classifier import SkolClassifier as SC, get_file_list
from skol_classifier.preprocessing import SuffixTransformer, ParagraphExtractor
from skol_classifier.utils import calculate_stats


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


In [3]:
couchdb_host = "127.0.0.1:5984" # e.g., "ACCOUNT.cloudant.com" or "localhost"
couchdb_username = "admin"
couchdb_password = "SU2orange!"
ingest_db_name = "skol_dev"
taxon_db_name = "skol_taxa_dev"

spark = SparkSession \
    .builder \
    .appName("CouchDB Spark SQL Example in Python using dataframes") \
    .master("local[2]") \
    .config("cloudant.protocol", "http") \
    .config("cloudant.host", couchdb_host) \
    .config("cloudant.username", couchdb_username) \
    .config("cloudant.password", couchdb_password) \
    .config("spark.jars.packages", bahir_package) \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "20g") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR") # Keeps the noise down!!!

couch = couchdb.Server(f'http://{couchdb_username}:{couchdb_password}@{couchdb_host}')
if ingest_db_name not in couch:
    db = couch.create(ingest_db_name)
else:
    db = couch[ingest_db_name]

user_agent = "synoptickeyof.life"

ingenta_rp = RobotFileParser()
ingenta_rp.set_url("https://www.ingentaconnect.com/robots.txt")
ingenta_rp.read() # Reads and parses the robots.txt file from the URL

25/11/16 23:49:04 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 172.16.227.68 instead (on interface wlp130s0f0)
25/11/16 23:49:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-83fbe488-4259-4aa8-b01f-5929c91ed69d;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found commons-codec#commons-codec;1.6 in central
	found com.cloudant#cloudant-http;2.17.0 in central
	found commons-io#commons-io;2.4 in central
	found com.squareup.okhttp3#okhttp;3.12.2 in central
	found com.squareup.okio#okio;1.15.0 in central
	found com.typesafe#config;1.3.1 in central
	found org.scalaj#scalaj-http_2.12;2.3.0 in central
:: resolution report :: resolve 212ms :: artifacts dl 8ms
	:: modules in use

## The Data Sources

The goal is to collect all the open access taxonomic literature in Mycology. Most of the sources below mainly cover macro-fungi and slime molds.

### Ingested Data Sources

* [Mycotaxon at Ingenta Connect](https://www.ingentaconnect.com/content/mtax/mt)
* [Studies in Mycology at Ingenta Connect](https://www.studiesinmycology.org/)

### Source of many older public domain and open access works

Mycoweb includes scans of many older works in mycology. I have local copies but need to ingest them.

* [Mycoweb](https://mykoweb.com/)

### Journals in hand

These are journals we've collected over the years. The initial annotated issues are from early years of Mycotaxon. We still need to ingest all of these.

* Mycologia (back issues)
* [Mycologia at Taylor and Francis](https://www.tandfonline.com/journals/umyc20)
  Mycologia is the main journal of the Mycological Society of America. It is a mix of open access and traditional access articles. The connector for this journal will need to identify the open access articles.
* Persoonia (all issues)
  Persoonia is no longer published.
* Mycotaxon (back issues)
  Mycotaxon is no longer published.

### Journals that need connectors

These are journals we're aware that include open access articles.

* [Amanitaceae.org](http://www.tullabs.com/amanita/?home)
* [Mycosphere](https://mycosphere.org/)
* [Mycoscience](https://mycoscience.org/)
* [Journal of Fungi](https://www.mdpi.com/journal/jof)
* [Mycology](https://www.tandfonline.com/journals/tmyc20)
* [Open Access Journal of Mycology & Mycological Sciences](https://www.medwinpublishers.com/OAJMMS/)
* [Mycokeys](https://mycokeys.pensoft.net/)


## Ingestion

Each journal or other data source gets an ingester that puts PDFs into our document store along with any metadata we can collect. The metadata is sufficient to create citations for each issue, book, or article. If bibtex citations are available we prefer to store these verbatim.

### Ingenta RSS ingestion

Ingenta Connect is an electronic publisher that holds two Mycology journals. New articles are available via RSS (Really Simple Syndication).

In [4]:
def ingest_from_bibtex(
        db: couchdb.Database,
        content: bytes,
        bibtex_link: str,
        meta: Dict[str, Any],
        rp
        ) -> None:
    """Load documents referenced in an Ingenta BibTeX database."""
    bib_database = bibtexparser.parse_string(content)

    bibtex_data = {
        'link': bibtex_link,
        'bibtex': bibtexparser.write_string(bib_database),
    }
    
    for bib_entry in bib_database.entries:
        doc = {
            '_id': uuid4().hex,
            'meta': meta,
            'pdf_url': f"{bib_entry['url']}?crawler=true",
        }

        # Do not fetch if we already have an entry.
        selector = {'selector': {'pdf_url': doc['pdf_url']}}
        found = False
        for e in db.find(selector):
            found = True
        if found:
            print(f"Skipping {doc['pdf_url']}")
            continue

        if not rp.can_fetch(user_agent, doc['pdf_url']):
            # TODO(piggy): We should probably record blocked URLs.
            print(f"Robot permission denied {doc['pdf_url']}")
            continue

        print(f"Adding {doc['pdf_url']}")
        for k in bib_entry.fields_dict.keys():
            doc[k] = bib_entry[k]
        
        doc_id, doc_rev = db.save(doc)
        with requests.get(doc['pdf_url'], stream=False) as pdf_f:
            pdf_f.raise_for_status()
            pdf_doc = pdf_f.content
        
        attachment_filename = 'article.pdf'
        attachment_content_type = 'application/pdf'
        attachment_file = BytesIO(pdf_doc)

        db.put_attachment(doc, attachment_file, attachment_filename, attachment_content_type)

        print("-" * 10)

In [5]:
def ingest_ingenta(
        db: couchdb.Database,
        rss_url: str,
        rp
) -> None:
    """Ingest documents from an Ingenta RSS feed."""

    feed = feedparser.parse(rss_url)
    
    feed_meta = {
        'url': rss_url,
        'title': feed.feed.title,
        'link': feed.feed.link,
        'description': feed.feed.description,
    }

    for entry in feed.entries:
        entry_meta = {
            'title': entry.title,
            'link': entry.link,
        }
        if hasattr(entry, 'summary'):
            entry_meta['summary'] = entry.summary
        if hasattr(entry, 'description'):
            entry_meta['description'] = entry.description

        bibtex_link = f'{entry.link}?format=bib'
        print(f"bibtex_link: {bibtex_link}")

        if not rp.can_fetch(user_agent, bibtex_link):
            print(f"Robot permission denied {bibtex_link}")
            continue

        with requests.get(bibtex_link, stream=False) as bibtex_f:
            bibtex_f.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

            ingest_from_bibtex(
                db=db,
                content=bibtex_f.content\
                    .replace(b"\"\nparent", b"\",\nparent")\
                    .replace(b"\n", b""),
                bibtex_link=bibtex_link,
                meta={
                    'feed': feed_meta,
                    'entry': entry_meta,
                },
                rp=rp
            )
        print("=" * 20)

In [6]:
def ingest_from_local_bibtex(
    db: couchdb.Database,
    root: Path,
    rp
) -> None:
    """Ingest from a local directory with Ingenta bibtext files in it."""
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if not filename.endswith('format=bib'):
                continue
            full_filepath = os.path.join(dirpath, filename)
            bibtex_link = f"https://www.ingentaconnect.com/{full_filepath[len(str(root)):]}"
            with open(full_filepath) as f:
                content = f.read()\
                    .replace("\"\nparent", "\",\nparent")\
                    .replace("\n", "")
                ingest_from_bibtex(db, content, bibtex_link, meta={}, rp=rp)


In [7]:
# Mycotaxon
# ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/mtax/mt?format=rss', rp=ingenta_rp)

In [8]:
# Studies in Mycology
# ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/wfbi/sim?format=rss', rp=ingenta_rp)

In [9]:
# ingest_from_local_bibtex(
#     db=db,
#     root=Path("/data/skol/www/www.ingentaconnect.com"),
#     rp=ingenta_rp
# )

Download the RSS

Read bibtex files and create records for each article.

Download the PDFs at the URLs in the bibtex entries.

Create a JSON record with the PDF as an attachment.

### Text extraction

We extract 
Extract the text, optionally with OCR. Add as an additional attachment on the source record.

In [10]:
# df = spark.read.load(
#     format="org.apache.bahir.cloudant",
#     database=ingest_db_name
# )

In [11]:
# df.describe()

In [12]:
# Content-Type: text/html; charset=UTF-8

def pdf_to_text(pdf_contents: bytes) -> bytes:
    doc = fitz.open(stream=BytesIO(pdf_contents), filetype="pdf")

    full_text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Possibly perform OCR on the page
        text = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_DEHYPHENATE)
        full_text += f"\n--- PDF Page {page_num+1} ---\n"
        full_text += text

    return full_text.encode("utf-8")

def add_text_to_partition(iterator) -> None:
    couch = couchdb.Server(f'http://{couchdb_username}:{couchdb_password}@{couchdb_host}')
    local_db = couch[ingest_db_name]
    for row in iterator:
        if not row:
            continue
        if not row._attachments:
            continue
        row_dict = row.asDict()
        attachment_dict = row._attachments.asDict()
        for pdf_filename in attachment_dict:
            pdf_path = PurePath(pdf_filename)
            if pdf_path.suffix != '.pdf':
                continue
            pdf_path = PurePath(pdf_filename)
            txt_path_str = pdf_path.stem + '.txt'
            if txt_path_str in attachment_dict:
                # TODO(piggy): Recalculate text if text is terrible. Too much noise vocabulary?
                print(f"Already have text for {row.pdf_url}")
                continue
            print(f"{row._id}, {row.pdf_url}")
            pdf_file = local_db.get_attachment(row._id, str(pdf_path)).read()
            txt_file = pdf_to_text(pdf_file)
            attachment_content_type = 'text/simple; charset=UTF-8'
            attachment_file = BytesIO(txt_file)
            local_db.put_attachment(row_dict, attachment_file, txt_path_str, attachment_content_type)
    


In [13]:
# df.select("*").foreachPartition(add_text_to_partition)

In [14]:
# Identical to skol_classifier.CouchDBConnection.
from skol_classifier import CouchDBConnection as CDBC

class CouchDBConnection(CDBC):
    """
    Manages CouchDB connection and provides I/O operations.

    This class encapsulates connection parameters and provides an idempotent
    connection method that can be safely called multiple times.
    """

    def __init__(
        self,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None
    ):
        """
        Initialize CouchDB connection parameters.

        Args:
            couchdb_url: CouchDB server URL (e.g., "http://localhost:5984")
            database: Database name
            username: Optional username for authentication
            password: Optional password for authentication
        """
        self.couchdb_url = couchdb_url
        self.database = database
        self.username = username
        self.password = password
        self._server = None
        self._db = None

    def _connect(self):
        """
        Idempotent connection method that returns a CouchDB server object.

        This method can be called multiple times safely - it will only create
        a connection if one doesn't already exist.

        Returns:
            couchdb.Server: Connected CouchDB server object
        """
        if self._server is None:
            self._server = couchdb.Server(self.couchdb_url)
            if self.username and self.password:
                self._server.resource.credentials = (self.username, self.password)

        if self._db is None:
            self._db = self._server[self.database]

        return self._server

    @property
    def db(self):
        """Get the database object, connecting if necessary."""
        if self._db is None:
            self._connect()
        return self._db

    def get_document_list(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Get a list of documents with text attachments from CouchDB.

        This only fetches document metadata (not content) to create a DataFrame
        that can be processed in parallel. Creates ONE ROW per attachment, so if
        a document has multiple attachments matching the pattern, it will have
        multiple rows in the resulting DataFrame.

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names (e.g., "*.txt")

        Returns:
            DataFrame with columns: doc_id, attachment_name
            One row per (doc_id, attachment_name) pair
        """
        # Connect to CouchDB (driver only)
        db = self.db

        # Get all documents with attachments matching pattern
        doc_list = []
        for doc_id in db:
            try:
                doc = db[doc_id]
                attachments = doc.get('_attachments', {})

                # Loop through ALL attachments in the document
                for att_name in attachments.keys():
                    # Check if attachment matches pattern
                    # Pattern matching: "*.txt" matches files ending with .txt
                    if pattern == "*.txt" and att_name.endswith('.txt'):
                        doc_list.append((doc_id, att_name))
                    elif pattern == "*.*" or pattern == "*":
                        # Match all attachments
                        doc_list.append((doc_id, att_name))
                    elif pattern.startswith("*.") and att_name.endswith(pattern[1:]):
                        # Generic pattern matching for *.ext
                        doc_list.append((doc_id, att_name))
            except Exception:
                # Skip documents we can't read
                continue

        # Create DataFrame with document IDs and attachment names
        schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("attachment_name", StringType(), False)
        ])

        return spark.createDataFrame(doc_list, schema)

    def fetch_partition(
        self,
        partition: Iterator[Row]
    ) -> Iterator[Row]:
        """
        Fetch CouchDB attachments for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id and attachment_name

        Yields:
            Rows with doc_id, attachment_name, and value (content)
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document has multiple .txt attachments, there will be multiple rows
            for row in partition:
                try:
                    doc = db[row.doc_id]

                    # Get the specific attachment for this row
                    if row.attachment_name in doc.get('_attachments', {}):
                        attachment = db.get_attachment(doc, row.attachment_name)
                        if attachment:
                            content = attachment.read().decode('utf-8', errors='ignore')
                            yield Row(
                                doc_id=row.doc_id,
                                attachment_name=row.attachment_name,
                                value=content
                            )
                except Exception as e:
                    # Log error but continue processing
                    print(f"Error fetching {row.doc_id}/{row.attachment_name}: {e}")
                    continue

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            return

    def save_partition(
        self,
        partition: Iterator[Row],
        suffix: str = ".ann"
    ) -> Iterator[Row]:
        """
        Save annotated content to CouchDB for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id, attachment_name, final_aggregated_pg
            suffix: Suffix to append to attachment names

        Yields:
            Rows with doc_id, attachment_name, and success status
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document had multiple .txt files, we save multiple .ann files
            for row in partition:
                success = False
                try:
                    doc = db[row.doc_id]

                    # Create new attachment name by appending suffix
                    # e.g., "article.txt" becomes "article.txt.ann"
                    new_attachment_name = f"{row.attachment_name}{suffix}"

                    # Save the annotated content as a new attachment
                    db.put_attachment(
                        doc,
                        row.final_aggregated_pg.encode('utf-8'),
                        filename=new_attachment_name,
                        content_type='text/plain'
                    )

                    success = True

                except Exception as e:
                    print(f"Error saving {row.doc_id}/{row.attachment_name}: {e}")

                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=success
                )

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            # Yield failures for all rows
            for row in partition:
                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )

    def load_distributed(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Load text attachments from CouchDB using foreachPartition.

        This function:
        1. Gets list of documents (on driver)
        2. Creates a DataFrame with doc IDs
        3. Uses mapPartitions to fetch content efficiently (one connection per partition)

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names

        Returns:
            DataFrame with columns: doc_id, attachment_name, value
        """
        # Get document list
        doc_df = self.get_document_list(spark, pattern)

        # Use mapPartitions for efficient batch fetching
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def fetch_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)
            return conn.fetch_partition(partition)

        # Define output schema
        schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("attachment_name", StringType(), False),
            StructField("value", StringType(), False)
        ])

        # Apply mapPartitions
        result_df = doc_df.rdd.mapPartitions(fetch_partition).toDF(schema)

        return result_df

    def save_distributed(
        self,
        df: DataFrame,
        suffix: str = ".ann"
    ) -> DataFrame:
        """
        Save annotated predictions to CouchDB using foreachPartition.

        This function uses mapPartitions where each partition creates a single
        CouchDB connection and reuses it for all rows.

        Args:
            df: DataFrame with columns: doc_id, attachment_name, final_aggregated_pg
            suffix: Suffix to append to attachment names

        Returns:
            DataFrame with doc_id, attachment_name, and success columns
        """
        # Use mapPartitions for efficient batch saving
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def save_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)
            return conn.save_partition(partition, suffix)

        # Define output schema
        schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("attachment_name", StringType(), False),
            StructField("success", BooleanType(), False)
        ])

        # Apply mapPartitions
        result_df = df.rdd.mapPartitions(save_partition).toDF(schema)

        return result_df

    def process_partition_with_func(
        self,
        partition: Iterator[Row],
        processor_func,
        suffix: str = ".ann"
    ) -> Iterator[Row]:
        """
        Generic function to read, process, and save in one partition operation.

        This allows custom processing logic while maintaining single connection per partition.

        Args:
            partition: Iterator of Rows
            processor_func: Function to process content (takes content string, returns processed string)
            suffix: Suffix for output attachment

        Yields:
            Rows with processing results
        """
        try:
            db = self.db

            for row in partition:
                try:
                    doc = db[row.doc_id]

                    # Fetch
                    if row.attachment_name in doc.get('_attachments', {}):
                        attachment = db.get_attachment(doc, row.attachment_name)
                        if attachment:
                            content = attachment.read().decode('utf-8', errors='ignore')

                            # Process
                            processed = processor_func(content)

                            # Save
                            new_attachment_name = f"{row.attachment_name}{suffix}"
                            db.put_attachment(
                                doc,
                                processed.encode('utf-8'),
                                filename=new_attachment_name,
                                content_type='text/plain'
                            )

                            yield Row(
                                doc_id=row.doc_id,
                                attachment_name=row.attachment_name,
                                success=True
                            )
                            continue

                except Exception as e:
                    print(f"Error processing {row.doc_id}/{row.attachment_name}: {e}")

                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            for row in partition:
                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )



In [15]:
"""
Main classifier module for SKOL text classification
"""
class SkolClassifier(SC):
    """
    Text classifier for taxonomic literature.

    Supports multiple classification models (Logistic Regression, Random Forest)
    and feature types (word TF-IDF, suffix TF-IDF, combined).
    """
    def save_to_redis(
        self,
        redis_client: Optional[Any] = None,
        redis_key: Optional[str] = None
    ) -> bool:
        """
        Save the trained models to Redis.

        The models are saved to a temporary directory, then packaged and stored in Redis
        as a compressed binary blob along with metadata.

        Args:
            redis_client: Redis client (uses self.redis_client if not provided)
            redis_key: Redis key name (uses self.redis_key if not provided)

        Returns:
            True if successful, False otherwise

        Raises:
            ValueError: If no models are trained or Redis client is not available
        """
        if self.pipeline_model is None or self.classifier_model is None:
            raise ValueError(
                "No models to save. Train models using fit() or train_classifier() first."
            )

        client = redis_client or self.redis_client
        key = redis_key or self.redis_key

        if client is None:
            raise ValueError(
                "No Redis client available. Provide redis_client argument or "
                "initialize classifier with redis_client."
            )

        temp_dir = None
        try:
            # Create temporary directory for model files
            temp_dir = tempfile.mkdtemp(prefix="skol_model_")
            temp_path = Path(temp_dir)

            # Save pipeline model
            pipeline_path = temp_path / "pipeline_model"
            self.pipeline_model.save(str(pipeline_path))

            # Save classifier model
            classifier_path = temp_path / "classifier_model"
            self.classifier_model.save(str(classifier_path))

            # Save metadata (labels and model info)
            metadata = {
                "labels": self.labels,
                "version": "0.0.1"
            }
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'w') as f:
                json.dump(metadata, f)

            # Create archive in memory
            import io
            import tarfile

            archive_buffer = io.BytesIO()
            with tarfile.open(fileobj=archive_buffer, mode='w:gz') as tar:
                tar.add(temp_path, arcname='.')

            # Get compressed data
            archive_data = archive_buffer.getvalue()

            # Save to Redis
            client.set(key, archive_data)

            return True

        except Exception as e:
            print(f"Error saving to Redis: {e}")
            return False

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)

    def load_from_redis(
        self,
        redis_client: Optional[Any] = None,
        redis_key: Optional[str] = None
    ) -> bool:
        """
        Load trained models from Redis.

        Args:
            redis_client: Redis client (uses self.redis_client if not provided)
            redis_key: Redis key name (uses self.redis_key if not provided)

        Returns:
            True if successful, False otherwise

        Raises:
            ValueError: If Redis client is not available or key doesn't exist
        """
        client = redis_client or self.redis_client
        key = redis_key or self.redis_key

        if client is None:
            raise ValueError(
                "No Redis client available. Provide redis_client argument or "
                "initialize classifier with redis_client."
            )

        temp_dir = None
        try:
            # Retrieve from Redis
            archive_data = client.get(key)
            if archive_data is None:
                raise ValueError(f"No model found in Redis with key: {key}")

            # Create temporary directory for extraction
            temp_dir = tempfile.mkdtemp(prefix="skol_model_load_")
            temp_path = Path(temp_dir)

            # Extract archive
            import io
            import tarfile

            archive_buffer = io.BytesIO(archive_data)
            with tarfile.open(fileobj=archive_buffer, mode='r:gz') as tar:
                tar.extractall(temp_path)

            # Load pipeline model
            pipeline_path = temp_path / "pipeline_model"
            self.pipeline_model = PipelineModel.load(str(pipeline_path))

            # Load classifier model
            classifier_path = temp_path / "classifier_model"
            self.classifier_model = PipelineModel.load(str(classifier_path))

            # Load metadata
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'r') as f:
                metadata = json.load(f)
                self.labels = metadata.get("labels")

            return True

        except Exception as e:
            print(f"Error loading from Redis: {e}")
            return False

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)

    def load_from_couchdb(
        self,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Load raw text from CouchDB attachments using distributed UDFs.

        This method uses Spark UDFs to fetch attachments in parallel across workers,
        rather than loading all data on the driver.

        Args:
            couchdb_url: CouchDB server URL (e.g., "http://localhost:5984")
            database: Database name
            username: Optional username for authentication
            password: Optional password for authentication
            pattern: Pattern for attachment names (default: "*.txt")

        Returns:
            DataFrame with columns: doc_id, attachment_name, value
        """
        conn = CouchDBConnection(couchdb_url, database, username, password)
        return conn.fetch_partition(self.spark, pattern)
    

    def predict_from_couchdb(
        self,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None,
        pattern: str = "*.txt",
        output_format: str = "annotated"
    ) -> DataFrame:
        """
        Load text from CouchDB, predict labels, and return predictions.

        Args:
            couchdb_url: CouchDB server URL
            database: Database name
            username: Optional username
            password: Optional password
            pattern: Pattern for attachment names
            output_format: Output format ('annotated' or 'simple')

        Returns:
            DataFrame with predictions, including doc_id and attachment_name
        """
        if self.pipeline_model is None or self.classifier_model is None:
            raise ValueError(
                "Models not trained. Call fit_features() and train_classifier() first."
            )

        # Load data from CouchDB
        df = self.load_from_couchdb(
            couchdb_url, database, username, password, pattern
        )

        # Process paragraphs
        from .preprocessing import ParagraphExtractor
        from pyspark.sql.types import ArrayType, StringType
        from pyspark.sql.window import Window

        heuristic_udf = udf(
            ParagraphExtractor.extract_heuristic_paragraphs,
            ArrayType(StringType())
        )

        # Window specification for ordering
        window_spec = Window.partitionBy("doc_id", "attachment_name").orderBy("start_idx")

        # Group and extract paragraphs
        grouped_df = (
            df.groupBy("doc_id", "attachment_name")
            .agg(
                collect_list("value").alias("lines"),
                min(lit(0)).alias("start_idx")
            )
            .withColumn("value", explode(heuristic_udf(col("lines"))))
            .drop("lines")
            .filter(trim(col("value")) != "")
            .withColumn("row_number", row_number().over(window_spec))
        )

        # Extract features
        features = self.pipeline_model.transform(grouped_df)

        # Predict
        predictions = self.classifier_model.transform(features)

        # Convert label indices to strings
        from pyspark.ml.feature import IndexToString

        converter = IndexToString(
            inputCol="prediction",
            outputCol="predicted_label",
            labels=self.labels
        )
        labeled_predictions = converter.transform(predictions)

        # Format output
        if output_format == "annotated":
            labeled_predictions = labeled_predictions.withColumn(
                "annotated_pg",
                concat(
                    lit("[@ "),
                    col("value"),
                    lit("#"),
                    col("predicted_label"),
                    lit("]")
                )
            )

        return labeled_predictions

    def save_to_couchdb(
        self,
        predictions: DataFrame,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None,
        suffix: str = ".ann"
    ) -> List[Dict[str, Any]]:
        """
        Save annotated predictions back to CouchDB using distributed UDFs.

        This method uses Spark UDFs to save attachments in parallel across workers,
        distributing the write operations.

        Args:
            predictions: DataFrame with predictions (must include annotated_pg column)
            couchdb_url: CouchDB server URL
            database: Database name
            username: Optional username
            password: Optional password
            suffix: Suffix to append to attachment names (default: ".ann")

        Returns:
            List of results from CouchDB operations
        """
        conn = CouchDBConnection(couchdb_url, database, username, password)

        # Aggregate paragraphs by document and attachment
        aggregated_df = (
            predictions.groupBy("doc_id", "attachment_name")
            .agg(
                expr("sort_array(collect_list(struct(row_number, annotated_pg))) AS sorted_list")
            )
            .withColumn("annotated_pg_ordered", expr("transform(sorted_list, x -> x.annotated_pg)"))
            .withColumn("final_aggregated_pg", expr("array_join(annotated_pg_ordered, '\n')"))
            .select("doc_id", "attachment_name", "final_aggregated_pg")
        )

        # Save to CouchDB using distributed UDF
        result_df = conn.save_distributed(aggregated_df, suffix)

        # Collect results
        results = []
        for row in result_df.collect():
            results.append({
                'doc_id': row.doc_id,
                'attachment_name': f"{row.attachment_name}{suffix}",
                'success': row.success
            })

        return results

In [16]:
# Train classifier on annotated data and save to Redis
# Connect to Redis
redis_client = redis.Redis(
    host='localhost',
    port=6379,
    db=0,
    decode_responses=False
)
classifier_model_name = "skol:classifier:model:v1.0"

# Initialize classifier with Redis connection
classifier = SkolClassifier(
    spark=spark,
    redis_client=redis_client,
    redis_key=classifier_model_name,
    auto_load=False  # Don't auto-load, we want to train fresh
)

# Get annotated training files
annotated_path = Path.cwd().parent / "data" / "annotated"
print(f"Loading annotated files from: {annotated_path}")

if annotated_path.exists():
    annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")
    
    if len(annotated_files) > 0:
        print(f"Found {len(annotated_files)} annotated files")
        
        # Train the classifier
        print("Training classifier...")
        results = classifier.fit(annotated_files)
        
        print(f"Training complete!")
        print(f"  Accuracy: {results['accuracy']:.4f}")
        print(f"  F1 Score: {results['f1_score']:.4f}")
        print(f"  Labels: {classifier.labels}")
        
        # Save model to Redis
        print("\nSaving model to Redis...")
        if classifier.save_to_redis():
            print(f"✓ Model successfully saved to Redis with key: {classifier_model_name}.")
        else:
            print("✗ Failed to save model to Redis")
    else:
        print(f"No annotated files found in {annotated_path}")
else:
    print(f"Directory does not exist: {annotated_path}")
    print("Please ensure annotated training data is available.")

Loading annotated files from: /data/piggy/src/github.com/piggyatbaqaqi/skol/data/annotated
Found 190 annotated files
Training classifier...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

Test Accuracy: 0.9410
Test Precision: 0.9614
Test Recall: 0.9650
Test F1 Score: 0.9407


[Stage 78:>                                                         (0 + 2) / 2]

Training complete!
  Accuracy: 0.9410
  F1 Score: 0.9407
  Labels: ['Misc-exposition', 'Description', 'Nomenclature']

Saving model to Redis...
Error saving to Redis: ('Pipeline write will fail on this pipeline because stage %s of type %s is not MLWritable', 'SuffixTransformer_6aa87b223488', <class 'skol_classifier.preprocessing.SuffixTransformer'>)
✗ Failed to save model to Redis


                                                                                

## Extract the taxa names and descriptions

We use a classifier to extract taxa names and descriptions from articles, issues, and books. Earlier versions of the project added YEDDA annotations. New to this project is saving the metadata, taxa names, and their descriptions directly to a database. Also new is the saved model

We use CouchDB to store a full record for each taxon. We copy all metadata to the taxon records.

## Bibliography

* doi Foundation, "DOI Citation Formatter HTTP API", https://citation.doi.org/api-docs.html, accessed 2025-11-12.
* YEDDA annotation format (look up citation)


## Appendix: On the use of an AI Coder

Portions of this work were completed with the aid of Claude Code Pro. I wish to give a clarifying example of how I've used this very powerful tool, and reveal why I am comfortable with claiming authorship of the resulting code.

For this project I needed results from an earlier class project in which a trio of students built and evaluated models for classifying paragraphs. The earlier work was built as a iPython Notebook, with many examples and inline code. Just copying the earlier notebook would have introduced many irrelevant details and would not further the overall project.

I asked Claude Code to translate the notebook into a module that I could import. It did a pretty good job. Without being told, it made a submodule, extract the illustrative code as examples, wrote reasonable documentation and created packaging for the module.

The skill level of the coding was roughly that of a highly disciplined average junior programmer. The architecture was simplistic and violated several design constraints such as DRY. I requested specific refactorings, such as asking for a group of functions to be converted into an object that shared duplicated parameters.

The initial code used REST interfaces directly, and read all the data into a single machine, not using pyspark correctly. Through a series of refactorings, I asked that the code use appropriate libraries I named, and create correct udf functions to execute transformations in parallel.

I walked the AI through creating an object that I could use to illustrate my use of redis and couchdb interfaces, while leaving the irrelevant details in a separate library.

In short, I still had to understand good design principles. I had to be able to recognize where appropriate libraries were applicable. I still had to understand the frameworks I am working with.

I now have a strong understanding of the difference between "vibe coding" and AI-assisted software engineering. In my first 4 hours with Claude Code, I was able to produce roughly 4 days' worth of professional-grade working code.