# SKOL IV: All the Data

In [1]:
bahir_package = 'org.apache.bahir:spark-sql-cloudant_2.12:2.4.0'
!spark-shell --packages $bahir_package < /dev/null

25/12/05 21:43:58 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 172.16.227.68 instead (on interface wlp130s0f0)
25/12/05 21:43:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-372124e1-6c86-41d0-a672-f0912fef7a53;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	fou

In [2]:
from io import BytesIO
import json
import hashlib
import os
from pathlib import Path, PurePath
import pickle
import requests
import shutil
import sys
import tempfile
from typing import Any, Dict, Iterator, List, Optional
from urllib.robotparser import RobotFileParser

# Be sure to get version 2: https://simple-repository.app.cern.ch/project/bibtexparser/2.0.0b8/description
import bibtexparser
import couchdb
import feedparser
import fitz # PyMuPDF

import pandas as pd  # TODO(piggy): Remove this dependency in favor of pure pyspark DataFrames.

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import (
    Tokenizer, CountVectorizer, IDF, StringIndexer, VectorAssembler, IndexToString
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql.functions import (
    input_file_name, collect_list, regexp_extract, col, udf,
    explode, trim, row_number, min, expr, concat, lit
)
from pyspark.sql.types import (
    ArrayType, BooleanType, IntegerType, MapType, NullType,
    StringType, StructType, StructField
)
from pyspark.sql.window import Window

import redis
import torch
from uuid import uuid4

# Local modules
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
parent_path = Path(parent_dir)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from couchdb_file import CouchDBFile as CDBF
from fileobj import FileObject
from finder import parse_annotated, remove_interstitials
import line
from line import Line

# Import SKOL classifiers
from skol_classifier.classifier_v2 import SkolClassifierV2 as SC
from skol_classifier.preprocessing import SuffixTransformer, ParagraphExtractor
from skol_classifier.model import SkolModel
from skol_classifier.output_formatters import CouchDBOutputWriter as CDBOW, YeddaFormatter
from skol_classifier.utils import calculate_stats, get_file_list

from taxon import group_paragraphs, Taxon

from taxa_json_translator import TaxaJSONTranslator as TJT

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


## Important constants

In [3]:
couchdb_host = "127.0.0.1:5984" # e.g., "ACCOUNT.cloudant.com" or "localhost"
couchdb_username = "admin"
couchdb_password = "SU2orange!"
ingest_db_name = "skol_dev"  # Development ingestion database
taxon_db_name = "skol_taxa_dev"  # Development Taxa database
json_taxon_db_name = "skol_taxa_full_dev"  # Development Taxa database with JSON translations

redis_host = 'localhost'
redis_port = 6379

embedding_name = 'skol:embedding:v1.3'
embedding_expire = 60 * 60 * 24  # Expire after 24 hours
classifier_model_name = "skol:classifier:model:v1.5"
classifier_model_expire = 60 * 60 * 24  # Expire after 1 day.

neo4j_uri = "bolt://localhost:7687"

couchdb_url = f'http://{couchdb_host}'

## robots.txt

We want to be a well-behaved web scraper. Respect `robots.txt`.

In [4]:
user_agent = "synoptickeyof.life"

ingenta_rp = RobotFileParser()
ingenta_rp.set_url("https://www.ingentaconnect.com/robots.txt")
ingenta_rp.read() # Reads and parses the robots.txt file from the URL

In [5]:
spark = SparkSession \
    .builder \
    .appName("CouchDB Spark SQL Example in Python using dataframes") \
    .master("local[2]") \
    .config("cloudant.protocol", "http") \
    .config("cloudant.host", couchdb_host) \
    .config("cloudant.username", couchdb_username) \
    .config("cloudant.password", couchdb_password) \
    .config("spark.jars.packages", bahir_package) \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "20g") \
    .config("spark.submit.pyFiles",
            f'{parent_path / "line.py"},{parent_path / "fileobj.py"},'
            f'{parent_path / "couchdb_file.py"},{parent_path / "finder.py"},'
            f'{parent_path / "taxon.py"},{parent_path / "paragraph.py"},'
            f'{parent_path / "label.py"},{parent_path / "file.py"},'
            f'{parent_path / "extract_taxa_to_couchdb.py"}'
           ) \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR") # Keeps the noise down!!!

couch = couchdb.Server(couchdb_url)
couch.resource.credentials = (couchdb_username, couchdb_password)

if ingest_db_name not in couch:
    db = couch.create(ingest_db_name)
else:
    db = couch[ingest_db_name]

25/12/05 21:44:07 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 172.16.227.68 instead (on interface wlp130s0f0)
25/12/05 21:44:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b483d9ba-283f-4f28-8ca0-aaa6cae5e09e;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found commons-codec#commons-codec;1.6 in central
	found com.cloudant#cloudant-http;2.17.0 in central
	found commons-io#commons-io;2.4 in central
	found com.squareup.okhttp3#okhttp;3.12.2 in central
	found com.squareup.okio#okio;1.15.0 in central
	found com.typesafe#config;1.3.1 in central
	found org.scalaj#scalaj-http_2.12;2.3.0 in central
:: resolution report :: resolve 248ms :: artifacts dl 8ms
	:: modules in use

In [6]:
# Connect to Redis
redis_client = redis.Redis(
    host=redis_host,
    port=redis_port,
    db=0,
    decode_responses=False
)

## The Data Sources

The goal is to collect all the open access taxonomic literature in Mycology. Most of the sources below mainly cover macro-fungi and slime molds.

### Ingested Data Sources

* [Mycotaxon at Ingenta Connect](https://www.ingentaconnect.com/content/mtax/mt)
* [Studies in Mycology at Ingenta Connect](https://www.studiesinmycology.org/)

### Source of many older public domain and open access works

Mycoweb includes scans of many older works in mycology. I have local copies but need to write ingesters for them.

* [Mycoweb](https://mykoweb.com/)

### Journals in hand

These are journals we've collected over the years. The initial annotated issues are from early years of Mycotaxon. We still need to write ingesters for all of these.

* Mycologia (back issues)
* [Mycologia at Taylor and Francis](https://www.tandfonline.com/journals/umyc20)
  Mycologia is the main journal of the Mycological Society of America. It is a mix of open access and traditional access articles. The connector for this journal will need to identify the open access articles.
* Persoonia (all issues)
  Persoonia is no longer published.
* Mycotaxon (back issues)
  Mycotaxon is no longer published.

### Journals that need connectors

These are journals we're aware that include open access articles.

* [Amanitaceae.org](http://www.tullabs.com/amanita/?home)
* [Mycosphere](https://mycosphere.org/)
* [Mycoscience](https://mycoscience.org/)
* [Journal of Fungi](https://www.mdpi.com/journal/jof)
* [Mycology](https://www.tandfonline.com/journals/tmyc20)
* [Open Access Journal of Mycology & Mycological Sciences](https://www.medwinpublishers.com/OAJMMS/)
* [Mycokeys](https://mycokeys.pensoft.net/)


## Ingestion

Each journal or other data source gets an ingester that puts PDFs into our document store along with any metadata we can collect. The metadata is sufficient to create citations for each issue, book, or article. If bibtex citations are available we prefer to store these verbatim.

### Ingenta RSS ingestion

Ingenta Connect is an electronic publisher that holds two Mycology journals. New articles are available via RSS (Really Simple Syndication).

In [6]:
def ingest_from_bibtex(
        db: couchdb.Database,
        content: bytes,
        bibtex_link: str,
        meta: Dict[str, Any],
        rp
        ) -> None:
    """Load documents referenced in an Ingenta BibTeX database."""
    bib_database = bibtexparser.parse_string(content)

    bibtex_data = {
        'link': bibtex_link,
        'bibtex': bibtexparser.write_string(bib_database),
    }

    for bib_entry in bib_database.entries:
        doc = {
            '_id': uuid4().hex,
            'meta': meta,
            'pdf_url': f"{bib_entry['url']}?crawler=true",
        }

        # Do not fetch if we already have an entry.
        selector = {'selector': {'pdf_url': doc['pdf_url']}}
        found = False
        for e in db.find(selector):
            found = True
        if found:
            print(f"Skipping {doc['pdf_url']}")
            continue

        if not rp.can_fetch(user_agent, doc['pdf_url']):
            # TODO(piggy): We should probably log blocked URLs.
            print(f"Robot permission denied {doc['pdf_url']}")
            continue

        print(f"Adding {doc['pdf_url']}")
        for k in bib_entry.fields_dict.keys():
            doc[k] = bib_entry[k]

        doc_id, doc_rev = db.save(doc)
        with requests.get(doc['pdf_url'], stream=False) as pdf_f:
            pdf_f.raise_for_status()
            pdf_doc = pdf_f.content

        attachment_filename = 'article.pdf'
        attachment_content_type = 'application/pdf'
        attachment_file = BytesIO(pdf_doc)

        db.put_attachment(doc, attachment_file, attachment_filename, attachment_content_type)

        print("-" * 10)

In [7]:
def ingest_ingenta(
        db: couchdb.Database,
        rss_url: str,
        rp
) -> None:
    """Ingest documents from an Ingenta RSS feed."""

    feed = feedparser.parse(rss_url)

    feed_meta = {
        'url': rss_url,_utils/#/_al
        'title': feed.feed.title,
        'link': feed.feed.link,
        'description': feed.feed.description,
    }

    for entry in feed.entries:
        entry_meta = {
            'title': entry.title,
            'link': entry.link,
        }
        if hasattr(entry, 'summary'):
            entry_meta['summary'] = entry.summary
        if hasattr(entry, 'description'):
            entry_meta['description'] = entry.description

        bibtex_link = f'{entry.link}?format=bib'
        print(f"bibtex_link: {bibtex_link}")

        if not rp.can_fetch(user_agent, bibtex_link):
            print(f"Robot permission denied {bibtex_link}")
            continue

        with requests.get(bibtex_link, stream=False) as bibtex_f:
            bibtex_f.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

            ingest_from_bibtex(
                db=db,
                content=bibtex_f.content\
                    .replace(b"\"\nparent", b"\",\nparent")\
                    .replace(b"\n", b""),
                bibtex_link=bibtex_link,
                meta={
                    'feed': feed_meta,
                    'entry': entry_meta,
                },
                rp=rp
            )
        print("=" * 20)

In [8]:
def ingest_from_local_bibtex(
    db: couchdb.Database,
    root: Path,
    rp
) -> None:
    """Ingest from a local directory with Ingenta bibtext files in it."""
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if not filename.endswith('format=bib'):
                continue
            full_filepath = os.path.join(dirpath, filename)
            bibtex_link = f"https://www.ingentaconnect.com/{full_filepath[len(str(root)):]}"
            with open(full_filepath) as f:
                # Paper over a syntax problem in Ingenta Connect Bibtex files.
                content = f.read()\
                    .replace("\"\nparent", "\",\nparent")\
                    .replace("\n", "")
                ingest_from_bibtex(db, content, bibtex_link, meta={}, rp=rp)


### Text extraction

We extract the text, optionally with OCR. Add as an additional attachment on the source record.

In [9]:
df = spark.read.load(
    format="org.apache.bahir.cloudant",
    database=ingest_db_name
)

                                                                                

In [10]:
df.describe()

DataFrame[summary: string, _id: string, _rev: string, abstract: string, author: string, doi: string, eissn: string, issn: string, itemtype: string, journal: string, number: string, pages: string, parent_itemid: string, pdf_url: string, publication date: string, publishercode: string, title: string, url: string, volume: string, year: string]

In [11]:
# Content-Type: text/html; charset=UTF-8

def pdf_to_text(pdf_contents: bytes) -> bytes:
    doc = fitz.open(stream=BytesIO(pdf_contents), filetype="pdf")

    full_text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Possibly perform OCR on the page
        text = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_DEHYPHENATE)
        # full_text += f"\n--- PDF Page {page_num+1} ---\n"  # TODO(piggy): Introduce PDF page tracking in line-by-line and paragraph parsers.
        full_text += text

    return full_text.encode("utf-8")

def add_text_to_partition(iterator) -> None:
    couch = couchdb.Server(couchdb_url)
    couch.resource.credentials = (couchdb_username, couchdb_password)
    local_db = couch[ingest_db_name]
    for row in iterator:
        if not row:
            continue
        if not row._attachments:
            continue
        row_dict = row.asDict()
        attachment_dict = row._attachments.asDict()
        for pdf_filename in attachment_dict:
            pdf_path = PurePath(pdf_filename)
            if pdf_path.suffix != '.pdf':
                continue
            pdf_path = PurePath(pdf_filename)
            txt_path_str = pdf_path.stem + '.txt'
            # if txt_path_str in attachment_dict:
            #     # TODO(piggy): Recalculate text if text is terrible. Too much noise vocabulary?
            #     print(f"Already have text for {row.pdf_url}")
            #     continue
            print(f"{row._id}, {row.pdf_url}")
            pdf_file = local_db.get_attachment(row._id, str(pdf_path)).read()
            txt_file = pdf_to_text(pdf_file)
            attachment_content_type = 'text/simple; charset=UTF-8'
            attachment_file = BytesIO(txt_file)
            local_db.put_attachment(row_dict, attachment_file, txt_path_str, attachment_content_type)


In [12]:
# Identical to skol_classifier.CouchDBConnection.
from skol_classifier.couchdb_io import CouchDBConnection as CDBC

class CouchDBConnection(CDBC):
    """
    Manages CouchDB connection and provides I/O operations.

    This class encapsulates connection parameters and provides an idempotent
    connection method that can be safely called multiple times.
    """

    # Shared schema definitions (DRY principle)
    LOAD_SCHEMA = StructType([
        StructField("doc_id", StringType(), False),
        StructField("human_url", StringType(), False),
        StructField("attachment_name", StringType(), False),
        StructField("value", StringType(), False),
    ])

    SAVE_SCHEMA = StructType([
        StructField("doc_id", StringType(), False),
        StructField("attachment_name", StringType(), False),
        StructField("success", BooleanType(), False),
    ])

    def __init__(
        self,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None
    ):
        """
        Initialize CouchDB connection parameters.

        Args:
            couchdb_url: CouchDB server URL (e.g., "http://localhost:5984")
            database: Database name
            username: Optional username for authentication
            password: Optional password for authentication
        """
        self.couchdb_url = couchdb_url
        self.database = database
        self.username = username
        self.password = password
        self._server = None
        self._db = None

    def _connect(self):
        """
        Idempotent connection method that returns a CouchDB server object.

        This method can be called multiple times safely - it will only create
        a connection if one doesn't already exist.

        Returns:
            couchdb.Server: Connected CouchDB server object
        """
        if self._server is None:
            self._server = couchdb.Server(self.couchdb_url)
            if self.username and self.password:
                self._server.resource.credentials = (self.username, self.password)

        if self._db is None:
            self._db = self._server[self.database]

        return self._server

    @property
    def db(self):
        """Get the database object, connecting if necessary."""
        if self._db is None:
            self._connect()
        return self._db

    def get_all_doc_ids(self, pattern: str = "*") -> List[str]:
        """
        Get list of document IDs matching the pattern from CouchDB.

        Args:
            pattern: Pattern for document IDs (e.g., "taxon_*", "*")
                    - "*" matches all non-design documents
                    - "prefix*" matches documents starting with prefix
                    - "exact" matches exactly

        Returns:
            List of matching document IDs
        """
        db = self.db
        
        # Get all document IDs (excluding design documents)
        all_doc_ids = [doc_id for doc_id in list(db) if not doc_id.startswith('_design/')]

        # Filter by pattern
        if pattern == "*":
            # Return all non-design documents
            return all_doc_ids
        elif pattern.endswith('*'):
            # Prefix matching
            prefix = pattern[:-1]
            return [doc_id for doc_id in all_doc_ids if doc_id.startswith(prefix)]
        else:
            # Exact match
            return [doc_id for doc_id in all_doc_ids if doc_id == pattern]

    def get_document_list(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Get a list of documents with text attachments from CouchDB.

        This only fetches document metadata (not content) to create a DataFrame
        that can be processed in parallel. Creates ONE ROW per attachment, so if
        a document has multiple attachments matching the pattern, it will have
        multiple rows in the resulting DataFrame.

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names (e.g., "*.txt")

        Returns:
            DataFrame with columns: doc_id, attachment_name
            One row per (doc_id, attachment_name) pair
        """
        # Connect to CouchDB (driver only)
        db = self.db

        # Get all documents with attachments matching pattern
        doc_list = []
        for doc_id in db:
            try:
                doc = db[doc_id]
                attachments = doc.get('_attachments', {})

                # Loop through ALL attachments in the document
                for att_name in attachments.keys():
                    # Check if attachment matches pattern
                    # Pattern matching: "*.txt" matches files ending with .txt
                    if pattern == "*.txt" and att_name.endswith('.txt'):
                        doc_list.append((doc_id, att_name))
                    elif pattern == "*.*" or pattern == "*":
                        # Match all attachments
                        doc_list.append((doc_id, att_name))
                    elif pattern.startswith("*.") and att_name.endswith(pattern[1:]):
                        # Generic pattern matching for *.ext
                        doc_list.append((doc_id, att_name))
            except Exception:
                # Skip documents we can't read
                continue

        # Create DataFrame with document IDs and attachment names
        schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("attachment_name", StringType(), False)
        ])

        return spark.createDataFrame(doc_list, schema)

    def fetch_partition(
        self,
        partition: Iterator[Row]
    ) -> Iterator[Row]:
        """
        Fetch CouchDB attachments for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id and attachment_name

        Yields:
            Rows with doc_id, human_url, attachment_name, and value (content).
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document has multiple .txt attachments, there will be multiple rows
            for row in partition:
                try:
                    doc = db[row.doc_id]

                    # Get the specific attachment for this row
                    if row.attachment_name in doc.get('_attachments', {}):
                        attachment = db.get_attachment(doc, row.attachment_name)
                        if attachment:
                            content = attachment.read().decode('utf-8', errors='ignore')

                            yield Row(
                                doc_id=row.doc_id,
                                human_url=doc.get('url', 'missing_human_url'),
                                attachment_name=row.attachment_name,
                                value=content
                            )
                except Exception as e:
                    # Log error but continue processing
                    print(f"Error fetching {row.doc_id}/{row.attachment_name}: {e}")
                    continue

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            return

    def save_partition(
        self,
        partition: Iterator[Row],
        suffix: str = ".ann"
    ) -> Iterator[Row]:
        """
        Save annotated content to CouchDB for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id, attachment_name, final_aggregated_pg
                       and optionally human_url
            suffix: Suffix to append to attachment names

        Yields:
            Rows with doc_id, attachment_name, and success status.
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document had multiple .txt files, we save multiple .ann files
            for row in partition:
                success = False
                try:
                    doc = db[row.doc_id]

                    # Update human_url field if provided
                    if hasattr(row, 'human_url') and row.human_url:
                        doc['url'] = row.human_url
                        db.save(doc)
                        # Reload doc to get updated _rev
                        doc = db[row.doc_id]

                    # Create new attachment name by appending suffix
                    # e.g., "article.txt" becomes "article.txt.ann"
                    new_attachment_name = f"{row.attachment_name}{suffix}"

                    # Save the annotated content as a new attachment
                    db.put_attachment(
                        doc,
                        row.final_aggregated_pg.encode('utf-8'),
                        filename=new_attachment_name,
                        content_type='text/plain'
                    )

                    success = True

                except Exception as e:
                    print(f"Error saving {row.doc_id}/{row.attachment_name}: {e}")

                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=success
                )

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            # Yield failures for all rows
            for row in partition:
                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )

    def load_distributed(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Load text attachments from CouchDB using foreachPartition.

        This function:
        1. Gets list of documents (on driver)
        2. Creates a DataFrame with doc IDs
        3. Uses mapPartitions to fetch content efficiently (one connection per partition)

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names

        Returns:
            DataFrame with columns: doc_id, human_url, attachment_name, and value.
        """
        # Get document list
        doc_df = self.get_document_list(spark, pattern)

        # Use mapPartitions for efficient batch fetching
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def fetch_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)_, 
            return conn.fetch_partition(partition)

        # Apply mapPartitions using shared schema
        result_df = doc_df.rdd.mapPartitions(fetch_partition).toDF(self.LOAD_SCHEMA)

        return result_df

    def save_distributed(
        self,
        df: DataFrame,
        suffix: str = ".ann"
    ) -> DataFrame:
        """
        Save annotated predictions to CouchDB using foreachPartition.

        This function uses mapPartitions where each partition creates a single
        CouchDB connection and reuses it for all rows.

        Args:
            df: DataFrame with columns: doc_id, attachment_name, final_aggregated_pg
            suffix: Suffix to append to attachment names

        Returns:
            DataFrame with doc_id, attachment_name, and success columns
        """
        # Use mapPartitions for efficient batch saving
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def save_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)
            return conn.save_partition(partition, suffix)

        # Apply mapPartitions using shared schema
        result_df = df.rdd.mapPartitions(save_partition).toDF(self.SAVE_SCHEMA)

        return result_df

In [13]:
class CouchDBOutputWriter(CDBOW):
    """
    Writes predictions back to CouchDB as attachments.
    """

    def __init__(
        self,
        couchdb_url: str,
        database: str,
        username: str,
        password: str
    ):
        """
        Initialize the writer.

        Args:
            couchdb_url: CouchDB server URL
            database: Database name
            username: CouchDB username
            password: CouchDB password
        """
        self.conn = CouchDBConnection(
            couchdb_url=couchdb_url,
            database=database,
            username=username,
            password=password
        )

    def save_annotated(
        self,
        predictions: DataFrame,
        suffix: str = ".ann",
        coalesce_labels: bool = False,
        line_level: bool = False
    ) -> None:
        """
        Save predictions to CouchDB as attachments.

        Args:
            predictions: DataFrame with predictions
            suffix: Suffix for attachment names
            coalesce_labels: Whether to coalesce consecutive labels
            line_level: Whether data is line-level
        """
        # Format predictions
        if "annotated_value" not in predictions.columns:
            predictions = YeddaFormatter.format_predictions(predictions)

        # Coalesce if requested
        if coalesce_labels and line_level:
            predictions = YeddaFormatter.coalesce_consecutive_labels(
                predictions, line_level=True
            )
            # For coalesced output, we have coalesced_annotations column
            # We need to join them into final_aggregated_pg
            from pyspark.sql.functions import expr
            predictions = predictions.withColumn(
                "final_aggregated_pg",
                expr("array_join(coalesced_annotations, '\n')")
            )
        else:
            # Aggregate annotated values by document
            groupby_col = "doc_id" if "doc_id" in predictions.columns else "filename"
            attachment_col = "attachment_name" if "attachment_name" in predictions.columns else "filename"

            # Check if we have line_number for ordering
            if "line_number" in predictions.columns:
                from pyspark.sql.functions import expr
                predictions = (
                    predictions.groupBy(groupby_col, attachment_col)
                    .agg(
                        expr("sort_array(collect_list(struct(line_number, annotated_value))) AS sorted_list")
                    )
                    .withColumn("annotated_value_ordered", expr("transform(sorted_list, x -> x.annotated_value)"))
                    .withColumn("final_aggregated_pg", expr("array_join(annotated_value_ordered, '\n')"))
                    .select(groupby_col, attachment_col, "final_aggregated_pg")
                )
            else:
                from pyspark.sql.functions import collect_list, expr
                predictions = (
                    predictions.groupBy(groupby_col, attachment_col)
                    .agg(
                        collect_list("annotated_value").alias("annotations")
                    )
                    .withColumn("final_aggregated_pg", expr("array_join(annotations, '\n')"))
                    .select(groupby_col, attachment_col, "final_aggregated_pg")
                )

            # Rename columns for CouchDB save
            if groupby_col != "doc_id":
                predictions = predictions.withColumnRenamed(groupby_col, "doc_id")
            if attachment_col != "attachment_name":
                predictions = predictions.withColumnRenamed(attachment_col, "attachment_name")

        # Use CouchDB connection to save
        self.conn.save_distributed(predictions, suffix=suffix)


In [14]:
"""
Main classifier module for SKOL text classification
"""
class SkolClassifierV2(SC):
    """
    Text classifier for taxonomic literature.

    This version only includes the redis and couchdb I/O methods.
    All other methods are in SC.

    Supports multiple classification models (Logistic Regression, Random Forest)
    and feature types (word TF-IDF, suffix TF-IDF, combined).
    """
    def _load_raw_from_couchdb(self) -> DataFrame:
        """Load raw text from CouchDB."""
        conn = CouchDBConnection(
            self.couchdb_url,
            self.couchdb_database,
            self.couchdb_username,
            self.couchdb_password
        )

        df = conn.load_distributed(self.spark, self.couchdb_pattern)

        # Split into lines if line_level mode
        if self.line_level:
            from pyspark.sql.functions import explode, split, col, trim, row_number, lit
            from pyspark.sql.window import Window

            df = df.withColumn("value", explode(split(col("value"), "\\n")))
            df = df.filter(trim(col("value")) != "")

            # Add line numbers
            window_spec = Window.partitionBy("doc_id", "attachment_name").orderBy(lit(1))
            df = df.withColumn("line_number", row_number().over(window_spec) - 1)

        return df

    def _load_annotated_from_couchdb(self) -> DataFrame:
        """Load annotated text from CouchDB."""
        # Load raw annotations from CouchDB
        conn = CouchDBConnection(
            self.couchdb_url,
            self.couchdb_database,
            self.couchdb_username,
            self.couchdb_password
        )

        # Look for .ann files for training
        pattern = self.couchdb_pattern
        if not pattern.endswith('.ann'):
            pattern = pattern.replace('.txt', '.txt.ann')

        df = conn.load_distributed(self.spark, pattern)

        # Parse annotations
        from .preprocessing import AnnotatedTextParser

        parser = AnnotatedTextParser(line_level=self.line_level)
        return parser.parse(df)
        
    def _save_to_couchdb(self, predictions: DataFrame) -> None:
        """Save predictions to CouchDB."""

        writer = CouchDBOutputWriter(
            couchdb_url=self.couchdb_url,
            database=self.couchdb_database,
            username=self.couchdb_username,
            password=self.couchdb_password
        )

        writer.save_annotated(
            predictions,
            suffix=self.output_couchdb_suffix,
            coalesce_labels=self.coalesce_labels,
            line_level=self.line_level
        )

    def _save_model_to_redis(self) -> None:
        """Save model to Redis using tar archive."""
        import json
        import tempfile
        import shutil
        import tarfile
        import io

        if self._model is None or self._feature_pipeline is None:
            raise ValueError("No model to save. Train a model first.")

        temp_dir = None
        try:
            # Create temporary directory
            temp_dir = tempfile.mkdtemp(prefix="skol_model_v2_")
            temp_path = Path(temp_dir)

            # Save feature pipeline
            pipeline_path = temp_path / "feature_pipeline"
            self._feature_pipeline.save(str(pipeline_path))

            # Save classifier model
            classifier_model = self._model.get_model()
            if classifier_model is None:
                raise ValueError("Classifier model not trained")
            classifier_path = temp_path / "classifier_model"
            classifier_model.save(str(classifier_path))

            # Save metadata
            metadata = {
                'label_mapping': self._label_mapping,
                'config': {
                    'line_level': self.line_level,
                    'use_suffixes': self.use_suffixes,
                    'min_doc_freq': self.min_doc_freq,
                    'model_type': self.model_type,
                    'model_params': self.model_params
                },
                'version': '2.0'
            }
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'w') as f:
                json.dump(metadata, f, indent=2)

            # Create tar archive
            archive_buffer = io.BytesIO()
            with tarfile.open(fileobj=archive_buffer, mode='w:gz') as tar:
                tar.add(temp_path, arcname='.')

            # Save to Redis
            archive_data = archive_buffer.getvalue()
            self.redis_client.set(self.redis_key, archive_data)
            if self.redis_expire is not None:
                self.redis_client.expire(self.redis_key, self.redis_expire)

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)

    def _load_model_from_redis(self) -> None:
        """Load model from Redis tar archive."""
        import json
        import tempfile
        import shutil
        import tarfile
        import io
        from pyspark.ml import PipelineModel

        serialized = self.redis_client.get(self.redis_key)
        if not serialized:
            raise ValueError(f"No model found in Redis with key: {self.redis_key}")

        temp_dir = None
        try:
            # Create temporary directory
            temp_dir = tempfile.mkdtemp(prefix="skol_model_load_v2_")
            temp_path = Path(temp_dir)

            # Extract tar archive
            archive_buffer = io.BytesIO(serialized)
            with tarfile.open(fileobj=archive_buffer, mode='r:gz') as tar:
                tar.extractall(temp_path)

            # Load feature pipeline
            pipeline_path = temp_path / "feature_pipeline"
            self._feature_pipeline = PipelineModel.load(str(pipeline_path))

            # Load classifier model
            classifier_path = temp_path / "classifier_model"
            classifier_model = PipelineModel.load(str(classifier_path))

            # Load metadata
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'r') as f:
                metadata = json.load(f)

            self._label_mapping = metadata['label_mapping']
            self._reverse_label_mapping = {v: k for k, v in self._label_mapping.items()}

            # Recreate the SkolModel wrapper
            features_col = self._feature_extractor.get_features_col() if self._feature_extractor else "combined_idf"
            self._model = SkolModel(
                model_type=metadata['config']['model_type'],
                features_col=features_col,
                **metadata['config'].get('model_params', {})
            )
            self._model.set_model(classifier_model)
            self._model.set_labels(list(self._label_mapping.keys()))

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)


## Build a classifier to identify paragraph types.

We save this to redis so that we don't need to train the model every time.

In [9]:
# Train classifier on annotated data and save to Redis using SkolClassifierV2
if not redis_client.exists(classifier_model_name):
    # Get annotated training files
    annotated_path = Path.cwd().parent / "data" / "annotated"
    print(f"Loading annotated files from: {annotated_path}")

    if annotated_path.exists():
        annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")

        if len(annotated_files) > 0:
            print(f"Found {len(annotated_files)} annotated files")

            # Train using SkolClassifierV2 with unified API
            print("Training classifier with SkolClassifierV2...")
            classifier = SkolClassifierV2(
                spark=spark,

                # Input
                input_source='files',
                file_paths=annotated_files,

                # Model I/O
                auto_load_model=False,  # Fit a new model.
                model_storage='redis',
                redis_client=redis_client,
                redis_key=classifier_model_name,
                redis_expire=classifier_model_expire,

                # Model and preprocssing options
                line_level=True,
                use_suffixes=False,
                model_type='logistic',

                # Output options
                output_dest='couchdb',
                couchdb_url=couchdb_url,
                couchdb_database=ingest_db_name,
                output_couchdb_suffix='.ann'
            )

            # Train the model
            results = classifier.fit()

            print(f"Training complete!")
            print(f"  Accuracy: {results.get('accuracy', 0):.4f}")
            print(f"  F1 Score: {results.get('f1_score', 0):.4f}")
            print(f"✓ Model automatically saved to Redis with key: {classifier_model_name}")
        else:
            print(f"No annotated files found in {annotated_path}")
    else:
        print(f"Directory does not exist: {annotated_path}")
        print("Please ensure annotated training data is available.")
else:
    print(f"Skipping generation of model {classifier_model_name}.")

Loading annotated files from: /data/piggy/src/github.com/piggyatbaqaqi/skol/data/annotated
Found 190 annotated files
Training classifier with SkolClassifierV2...


NameError: name 'SkolClassifierV2' is not defined

## Extract the taxa names and descriptions

We use a classifier to extract taxa names and descriptions from articles, issues, and books. The YEDDA annotated texts are written back to CouchDB.

In [16]:
# Predict from CouchDB and save back to CouchDB using SkolClassifierV2
print("Initializing classifier with unified V2 API...")
classifier = SkolClassifierV2(
    spark=spark,
    input_source='couchdb',
    couchdb_url=couchdb_url,
    couchdb_database=ingest_db_name,
    couchdb_username=couchdb_username,
    couchdb_password=couchdb_password,
    couchdb_pattern='*.txt',
    output_dest='couchdb',
    output_couchdb_suffix='.ann',
    model_storage='redis',
    redis_client=redis_client,
    redis_key=classifier_model_name,
    auto_load_model=True,
    line_level=True,
    coalesce_labels=True,
    output_format='annotated'
)

print(f"Model loaded from Redis: {classifier_model_name}")

# Load, predict, and save in a streamlined workflow
print("\nLoading and classifying documents from CouchDB...")
raw_df = classifier.load_raw()
print(f"Loaded {raw_df.count()} text documents")
raw_df.show(10)

print("\nMaking predictions...")
predictions = classifier.predict(raw_df)

# Show sample predictions
print("\nSample predictions:")
predictions.select(
    "doc_id", "attachment_name", "predicted_label"
).show(5, truncate=50)

# Save results back to CouchDB
print("\nSaving predictions back to CouchDB...")
classifier.save_annotated(predictions)

print(f"\n✓ Predictions saved to CouchDB as .ann attachments")

Initializing classifier with unified V2 API...


  tar.extractall(temp_path)


Model loaded from Redis: skol:classifier:model:v1.5

Loading and classifying documents from CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

Loaded 1160796 text documents


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

+--------------------+--------------------+---------------+--------------------+-----------+
|              doc_id|           human_url|attachment_name|               value|line_number|
+--------------------+--------------------+---------------+--------------------+-----------+
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|           MYCOTAXON|          0|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Volume 108, pp. 2...|          1|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|     April–June 2009|          2|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Rattania setulife...|          3|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|an undescribed en...|          4|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|on rattans from W...|          5|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Ashish Prabhugaon...|          6|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|*ashishprab

Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


✓ Predictions saved to CouchDB as .ann attachments


In [17]:
predictions.select("predicted_label", "annotated_value").where('predicted_label = "Nomenclature"').show()
predictions.groupBy("predicted_label").count().orderBy("count").show()

Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


+---------------+--------------------+
|predicted_label|     annotated_value|
+---------------+--------------------+
|   Nomenclature|[@ Forssell, C. s...|
|   Nomenclature|[@ nom. nud. #Nom...|
|   Nomenclature|[@ 11. Caloplaca ...|
|   Nomenclature|[@ Szatala Ö. 194...|
|   Nomenclature|[@ Iran, 1937. An...|
|   Nomenclature|[@ Aegeischen Mee...|
|   Nomenclature|[@ Szatala Ö. 195...|
|   Nomenclature|[@ Marasmius pseu...|
|   Nomenclature|[@ Rattania Prabh...|
|   Nomenclature|[@ Rattania setul...|
|   Nomenclature|[@ Masseeella flu...|
|   Nomenclature|[@ Pulvinula cons...|
|   Nomenclature|[@ 70. 1907.\t #N...|
|   Nomenclature|[@ Phellodon tome...|
|   Nomenclature|[@ Arachnopeziza ...|
|   Nomenclature|[@ Sociedade Brot...|
|   Nomenclature|[@ Iturriaga T, K...|
|   Nomenclature|[@ Serendipita sa...|
|   Nomenclature|[@ dominated by S...|
|   Nomenclature|[@ Christiaan Hen...|
+---------------+--------------------+
only showing top 20 rows



  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

+---------------+-------+
|predicted_label|  count|
+---------------+-------+
|   Nomenclature|  14609|
|    Description| 102158|
|Misc-exposition|1044029|
+---------------+-------+



                                                                                

Here we estimate an approximation for the number of Taxon structures we'd like to find. The abbreviation "nov." ("novum") indicates a new taxon in the current article. This should be a lower bound, as it is not unusual to redescribe a species, e.g. in a survey article or monograph on a genus.

In [18]:
predictions.select("*").filter(col("annotated_value").contains("nov.")).where("predicted_label = 'Nomenclature'").count()

                                                                                

3790

## Build the Taxon objects and store them in CouchDB
We use CouchDB to store a full record for each taxon. We copy all metadata to the taxon records.

In [19]:
class CouchDBFile(CDBF):
    """
    File-like object that reads from CouchDB attachment content.

    This class extends FileObject to support reading text from CouchDB
    attachments while preserving database metadata (doc_id, attachment_name,
    and database name).
    """


In [20]:
from extract_taxa_to_couchdb import (
    TaxonExtractor,
    generate_taxon_doc_id,
    extract_taxa_from_partition,
    convert_taxa_to_rows
)

## Build Taxon objects

Here we extract the Taxon objects from the annotated attachments.

In [21]:
ingest_couchdb_url = couchdb_url
ingest_username = couchdb_username
ingest_password = couchdb_password
taxon_couchdb_url = couchdb_url
taxon_username = couchdb_username
taxon_password = couchdb_password
pattern = '*.txt.ann'

In [22]:
# Create TaxonExtractor instance with database configuration
extractor = TaxonExtractor(
    spark=spark,
    ingest_couchdb_url=ingest_couchdb_url,
    ingest_db_name=ingest_db_name,
    taxon_db_name=taxon_db_name,
    ingest_username=ingest_username,
    ingest_password=ingest_password,
    taxon_username=taxon_username,
    taxon_password=taxon_password
)

print("TaxonExtractor initialized")
print(f"  Ingest DB: {ingest_db_name}")
print(f"  Taxon DB:  {taxon_db_name}")

TaxonExtractor initialized
  Ingest DB: skol_dev
  Taxon DB:  skol_taxa_dev


In [23]:
# Step 1: Load annotated documents
print("\nStep 1: Loading annotated documents from CouchDB...")
annotated_df = extractor.load_annotated_documents(pattern='*.txt.ann')
print(f"Loaded {annotated_df.count()} annotated documents")
annotated_df.show(5, truncate=False)


Step 1: Loading annotated documents from CouchDB...


[Stage 40:>                                                         (0 + 2) / 2]

Loaded 2099 annotated documents
+--------------------------------+------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


In [24]:
# Step 2: Extract taxa to DataFrame
print("\nStep 2: Extracting taxa from annotated documents...")
taxa_df = extractor.extract_taxa(annotated_df)
print(f"Extracted {taxa_df.count()} taxa")
taxa_df.printSchema()
taxa_df.show(10, truncate=False)


Step 2: Extracting taxa from annotated documents...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033
DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2012/00000118/00000001/art00005

=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
DEBUG: etfp paragraphs[0].human_url = None
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
Se

Extracted 5239 taxa
root
 |-- taxon: string (nullable = false)
 |-- description: string (nullable = false)
 |-- source: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- line_number: integer (nullable = true)
 |-- paragraph_number: integer (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- empirical_page_number: string (nullable = true)



DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+-------


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
DEBUG: etfp paragraphs[0].human_url = None
DEBUG: etfp filtered[0].human_url = None
DEBUG: etfp filtered_list[0].human_url = None
DEBUG: taxon.human_url = <bound method Taxon.human_url of [{'filename': 'skol_dev/0020c88329ed456a95a18e0c219269f4/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'label': 'Nomenclature', 'paragraph_number': 2, 'page_number': 1, 'empirical_page_number': 'None', 'body': ' 2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)\n\n', 'serial_number': '1'}, {'filename': 'skol_dev/0020c88329ed456a95a18e0c219269f4/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art

In [25]:
# Step 3: Inspect actual Taxon objects from the RDD (optional debugging)
print("\n=== Sample Taxon Objects ===")
taxa_rdd = annotated_df.rdd.mapPartitions(
    lambda partition: extract_taxa_from_partition(iter(partition), ingest_db_name)  # type: ignore[reportUnknownArgumentType]
)
for i, taxon in enumerate(taxa_rdd.take(3)):
    print(f"\nTaxon {i+1}:")
    print(f"  Type: {type(taxon)}")
    print(f"  Has nomenclature: {taxon.has_nomenclature()}")
    taxon_row = taxon.as_row()
    print(f"  Taxon name: {taxon_row['taxon'][:80]}...")
    print(f"  Source: {taxon_row['source']}")


=== Sample Taxon Objects ===


DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033



Taxon 1:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)
...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 2:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  5. Caloplaca gyalolechiiformis Szatala, Ann. Hist.-Nat. Mus. Natl. Hungarici, s...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 3:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  7. Caloplaca lactea var. subimmersa Szatala, Ann. Hist.-Nat. Mus. Natl. Hungari...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name'


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
DEBUG: etfp paragraphs[0].human_url = None
DEBUG: etfp filtered[0].human_url = None
DEBUG: etfp filtered_list[0].human_url = None
DEBUG: taxon.human_url = <bound method Taxon.human_url of [{'filename': 'skol_dev/0020c88329ed456a95a18e0c219269f4/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'label': 'Nomenclature', 'paragraph_number': 2, 'page_number': 1, 'empirical_page_number': 'None', 'body': ' 2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)\n\n', 'serial_number': '2639'}, {'filename': 'skol_dev/0020c88329ed456a95a18e0c219269f4/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/

In [26]:
# Step 4: Save taxa to CouchDB
print("\nStep 4: Saving taxa to CouchDB...")
results_df = extractor.save_taxa(taxa_df)

# Show detailed results
results_df.groupBy("success").count().show(truncate=False)

# If there are failures, show error messages
print("\nError messages:")
results_df.filter("success = false").select("error_message").distinct().show(truncate=False)


Step 4: Saving taxa to CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

ost thallus, and 1 or 2septate brown ascospores. Similar brown vegetative hyphae arising from the\n\n', 'serial_number': '2553'}, {'filename': 'skol_dev/fa7ff16e28c54d3bbcd0c1d37b645b10/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2009/00000107/00000001/art00018', 'label': 'Description', 'paragraph_number': 44963, 'page_number': 1, 'empirical_page_number': 'None', 'body': ' species by often laterally compressed ascomata forming dense groups that are

+-------+-----+
|success|count|
+-------+-----+
|true   |5239 |
+-------+-----+


Error messages:


DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2012/00000118/00000001/art00005
DEBUG: etfp lines[0].human_url = https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

_dev/79989bd6bf7e43db9bc6fdb320d10727/article.txt.ann', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000002/art00011', 'label': 'Description', 'paragraph_number': 45878, 'page_number': 1, 'empirical_page_number': 'None', 'body': ' in 2 weeks at 25 °C. Mycelium mostly superficial; hyphae smooth, hyaline\nor rather pale brown, septate, branched, 1–2 µm diam. Phialides solitary\nor in fascicles, straight or slightly curved

+-------------+
|error_message|
+-------------+
+-------------+





In [None]:
# Alternative: Run the complete pipeline in one step
# Uncomment to use the simplified one-step approach:

# print("\nRunning complete pipeline...")
# results = extractor.run_pipeline(pattern='*.txt.ann')
#
# successful = results.filter("success = true").count()
# failed = results.filter("success = false").count()
#
# print(f"\nPipeline Results:")
# print(f"  Successful: {successful}")
# print(f"  Failed:     {failed}")
#
# results.groupBy("success").count().show(truncate=False)

### Observations on the classification models

The line-by-line classification model is classifying many Description lines as Misc-exposition. It works reasonably well for Nomenclature.

The problem with the paragraph classification model is that the heuristic paragrph parser does not generalize well to the more modern journals.

One possible approach to investigate is adding heuristics to the label-merging code to convert some Misc-exposition lines to Description if they are surrounded by Description paragraphs.

A more sophisticated approach is to use a completely new model that has some memory, such as an RNN, or a two pass model that adds the label of the previous line(s) as added features for each line.

It may become necessary to hand annotate some of the more modern journals.

## Dr. Drafts document embedding

Dr. Drafts loads documents from CouchDB. Save the embedding to redis.


In [6]:
from dr_drafts_mycosearch.data import SKOL_TAXA as STX
from dr_drafts_mycosearch.compute_embeddings import EmbeddingsComputer as EC

class SKOL_TAXA(STX):
    """Data interface for Synopotic Key of Life Taxa in CouchDB."""
    

In [7]:
class EmbeddingsComputer(EC):
    """Class for computing and storing embeddings from narrative data."""


## Compute Embeddings

We use SBERT to embed the taxa into a search space.

In [8]:
skol_taxa = SKOL_TAXA(
    couchdb_url="http://localhost:5984",
    username=couchdb_username,
    password=couchdb_password,
    db_name=taxon_db_name
)
descriptions = skol_taxa.get_descriptions()

DEBUG: doc: <Document 'taxon_000aaabf9fb6bcb8ff1c66e1ddeb8c30599c1283be703156d50e45bc8779df25'@'2-1e185639e462e67cab525449ce2d1e6d' {'taxon': ' added Valsaria Ces. & De Not. and Valsonectria Speg., both with uniseptate\n\n\n Mattirolia Berl. & Bres., Annuario Soc. Alpin. Trident. 14: 351 (1889)\n\n\n = Thyronectroidea Seaver, Mycologia 1: 206 (1909)\n\n\n = Balzania Speg., Anales Mus. Nac. Buenos Aires 6: 286 (1898)\n\n', 'description': ' Stroma variable, usually present, erumpent, covered with loosely interwoven\nyellowish or brownish hyphae, KOH negative. Perithecia globose, semiimmersed or isolated in the stroma. Paraphyses abundant. Asci unitunicate,\ncylindrical or clavate, non amyloid. Ascospores smooth, muriform, hyaline to\n\n\n Stroma subcortical, pulvinate, 0.5–6 mm diam. Perithecia aggregated,\nimmersed in the stroma or more rarely isolated, globose, 400–450 µm diam.,\nblack, surrounded by a yellowish tomentum, 30–50 µm thick, formed by\nhyphae 4–5 µm diam. Peridium pseudopa

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



DEBUG: doc: <Document 'taxon_b3df48cf6fe62d9f364166abbeca2ad16b23523d98a8483c60ce318ea0b7ffa4'@'2-7b3441c6b2795078a562e746eaf676c6' {'taxon': ' Arcyodes incarnata (Alb. & Schwein.) O.F. Cook, Science, N.Y. 15: 651, 1902.\n\n', 'description': ' Sporocarps aggregated, sessile, crowded and heaped, 0.5–0.7 mm in diam.,\n\n\n single, membranous, somewhat opalescent, persistent, irregularly dehiscent\nabove, yellow to colourless by transmitted light, inner surface with many\n\n\n protuberances. Columella absent. Capillitium tubular, elastic, branched\nand anastomosed, pale yellow by transmitted light, mostly 4 μm in diam., with\n\n\n with a faint reticulation. Spores free, pale pink in mass, yellowish pale to\ncolourless by transmitted light, 7–9.5 μm in diam., globose to subglobose,\ndensely warted, with scattered groups of more prominent warts.\n\n\n Fructifications plasmodiocarps, gregarious, reticulation 0.6–3 mm in\nextension, pale yellowish brown, fading to pale brown. Plasmodiocarps\n

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [9]:
if not redis_client.exists(embedding_name):

    embedder = EmbeddingsComputer(
        idir='/dev/null',
        redis_url='redis://localhost:6379',
        redis_expire=embedding_expire,
        embedding_name=embedding_name,
    )

    embedding_result = embedder.run(descriptions)

## Compute JSON versions of all descriptions

### Load the output of TaxonExtractor.save_taxa as a pyspark DataFrame

In [None]:
class TaxaJSONTranslator(TJT):
    """
    Translates taxa descriptions to structured JSON using a fine-tuned Mistral model.

    This class is optimized for processing PySpark DataFrames created by
    TaxonExtractor.load_taxa(), adding a new column with JSON-formatted features.
    """


In [None]:
translator = TaxaJSONTranslator(
    spark=spark,
    base_model_id="mistralai/Mistral-7B-Instruct-v0.3",
    max_length=2048,
    max_new_tokens=1024,
    device="cuda",
    load_in_4bit=True,
    use_auth_token=True,
    couchdb_url=couchdb_url,
    username=couchdb_username,
    password=couchdb_password
)

### Run the mistral model to generate JSON from each Taxon description.

### Add the generated fields as a field on the objects generated by save_taxa.

## Hierarchical clustering

We use Agglomerative Clustering to group the taxa into "clades" based in cosine similarity of their SBERT embeddings. We then load them into neo4j.

In [7]:
from taxon_clusterer import TaxonClusterer as TC

class TaxonClusterer(TC):
    pass

In [8]:
clusterer = TaxonClusterer(
    redis_host="localhost",
    redis_port=6379,
    redis_db=0,
    neo4j_uri=neo4j_uri,
)

# Load embeddings from Redis
(embeddings, taxon_names, metadata) = clusterer.load_embeddings(embedding_name)

TaxonClusterer initialized
  Redis: localhost:6379/0
  Neo4j: bolt://localhost:7687
Loading embeddings from Redis key: skol:embedding:v1.3
✓ Loaded 5239 taxa with 768-dimensional embeddings


In [9]:
metadata[0]

{'source_db_name': 'skol_dev',
 'source_doc_id': '6a99ebd4995242a4bf5a3465acf2bbdf',
 'source_human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00021',
 'line_number': 85,
 'paragraph_number': 37287,
 'page_number': 1,
 'empirical_page_number': '4666',
 'description': ' Stroma variable, usually present, erumpent, covered with loosely interwoven\nyellowish or brownish hyphae, KOH negative. Perithecia globose, semiimmersed or isolated in the stroma. Paraphyses abundant. Asci unitunicate,\ncylindrical or clavate, non amyloid. Ascospores smooth, muriform, hyaline to\n\n\n Stroma subcortical, pulvinate, 0.5–6 mm diam. Perithecia aggregated,\nimmersed in the stroma or more rarely isolated, globose, 400–450 µm diam.,\nblack, surrounded by a yellowish tomentum, 30–50 µm thick, formed by\nhyphae 4–5 µm diam. Peridium pseudoparenchymatous, orange, composed\nof isodiametric cells, 5–15 µm diam. Paraphyses 5 µm diam, regularly septate,\n\n\n µm, with 8 uniseriat

In [10]:
# Perform clustering
linkage_matrix = clusterer.cluster(method="average", metric="cosine")

# Store in Neo4j with root named "Fungi"
clusterer.store_in_neo4j(root_name="Fungi", clear_existing=True)

print("✓ Clustering complete!")

Performing agglomerative clustering...
  Method: average
  Metric: cosine
✓ Clustering complete
  Tree depth: 249
  Total nodes: 10477
Storing tree in Neo4j...
  Root name: Fungi
  Clearing existing Taxon and Pseudoclade nodes...
✓ Tree stored in Neo4j
  Taxon nodes: 5239
  Pseudoclade nodes: 5238
  PARENT_OF relationships: 10476
✓ Clustering complete!


## Bibliography

* doi Foundation, "DOI Citation Formatter HTTP API", https://citation.doi.org/api-docs.html, accessed 2025-11-12.
* Yang, Jie and Zhang, Yue and Li, Linwei and Li, Xingxuan, 2018, "YEDDA: A Lightweight Collaborative Text Span Annotation Tool", Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, http://aclweb.org/anthology/P18-4006


## Appendix: On the use of an AI Coder

Portions of this work were completed with the aid of Claude Code Pro. I wish to give a clarifying example of how I've used this very powerful tool, and reveal why I am comfortable with claiming authorship of the resulting code.

For this project I needed results from an earlier class project in which a trio of students built and evaluated models for classifying paragraphs. The earlier work was built as a iPython Notebook, with many examples and inline code. Just copying the earlier notebook would have introduced many irrelevant details and would not further the overall project.

I asked Claude Code to translate the notebook into a module that I could import. It did a pretty good job. Without being told, it made a submodule, extract the illustrative code as examples, wrote reasonable documentation and created packaging for the module.

The skill level of the coding was roughly that of a highly disciplined average junior programmer. The architecture was simplistic and violated several design constraints such as DRY. I requested specific refactorings, such as asking for a group of functions to be converted into an object that shared duplicated parameters.

The initial code used REST interfaces directly, and read all the data into a single machine, not using pyspark correctly. Through a series of refactorings, I asked that the code use appropriate libraries I named, and create correct udf functions to execute transformations in parallel.

I walked the AI through creating an object that I could use to illustrate my use of redis and couchdb interfaces, while leaving the irrelevant details in a separate library.

In short, I still have to understand good design principles. I have to be able to recognize where appropriate libraries were applicable. I still have to understand the frameworks I am working with.

I now have a strong understanding of the difference between "vibe coding" and AI-assisted software engineering. In my first 4 hours with Claude Code, I was able to produce roughly 4 days' worth of professional-grade working code.