# SKOL IV: All the Data

Synoptic Key of Life (SKOL) is a web site and application that aims to provide easy access to all of the open taxonomic literature in Mycology. A synoptic key is a tool that helps you identify an organism making successive observations, building up a detailed description of the organism in front of you. There are many fine synoptic keys available for particular taxa, but they are all hand-built. SKOL uses AI to build the synoptic key automatically.

The goal is to make it easier for advanced amateur mycologists to build technical descriptions of fungi.''

![image.png](attachment:a97e1e8d-1024-426b-95f3-b8282ac91076.png) Credit: Garth Smith Photography 2018.

## Storage needs

SKOL uses a diverse set of databases to hold different artifacts.

The original literature is ingested into the document database CouchDB (citation needed) along with available publication metadata. The originals are typically PDF files which are stored as attachments on the CouchDB ingestion records.

Text is extracted from the ingested files, using OCR if necessary. This text is a second attachment on the ingestion record.

A classifier is trained from hand-annotated articles and stored in Redis. The model has an expiration period of several weeks. The classifier then annotates each text document with labels for Nomenclature, Description, and Misc-exposition. It stores these annotated articles as attachments on the CouchDB ingestion records.

Taxon names (typically species names with literature annotation) and combined with matching descriptions into Taxon records and stored in another CouchDB database. These records are the core data for SKOL.

The Taxon records are processed a number of ways: sentence embedding, JSON encoding, and artificial cladograms.

The sentence embedding is done with an SBERT model (citation needed) and stored as a blob in Redis. The embedding has an expiration period of 24 hours.

A Mistral model (citation needed) converts each Taxon record description is converted into a hierarchy of features, subfeatures, and values. The epectation is that these data structures will eventually form the basis of pull-down menus in the SKOL user interface. These JSON structures are stored in another CouchDB database.

The sentence embeddings are further processed into a single tree of Taxon reccords based on their distance from each other in the sentence embedding space. This tree is stored in a neo4j database.

## A note on source code

In order to include only the most relevant parts of the source code, the heavy lifting of the new classes is not shown. Only the storage related methods are exposed here.


In [1]:
bahir_package = 'org.apache.bahir:spark-sql-cloudant_2.12:2.4.0'
!spark-shell --packages $bahir_package < /dev/null

25/12/25 15:31:16 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 10.1.10.58 instead (on interface wlp130s0f0)
25/12/25 15:31:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6e81f36c-21f0-4167-be59-c3db43dc21f3;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found 

In [2]:
import os
# Forces synchronous execution, making it easier to track GPU operations.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' 

# Enables basic CUDA debug logging.
os.environ['CUDA_DEBUG'] = '1' 

# Other potentially useful variables for more detailed logging:
# os.environ['CUDA_API_CALLS'] = '1' # Logs CUDA API calls
os.environ['CUDA_LOG_LEVEL'] = 'DEBUG' # Or 'DEBUG', 'WARNING', etc.


In [3]:
from io import BytesIO
import json
import hashlib
import os
from pathlib import Path, PurePath
import pickle
import requests
import shutil
import sys
import tempfile
import time
from typing import Any, Dict, Iterator, List, Optional, Tuple
from urllib.robotparser import RobotFileParser
import warnings

warnings.filterwarnings('error', category=UserWarning)

# os.environ['LD_LIBRARY_PATH'] = '/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/nvidia/cusparselt/lib'

# Be sure to get version 2: https://simple-repository.app.cern.ch/project/bibtexparser/2.0.0b8/description
import bibtexparser
import couchdb
import feedparser
import fitz # PyMuPDF

import pandas as pd  # TODO(piggy): Remove this dependency in favor of pure pyspark DataFrames.

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import (
    Tokenizer, CountVectorizer, IDF, StringIndexer, VectorAssembler, IndexToString
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql.functions import (
    input_file_name, collect_list, regexp_extract, col, udf,
    explode, trim, row_number, min, expr, concat, lit
)
from pyspark.sql.types import (
    ArrayType, BooleanType, IntegerType, MapType, NullType,
    StringType, StructType, StructField
)
from pyspark.sql.window import Window

import redis
from uuid import uuid4

# Local modules
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
parent_path = Path(parent_dir)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# TODO: Make a TaxonExtractor in this notebook with the needed i/o functions.
from fileobj import FileObject
from finder import parse_annotated, remove_interstitials
import line
from line import Line

import numpy as np

# Import SKOL classifiers
from skol_classifier.model import SkolModel
from skol_classifier.output_formatters import YeddaFormatter
from skol_classifier.preprocessing import SuffixTransformer, ParagraphExtractor
from skol_classifier.utils import get_file_list

from taxon import group_paragraphs, Taxon


## Important constants

In [4]:
should_ingest = True  # Should we ingest from the real web sites?
run_ocr = True  # Should we extract text from PDFs?
model_to_use = "rnn"
create_classifier = True  # Should we recalculate the classifier?
add_annotations = True # Should we add *.ann files? Very expensive!
build_taxon = True
generate_json = True  # Run mistral to generate JSON descriptions?
compute_embeddings = True  # Compute embeddings and save them.
run_clustering = True  # Produce the neo4j database.

couchdb_host = "127.0.0.1:5984" # e.g., "ACCOUNT.cloudant.com" or "localhost"
couchdb_username = "admin"
couchdb_password = "SU2orange!"
ingest_db_name = "skol_dev"  # Development ingestion database
training_db_name = "skol_training"
taxon_db_name = "skol_taxa_dev"  # Development Taxa database
json_taxon_db_name = "skol_taxa_full_dev"  # Development Taxa database with JSON translations

redis_host = 'localhost'
redis_port = 6379

embedding_name = 'skol:embedding:v1.1'
embedding_expire = 60 * 60 * 24 * 2  # Expire after 2 days.
classifier_model_expire = 60 * 60 * 24  * 2 # Expire after 2 days.
model_version = "v2.0"
neo4j_uri = "bolt://localhost:7687"

couchdb_url = f'http://{couchdb_host}'

cores = 2

## robots.txt

We want to be a well-behaved web scraper. Respect `robots.txt`, a standardized file that tells us what parts of a web site a scraper is allowed to access.

In [5]:
user_agent = "synoptickeyof.life"

ingenta_rp = RobotFileParser()
ingenta_rp.set_url("https://www.ingentaconnect.com/robots.txt")
ingenta_rp.read() # Reads and parses the robots.txt file from the URL

## Spark, couchdb, and redis connections

In [6]:
def make_spark_session():
    time.sleep(2)

    retval = SparkSession \
        .builder \
        .appName("CouchDB Spark SQL Example in Python using dataframes") \
        .master(f"local[{cores}]") \
        .config("cloudant.protocol", "http") \
        .config("cloudant.host", couchdb_host) \
        .config("cloudant.username", couchdb_username) \
        .config("cloudant.password", couchdb_password) \
        .config("spark.jars.packages", bahir_package) \
        .config("spark.driver.memory", "16g") \
        .config("spark.executor.memory", "20g") \
        .config("spark.driver.extraJavaOptions",
                "--add-opens=java.base/java.nio=ALL-UNNAMED "
                "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED "
                "--add-opens=java.base/sun.security.action=ALL-UNNAMED "
                "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED") \
        .config("spark.executor.extraJavaOptions",
                "--add-opens=java.base/java.nio=ALL-UNNAMED "
                "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED "
                "--add-opens=java.base/sun.security.action=ALL-UNNAMED "
                "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED") \
        .config("spark.submit.pyFiles",
                f'{parent_path / "line.py"},{parent_path / "fileobj.py"},'
                f'{parent_path / "couchdb_file.py"},{parent_path / "finder.py"},'
                f'{parent_path / "taxon.py"},{parent_path / "paragraph.py"},'
                f'{parent_path / "label.py"},{parent_path / "file.py"},'
                f'{parent_path / "extract_taxa_to_couchdb.py"}'
               ) \
        .getOrCreate()

    sc = retval.sparkContext
    sc.setLogLevel("ERROR") # Keeps the noise down!!!
    return retval

couch = couchdb.Server(couchdb_url)
couch.resource.credentials = (couchdb_username, couchdb_password)

if ingest_db_name not in couch:
    db = couch.create(ingest_db_name)
else:
    db = couch[ingest_db_name]

# Connect to Redis
redis_client = redis.Redis(
    host=redis_host,
    port=redis_port,
    db=0,
    decode_responses=False
)

## The Data Sources

The goal is to collect all the open access taxonomic literature in Mycology. Most of the sources below mainly cover macro-fungi and slime molds.

### Ingested Data Sources

* [Mycotaxon at Ingenta Connect](https://www.ingentaconnect.com/content/mtax/mt)
* [Studies in Mycology at Ingenta Connect](https://www.studiesinmycology.org/)
* [Fungal Systematics and Evolution at Ingenta Connect](https://api.ingentaconnect.com/content/wfbi/fuse)
* [Persoonia at Ingenta Connect](https://api.ingentaconnect.com/content/wfbi/pimj)
* [Mycoweb](https://mykoweb.com/) includes scans of many older works in mycology.
* [Mycosphere](https://mycosphere.org/)


### Source of many older public domain and open access works

The Online Books page at upenn.edu holds copies of many older materials. There is some overlap with Mycoweb, but there is enough difference to make it worthwhile.

* [Fungi at The Oneline Books Page](https://onlinebooks.library.upenn.edu/webbin/book/browse?type=lcsubc&key=Fungi%20%2D%2D%20Periodicals&c=x)
* [Ruel M. Bennett's list of mycology journals](https://medium.com/@reuelmbennett/mycology-journals-ddeec8676440)

### Journals in hand

These are journals I've collected over the years. The initial annotated issues are from early years of Mycotaxon. I still need to write ingesters for all of these.

* Mycologia (back issues)
* [Mycologia at Taylor and Francis](https://www.tandfonline.com/journals/umyc20) [RSS](https://www.tandfonline.com/feed/rss/umyc20)
  Mycologia is the main journal of the Mycological Society of America. It is a mix of open access and traditional access articles. The connector for this journal will need to identify the open access articles.
* Persoonia (all issues)
  Persoonia is no longer published.
* Mycotaxon (back issues)
  Mycotaxon is no longer published.

### Journals that need connectors

These are journals I'm aware that include open access articles.

* [Amanitaceae.org](http://www.tullabs.com/amanita/?home)
* [Mycoscience](https://mycoscience.org/)
* [Journal of Fungi](https://www.mdpi.com/journal/jof)
* [Mycology](https://www.tandfonline.com/journals/tmyc20)
* [Open Access Journal of Mycology & Mycological Sciences](https://www.medwinpublishers.com/OAJMMS/)
* [Mycokeys](https://mycokeys.pensoft.net/)
* [Sydowia current](https://www.sydowia.at/)
* [Sydowia 1903-1968 at IA](https://archive.org/details/pub_sydowia)


## Ingestion

Each journal or other data source gets an ingester that puts PDFs into our document store along with any metadata we can collect. The metadata is sufficient to create citations for each issue, book, or article. If bibtex citations are available we prefer to store these verbatim.

### Ingenta RSS ingestion

Ingenta Connect is an electronic publisher that holds two Mycology journals. New articles are available via RSS (Really Simple Syndication).

In [7]:
def ingest_from_bibtex(
        db: couchdb.Database,
        content: bytes,
        bibtex_link: str,
        meta: Dict[str, Any],
        rp
        ) -> None:
    """Load documents referenced in an Ingenta BibTeX database."""
    bib_database = bibtexparser.parse_string(content)

    bibtex_data = {
        'link': bibtex_link,
        'bibtex': bibtexparser.write_string(bib_database),
    }

    for bib_entry in bib_database.entries:
        doc = {
            '_id': uuid4().hex,
            'meta': meta,
            'pdf_url': f"{bib_entry['url']}?crawler=true",
        }

        # Do not fetch if we already have an entry.
        selector = {'selector': {'pdf_url': doc['pdf_url']}}
        found = False
        for e in db.find(selector):
            found = True
        if found:
            print(f"Skipping {doc['pdf_url']}")
            continue

        if not rp.can_fetch(user_agent, doc['pdf_url']):
            # TODO(piggy): We should probably log blocked URLs.
            print(f"Robot permission denied {doc['pdf_url']}")
            continue

        print(f"Adding {doc['pdf_url']}")
        for k in bib_entry.fields_dict.keys():
            doc[k] = bib_entry[k]

        doc_id, doc_rev = db.save(doc)
        with requests.get(doc['pdf_url'], stream=False) as pdf_f:
            pdf_f.raise_for_status()
            pdf_doc = pdf_f.content

        attachment_filename = 'article.pdf'
        attachment_content_type = 'application/pdf'
        attachment_file = BytesIO(pdf_doc)

        db.put_attachment(doc, attachment_file, attachment_filename, attachment_content_type)

        print("-" * 10)

In [8]:
def ingest_ingenta(
        db: couchdb.Database,
        rss_url: str,
        rp
) -> None:
    """Ingest documents from an Ingenta RSS feed."""

    feed = feedparser.parse(rss_url)

    feed_meta = {
        'url': rss_url,
        'title': feed.feed.title,
        'link': feed.feed.link,
        'description': feed.feed.description,
    }

    for entry in feed.entries:
        entry_meta = {
            'title': entry.title,
            'link': entry.link,
        }
        if hasattr(entry, 'summary'):
            entry_meta['summary'] = entry.summary
        if hasattr(entry, 'description'):
            entry_meta['description'] = entry.description

        bibtex_link = f'{entry.link}?format=bib'
        print(f"bibtex_link: {bibtex_link}")

        if not rp.can_fetch(user_agent, bibtex_link):
            print(f"Robot permission denied {bibtex_link}")
            continue

        with requests.get(bibtex_link, stream=False) as bibtex_f:
            bibtex_f.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

            ingest_from_bibtex(
                db=db,
                content=bibtex_f.content\
                    .replace(b"\"\nparent", b"\",\nparent")\
                    .replace(b"\n", b""),
                bibtex_link=bibtex_link,
                meta={
                    'feed': feed_meta,
                    'entry': entry_meta,
                },
                rp=rp
            )
        print("=" * 20)

In [9]:
def ingest_from_local_bibtex(
    db: couchdb.Database,
    root: Path,
    rp
) -> None:
    """Ingest from a local directory with Ingenta bibtext files in it."""
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if not filename.endswith('format=bib'):
                continue
            full_filepath = os.path.join(dirpath, filename)
            bibtex_link = f"https://www.ingentaconnect.com/{full_filepath[len(str(root)):]}"
            with open(full_filepath) as f:
                # Paper over a syntax problem in Ingenta Connect Bibtex files.
                content = f.read()\
                    .replace("\"\nparent", "\",\nparent")\
                    .replace("\n", "")
                ingest_from_bibtex(db, content, bibtex_link, meta={}, rp=rp)


In [10]:
# Mycotaxon
if should_ingest:
    ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/mtax/mt?format=rss', rp=ingenta_rp)

In [11]:
# Studies in Mycology
if should_ingest:
    ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/wfbi/sim?format=rss', rp=ingenta_rp)

In [12]:
if should_ingest:
    ingest_from_local_bibtex(
        db=db,
        root=Path("/data/skol/www/www.ingentaconnect.com"),
        rp=ingenta_rp
    )

### Text extraction

If the pdf has embedded text, we extract it. Otherwise, we run OCR to get text.

In [13]:
spark = make_spark_session()

df = spark.read.load(
    format="org.apache.bahir.cloudant",
    database=ingest_db_name
)

25/12/25 15:31:26 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 10.1.10.58 instead (on interface wlp130s0f0)
25/12/25 15:31:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1567f150-0921-4722-9939-77e891f869e9;1.0
	confs: [default]


:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found commons-codec#commons-codec;1.6 in central
	found com.cloudant#cloudant-http;2.17.0 in central
	found commons-io#commons-io;2.4 in central
	found com.squareup.okhttp3#okhttp;3.12.2 in central
	found com.squareup.okio#okio;1.15.0 in central
	found com.typesafe#config;1.3.1 in central
	found org.scalaj#scalaj-http_2.12;2.3.0 in central
:: resolution report :: resolve 216ms :: artifacts dl 6ms
	:: modules in use:
	com.cloudant#cloudant-client;2.17.0 from central in [default]
	com.cloudant#cloudant-http;2.17.0 from central in [default]
	com.google.code.gson#gson;2.8.2 from central in [default]
	com.squareup.okhttp3#okhttp;3.12.2 from central in [default]
	com.squareup.okio#okio;1.15.0 from central in [def

In [14]:
df.describe()

DataFrame[summary: string, _id: string, _rev: string, abstract: string, author: string, doi: string, eissn: string, issn: string, itemtype: string, journal: string, number: string, pages: string, parent_itemid: string, pdf_url: string, publication date: string, publishercode: string, title: string, url: string, volume: string, year: string]

In [15]:
# Content-Type: text/html; charset=UTF-8

def pdf_to_text(pdf_contents: bytes) -> bytes:
    doc = fitz.open(stream=BytesIO(pdf_contents), filetype="pdf")

    full_text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Possibly perform OCR on the page
        text = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_DEHYPHENATE)
        full_text += f"\n--- PDF Page {page_num+1} ---\n"
        full_text += text

    return full_text.encode("utf-8")

def add_text_to_partition(iterator) -> None:
    couch = couchdb.Server(couchdb_url)
    couch.resource.credentials = (couchdb_username, couchdb_password)
    local_db = couch[ingest_db_name]
    for row in iterator:
        if not row:
            continue
        if not row._attachments:
            continue
        row_dict = row.asDict()
        attachment_dict = row._attachments.asDict()
        for pdf_filename in attachment_dict:
            pdf_path = PurePath(pdf_filename)
            if pdf_path.suffix != '.pdf':
                continue
            pdf_path = PurePath(pdf_filename)
            txt_path_str = pdf_path.stem + '.txt'
            # if txt_path_str in attachment_dict:
            #     # TODO(piggy): Recalculate text if text is terrible. Too much noise vocabulary?
            #     print(f"Already have text for {row.pdf_url}")
            #     continue
            print(f"{row._id}, {row.pdf_url}")
            pdf_file = local_db.get_attachment(row._id, str(pdf_path)).read()
            txt_file = pdf_to_text(pdf_file)
            attachment_content_type = 'text/simple; charset=UTF-8'
            attachment_file = BytesIO(txt_file)
            local_db.put_attachment(row_dict, attachment_file, txt_path_str, attachment_content_type)


In [16]:
# Identical to skol_classifier.CouchDBConnection.
from skol_classifier.couchdb_io import CouchDBConnection as CDBC

class CouchDBConnection(CDBC):
    """
    Manages CouchDB connection and provides I/O operations.

    This class encapsulates connection parameters and provides an idempotent
    connection method that can be safely called multiple times.
    """


In [17]:
from skol_classifier.output_formatters import CouchDBOutputWriter as CDBOW
class CouchDBOutputWriter(CDBOW):
    """
    Writes predictions back to CouchDB as attachments.
    """


In [18]:
from skol_classifier.classifier_v2 import SkolClassifierV2 as SC
from skol_classifier.model import create_model
"""
Main classifier module for SKOL text classification
"""
class SkolClassifierV2(SC):
    """
    Text classifier for taxonomic literature.

    This version only includes the redis and couchdb I/O methods.
    All other methods are in SC.

    Supports multiple classification models (Logistic Regression, Random Forest, RNN)
    and feature types (word TF-IDF, suffix TF-IDF, combined).
    """


## Build a classifier to identify paragraph types.

We save this to redis so that we don't need to train the model every time.

The heuristic paragraph detector I wrote for earlier phases of the project did not generalize well to the two journals used as data sources presently. Intead, I introduced a line-by-line classification scheme.

In [19]:
# Get annotated training files
annotated_path = Path.cwd().parent / "data" / "annotated"
print(f"Loading annotated files from: {annotated_path}")
if annotated_path.exists():
    annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")


Loading annotated files from: /data/piggy/src/github.com/piggyatbaqaqi/skol/data/annotated


In [20]:
# Train classifier on annotated data and save to Redis using SkolClassifierV2

model_configs = {
    # "rnn": {
    #     "name": "RNN BiLSTM (line-level, advanced config)",
    #     "model_type": "rnn",
    #     "use_suffixes": True,
    #     "line_level": True,
    #     "num_workers": cores,
    #     "verbosity": 2,
    #     "num_classes": 3,
     
    #     # "input_size": 1000,
    #     # "hidden_size": 256,
    #     # "num_layers": 3,
    #     # "dropout": 0.3,
    #     # "window_size": 20,
    #     # "prediction_stride": 15,  # 25% overlap
    #     # "prediction_stride": 20,  # 0 overlap
    #     # "prediction_batch_size": 32,
    #     # "batch_size": 16,  # 442MiB footprint
    #     # "batch_size": 128,  # 570MiB footprint
    #     # "batch_size": 512,  # 1370MiB footprint
    #     # "batch_size": 1024,  # 2394MiB footprint
    #     # "batch_size": 2048,  # 4442MiB footprint, 5s per step
    #     # "batch_size": 3276,  # 8570MiB
    #     # "batch_size": 4096,  #  4442MiB footprint, 8s-11s per step
    #     # "batch_size": 8192,  # 8538MiB footprint, 36s per step
    #     # "batch_size": 16384,  # 16730MiB footprint, (3s) 38s-40s per step

    #     # "epochs": 4,
    #     # ==================================================
    #     # RNN Model Evaluation Statistics (Line-Level)
    #     # ==================================================
    #     # Accuracy:  0.8717
    #     # Precision: 0.8717
    #     # Recall:    1.0000
    #     # F1 Score:  0.8120
    #     # Frequencies: [Row(label_indexed=0.0, count=6933), Row(label_indexed=1.0, count=854), Row(label_indexed=2.0, count=133)]
    #     # "epochs": 10,
    #     #  Accuracy is still going up and loss is going down, so we are not not yet overfitting.
    #     # Epoch 1/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 6s 6s/step - accuracy: 0.3325 - loss: 1.1041
    #     # Epoch 2/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 3s 3s/step - accuracy: 0.7891 - loss: 0.8316
    #     # Epoch 3/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 141s 141s/step - accuracy: 0.8153 - loss: 0.6313
    #     # Epoch 4/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 141s 141s/step - accuracy: 0.8248 - loss: 0.5051
    #     # Epoch 5/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 141s 141s/step - accuracy: 0.8482 - loss: 0.4304
    #     # Epoch 6/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 141s 141s/step - accuracy: 0.8862 - loss: 0.3632
    #     # Epoch 7/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 140s 140s/step - accuracy: 0.9134 - loss: 0.3145
    #     # Epoch 8/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 147s 147s/step - accuracy: 0.9190 - loss: 0.2956
    #     # Epoch 9/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 155s 155s/step - accuracy: 0.9146 - loss: 0.2887
    #     # Epoch 10/10
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 147s 147s/step - accuracy: 0.9179 - loss: 0.2639
    #     # ======================================================================
    #     # RNN Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8363
    #     #   Precision: 0.8363
    #     #   Recall:    1.0000
    #     #   F1 Score:  0.7618
    #     #   Loss:      0.6439
    #     #   Total Predictions: 3409
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      1.0000     0.8363     1.0000     0.9109     0.1161     2851      
    #     # Description          0.0000     0.0000     0.0000     0.0000     2.8847     475       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     5.9489     83        
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition2851        0           0           
    #     # Description    475         0           0           
    #     # Nomenclature   83          0           0           
    #     # ======================================================================


    #     # "epochs": 20,
    #     # Epoch 1/20
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 6s 6s/step - accuracy: 0.2846 - loss: 1.1448
    #     # Epoch 2/20
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 3s 3s/step - accuracy: 0.7889 - loss: 0.8223
    #     # Epoch 3/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 146s 146s/step - accuracy: 0.8071 - loss: 0.6259
    #     # Epoch 4/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 147s 147s/step - accuracy: 0.8099 - loss: 0.5302
    #     # Epoch 5/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 150s 150s/step - accuracy: 0.8190 - loss: 0.4738
    #     # Epoch 6/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 148s 148s/step - accuracy: 0.8522 - loss: 0.4068
    #     # Epoch 7/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 149s 149s/step - accuracy: 0.8984 - loss: 0.3431
    #     # Epoch 8/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 159s 159s/step - accuracy: 0.9193 - loss: 0.3069
    #     # Epoch 9/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 157s 157s/step - accuracy: 0.9178 - loss: 0.2937
    #     # Epoch 10/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 158s 158s/step - accuracy: 0.9185 - loss: 0.2651
    #     # Epoch 11/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 154s 154s/step - accuracy: 0.9234 - loss: 0.2341
    #     # Epoch 12/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 157s 157s/step - accuracy: 0.9276 - loss: 0.2154
    #     # Epoch 13/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 154s 154s/step - accuracy: 0.9325 - loss: 0.2015
    #     # Epoch 14/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 148s 148s/step - accuracy: 0.9381 - loss: 0.1924
    #     # Epoch 15/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 146s 146s/step - accuracy: 0.9448 - loss: 0.1804
    #     # Epoch 16/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 149s 149s/step - accuracy: 0.9489 - loss: 0.1741
    #     # Epoch 17/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 147s 147s/step - accuracy: 0.9502 - loss: 0.1711
    #     # Epoch 18/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 146s 146s/step - accuracy: 0.9502 - loss: 0.1683
    #     # Epoch 19/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 145s 145s/step - accuracy: 0.9516 - loss: 0.1617
    #     # Epoch 20/20
                                                                                        
    #     # 1/1 ━━━━━━━━━━━━━━━━━━━━ 146s 146s/step - accuracy: 0.9524 - loss: 0.1533 
    #     # ======================================================================
    #     # RNN Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8363
    #     #   Precision: 0.8363
    #     #   Recall:    1.0000
    #     #   F1 Score:  0.7618
    #     #   Loss:      0.9835
    #     #   Total Predictions: 3409
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      1.0000     0.8363     1.0000     0.9109     0.0080     2851      
    #     # Description          0.0000     0.0000     0.0000     0.0000     5.7361     475       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     7.2958     83        
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition2851        0           0           
    #     # Description    475         0           0           
    #     # Nomenclature   83          0           0           
    #     # ======================================================================

    #     # "weight_strategy": "aggressive",
    #     # "focal_labels": ["Nomenclature", "Description"],

    #     # Trial 2  22224MiB GPU RAM
    #     # # Architecture - leverage GPU for depth and context
    #     # "hidden_size": 256,
    #     # "num_layers": 4,           # NEW: Deeper network
    #     # "dropout": 0.4,            # NEW: More regularization
    #     # "window_size": 35,         # NEW: More context
        
    #     # # Training - more epochs for minority classes
    #     # "epochs": 18,              # NEW: More training
    #     # "batch_size": 3276,        # 5000 OOMs.
    #     # "prediction_stride": 1,    # NEW: Maximum density
        
    #     # # Weights - ultra-aggressive for minorities
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 0.05
    #     # },

    #     # # Trial 3
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 3,           # REDUCED from 4
    #     # "dropout": 0.4,
    #     # "window_size": 35,
        
    #     # # Training
    #     # "epochs": 15,              # Slightly reduced
    #     # "batch_size": 4000,        # Increased slightly
    #     # "prediction_stride": 1,
        
    #     # # Weights - MUCH more balanced
    #     # "class_weights": {
    #     #     "Nomenclature": 80.0,   # Was 250
    #     #     "Description": 8.0,      # Was 20
    #     #     "Misc-exposition": 0.2   # Was 0.05
    #     # },
    #     # # ======================================================================
    #     # # RNN Model Evaluation Statistics (Line-Level)
    #     # # ======================================================================
        
    #     # # Overall Metrics:
    #     # #   Accuracy:  0.8750
    #     # #   Precision: 0.8753
    #     # #   Recall:    0.9996
    #     # #   F1 Score:  0.8170
    #     # #   Loss:      0.4758
    #     # #   Total Predictions: 7920
        
    #     # # Per-Class Metrics:
    #     # # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # # --------------------------------------------------------------------------------
    #     # # Misc-exposition      0.9996     0.8753     0.9996     0.9333     0.0441     6933      
    #     # # Description          0.0000     0.0000     0.0000     0.0000     3.2564     854       
    #     # # Nomenclature         0.0000     0.0000     0.0000     0.0000     5.1297     133       
        
    #     # # Confusion Matrix:
    #     # # True \ Pred    Misc-expositionDescription Nomenclature
    #     # # ---------------------------------------------------
    #     # # Misc-exposition6930        3           0           
    #     # # Description    854         0           0           
    #     # # Nomenclature   133         0           0           
    #     # # ======================================================================
        
    #     # # [Classifier Fit] Statistics calculated, adding metadata

    #     # # Trial 4
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 4,           # REDUCED from 4
    #     # "dropout": 0.4,
    #     # "window_size": 35,
        
    #     # # Training
    #     # "epochs": 15,              # Slightly reduced
    #     # "batch_size": 3276,        # Increased slightly
    #     # "prediction_stride": 1,
        
    #     # # Weights
    #     # "class_weights": {
    #     #     "Nomenclature": 0.1,   # Was 250
    #     #     "Description": 20.0,      # Was 20
    #     #     "Misc-exposition": 0.1   # Was 0.05
    #     # },
    #     # ======================================================================
    #     # RNN Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.3650
    #     #   Precision: 0.9273
    #     #   Recall:    0.3147
    #     #   F1 Score:  0.4352
    #     #   Loss:      1.1199
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.3147     0.9273     0.3147     0.4700     1.1105     6933      
    #     # Description          0.8302     0.1274     0.8302     0.2208     0.3793     854       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     6.3663     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition2182        4751        0           
    #     # Description    145         709         0           
    #     # Nomenclature   26          107         0           
    #     # ======================================================================
        
    #     # # Trial 5
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 4,           # REDUCED from 4
    #     # "dropout": 0.4,
    #     # "window_size": 35,
        
    #     # # Training
    #     # "epochs": 15,              # Slightly reduced
    #     # "batch_size": 3276,        # Increased slightly
    #     # "prediction_stride": 1,
        
    #     # # Weights
    #     # "class_weights": {
    #     #     "Nomenclature": 0.1,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 1.0,
    #     # },
    #     # ======================================================================
    #     # RNN Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8754
    #     #   Precision: 0.8754
    #     #   Recall:    1.0000
    #     #   F1 Score:  0.8172
    #     #   Loss:      0.4787
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      1.0000     0.8754     1.0000     0.9335     0.0539     6933      
    #     # Description          0.0000     0.0000     0.0000     0.0000     3.1055     854       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     5.7583     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition6933        0           0           
    #     # Description    854         0           0           
    #     # Nomenclature   133         0           0           
    #     # ======================================================================


    #     # # Trial 6
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 4,           # REDUCED from 4
    #     # "dropout": 0.4,
    #     # "window_size": 30,

    #     # # Features
    #     # "word_vocab_size": 1800,
    #     # "suffix_vocab_size": 200,
   
    #     # # Training
    #     # "epochs": 15,              # Slightly reduced
    #     # "batch_size": 3276,        # Increased slightly
    #     # "prediction_stride": 1,
        
    #     # # Weights
    #     # "weight_strategy": "inverse",
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8634
    #     #   Precision: 0.8774
    #     #   Recall:    0.9810
    #     #   F1 Score:  0.8187
    #     #   Loss:      0.4244
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.9810     0.8774     0.9810     0.9263     0.1908     6933      
    #     # Description          0.0433     0.2256     0.0433     0.0727     1.7196     854       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     4.2815     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition6801        127         5           
    #     # Description    817         37          0           
    #     # Nomenclature   133         0           0           
    #     # ======================================================================
        
    #     # # Trial 7
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 5,
    #     # "dropout": 0.4,
    #     # "window_size": 35,
        
    #     # # Training
    #     # "epochs": 15,
    #     # "batch_size": 3276,
    #     # "prediction_stride": 1,
        
    #     # # Weights
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 0.1
    #     # },
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8403
    #     #   Precision: 0.8813
    #     #   Recall:    0.9449
    #     #   F1 Score:  0.8152
    #     #   Loss:      0.4622
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.9449     0.8813     0.9449     0.9120     0.3141     6933      
    #     # Description          0.1218     0.2176     0.1218     0.1562     1.2057     854       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     3.4048     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition6551        373         9           
    #     # Description    750         104         0           
    #     # Nomenclature   132         1           0           
    #     # ======================================================================

    #     # # Trial 8
    #     # # Architecture - reduce depth, keep context
    #     # "hidden_size": 256,
    #     # "num_layers": 5,
    #     # "dropout": 0.4,
    #     # "window_size": 30,
        
    #     # # Features
    #     # "word_vocab_size": 3600,
    #     # "suffix_vocab_size": 400,
       
    #     # # Training
    #     # "epochs": 15,
    #     # "batch_size": 3000,
    #     # "prediction_stride": 1,
        
    #     # # Weights
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 1.0,
    #     # },
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.8754
    #     #   Precision: 0.8754
    #     #   Recall:    1.0000
    #     #   F1 Score:  0.8172
    #     #   Loss:      0.6708
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      1.0000     0.8754     1.0000     0.9335     0.0091     6933      
    #     # Description          0.0000     0.0000     0.0000     0.0000     5.0615     854       
    #     # Nomenclature         0.0000     0.0000     0.0000     0.0000     6.9721     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition6933        0           0           
    #     # Description    854         0           0           
    #     # Nomenclature   133         0           0           
    #     # ======================================================================

    #     # Trial 9
    #     # Architecture - reduce depth, keep context
    #     "hidden_size": 256,
    #     "num_layers": 5,
    #     "dropout": 0.4,
    #     "window_size": 30,
    #     "use_gpu_in_udf": True,
        
    #     # Features
    #     "word_vocab_size": 3600,
    #     "suffix_vocab_size": 400,
       
    #     # Training
    #     "epochs": 15,
    #     "batch_size": 3000,
    #     "prediction_stride": 1,
        
    #     # Weights
    #     "class_weights": {
    #         "Nomenclature": 250.0,
    #         "Description": 20.0,
    #         "Misc-exposition": 1.0,
    #     },
    # },
    # "logistic": {
    #     "name": "Logistic Regression (line-level, words + suffixes)",
    #     "model_type": "logistic",
    #     "verbosity": 2,
    #     # Input
    #     "input_source": 'files',
    #     "file_paths": annotated_files,
    
    #     # "use_suffixes": True,
    #     # "maxIter": 10,
    #     # "regParam": 0.01,
    #     # "line_level": True,
    #     # "weight_strategy": "inverse",
    #     #
    #     # Training complete!
    #     #   Accuracy: 0.2905
    #     #   F1 Score: 0.3295
    #     # ✓ Model saved to Redis with key: skol:classifier:model:logistic_v2.0

        
    #     # "use_suffixes": True,
    #     # "maxIter": 10,
    #     # "regParam": 0.01,
    #     # "line_level": True,
    #     # "weight_strategy": "inverse",
    #     # "word_vocab_size": 1800,
    #     # "suffix_vocab_size": 200,
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.3812
    #     #   Precision: 0.9995
    #     #   Recall:    0.2968
    #     #   F1 Score:  0.4498
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Support   
    #     # ----------------------------------------------------------------------
    #     # Misc-exposition      0.2968     0.9995     0.2968     0.4577     6933      
    #     # Description          0.9696     0.2864     0.9696     0.4422     854       
    #     # Nomenclature         1.0000     0.0448     1.0000     0.0857     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition2058        2063        2812        
    #     # Description    1           828         25          
    #     # Nomenclature   0           0           133         
    #     # ======================================================================
        
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.3631
    #     #   Precision: 0.9995
    #     #   Recall:    0.2771
    #     #   F1 Score:  0.4301
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Support   
    #     # ----------------------------------------------------------------------
    #     # Misc-exposition      0.2771     0.9995     0.2771     0.4339     6933      
    #     # Description          0.9625     0.2973     0.9625     0.4543     854       
    #     # Nomenclature         1.0000     0.0411     1.0000     0.0790     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition1921        1943        3069        
    #     # Description    1           822         31          
    #     # Nomenclature   0           0           133         
    #     # ======================================================================
        
    #     # "use_suffixes": True,
    #     # "maxIter": 100,
    #     # "regParam": 0.01,
    #     # "line_level": True,
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 0.1
    #     # },
    #     # "word_vocab_size": 1800,
    #     # "suffix_vocab_size": 200,

    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.3631
    #     #   Precision: 0.9995
    #     #   Recall:    0.2771
    #     #   F1 Score:  0.4301
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Support   
    #     # ----------------------------------------------------------------------
    #     # Misc-exposition      0.2771     0.9995     0.2771     0.4339     6933      
    #     # Description          0.9625     0.2973     0.9625     0.4543     854       
    #     # Nomenclature         1.0000     0.0411     1.0000     0.0790     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition1921        1943        3069        
    #     # Description    1           822         31          
    #     # Nomenclature   0           0           133         
    #     # ======================================================================

    #     # "use_suffixes": True,
    #     # "maxIter": 100,
    #     # "regParam": 0.01,
    #     # "line_level": True,
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 1.0
    #     # },
    #     # "word_vocab_size": 3600,
    #     # "suffix_vocab_size": 400,
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.6104
    #     #   Precision: 0.9964
    #     #   Recall:    0.5605
    #     #   F1 Score:  0.7007
    #     #   Loss:      0.8589
    #     #   Total Predictions: 7920
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.5605     0.9964     0.5605     0.7174     0.9655     6933      
    #     # Description          0.9543     0.5012     0.9543     0.6573     0.1223     854       
    #     # Nomenclature         1.0000     0.0556     1.0000     0.1053     0.0306     133       
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition3886        811         2236        
    #     # Description    14          815         25          
    #     # Nomenclature   0           0           133         
    #     # ======================================================================
    #     # "use_suffixes": True,
    #     # "maxIter": 100,
    #     # "regParam": 0.01,
    #     # "line_level": False,
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 2.0
    #     # },
    #     # "word_vocab_size": 3600,
    #     # "suffix_vocab_size": 400,
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.9458
    #     #   Precision: 1.0000
    #     #   Recall:    0.9399
    #     #   F1 Score:  0.9588
    #     #   Loss:      0.1624
    #     #   Total Predictions: 1217
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.9399     1.0000     0.9399     0.9690     0.1739     1082      
    #     # Description          0.9915     0.9070     0.9915     0.9474     0.0761     118       
    #     # Nomenclature         1.0000     0.2394     1.0000     0.3864     0.0312     17        
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition1017        12          53          
    #     # Description    0           117         1           
    #     # Nomenclature   0           0           17          
    #     # ======================================================================
    #     # "use_suffixes": True,
    #     # "maxIter": 100,
    #     # "regParam": 0.01,
    #     # "line_level": False,
    #     # "class_weights": {
    #     #     "Nomenclature": 250.0,
    #     #     "Description": 20.0,
    #     #     "Misc-exposition": 4.0
    #     # },
    #     # "word_vocab_size": 3600,
    #     # "suffix_vocab_size": 400,
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.9589
    #     #   Precision: 0.9981
    #     #   Recall:    0.9566
    #     #   F1 Score:  0.9678
    #     #   Loss:      0.1206
    #     #   Total Predictions: 1217
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.9566     0.9981     0.9566     0.9769     0.1241     1082      
    #     # Description          0.9746     0.9426     0.9746     0.9583     0.0983     118       
    #     # Nomenclature         1.0000     0.2931     1.0000     0.4533     0.0532     17        
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition1035        7           40          
    #     # Description    2           115         1           
    #     # Nomenclature   0           0           17          
    #     # ======================================================================
    #     "use_suffixes": True,
    #     "maxIter": 100,
    #     "regParam": 0.01,
    #     "line_level": False,
    #     "class_weights": {
    #         "Nomenclature": 250.0,
    #         "Description": 20.0,
    #         "Misc-exposition": 8.0
    #     },
    #     "word_vocab_size": 3600,
    #     "suffix_vocab_size": 400,
    #     # ======================================================================
    #     # Model Evaluation Statistics (Line-Level)
    #     # ======================================================================
        
    #     # Overall Metrics:
    #     #   Accuracy:  0.9622
    #     #   Precision: 0.9933
    #     #   Recall:    0.9649
    #     #   F1 Score:  0.9688
    #     #   Loss:      0.0932
    #     #   Total Predictions: 1217
        
    #     # Per-Class Metrics:
    #     # Class                Accuracy   Precision  Recall     F1         Loss       Support   
    #     # --------------------------------------------------------------------------------
    #     # Misc-exposition      0.9649     0.9933     0.9649     0.9789     0.0886     1082      
    #     # Description          0.9322     0.9565     0.9322     0.9442     0.1359     118       
    #     # Nomenclature         1.0000     0.3333     1.0000     0.5000     0.0914     17        
        
    #     # Confusion Matrix:
    #     # True \ Pred    Misc-expositionDescription Nomenclature
    #     # ---------------------------------------------------
    #     # Misc-exposition1044        5           33          
    #     # Description    7           110         1           
    #     # Nomenclature   0           0           17          
    #     # ======================================================================

    # },
    
    "logistic_sections": {
        "name": "Logistic Regression (sections, words + suffixes + sections)",
        "model_type": "logistic",
        "verbosity": 2,

        # "input_source": "couchdb",
        # "couchdb_training_database": training_db_name,
        # "use_suffixes": True,
        # "maxIter": 100,
        # "regParam": 0.01,
        # "extraction_mode": "section",
        # "class_weights": {
        #     "Nomenclature": 250.0,
        #     "Description": 20.0,
        #     "Misc-exposition": 10.0
        # },
        # "word_vocab_size": 3600,
        # "suffix_vocab_size": 400,
        # "section_name_vocab_size": 50,
        # ======================================================================
        # Model Evaluation Statistics (Line-Level)
        # ======================================================================
        
        # Overall Metrics:
        #   Accuracy:  0.9652
        #   Precision: 0.9921
        #   Recall:    0.9671
        #   F1 Score:  0.9667
        #   Loss:      0.1122
        #   Total Predictions: 2273
        
        # Per-Class Metrics:
        # Class                Accuracy   Precision  Recall     F1         Loss       Support   
        # --------------------------------------------------------------------------------
        # Misc-exposition      0.9671     0.9921     0.9671     0.9794     0.1132     1943      
        # Nomenclature         1.0000     0.7551     1.0000     0.8605     0.0325     185       
        # Description          0.8966     0.9701     0.8966     0.9319     0.2014     145       
        
        # Confusion Matrix:
        # True \ Pred    Misc-expositionNomenclatureDescription 
        # ---------------------------------------------------
        # Misc-exposition1879        60          4           
        # Nomenclature   0           185         0           
        # Description    15          0           130         
        # ======================================================================

        "input_source": "couchdb",
        "couchdb_training_database": training_db_name,
        "use_suffixes": True,
        "maxIter": 100,
        "regParam": 0.01,
        "extraction_mode": "section",
        "class_weights": {
            "Nomenclature": 250.0,
            "Description": 20.0,
            "Misc-exposition": 20.0
        },
        "word_vocab_size": 3600,
        "suffix_vocab_size": 400,
        "section_name_vocab_size": 50,

    }

}

for model_name, model_config in model_configs.items():

    classifier_model_name = f"skol:classifier:model:{model_name}_{model_version}"

    if create_classifier or not redis_client.exists(classifier_model_name):
        if annotated_path.exists():
            annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")
    
            if len(annotated_files) > 0:
                print(f"Found {len(annotated_files)} annotated files")
                spark = make_spark_session()
    
                # Train using SkolClassifierV2 with unified API
                print("Training classifier with SkolClassifierV2...")
                classifier = SkolClassifierV2(
                    spark=spark,
    
                    # Model I/O
                    auto_load_model=False,  # Fit a new model.
                    model_storage='redis',
                    redis_client=redis_client,
                    redis_key=classifier_model_name,
                    redis_expire=classifier_model_expire,
    
    
                    # Output options
                    output_dest='couchdb',
                    couchdb_url=couchdb_url,
                    couchdb_database=ingest_db_name,
                    couchdb_username=couchdb_username,
                    couchdb_password=couchdb_password,
                    output_couchdb_suffix='.ann',
                    
                    # Model and preprocssing options
                    **model_config
                )
    
                # Train the model
                results = classifier.fit()
    
                print(f"Training complete!")
                print(f"  Accuracy: {results.get('accuracy', 0):.4f}")
                print(f"  F1 Score: {results.get('f1_score', 0):.4f}")
    
                classifier.save_model()
                print(f"✓ Model saved to Redis with key: {classifier_model_name}")
            else:
                print(f"No annotated files found in {annotated_path}")
        else:
            print(f"Directory does not exist: {annotated_path}")
            print("Please ensure annotated training data is available.")
    else:
        print(f"Skipping generation of model {classifier_model_name}.")

Found 190 annotated files
Training classifier with SkolClassifierV2...
[Classifier] Loading training data from database: skol_training


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                


[LogisticRegressionSkolModel] Using class weights:
  Nomenclature         250.00
  Description           20.00
  Misc-exposition       20.00



                                                                                

[Classifier Fit] Model training completed, starting evaluation
[Classifier Fit] Stored label mappings for 3 labels
[Classifier Fit] Splitting data for evaluation (80/20)
[Classifier Fit] Grouping by document column: doc_id
[Classifier Fit] Total unique documents: 190
[Classifier Fit] Counting split data...
[Classifier Fit]   Train data count: 75606
[Classifier Fit]   Test data count: 2273
[Classifier Fit] Making predictions on test set
[Classifier Fit] Predictions completed, validating output
[Classifier Fit] Predictions count: 2273
[Classifier Fit] Predictions columns: ['doc_id', 'human_url', 'attachment_name', 'label', 'value', 'section_name', 'words', 'word_tf', 'word_idf', 'suffixes', 'suffix_tf', 'suffix_idf', 'section_name_filled', 'section_tokens', 'section_tf', 'section_idf', 'combined_idf', 'label_indexed', 'rawPrediction', 'probability', 'prediction', 'probabilities']
[Classifier Fit] First prediction:
  Row(doc_id='069c6e9255f2516d4a7eb9f38123242c', human_url='https://mykowe

In [21]:
# INSERT COMBINER HERE

## Extract the taxa names and descriptions

A classifier extracts taxa names and descriptions from articles, issues, and books. The YEDDA annotated texts are written back to CouchDB.

In [22]:
import time

# Predict from CouchDB and save back to CouchDB using SkolClassifierV2
if add_annotations:
    print("Initializing classifier with unified V2 API...")

    model_type = "logistic_sections"
    classifier_model_name = f"skol:classifier:model:{model_type}_{model_version}"
    
    model_config2 = model_configs['logistic_sections'].copy()
    model_config2.update({
        "num_workers": cores,
        "prediction_batch_size": 96,
        "verbosity": 1,
    })
    
    spark.stop()
    spark = make_spark_session()
    
    classifier = SkolClassifierV2(
        spark=spark,
        couchdb_url=couchdb_url,
        couchdb_database=ingest_db_name,
        couchdb_username=couchdb_username,
        couchdb_password=couchdb_password,
        couchdb_pattern='*.txt',
        output_dest='couchdb',
        output_couchdb_suffix='.ann',
        model_storage='redis',
        redis_client=redis_client,
        redis_key=classifier_model_name,
        auto_load_model=True,
        coalesce_labels=True,
        output_format='annotated',
        **model_config2
    )
    
    print(f"Model loaded from Redis: {classifier_model_name}")
    
    # Load, predict, and save in a streamlined workflow
    print("\nLoading and classifying documents from CouchDB...")
    raw_df = classifier.load_raw()
    print(f"Loaded {raw_df.count()} text documents")
    raw_df.show(10)
    
    print("\nMaking predictions...")
    predictions = classifier.predict(raw_df)
    
    # Show sample predictions
    print("\nSample predictions:")
    predictions.select(
        "doc_id", "line_number", "attachment_name", "predicted_label", "value"
    ).show(5, truncate=50)
    
    # Save results back to CouchDB
    print("\nSaving predictions back to CouchDB...")
    classifier.save_annotated(predictions)
    
    print(f"\n✓ Predictions saved to CouchDB as .ann attachments")
else:
    print("\n Skipping annotations.")

Initializing classifier with unified V2 API...
Model loaded from Redis: skol:classifier:model:logistic_sections_v2.0

Loading and classifying documents from CouchDB...
[Classifier] Auto-discovering PDF documents in database: skol_dev
[Classifier] Found 2099 documents with PDF attachments
[Classifier] Database: skol_dev
MuPDF error: format error: non-page object in page tree

MuPDF error: format error: object is not a stream

MuPDF error: format error: object is not a stream

MuPDF error: syntax error: no XObject subtype specified

MuPDF error: format error: object is not a stream

MuPDF error: format error: object is not a stream

MuPDF error: syntax error: no XObject subtype specified

MuPDF error: format error: object is not a stream

MuPDF error: format error: object is not a stream

MuPDF error: syntax error: no XObject subtype specified

MuPDF error: format error: object is not a stream

MuPDF error: format error: object is not a stream

MuPDF error: syntax error: no XObject subty

                                                                                

[Classifier] Total sections: 273377


                                                                                

Loaded 273377 text documents
+--------------------+--------------------+---------------+----------------+-----------+-----------+---------------------+------------+
|               value|              doc_id|attachment_name|paragraph_number|line_number|page_number|empirical_page_number|section_name|
+--------------------+--------------------+---------------+----------------+-----------+-----------+---------------------+------------+
|           MYCOTAXON|0020c88329ed456a9...|    article.pdf|               1|          3|          1|                  111|        NULL|
|Volume 111, pp. 2...|0020c88329ed456a9...|    article.pdf|               2|          4|          1|                  111|        NULL|
|László Lőkös3, Ol...|0020c88329ed456a9...|    article.pdf|               3|          9|          1|                  111|        NULL|
|j.vondrak@seznam....|0020c88329ed456a9...|    article.pdf|               4|         10|          1|                  111|        NULL|
|Branišovská 31, C.

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


+--------------------------------+-----------+---------------+---------------+--------------------------------------------------+
|                          doc_id|line_number|attachment_name|predicted_label|                                             value|
+--------------------------------+-----------+---------------+---------------+--------------------------------------------------+
|0020c88329ed456a95a18e0c219269f4|          3|    article.pdf|Misc-exposition|                                         MYCOTAXON|
|0020c88329ed456a95a18e0c219269f4|          4|    article.pdf|Misc-exposition|Volume 111, pp. 241–250 January–March 2010 The ...|
|0020c88329ed456a95a18e0c219269f4|          9|    article.pdf|Misc-exposition|                    László Lőkös3, Olga Merkulova4|
|0020c88329ed456a95a18e0c219269f4|         10|    article.pdf|Misc-exposition|j.vondrak@seznam.cz 1Department of Botany, Facu...|
|0020c88329ed456a95a18e0c219269f4|         12|    article.pdf|Misc-exposition|Branišovská 

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


✓ Predictions saved to CouchDB as .ann attachments


In [23]:
if add_annotations:
    predictions.select("predicted_label", "annotated_value").where('predicted_label = "Nomenclature"').show()
    predictions.groupBy("predicted_label").count().orderBy("count").show()

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


+---------------+--------------------+
|predicted_label|     annotated_value|
+---------------+--------------------+
|   Nomenclature|[@ s.l., Candelar...|
|   Nomenclature|[@ Pseudosperma a...|
|   Nomenclature|[@ Serendipita sa...|
|   Nomenclature|[@ Fig. 7. Charle...|
|   Nomenclature|[@ Fig. 19. Morde...|
|   Nomenclature|[@ Fig. 22. Lucie...|
|   Nomenclature|[@ Todzia CA. 198...|
|   Nomenclature|[@ Hymenochaete c...|
|   Nomenclature|[@ Kretzschmaria ...|
|   Nomenclature|[@ Holotype: Keny...|
|   Nomenclature|[@ Abstract — Ell...|
|   Nomenclature|[@ Aspicilia pycn...|
|   Nomenclature|[@ Cylindrochytri...|
|   Nomenclature|[@ CI = 0.6183, R...|
|   Nomenclature|[@ (Oehl) Colombi...|
|   Nomenclature|[@ Holotype OSC #...|
|   Nomenclature|[@ Holotype OSC #...|
|   Nomenclature|[@ Holotype Z+ZT,...|
|   Nomenclature|[@ 94 ... Oehl & ...|
|   Nomenclature|[@ Glomoid specie...|
+---------------+--------------------+
only showing top 20 rows



  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

+---------------+------+
|predicted_label| count|
+---------------+------+
|   Nomenclature|  4755|
|    Description|  7416|
|Misc-exposition|261206|
+---------------+------+



                                                                                

Here we estimate an approximation for the number of Taxon structures we'd like to find. The abbreviation "nov." ("novum") indicates a new taxon in the current article. This should be a lower bound, as it is not unusual to redescribe a species, e.g. in a survey article or monograph on a genus.

In [24]:
if add_annotations:
    predictions.select("*").filter(col("annotated_value").contains("nov.")).where("predicted_label = 'Nomenclature'").count()

                                                                                

## Build the Taxon objects and store them in CouchDB
We use CouchDB to store a full record for each taxon. We copy all metadata to the taxon records.

In [25]:
from couchdb_file import CouchDBFile as CDBF, read_couchdb_partition, read_couchdb_rows
class CouchDBFile(CDBF):
    """
    File-like object that reads from CouchDB attachment content.

    This class extends FileObject to support reading text from CouchDB
    attachments while preserving database metadata (doc_id, attachment_name,
    and database name).
    """

# Module-level functions for reading CouchDB data


## Build Taxon objects

Here we extract the Taxon objects from the annotated attachments. We look for Nomenclature lines and then following Description lines. These are merged together into Taxon objects.

In [26]:
ingest_couchdb_url = couchdb_url
ingest_username = couchdb_username
ingest_password = couchdb_password
taxon_couchdb_url = couchdb_url
taxon_username = couchdb_username
taxon_password = couchdb_password
pattern = '*.txt.ann'

In [27]:
from extract_taxa_to_couchdb import (
    TaxonExtractor as TE,
    generate_taxon_doc_id,
    extract_taxa_from_partition,
    convert_taxa_to_rows
)

class TaxonExtractor(TE):
    pass

In [28]:
# Create TaxonExtractor instance with database configuration
spark.stop()
spark = make_spark_session()

extractor = TaxonExtractor(
    spark=spark,
    ingest_couchdb_url=ingest_couchdb_url,
    ingest_db_name=ingest_db_name,
    taxon_db_name=taxon_db_name,
    ingest_username=ingest_username,
    ingest_password=ingest_password,
    taxon_username=taxon_username,
    taxon_password=taxon_password
)

print("TaxonExtractor initialized")
print(f"  Ingest DB: {ingest_db_name}")
print(f"  Taxon DB:  {taxon_db_name}")

TaxonExtractor initialized
  Ingest DB: skol_dev
  Taxon DB:  skol_taxa_dev


In [29]:
# Step 1: Load annotated documents
if build_taxon:
    print("\nStep 1: Loading annotated documents from CouchDB...")
    annotated_df = extractor.load_annotated_documents(pattern='*.txt.ann')
    print(f"Loaded {annotated_df.count()} annotated documents")
    annotated_df.show(5, truncate=False)


Step 1: Loading annotated documents from CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

Loaded 2099 annotated documents
+--------------------------------+------------------------------------------------------------------------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


In [30]:
if build_taxon:
    # Step 2: Extract taxa to DataFrame
    print("\nStep 2: Extracting taxa from annotated documents...")
    taxa_df = extractor.extract_taxa(annotated_df)
    print(f"Extracted {taxa_df.count()} taxa")
    taxa_df.printSchema()
    taxa_df.show(10, truncate=False)

Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe



Step 2: Extracting taxa from annotated documents...
[TaxonExtractor] Input DataFrame columns: ['doc_id', 'human_url', 'attachment_name', 'value']
[TaxonExtractor] Filtered DataFrame columns: ['doc_id', 'value', 'attachment_name']
[TaxonExtractor] Filtered DataFrame schema:
root
 |-- doc_id: string (nullable = false)
 |-- value: string (nullable = false)
 |-- attachment_name: string (nullable = false)



  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 4:>                                                          (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)
                                                                                

Extracted 5239 taxa
root
 |-- taxon: string (nullable = false)
 |-- description: string (nullable = false)
 |-- source: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- line_number: integer (nullable = true)
 |-- paragraph_number: integer (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- empirical_page_number: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- json_annotated: string (nullable = true)



[Stage 7:>                                                          (0 + 1) / 1]

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+-----------+----------------


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
                                                                                

In [31]:
if build_taxon:
    # Step 3: Inspect actual Taxon objects from the RDD (optional debugging)
    print("\n=== Sample Taxon Objects ===")
    taxa_rdd = annotated_df.rdd.mapPartitions(
        lambda partition: extract_taxa_from_partition(iter(partition), ingest_db_name)  # type: ignore[reportUnknownArgumentType]
    )
    for i, taxon in enumerate(taxa_rdd.take(3)):
        print(f"\nTaxon {i+1}:")
        print(f"  Type: {type(taxon)}")
        print(f"  Has nomenclature: {taxon.has_nomenclature()}")
        taxon_row = taxon.as_row()
        print(f"  Taxon name: {taxon_row['taxon'][:80]}...")
        print(f"  Source: {taxon_row['source']}")

Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe



=== Sample Taxon Objects ===


[Stage 8:>                                                          (0 + 1) / 1]


Taxon 1:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)
...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 2:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  5. Caloplaca gyalolechiiformis Szatala, Ann. Hist.-Nat. Mus. Natl. Hungarici, s...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 3:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  7. Caloplaca lactea var. subimmersa Szatala, Ann. Hist.-Nat. Mus. Natl. Hungari...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name'


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
                                                                                

In [32]:
if build_taxon:
    # Step 4: Save taxa to CouchDB
    print("\nStep 4: Saving taxa to CouchDB...")
    results_df = extractor.save_taxa(taxa_df)
    
    # Show detailed results
    results_df.groupBy("success").count().show(truncate=False)
    
    # If there are failures, show error messages
    print("\nError messages:")
    results_df.filter("success = false").select("error_message").distinct().show(truncate=False)


Step 4: Saving taxa to CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 9:>                                                          (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                        

+-------+-----+
|success|count|
+-------+-----+
|true   |5239 |
+-------+-----+


Error messages:


[Stage 12:>                                                         (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)

=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)


+-------------+
|error_message|
+-------------+
+-------------+





In [33]:
# Alternative: Run the complete pipeline in one step
# Uncomment to use the simplified one-step approach:

# print("\nRunning complete pipeline...")
# results = extractor.run_pipeline(pattern='*.txt.ann')
#
# successful = results.filter("success = true").count()
# failed = results.filter("success = false").count()
#
# print(f"\nPipeline Results:")
# print(f"  Successful: {successful}")
# print(f"  Failed:     {failed}")
#
# results.groupBy("success").count().show(truncate=False)

### Observations on the classification models

The line-by-line classification model is classifying many Description lines as Misc-exposition. It works reasonably well for Nomenclature.

The problem with the paragraph classification model is that the heuristic paragrph parser does not generalize well to the more modern journals.

One possible approach to investigate is adding heuristics to the label-merging code to convert some Misc-exposition lines to Description if they are surrounded by Description paragraphs.

We put substantial effort into a model with some memory. Specifically, we use a bidirectional LSTM (RNN) with a sliding window. This should be better at detecting context. The computational demands tested our available compute platform. The performance of the model as of current writing is terrible--out of a million lines of text it finds only 73 lines of Description, 175223 UNKNOWN_None, no lines of Nomenclature and the rest as Misc-exposition. As expected increasing the number of epochs increases training accuracy and decreases the loss function, but confusingly it performs identically on the test set no matter how many epochs are used.

It may become necessary to hand annotate some of the more modern journals.

## Dr. Drafts document embedding

Dr. Drafts is the framework we use to embed all the descriptions into a searchable space. SBERT is a model that can embed sentences into a semantic space such that sentences with similar meaning are near each other. The data structure that we build here is central to the eventual function of the SKOL web site.

Dr. Drafts loads taxon documents from the CouchDB, and builds an embedding which it saves to redis.

In [34]:
from dr_drafts_mycosearch.data import SKOL_TAXA as STX

class SKOL_TAXA(STX):
    """Data interface for Synopotic Key of Life Taxa in CouchDB."""


I0000 00:00:1766679373.238705 4018262 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1766679373.269971 4018262 cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI AVX_VNNI_INT8 AVX_NE_CONVERT FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1766679373.904347 4018262 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [35]:
from dr_drafts_mycosearch.compute_embeddings import EmbeddingsComputer as EC
class EmbeddingsComputer(EC):
    """Class for computing and storing embeddings from narrative data."""
    
    def write_embeddings_to_redis(self):
        """Write embeddings to Redis using instance configuration."""
        if self.redis_username and self.redis_password:
            r = redis.from_url(self.redis_url, username=self.redis_username, password=self.redis_password, db=self.redis_db)
        else:
            r = redis.from_url(self.redis_url, db=self.redis_db)

        pickled_data = pickle.dumps(self.result)
        r.set(self.embedding_name, pickled_data)
        if self.redist_expire is not None:
            r.expire(self.embedding_name, self.redist_expire)
        print(f'Embeddings written to Redis (db={self.redis_db}) with key: {self.embedding_name}')


## Compute Embeddings

We use SBERT to embed the taxa into a search space.

In [36]:
if compute_embeddings:
    skol_taxa = SKOL_TAXA(
        couchdb_url="http://localhost:5984",
        username=couchdb_username,
        password=couchdb_password,
        db_name=taxon_db_name
    )
    descriptions = skol_taxa.get_descriptions()

DEBUG: doc: <Document 'taxon_000aaabf9fb6bcb8ff1c66e1ddeb8c30599c1283be703156d50e45bc8779df25'@'4-3f2799f3935df8c3bdd0789804d3ee19' {'taxon': ' added Valsaria Ces. & De Not. and Valsonectria Speg., both with uniseptate\n\n\n Mattirolia Berl. & Bres., Annuario Soc. Alpin. Trident. 14: 351 (1889)\n\n\n = Thyronectroidea Seaver, Mycologia 1: 206 (1909)\n\n\n = Balzania Speg., Anales Mus. Nac. Buenos Aires 6: 286 (1898)\n\n', 'description': ' Stroma variable, usually present, erumpent, covered with loosely interwoven\nyellowish or brownish hyphae, KOH negative. Perithecia globose, semiimmersed or isolated in the stroma. Paraphyses abundant. Asci unitunicate,\ncylindrical or clavate, non amyloid. Ascospores smooth, muriform, hyaline to\n\n\n Stroma subcortical, pulvinate, 0.5–6 mm diam. Perithecia aggregated,\nimmersed in the stroma or more rarely isolated, globose, 400–450 µm diam.,\nblack, surrounded by a yellowish tomentum, 30–50 µm thick, formed by\nhyphae 4–5 µm diam. Peridium pseudopa

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



DEBUG: doc: <Document 'taxon_d36e80539de4b2c8543f7cd3076ed05d7db42fd92c0a59c54e2e0f17a23dc1be'@'4-de699dbee6533d7a972536951596005b' {'taxon': ' Góry. Monogr. Bot. 83: 1–131.\n\n', 'description': ' Agaricus of North America is a sturdy hefty tome with stiff hard covers\n\n\n Thick, dense, but compact, the book is held easily in one hand. Its sturdy\n\n', 'source': {'db_name': 'skol_dev', 'doc_id': 'ee7f2de0964f4ea7af4d513dfa866356', 'human_url': 'skol_dev/ee7f2de0964f4ea7af4d513dfa866356/article.txt.ann'}, 'line_number': 518, 'paragraph_number': 41598, 'page_number': 1, 'empirical_page_number': '4666', 'json_annotated': None}>
DEBUG: doc: <Document 'taxon_d38875bb734920512d624acd2cbc72ad6683a7543bb6521cc3282b479abeb049'@'4-698e15b0e5e43b442b137c08958f4d5a' {'taxon': ' Stud. Mycol. 13: 46. 1976.\nSynonyms: Chaetopsis cylindrospora (W. Gams & Hol.-Jech.)\n\n\n 45. 1976. (Nom. illeg. Art. 53.1.) [non Chaetosphaeria fusispora\n\n\n Gongromeriza Preuss, Linnaea 24: 106. 1851.\nSynonym: Ejner

In [37]:
if compute_embeddings or not redis_client.exists(embedding_name):

    embedder = EmbeddingsComputer(
        idir='/dev/null',
        redis_url='redis://localhost:6379',
        redis_expire=embedding_expire,
        embedding_name=embedding_name,
    )

    embedding_result = embedder.run(descriptions)

Using GPU: NVIDIA GeForce RTX 5090 Laptop GPU
GPU Memory: 25.15 GB


Batches:   0%|          | 0/82 [00:00<?, ?it/s]

DEBUG: Embeddings.run Resulting DataFrame columns: ['description', 'taxon', 'source', 'line_number', 'paragraph_number', 'page_number', 'empirical_page_number', 'F0', 'F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'F10', 'F11', 'F12', 'F13', 'F14', 'F15', 'F16', 'F17', 'F18', 'F19', 'F20', 'F21', 'F22', 'F23', 'F24', 'F25', 'F26', 'F27', 'F28', 'F29', 'F30', 'F31', 'F32', 'F33', 'F34', 'F35', 'F36', 'F37', 'F38', 'F39', 'F40', 'F41', 'F42', 'F43', 'F44', 'F45', 'F46', 'F47', 'F48', 'F49', 'F50', 'F51', 'F52', 'F53', 'F54', 'F55', 'F56', 'F57', 'F58', 'F59', 'F60', 'F61', 'F62', 'F63', 'F64', 'F65', 'F66', 'F67', 'F68', 'F69', 'F70', 'F71', 'F72', 'F73', 'F74', 'F75', 'F76', 'F77', 'F78', 'F79', 'F80', 'F81', 'F82', 'F83', 'F84', 'F85', 'F86', 'F87', 'F88', 'F89', 'F90', 'F91', 'F92', 'F93', 'F94', 'F95', 'F96', 'F97', 'F98', 'F99', 'F100', 'F101', 'F102', 'F103', 'F104', 'F105', 'F106', 'F107', 'F108', 'F109', 'F110', 'F111', 'F112', 'F113', 'F114', 'F115', 'F116', 'F117', 'F118

### Command-line tool

There is a command line tool for dr-drafts-mycosearch which uses the embeddings from a local file. Future work will add the ability to get the embeddings from redis. Here is a sample interaction with the search tool:

![image.png](attachment:9525d4a5-0dcc-480a-b259-1077742d82ce.png)

## Compute JSON versions of all descriptions

There is an anticipated need for the details of each description to be available as a nested JSON structure, which can be used to build menus with features, subfeatures, and values.

Here is an example JSON representation of a technical description as generated by ChatGPT 4.9.

```JSON
{
    "mycelium": {
        "location": "on substrate",
        "color": "medium orange-brown"
    },
    "hyphae": {
        "septate": true,
        "diameter_um": {
            "min": 3,
            "max": 4 
        },
        "surface": "distinctly and coarsely rough",
        "orientation": "parallel to long axis of substrate tissue cells",
        "branching": {
            "location": "on exterior",
            "pattern": "non-patterned network"
        }
    },
    "sporulation_units": {
        "origin": "enlarging tips of short branches",
        "color_progression": [
            "concolorous with surface mycelium",
            "dark orange-brown",
            "black-brown and opaque"
        ],
        "attachment": {
            "occasional": "attached to hypha",
            "typical": "broken loose or obscured by crowding and opacity"
        }
    }
}
```

The TaxaJSONTranslator reads taxa from the CouchDB and writes annotated taxa back out to CouchDB.

In [38]:
from taxa_json_translator import TaxaJSONTranslator as TJT

class TaxaJSONTranslator(TJT):
    """
    Translates taxa descriptions to structured JSON using a fine-tuned Mistral model.

    This class is optimized for processing PySpark DataFrames created by
    TaxonExtractor.load_taxa(), adding a new column with JSON-formatted features.
    """


In [39]:
spark.stop()
spark = make_spark_session()

translator = TaxaJSONTranslator(
    spark=spark,
    base_model_id="mistralai/Mistral-7B-Instruct-v0.3",
    max_length=2048,
    max_new_tokens=1024,
    device="cuda",
    load_in_4bit=True,
    use_auth_token=True,
    couchdb_url=couchdb_url,
    username=couchdb_username,
    password=couchdb_password
)

TaxaJSONTranslator initialized
  CouchDB URL: http://127.0.0.1:5984
  Base model: mistralai/Mistral-7B-Instruct-v0.3
  Checkpoint: None (using base model)
  Device: cuda
  4-bit quantization: True


### Run the mistral model to generate JSON from each Taxon description.

In [40]:
if generate_json:
    descriptions_df = translator.load_taxa(db_name=taxon_db_name).limit(100)

Loading 5239 taxa from skol_taxa_dev...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

✓ Loaded 5239 taxa


                                                                                

In [41]:
if generate_json:
    descriptions_df.show()

+--------------------+--------------------+--------------------+--------------------+-----------+----------------+-----------+---------------------+
|                 _id|               taxon|         description|              source|line_number|paragraph_number|page_number|empirical_page_number|
+--------------------+--------------------+--------------------+--------------------+-----------+----------------+-----------+---------------------+
|taxon_000aaabf9fb...| added Valsaria C...| Stroma variable,...|{human_url -> sko...|         85|           37287|          1|                 4666|
|taxon_001de8a5aa8...| Geastrum corolli...| subglobose, ovoi...|{human_url -> sko...|        188|           20293|          1|                 4666|
|taxon_0024cb7ea83...|    Rick (1959).\n\n| Basidiomata whit...|{human_url -> sko...|        246|           43960|          1|                 4666|
|taxon_002d9147281...| Graphis caesioca...| Thallus corticol...|{human_url -> sko...|         83|         

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


In [42]:
if generate_json:
    json_annotated_df = translator.translate_descriptions_batch(
        taxa_df=descriptions_df,
        batch_size=10,
        description_col="description",
        output_col="json_annotated"
    )

Translating descriptions to JSON (batch mode)...
  Input column: description
  Output column: json_annotated
  Batch size: 10


Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


Processing 100 descriptions...
  Batch 1/10
Loading tokenizer from mistralai/Mistral-7B-Instruct-v0.3...
✓ Tokenizer loaded
Loading base model from mistralai/Mistral-7B-Instruct-v0.3...




Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✓ Base model loaded
⚠ Using base model (no checkpoint provided)
  Batch 2/10
  Batch 3/10


Token indices sequence length is longer than the specified maximum sequence length for this model (2266 > 2048). Running this sequence through the model will result in indexing errors


  Batch 4/10
  Batch 5/10
  Batch 6/10
  Batch 7/10
  Batch 8/10
  Batch 9/10
  Batch 10/10
✓ Translated 100 descriptions


### Add the generated fields as a field on the objects generated by save_taxa.

In [43]:
if generate_json:
    results_df = translator.save_taxa(json_annotated_df, db_name=json_taxon_db_name)
    
    results_df.groupBy("success").count().show(truncate=False)
    
    print("\nError messages:")
    results_df.filter("success = false").select("error_message").distinct().show(truncate=False)

Saving taxa to skol_taxa_full_dev...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

✓ Save complete:
  Total: 100
  Successful: 100
  Failed: 0
+-------+-----+
|success|count|
+-------+-----+
|true   |100  |
+-------+-----+


Error messages:
+-------------+
|error_message|
+-------------+
+-------------+



[Stage 35:>                                                         (0 + 1) / 1]

Due to issues with the classifier producing poor description fields, the mistral model is producing empty JSON descriptions. This is an area for further debugging.

## Hierarchical clustering

We use Agglomerative Clustering to group the taxa into "clades" based in cosine similarity of their SBERT embeddings. We then load them into neo4j for exploration.

In [44]:
from taxon_clusterer import TaxonClusterer as TC, ClusterNode
from scipy.cluster.hierarchy import linkage, to_tree

class TaxonClusterer(TC):
    pass


In [45]:
if run_clustering:
    clusterer = TaxonClusterer(
        redis_host="localhost",
        redis_port=6379,
        redis_db=0,
        neo4j_uri=neo4j_uri,
    )
    
    # Load embeddings from Redis
    (embeddings, taxon_names, metadata) = clusterer.load_embeddings(embedding_name)

TaxonClusterer initialized
  Redis: localhost:6379/0
  Neo4j: bolt://localhost:7687
Loading embeddings from Redis key: skol:embedding:v1.1
✓ Loaded 5239 taxa with 768-dimensional embeddings


Here is an excellent description. The taxon name is kind of terrible, but I can imagine how the lines qualified as species name.

In [46]:
(taxon_names[0], metadata[0])

(' added Valsaria Ces. & De Not. and Valsonectria Speg., both with uniseptate\n\n\n Mattirolia Berl. & Bres., Annuario Soc. Alpin. Trident. 14: 351 (1889)\n\n\n = Thyronectroidea Seaver, Mycologia 1: 206 (1909)\n\n\n = Balzania Speg., Anales Mus. Nac. Buenos Aires 6: 286 (1898)\n\n',
 {'source_db_name': 'skol_dev',
  'source_doc_id': '6a99ebd4995242a4bf5a3465acf2bbdf',
  'source_human_url': 'skol_dev/6a99ebd4995242a4bf5a3465acf2bbdf/article.txt.ann',
  'line_number': 85,
  'paragraph_number': 37287,
  'page_number': 1,
  'empirical_page_number': '4666',
  'description': ' Stroma variable, usually present, erumpent, covered with loosely interwoven\nyellowish or brownish hyphae, KOH negative. Perithecia globose, semiimmersed or isolated in the stroma. Paraphyses abundant. Asci unitunicate,\ncylindrical or clavate, non amyloid. Ascospores smooth, muriform, hyaline to\n\n\n Stroma subcortical, pulvinate, 0.5–6 mm diam. Perithecia aggregated,\nimmersed in the stroma or more rarely isolated,

In [47]:
if run_clustering:
    # Perform clustering
    linkage_matrix = clusterer.cluster(method="average", metric="cosine")
    
    # Store in Neo4j with root named "Fungi"
    clusterer.store_in_neo4j(root_name="Fungi", clear_existing=True)
    
    print("✓ Clustering complete!")

Performing agglomerative clustering...
  Method: average
  Metric: cosine
✓ Clustering complete
  Tree depth: 249
  Total nodes: 10477
Storing tree in Neo4j...
  Root name: Fungi
  Clearing existing Taxon and Pseudoclade nodes...
✓ Tree stored in Neo4j
  Taxon nodes: 5239
  Pseudoclade nodes: 5238
  PARENT_OF relationships: 10476
✓ Clustering complete!


### 3 related leaves

To demonstrate the effectiveness of clustering, here is a 3 leaf subtree (pseudoclade) with identical descriptions. If the classifier were working properly, there'd be a lot more description. The species names turned out well. The fragment of description they share shows through organisms that are characterized by features present when cultured on agar.

![Teunia_quercus.png](attachment:f22975f5-1e39-4c8a-ab58-5899888d507c.png)
![Vankyiozyma_motuoensis.png](attachment:511cc441-f96c-4729-88fe-af72bf77f87b.png)
![Exobasidium_lijiangense.png](attachment:dcb04dac-77d5-460a-ba0a-adece691a69e.png)

## Bibliography

* Anderson, J. Chris, Jan Lehnardt, and Noah Slater, 2010, "CouchDB: The Definitive Guide", O'Reilly Media.
* Balasi, Jen, Chris Murphy, Shintaro Osuga, La Monte Yarroll: “Synoptic Key of Life” PowerPoint presentation for IST 664, 2024, https://github.com/piggyatbaqaqi/skol/blob/master/docs/SKOL_presentation3.pptx, retrieved Feb 4, 2025.
* Caspers, Dave, Chris Murph, La Monte H.P. Yarroll, "Synoptic Key of Life II", https://docs.google.com/document/d/1SiuWlLmD_R4i8SNVr7WkTtjBwOdKDJqB/edit?usp=sharing&ouid=112079383051265085039&rtpof=true&sd=true, retrieved December 14, 2025.
* doi Foundation, "DOI Citation Formatter HTTP API", https://citation.doi.org/api-docs.html, accessed 2025-11-12.
* Eddelbuettel, D. (2022). A Brief Introduction to Redis (arXiv:2203.06559). arXiv. https://arxiv.org/abs/2203.06559 * Terrill, Gavin, (June 5, 2008). "Neo4j - an Embedded, Network Database". InfoQ. C4Media Inc. http://www.infoq.com/news/2008/06/neo4j Retrieved 2010-02-17.
* Gisolfi, Dr. Nick, "Dr Draft's state-of-the-art (SOTA) Literature Search", https://github.com/autonlab/dr-drafts-sota-literature-search, retrieved December 14, 2025.
* Hennebert, G.L., Richard P. Korf eds., Mycotaxon: A New Journal on Taxonomy and Nomenclature of Fungi and Lichens, Ithaca, NY., 1974-2010.
* Le, Patrick, Padmaja Kurumaddali, La Monte H.P. Yarroll. "Synoptic Key of Life: Feature Extraction for Fungal Taxonomy", https://docs.google.com/document/d/1m_Ja1YnyoA8zcAwI2XnKJOWBzgp8DCIX/edit?usp=sharing&ouid=112079383051265085039&rtpof=true&sd=true, retrieved December 14, 2025.
* Murrill, William A., Mycologia, New York Botanical Garden, Mycological Society of America, 1909-1961.
* Nauta, M.M., M.E. Noordeloos, eds., Persoonia: A Mycological Journal, Riksherbarium, Leiden, The Netherlands, 1959-1998.
* Turner et al., (2025). PyPaperRetriever: A Python Tool for Finding and Downloading Scientific Literature. Journal of Open Source Software, 10(113), 8135, https://doi.org/10.21105/joss.08135
* Yang, Jie and Zhang, Yue and Li, Linwei and Li, Xingxuan, 2018, "YEDDA: A Lightweight Collaborative Text Span Annotation Tool", Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, http://aclweb.org/anthology/P18-4006


## Source Code

The source code is in 2 projects on github: [SKOL](https://github.com/piggyatbaqaqi/skol) and [Dr Drafts Mycosearch](https://github.com/piggyatbaqaqi/dr-drafts-mycosearch).

The eventual home for SKOL is https://synoptickeyof.life.

## Appendix: On the use of an AI Coder

Portions of this work were completed with the aid of Claude Code Pro. I wish to give a clarifying example of how I've used this very powerful tool, and reveal why I am comfortable with claiming authorship of the resulting code.

For this project I needed results from an earlier class project in which a trio of students built and evaluated models for classifying paragraphs. The earlier work was built as a iPython Notebook, with many examples and inline code. Just copying the earlier notebook would have introduced many irrelevant details and would not further the overall project.

I asked Claude Code to translate the notebook into a module that I could import. It did a pretty good job. Without being told, it made a submodule, extracted the illustrative code as examples, wrote reasonable documentation, and created packaging for the module.

The skill level of the coding was roughly that of a highly disciplined average junior programmer. The architecture was simplistic and violated several design constraints such as DRY. I requested specific refactorings, such as asking for a group of functions to be converted into an object that shared duplicated parameters.

The initial code used REST interfaces directly, and read all the data into a single machine, not using pyspark correctly. Through a series of refactorings, I asked that the code use appropriate libraries I named, and create correct udf functions to execute transformations in parallel.

I walked the AI through creating an object that I could use to illustrate my use of redis and couchdb interfaces, while leaving the irrelevant details in a separate library.

In short, I still have to understand good design principles. I have to be able to recognize where appropriate libraries were applicable. I still have to understand the frameworks I am working with.

I now have a strong understanding of the difference between "vibe coding" and AI-assisted software engineering. In my first 4 hours with Claude Code, I was able to produce roughly 4 days' worth of professional-grade working code.

I'm still learning how to use Claude Code effectively for debugging. Feeding it a series of error messages leads to increasingly convoluted code. Using it to help produce a small test program which I can hand inspect seems to work better. I've had moderate success with "Run the test program and correct any errors.", especially where I'm willing to review each edit as it is produced.