# SKOL IV: All the Data

Synoptic Key of Life (SKOL) is a web site and application that aims to provide easy access to all of the open taxonomic literature in Mycology. A synoptic key is a tool that helps you identify an organism making successive observations, building up a detailed description of the organism in front of you. There are many fine synoptic keys available for particular taxa, but they are all hand-built. SKOL uses AI to build the synoptic key automatically.

The goal is to make it easier for advanced amateur mycologists to build technical descriptions of fungi.''

## Storage needs

SKOL uses a diverse set of databases to hold different artifacts.

The original literature is ingested into the document database CouchDB (citation needed) along with available publication metadata. The originals are typically PDF files which are stored as attachments on the CouchDB ingestion records.

Text is extracted from the ingested files, using OCR if necessary. This text is a second attachment on the ingestion record.

A classifier is trained from hand-annotated articles and stored in Redis. The model has an expiration period of several weeks. The classifier then annotates each text document with labels for Nomenclature, Description, and Misc-exposition. It stores these annotated articles as attachments on the CouchDB ingestion records.

Taxon names (typically species names with literature annotation) and combined with matching descriptions into Taxon records and stored in another CouchDB database. These records are the core data for SKOL.

The Taxon records are processed a number of ways: sentence embedding, JSON encoding, and artificial cladograms.

The sentence embedding is done with an SBERT model (citation needed) and stored as a blob in Redis. The embedding has an expiration period of 24 hours.

A Mistral model (citation needed) converts each Taxon record description is converted into a hierarchy of features, subfeatures, and values. The epectation is that these data structures will eventually form the basis of pull-down menus in the SKOL user interface. These JSON structures are stored in another CouchDB database.

The sentence embeddings are further processed into a single tree of Taxon reccords based on their distance from each other in the sentence embedding space. This tree is stored in a neo4j database.


In [1]:
bahir_package = 'org.apache.bahir:spark-sql-cloudant_2.12:2.4.0'
!spark-shell --packages $bahir_package < /dev/null

25/12/14 17:15:29 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 10.1.10.58 instead (on interface wlp130s0f0)
25/12/14 17:15:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-500fbb65-4b67-43b0-828a-eef475bf0380;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found

In [2]:
import os
# Forces synchronous execution, making it easier to track GPU operations.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' 

# Enables basic CUDA debug logging.
os.environ['CUDA_DEBUG'] = '1' 

# Other potentially useful variables for more detailed logging:
# os.environ['CUDA_API_CALLS'] = '1' # Logs CUDA API calls
os.environ['CUDA_LOG_LEVEL'] = 'DEBUG' # Or 'DEBUG', 'WARNING', etc.


In [3]:
from io import BytesIO
import json
import hashlib
import os
from pathlib import Path, PurePath
import pickle
import requests
import shutil
import sys
import tempfile
from typing import Any, Dict, Iterator, List, Optional, Tuple
from urllib.robotparser import RobotFileParser
import warnings

warnings.filterwarnings('error', category=UserWarning)

# os.environ['LD_LIBRARY_PATH'] = '/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/nvidia/cusparselt/lib'

# Be sure to get version 2: https://simple-repository.app.cern.ch/project/bibtexparser/2.0.0b8/description
import bibtexparser
import couchdb
import feedparser
import fitz # PyMuPDF

import pandas as pd  # TODO(piggy): Remove this dependency in favor of pure pyspark DataFrames.

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import (
    Tokenizer, CountVectorizer, IDF, StringIndexer, VectorAssembler, IndexToString
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql.functions import (
    input_file_name, collect_list, regexp_extract, col, udf,
    explode, trim, row_number, min, expr, concat, lit
)
from pyspark.sql.types import (
    ArrayType, BooleanType, IntegerType, MapType, NullType,
    StringType, StructType, StructField
)
from pyspark.sql.window import Window

import redis
from uuid import uuid4

# Local modules
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
parent_path = Path(parent_dir)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# TODO: Make a TaxonExtractor in this notebook with the needed i/o functions.
from fileobj import FileObject
from finder import parse_annotated, remove_interstitials
import line
from line import Line

import numpy as np

# Import SKOL classifiers
from skol_classifier.model import SkolModel
from skol_classifier.output_formatters import YeddaFormatter
from skol_classifier.preprocessing import SuffixTransformer, ParagraphExtractor
from skol_classifier.utils import get_file_list

from taxon import group_paragraphs, Taxon


## Important constants

In [4]:
should_ingest = True  # Should we ingest from the real web sites?
model_to_use = "logistic"
create_classifier = True  # Should we recalculate the classifier?
add_annotations = True # Should we add *.ann files? Very expensive!
build_taxon = True
generate_json = True  # Run mistral to generate JSON descriptions?

couchdb_host = "127.0.0.1:5984" # e.g., "ACCOUNT.cloudant.com" or "localhost"
couchdb_username = "admin"
couchdb_password = "SU2orange!"
ingest_db_name = "skol_dev"  # Development ingestion database
taxon_db_name = "skol_taxa_dev"  # Development Taxa database
json_taxon_db_name = "skol_taxa_full_dev"  # Development Taxa database with JSON translations

redis_host = 'localhost'
redis_port = 6379

embedding_name = 'skol:embedding:v1.0'
embedding_expire = 60 * 60 * 24 * 2  # Expire after 2 days.
classifier_model_name = "skol:classifier:model:rnn-v1.0"
classifier_model_expire = 60 * 60 * 24  * 2 # Expire after 2 days.

neo4j_uri = "bolt://localhost:7687"

couchdb_url = f'http://{couchdb_host}'

cores = 2

## robots.txt

We want to be a well-behaved web scraper. Respect `robots.txt`, a standardized file that tells us what parts of a web site a scraper is allowed to access.

In [5]:
user_agent = "synoptickeyof.life"

ingenta_rp = RobotFileParser()
ingenta_rp.set_url("https://www.ingentaconnect.com/robots.txt")
ingenta_rp.read() # Reads and parses the robots.txt file from the URL

def make_spark_session():
    retval = SparkSession \
        .builder \
        .appName("CouchDB Spark SQL Example in Python using dataframes") \
        .master(f"local[{cores}]") \
        .config("cloudant.protocol", "http") \
        .config("cloudant.host", couchdb_host) \
        .config("cloudant.username", couchdb_username) \
        .config("cloudant.password", couchdb_password) \
        .config("spark.jars.packages", bahir_package) \
        .config("spark.driver.memory", "16g") \
        .config("spark.executor.memory", "20g") \
        .config("spark.driver.extraJavaOptions",
                "--add-opens=java.base/java.nio=ALL-UNNAMED "
                "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED "
                "--add-opens=java.base/sun.security.action=ALL-UNNAMED "
                "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED") \
        .config("spark.executor.extraJavaOptions",
                "--add-opens=java.base/java.nio=ALL-UNNAMED "
                "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED "
                "--add-opens=java.base/sun.security.action=ALL-UNNAMED "
                "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED") \
        .config("spark.submit.pyFiles",
                f'{parent_path / "line.py"},{parent_path / "fileobj.py"},'
                f'{parent_path / "couchdb_file.py"},{parent_path / "finder.py"},'
                f'{parent_path / "taxon.py"},{parent_path / "paragraph.py"},'
                f'{parent_path / "label.py"},{parent_path / "file.py"},'
                f'{parent_path / "extract_taxa_to_couchdb.py"}'
               ) \
        .getOrCreate()

    sc = retval.sparkContext
    sc.setLogLevel("ERROR") # Keeps the noise down!!!
    return retval

couch = couchdb.Server(couchdb_url)
couch.resource.credentials = (couchdb_username, couchdb_password)

if ingest_db_name not in couch:
    db = couch.create(ingest_db_name)
else:
    db = couch[ingest_db_name]

# Connect to Redis
redis_client = redis.Redis(
    host=redis_host,
    port=redis_port,
    db=0,
    decode_responses=False
)

## The Data Sources

The goal is to collect all the open access taxonomic literature in Mycology. Most of the sources below mainly cover macro-fungi and slime molds.

### Ingested Data Sources

* [Mycotaxon at Ingenta Connect](https://www.ingentaconnect.com/content/mtax/mt)
* [Studies in Mycology at Ingenta Connect](https://www.studiesinmycology.org/)

### Source of many older public domain and open access works

Mycoweb includes scans of many older works in mycology. I have local copies but need to write ingesters for them.

* [Mycoweb](https://mykoweb.com/)

### Journals in hand

These are journals we've collected over the years. The initial annotated issues are from early years of Mycotaxon. We still need to write ingesters for all of these.

* Mycologia (back issues)
* [Mycologia at Taylor and Francis](https://www.tandfonline.com/journals/umyc20)
  Mycologia is the main journal of the Mycological Society of America. It is a mix of open access and traditional access articles. The connector for this journal will need to identify the open access articles.
* Persoonia (all issues)
  Persoonia is no longer published.
* Mycotaxon (back issues)
  Mycotaxon is no longer published.

### Journals that need connectors

These are journals we're aware that include open access articles.

* [Amanitaceae.org](http://www.tullabs.com/amanita/?home)
* [Mycosphere](https://mycosphere.org/)
* [Mycoscience](https://mycoscience.org/)
* [Journal of Fungi](https://www.mdpi.com/journal/jof)
* [Mycology](https://www.tandfonline.com/journals/tmyc20)
* [Open Access Journal of Mycology & Mycological Sciences](https://www.medwinpublishers.com/OAJMMS/)
* [Mycokeys](https://mycokeys.pensoft.net/)


## Ingestion

Each journal or other data source gets an ingester that puts PDFs into our document store along with any metadata we can collect. The metadata is sufficient to create citations for each issue, book, or article. If bibtex citations are available we prefer to store these verbatim.

### Ingenta RSS ingestion

Ingenta Connect is an electronic publisher that holds two Mycology journals. New articles are available via RSS (Really Simple Syndication).

In [6]:
def ingest_from_bibtex(
        db: couchdb.Database,
        content: bytes,
        bibtex_link: str,
        meta: Dict[str, Any],
        rp
        ) -> None:
    """Load documents referenced in an Ingenta BibTeX database."""
    bib_database = bibtexparser.parse_string(content)

    bibtex_data = {
        'link': bibtex_link,
        'bibtex': bibtexparser.write_string(bib_database),
    }

    for bib_entry in bib_database.entries:
        doc = {
            '_id': uuid4().hex,
            'meta': meta,
            'pdf_url': f"{bib_entry['url']}?crawler=true",
        }

        # Do not fetch if we already have an entry.
        selector = {'selector': {'pdf_url': doc['pdf_url']}}
        found = False
        for e in db.find(selector):
            found = True
        if found:
            print(f"Skipping {doc['pdf_url']}")
            continue

        if not rp.can_fetch(user_agent, doc['pdf_url']):
            # TODO(piggy): We should probably log blocked URLs.
            print(f"Robot permission denied {doc['pdf_url']}")
            continue

        print(f"Adding {doc['pdf_url']}")
        for k in bib_entry.fields_dict.keys():
            doc[k] = bib_entry[k]

        doc_id, doc_rev = db.save(doc)
        with requests.get(doc['pdf_url'], stream=False) as pdf_f:
            pdf_f.raise_for_status()
            pdf_doc = pdf_f.content

        attachment_filename = 'article.pdf'
        attachment_content_type = 'application/pdf'
        attachment_file = BytesIO(pdf_doc)

        db.put_attachment(doc, attachment_file, attachment_filename, attachment_content_type)

        print("-" * 10)

In [7]:
def ingest_ingenta(
        db: couchdb.Database,
        rss_url: str,
        rp
) -> None:
    """Ingest documents from an Ingenta RSS feed."""

    feed = feedparser.parse(rss_url)

    feed_meta = {
        'url': rss_url,
        'title': feed.feed.title,
        'link': feed.feed.link,
        'description': feed.feed.description,
    }

    for entry in feed.entries:
        entry_meta = {
            'title': entry.title,
            'link': entry.link,
        }
        if hasattr(entry, 'summary'):
            entry_meta['summary'] = entry.summary
        if hasattr(entry, 'description'):
            entry_meta['description'] = entry.description

        bibtex_link = f'{entry.link}?format=bib'
        print(f"bibtex_link: {bibtex_link}")

        if not rp.can_fetch(user_agent, bibtex_link):
            print(f"Robot permission denied {bibtex_link}")
            continue

        with requests.get(bibtex_link, stream=False) as bibtex_f:
            bibtex_f.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

            ingest_from_bibtex(
                db=db,
                content=bibtex_f.content\
                    .replace(b"\"\nparent", b"\",\nparent")\
                    .replace(b"\n", b""),
                bibtex_link=bibtex_link,
                meta={
                    'feed': feed_meta,
                    'entry': entry_meta,
                },
                rp=rp
            )
        print("=" * 20)

In [8]:
def ingest_from_local_bibtex(
    db: couchdb.Database,
    root: Path,
    rp
) -> None:
    """Ingest from a local directory with Ingenta bibtext files in it."""
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if not filename.endswith('format=bib'):
                continue
            full_filepath = os.path.join(dirpath, filename)
            bibtex_link = f"https://www.ingentaconnect.com/{full_filepath[len(str(root)):]}"
            with open(full_filepath) as f:
                # Paper over a syntax problem in Ingenta Connect Bibtex files.
                content = f.read()\
                    .replace("\"\nparent", "\",\nparent")\
                    .replace("\n", "")
                ingest_from_bibtex(db, content, bibtex_link, meta={}, rp=rp)


In [9]:
# Mycotaxon
if should_ingest:
    ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/mtax/mt?format=rss', rp=ingenta_rp)

bibtex_link: https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004?format=bib
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00008?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00009?crawler=tr

In [10]:
# Studies in Mycology
if should_ingest:
    ingest_ingenta(db=db, rss_url='https://api.ingentaconnect.com/content/wfbi/sim?format=rss', rp=ingenta_rp)

bibtex_link: https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001?format=bib
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000111/00000001/art00005?crawler=true
bibtex_link: https://www.ingentaconnect.com/content/wfbi/sim/2025/00000110/00000001?format=bib
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000110/00000001/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000110/00000001/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/wfbi/sim/2025/00000110/00000001/art00003?crawler

In [11]:
if should_ingest:
    ingest_from_local_bibtex(
        db=db,
        root=Path("/data/skol/www/www.ingentaconnect.com"),
        rp=ingenta_rp
    )

Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00008?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00009?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2013/00000125/00000001/art00010?cra

Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>


Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000003/art00029?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00008?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2017/00000132/00000002/art00009?cra

Parsing of `@article ` block (line 0) aborted on line 0 due to syntactical error in bibtex:
 Expected a `=` after entry key, but found `"`.
Unknown block type <class 'bibtexparser.model.ParsingFailedBlock'>
Unknown block type <class 'bibtexparser.model.ParsingFailedBlock'>


Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00035?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2023/00000137/00000004/art00024?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2009/00000109/00000001/art00008?cra

Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>


Skipping https://www.ingentaconnect.com/content/mtax/mt/2020/00000134/00000004/art00018?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00008?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000116/00000001/art00009?cra

Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>


Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000115/00000001/art00061?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2011/00000115/00000001/art00062?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000132/00000004/art00008?cra

Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>
Unknown block type <class 'bibtexparser.model.DuplicateBlockKeyBlock'>


Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000001/art00026?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000001/art00027?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00001?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00002?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00003?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00004?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00005?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00006?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00007?crawler=true
Skipping https://www.ingentaconnect.com/content/mtax/mt/2018/00000133/00000003/art00008?cra

### Text extraction

We extract the text, optionally with OCR. Add as an additional attachment on the source record.

In [12]:
spark = make_spark_session()

df = spark.read.load(
    format="org.apache.bahir.cloudant",
    database=ingest_db_name
)

25/12/14 17:21:04 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 10.1.10.58 instead (on interface wlp130s0f0)
25/12/14 17:21:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d90b490f-e753-4cac-a488-a946e976e2ee;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found commons-codec#commons-codec;1.6 in central
	found com.cloudant#cloudant-http;2.17.0 in central
	found commons-io#commons-io;2.4 in central
	found com.squareup.okhttp3#okhttp;3.12.2 in central
	found com.squareup.okio#okio;1.15.0 in central
	found com.typesafe#config;1.3.1 in central
	found org.scalaj#scalaj-http_2.12;2.3.0 in central
:: resolution report :: resolve 208ms :: artifacts dl 5ms
	:: modules in use

In [13]:
df.describe()

DataFrame[summary: string, _id: string, _rev: string, abstract: string, author: string, doi: string, eissn: string, issn: string, itemtype: string, journal: string, number: string, pages: string, parent_itemid: string, pdf_url: string, publication date: string, publishercode: string, title: string, url: string, volume: string, year: string]

In [14]:
# Content-Type: text/html; charset=UTF-8

def pdf_to_text(pdf_contents: bytes) -> bytes:
    doc = fitz.open(stream=BytesIO(pdf_contents), filetype="pdf")

    full_text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Possibly perform OCR on the page
        text = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_DEHYPHENATE)
        # full_text += f"\n--- PDF Page {page_num+1} ---\n"  # TODO(piggy): Introduce PDF page tracking in line-by-line and paragraph parsers.
        full_text += text

    return full_text.encode("utf-8")

def add_text_to_partition(iterator) -> None:
    couch = couchdb.Server(couchdb_url)
    couch.resource.credentials = (couchdb_username, couchdb_password)
    local_db = couch[ingest_db_name]
    for row in iterator:
        if not row:
            continue
        if not row._attachments:
            continue
        row_dict = row.asDict()
        attachment_dict = row._attachments.asDict()
        for pdf_filename in attachment_dict:
            pdf_path = PurePath(pdf_filename)
            if pdf_path.suffix != '.pdf':
                continue
            pdf_path = PurePath(pdf_filename)
            txt_path_str = pdf_path.stem + '.txt'
            # if txt_path_str in attachment_dict:
            #     # TODO(piggy): Recalculate text if text is terrible. Too much noise vocabulary?
            #     print(f"Already have text for {row.pdf_url}")
            #     continue
            print(f"{row._id}, {row.pdf_url}")
            pdf_file = local_db.get_attachment(row._id, str(pdf_path)).read()
            txt_file = pdf_to_text(pdf_file)
            attachment_content_type = 'text/simple; charset=UTF-8'
            attachment_file = BytesIO(txt_file)
            local_db.put_attachment(row_dict, attachment_file, txt_path_str, attachment_content_type)


In [15]:
# Identical to skol_classifier.CouchDBConnection.
from skol_classifier.couchdb_io import CouchDBConnection as CDBC

class CouchDBConnection(CDBC):
    """
    Manages CouchDB connection and provides I/O operations.

    This class encapsulates connection parameters and provides an idempotent
    connection method that can be safely called multiple times.
    """
    # Shared schema definitions (DRY principle)
    LOAD_SCHEMA = StructType([
        StructField("doc_id", StringType(), False),
        StructField("human_url", StringType(), False),
        StructField("attachment_name", StringType(), False),
        StructField("value", StringType(), False),
    ])

    SAVE_SCHEMA = StructType([
        StructField("doc_id", StringType(), False),
        StructField("attachment_name", StringType(), False),
        StructField("success", BooleanType(), False),
    ])

    def __init__(
        self,
        couchdb_url: str,
        database: str,
        username: Optional[str] = None,
        password: Optional[str] = None
    ):
        """
        Initialize CouchDB connection parameters.

        Args:
            couchdb_url: CouchDB server URL (e.g., "http://localhost:5984")
            database: Database name
            username: Optional username for authentication
            password: Optional password for authentication
        """
        self.couchdb_url = couchdb_url
        self.database = database
        self.username = username
        self.password = password
        self._server = None
        self._db = None

    def _connect(self):
        """
        Idempotent connection method that returns a CouchDB server object.

        This method can be called multiple times safely - it will only create
        a connection if one doesn't already exist.

        Returns:
            couchdb.Server: Connected CouchDB server object
        """
        if self._server is None:
            self._server = couchdb.Server(self.couchdb_url)
            if self.username and self.password:
                self._server.resource.credentials = (self.username, self.password)

        if self._db is None:
            self._db = self._server[self.database]

        return self._server

    @property
    def db(self):
        """Get the database object, connecting if necessary."""
        if self._db is None:
            self._connect()
        return self._db

    def get_all_doc_ids(self, pattern: str = "*") -> List[str]:
        """
        Get list of document IDs matching the pattern from CouchDB.

        Args:
            pattern: Pattern for document IDs (e.g., "taxon_*", "*")
                    - "*" matches all non-design documents
                    - "prefix*" matches documents starting with prefix
                    - "exact" matches exactly

        Returns:
            List of matching document IDs
        """
        db = self.db

        # Get all document IDs (excluding design documents)
        all_doc_ids = [doc_id for doc_id in list(db) if not doc_id.startswith('_design/')]

        # Filter by pattern
        if pattern == "*":
            # Return all non-design documents
            return all_doc_ids
        elif pattern.endswith('*'):
            # Prefix matching
            prefix = pattern[:-1]
            return [doc_id for doc_id in all_doc_ids if doc_id.startswith(prefix)]
        else:
            # Exact match
            return [doc_id for doc_id in all_doc_ids if doc_id == pattern]

    def get_document_list(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Get a list of documents with text attachments from CouchDB.

        This only fetches document metadata (not content) to create a DataFrame
        that can be processed in parallel. Creates ONE ROW per attachment, so if
        a document has multiple attachments matching the pattern, it will have
        multiple rows in the resulting DataFrame.

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names (e.g., "*.txt")

        Returns:
            DataFrame with columns: doc_id, attachment_name
            One row per (doc_id, attachment_name) pair
        """
        # Connect to CouchDB (driver only)
        db = self.db

        # Get all documents with attachments matching pattern
        doc_list = []
        for doc_id in db:
            try:
                doc = db[doc_id]
                attachments = doc.get('_attachments', {})

                # Loop through ALL attachments in the document
                for att_name in attachments.keys():
                    # Check if attachment matches pattern
                    # Pattern matching: "*.txt" matches files ending with .txt
                    if pattern == "*.txt" and att_name.endswith('.txt'):
                        doc_list.append((doc_id, att_name))
                    elif pattern == "*.*" or pattern == "*":
                        # Match all attachments
                        doc_list.append((doc_id, att_name))
                    elif pattern.startswith("*.") and att_name.endswith(pattern[1:]):
                        # Generic pattern matching for *.ext
                        doc_list.append((doc_id, att_name))
            except Exception:
                # Skip documents we can't read
                continue

        # Create DataFrame with document IDs and attachment names
        schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("attachment_name", StringType(), False)
        ])

        return spark.createDataFrame(doc_list, schema)

    def fetch_partition(
        self,
        partition: Iterator[Row]
    ) -> Iterator[Row]:
        """
        Fetch CouchDB attachments for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id and attachment_name

        Yields:
            Rows with doc_id, human_url, attachment_name, and value (content).
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document has multiple .txt attachments, there will be multiple rows
            for row in partition:
                try:
                    doc = db[row.doc_id]

                    # Get the specific attachment for this row
                    if row.attachment_name in doc.get('_attachments', {}):
                        attachment = db.get_attachment(doc, row.attachment_name)
                        if attachment:
                            content = attachment.read().decode('utf-8', errors='ignore')

                            yield Row(
                                doc_id=row.doc_id,
                                human_url=doc.get('url', 'missing_human_url'),
                                attachment_name=row.attachment_name,
                                value=content
                            )
                except Exception as e:
                    # Log error but continue processing
                    print(f"Error fetching {row.doc_id}/{row.attachment_name}: {e}")
                    continue

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            return

    def save_partition(
        self,
        partition: Iterator[Row],
        suffix: str = ".ann"
    ) -> Iterator[Row]:
        """
        Save annotated content to CouchDB for an entire partition.

        This function is designed to be used with foreachPartition or mapPartitions.
        It creates a single CouchDB connection per partition and reuses it for all rows.

        Args:
            partition: Iterator of Rows with doc_id, attachment_name, final_aggregated_pg
                       and optionally human_url
            suffix: Suffix to append to attachment names

        Yields:
            Rows with doc_id, attachment_name, and success status.
        """
        # Connect to CouchDB once per partition
        try:
            db = self.db

            # Process all rows in partition with same connection
            # Note: Each row represents one (doc_id, attachment_name) pair
            # If a document had multiple .txt files, we save multiple .ann files
            for row in partition:
                success = False
                try:
                    doc = db[row.doc_id]

                    # Update human_url field if provided
                    if hasattr(row, 'human_url') and row.human_url:
                        doc['url'] = row.human_url
                        db.save(doc)
                        # Reload doc to get updated _rev
                        doc = db[row.doc_id]

                    # Create new attachment name by appending suffix
                    # e.g., "article.txt" becomes "article.txt.ann"
                    new_attachment_name = f"{row.attachment_name}{suffix}"

                    # Save the annotated content as a new attachment
                    db.put_attachment(
                        doc,
                        row.final_aggregated_pg.encode('utf-8'),
                        filename=new_attachment_name,
                        content_type='text/plain'
                    )

                    success = True

                except Exception as e:
                    print(f"Error saving {row.doc_id}/{row.attachment_name}: {e}")

                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=success
                )

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            # Yield failures for all rows
            for row in partition:
                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )

    def load_distributed(
        self,
        spark: SparkSession,
        pattern: str = "*.txt"
    ) -> DataFrame:
        """
        Load text attachments from CouchDB using foreachPartition.

        This function:
        1. Gets list of documents (on driver)
        2. Creates a DataFrame with doc IDs
        3. Uses mapPartitions to fetch content efficiently (one connection per partition)

        Args:
            spark: SparkSession
            pattern: Pattern for attachment names

        Returns:
            DataFrame with columns: doc_id, attachment_name, and value.
        """
        # Get document list
        doc_df = self.get_document_list(spark, pattern)

        # Use mapPartitions for efficient batch fetching
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def fetch_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)
            return conn.fetch_partition(partition)

        # Apply mapPartitions using shared schema
        result_df = doc_df.rdd.mapPartitions(fetch_partition).toDF(self.LOAD_SCHEMA)

        return result_df

    def save_distributed(
        self,
        df: DataFrame,
        suffix: str = ".ann"
    ) -> DataFrame:
        """
        Save annotated predictions to CouchDB using foreachPartition.

        This function uses mapPartitions where each partition creates a single
        CouchDB connection and reuses it for all rows.

        Args:
            df: DataFrame with columns: doc_id, attachment_name, final_aggregated_pg
            suffix: Suffix to append to attachment names

        Returns:
            DataFrame with doc_id, attachment_name, and success columns
        """
        # Use mapPartitions for efficient batch saving
        # Create new connection instance with same params for workers
        conn_params = (self.couchdb_url, self.database, self.username, self.password)

        def save_partition(partition):
            # Each worker creates its own connection
            conn = CouchDBConnection(*conn_params)
            return conn.save_partition(partition, suffix)

        # Apply mapPartitions using shared schema
        result_df = df.rdd.mapPartitions(save_partition).toDF(self.SAVE_SCHEMA)

        return result_df

    def process_partition_with_func(
        self,
        partition: Iterator[Row],
        processor_func,
        suffix: str = ".ann"
    ) -> Iterator[Row]:
        """
        Generic function to read, process, and save in one partition operation.

        This allows custom processing logic while maintaining single connection per partition.

        Args:
            partition: Iterator of Rows
            processor_func: Function to process content (takes content string, returns processed string)
            suffix: Suffix for output attachment

        Yields:
            Rows with processing results, including success status for logging.
        """
        try:
            db = self.db

            for row in partition:
                try:
                    doc = db[row.doc_id]

                    # Fetch
                    if row.attachment_name in doc.get('_attachments', {}):
                        attachment = db.get_attachment(doc, row.attachment_name)
                        if attachment:
                            content = attachment.read().decode('utf-8', errors='ignore')

                            # Process
                            processed = processor_func(content)

                            # Save
                            new_attachment_name = f"{row.attachment_name}{suffix}"
                            db.put_attachment(
                                doc,
                                processed.encode('utf-8'),
                                filename=new_attachment_name,
                                content_type='text/plain'
                            )

                            yield Row(
                                doc_id=row.doc_id,
                                attachment_name=row.attachment_name,
                                success=True
                            )
                            continue

                except Exception as e:
                    print(f"Error processing {row.doc_id}/{row.attachment_name}: {e}")

                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )

        except Exception as e:
            print(f"Error connecting to CouchDB: {e}")
            for row in partition:
                yield Row(
                    doc_id=row.doc_id,
                    attachment_name=row.attachment_name,
                    success=False
                )


In [16]:
from skol_classifier.output_formatters import CouchDBOutputWriter as CDBOW
class CouchDBOutputWriter(CDBOW):
    """
    Writes predictions back to CouchDB as attachments.
    """

    def __init__(
        self,
        couchdb_url: str,
        database: str,
        username: str,
        password: str
    ):
        """
        Initialize the writer.

        Args:
            couchdb_url: CouchDB server URL
            database: Database name
            username: CouchDB username
            password: CouchDB password
        """
        self.conn = CouchDBConnection(
            couchdb_url=couchdb_url,
            database=database,
            username=username,
            password=password
        )

    def save_annotated(
        self,
        predictions: DataFrame,
        suffix: str = ".ann",
        coalesce_labels: bool = False,
        line_level: bool = False
    ) -> None:
        """
        Save predictions to CouchDB as attachments.

        Args:
            predictions: DataFrame with predictions
            suffix: Suffix for attachment names
            coalesce_labels: Whether to coalesce consecutive labels
            line_level: Whether data is line-level
        """
        # Format predictions
        if "annotated_value" not in predictions.columns:
            predictions = YeddaFormatter.format_predictions(predictions)

        # Coalesce if requested
        if coalesce_labels and line_level:
            predictions = YeddaFormatter.coalesce_consecutive_labels(
                predictions, line_level=True
            )
            # For coalesced output, we have coalesced_annotations column
            # We need to join them into final_aggregated_pg
            from pyspark.sql.functions import expr
            predictions = predictions.withColumn(
                "final_aggregated_pg",
                expr("array_join(coalesced_annotations, '\n')")
            )
        else:
            # Aggregate annotated values by document
            groupby_col = "doc_id" if "doc_id" in predictions.columns else "filename"
            attachment_col = "attachment_name" if "attachment_name" in predictions.columns else "filename"

            # Check if we have line_number for ordering
            if "line_number" in predictions.columns:
                from pyspark.sql.functions import expr, first
                # Preserve human_url if it exists
                if "human_url" in predictions.columns:
                    predictions = (
                        predictions.groupBy(groupby_col, attachment_col)
                        .agg(
                            expr("sort_array(collect_list(struct(line_number, annotated_value)))").alias("sorted_list"),
                            first("human_url").alias("human_url")
                        )
                        .withColumn("annotated_value_ordered", expr("transform(sorted_list, x -> x.annotated_value)"))
                        .withColumn("final_aggregated_pg", expr("array_join(annotated_value_ordered, '\n')"))
                        .select(groupby_col, "human_url", attachment_col, "final_aggregated_pg")
                    )
                else:
                    predictions = (
                        predictions.groupBy(groupby_col, attachment_col)
                        .agg(
                            expr("sort_array(collect_list(struct(line_number, annotated_value)))").alias("sorted_list")
                        )
                        .withColumn("annotated_value_ordered", expr("transform(sorted_list, x -> x.annotated_value)"))
                        .withColumn("final_aggregated_pg", expr("array_join(annotated_value_ordered, '\n')"))
                        .select(groupby_col, attachment_col, "final_aggregated_pg")
                    )
            else:
                from pyspark.sql.functions import collect_list, expr, first
                # Preserve human_url if it exists
                if "human_url" in predictions.columns:
                    predictions = (
                        predictions.groupBy(groupby_col, attachment_col)
                        .agg(
                            collect_list("annotated_value").alias("annotations"),
                            first("human_url").alias("human_url")
                        )
                        .withColumn("final_aggregated_pg", expr("array_join(annotations, '\n')"))
                        .select(groupby_col, "human_url", attachment_col, "final_aggregated_pg")
                    )
                else:
                    predictions = (
                        predictions.groupBy(groupby_col, attachment_col)
                        .agg(
                            collect_list("annotated_value").alias("annotations")
                        )
                        .withColumn("final_aggregated_pg", expr("array_join(annotations, '\n')"))
                        .select(groupby_col, attachment_col, "final_aggregated_pg")
                    )

            # Rename columns for CouchDB save
            if groupby_col != "doc_id":
                predictions = predictions.withColumnRenamed(groupby_col, "doc_id")
            if attachment_col != "attachment_name":
                predictions = predictions.withColumnRenamed(attachment_col, "attachment_name")

        # Use CouchDB connection to save
        self.conn.save_distributed(predictions, suffix=suffix)



In [20]:
from skol_classifier.classifier_v2 import SkolClassifierV2 as SC
from skol_classifier.model import create_model
"""
Main classifier module for SKOL text classification
"""
class SkolClassifierV2(SC):
    """
    Text classifier for taxonomic literature.

    This version only includes the redis and couchdb I/O methods.
    All other methods are in SC.

    Supports multiple classification models (Logistic Regression, Random Forest, RNN)
    and feature types (word TF-IDF, suffix TF-IDF, combined).
    """

    def _load_raw_from_couchdb(self) -> DataFrame:
        """Load raw text from CouchDB."""
        conn = CouchDBConnection(
            self.couchdb_url,
            self.couchdb_database,
            self.couchdb_username,
            self.couchdb_password
        )

        df = conn.load_distributed(self.spark, self.couchdb_pattern)

        # Split into lines if line_level mode
        if self.line_level:
            from pyspark.sql.functions import explode, split, col, trim, row_number, lit
            from pyspark.sql.window import Window

            df = df.withColumn("value", explode(split(col("value"), "\\n")))
            df = df.filter(trim(col("value")) != "")

            # Add line numbers
            window_spec = Window.partitionBy("doc_id", "attachment_name").orderBy(lit(1))
            df = df.withColumn("line_number", row_number().over(window_spec) - 1)

        return df

    def _load_annotated_from_couchdb(self) -> DataFrame:
        """Load annotated data from CouchDB."""
        # Load raw annotations from CouchDB
        conn = CouchDBConnection(
            self.couchdb_url,
            self.couchdb_database,
            self.couchdb_username,
            self.couchdb_password
        )

        # Look for .ann files for training
        pattern = self.couchdb_pattern
        if not pattern.endswith('.ann'):
            pattern = pattern.replace('.txt', '.txt.ann')

        df = conn.load_distributed(self.spark, pattern)

        # Parse annotations
        from .preprocessing import AnnotatedTextParser

        parser = AnnotatedTextParser(line_level=self.line_level)
        return parser.parse(df)

    def _save_to_couchdb(self, predictions: DataFrame) -> None:
        """Save predictions to CouchDB."""
        from skol_classifier.output_formatters import CouchDBOutputWriter

        writer = CouchDBOutputWriter(
            couchdb_url=self.couchdb_url,
            database=self.couchdb_database,
            username=self.couchdb_username,
            password=self.couchdb_password
        )

        writer.save_annotated(
            predictions,
            suffix=self.output_couchdb_suffix,
            coalesce_labels=self.coalesce_labels,
            line_level=self.line_level
        )

    def _save_to_couchdb(self, predictions: DataFrame) -> None:
        """Save predictions to CouchDB."""
        from skol_classifier.output_formatters import CouchDBOutputWriter

        writer = CouchDBOutputWriter(
            couchdb_url=self.couchdb_url,
            database=self.couchdb_database,
            username=self.couchdb_username,
            password=self.couchdb_password
        )

        writer.save_annotated(
            predictions,
            suffix=self.output_couchdb_suffix,
            coalesce_labels=self.coalesce_labels,
            line_level=self.line_level
        )

    def load_raw(self) -> DataFrame:
        """
        Load raw (unannotated) data from configured input source.

        Returns:
            DataFrame with raw text data

        Raises:
            ValueError: If input_source is not properly configured
        """
        if self.input_source == 'files':
            return self._load_raw_from_files()
        elif self.input_source == 'couchdb':
            return self._load_raw_from_couchdb()
        else:
            raise ValueError(f"load_raw() not supported for input_source='{self.input_source}'")

    def _save_model_to_redis(self) -> None:
        """Save model to Redis using tar archive."""
        import json
        import tempfile
        import shutil
        import tarfile
        import io

        if self._model is None or self._feature_pipeline is None:
            raise ValueError("No model to save. Train a model first.")

        temp_dir = None
        try:
            # Create temporary directory
            temp_dir = tempfile.mkdtemp(prefix="skol_model_v2_")
            temp_path = Path(temp_dir)

            # Save feature pipeline
            pipeline_path = temp_path / "feature_pipeline"
            self._feature_pipeline.save(str(pipeline_path))

            # Save classifier model
            classifier_model = self._model.get_model()
            if classifier_model is None:
                raise ValueError("Classifier model not trained")
            classifier_path = temp_path / "classifier_model.h5"
            classifier_model.save(str(classifier_path))

            # Save metadata
            # For RNN models, save the actual model parameters (not the original params)
            if self.model_type == 'rnn':
                actual_model_params = {
                    'input_size': self._model.input_size,
                    'hidden_size': self._model.hidden_size,
                    'num_layers': self._model.num_layers,
                    'num_classes': self._model.num_classes,
                    'dropout': self._model.dropout,
                    'window_size': self._model.window_size,
                    'batch_size': self._model.batch_size,
                    'epochs': self._model.epochs,
                    'num_workers': self._model.num_workers,
                    'verbosity': self._model.verbosity,
                }
                if hasattr(self._model, 'prediction_stride'):
                    actual_model_params['prediction_stride'] = self._model.prediction_stride
                if hasattr(self._model, 'prediction_batch_size'):
                    actual_model_params['prediction_batch_size'] = self._model.prediction_batch_size
                if hasattr(self._model, 'name'):
                    actual_model_params['name'] = self._model.name
            else:
                actual_model_params = self.model_params

            metadata = {
                'label_mapping': self._label_mapping,
                'config': {
                    'line_level': self.line_level,
                    'use_suffixes': self.use_suffixes,
                    'min_doc_freq': self.min_doc_freq,
                    'model_type': self.model_type,
                    'model_params': actual_model_params
                },
                'version': '2.0'
            }
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'w') as f:
                json.dump(metadata, f, indent=2)

            # Create tar archive
            archive_buffer = io.BytesIO()
            with tarfile.open(fileobj=archive_buffer, mode='w:gz') as tar:
                tar.add(temp_path, arcname='.')

            # Save to Redis
            archive_data = archive_buffer.getvalue()
            self.redis_client.set(self.redis_key, archive_data)
            if self.redis_expire is not None:
                self.redis_client.expire(self.redis_key, self.redis_expire)

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)

    def _load_model_from_redis(self) -> None:
        """Load model from Redis tar archive."""
        import json
        import tempfile
        import shutil
        import tarfile
        import io
        from pyspark.ml import PipelineModel

        serialized = self.redis_client.get(self.redis_key)
        if not serialized:
            raise ValueError(f"No model found in Redis with key: {self.redis_key}")

        temp_dir = None
        try:
            # Create temporary directory
            temp_dir = tempfile.mkdtemp(prefix="skol_model_load_v2_")
            temp_path = Path(temp_dir)

            # Extract tar archive
            archive_buffer = io.BytesIO(serialized)
            with tarfile.open(fileobj=archive_buffer, mode='r:gz') as tar:
                tar.extractall(temp_path)

            # Load metadata first to know model type
            metadata_path = temp_path / "metadata.json"
            with open(metadata_path, 'r') as f:
                metadata = json.load(f)

            self._label_mapping = metadata['label_mapping']
            self._reverse_label_mapping = {v: k for k, v in self._label_mapping.items()}
            model_type = metadata['config']['model_type']

            # Load feature pipeline
            pipeline_path = temp_path / "feature_pipeline"
            self._feature_pipeline = PipelineModel.load(str(pipeline_path))

            # Load classifier model (different approach for RNN vs traditional ML)
            classifier_path = temp_path / "classifier_model.h5"

            if model_type == 'rnn':
                # For RNN models, load the Keras .h5 file directly
                from tensorflow import keras
                keras_model = keras.models.load_model(str(classifier_path))
                classifier_model = keras_model  # This is the Keras model itself
            else:
                # For traditional ML models, load as PipelineModel
                classifier_model = PipelineModel.load(str(classifier_path))

            # Recreate the SkolModel wrapper using factory
            features_col = self._feature_extractor.get_features_col() if self._feature_extractor else "combined_idf"

            # Merge saved model params with any new params provided in constructor
            # New params override saved params for runtime-tunable parameters
            saved_params = metadata['config'].get('model_params', {})
            merged_params = saved_params.copy()

            # Override runtime-tunable parameters if provided
            if self.model_params:
                # These parameters can be changed without retraining
                runtime_tunable = {
                    'prediction_batch_size',
                    'prediction_stride',
                    'num_workers',
                    'verbosity',
                    'batch_size'  # Training batch size, can be changed for future fine-tuning
                }
                for param, value in self.model_params.items():
                    if param in runtime_tunable:
                        merged_params[param] = value
                        if self.verbosity >= 2:
                            print(f"[Load Model] Overriding {param}: {saved_params.get(param)} -> {value}")

            self._model = create_model(
                model_type=model_type,
                features_col=features_col,
                label_col="label_indexed",
                **merged_params
            )
            self._model.set_model(classifier_model)
            self._model.set_labels(list(self._label_mapping.keys()))

        finally:
            # Clean up temporary directory
            if temp_dir and Path(temp_dir).exists():
                shutil.rmtree(temp_dir)



## Build a classifier to identify paragraph types.

We save this to redis so that we don't need to train the model every time.

In [21]:
# Train classifier on annotated data and save to Redis using SkolClassifierV2

if model_to_use == "rnn":
    model_config = {
        "name": "RNN BiLSTM (line-level, advanced config)",
        "model_type": "rnn",
        "use_suffixes": True,
        "line_level": True,
        "input_size": 1000,
        "hidden_size": 128,
        "num_layers": 2,
        "num_classes": 3,
        "dropout": 0.3,
        "window_size": 20,
        # "prediction_stride": 15,  # 25% overlap
        "prediction_stride": 20,  # 0 overlap
        "prediction_batch_size": 32,
        # "batch_size": 16,  # 442MiB footprint
        # "batch_size": 128,  # 570MiB footprint
        # "batch_size": 512,  # 1370MiB footprint
        # "batch_size": 1024,  # 2394MiB footprint
        # "batch_size": 2048,  # 4442MiB footprint, 5s per step
        # "batch_size": 4096,  #  4442MiB footprint, 8s-11s per step
        # "batch_size": 8192,  # 8538MiB footprint, 36s per step
        "batch_size": 16384,  # 16730MiB footprint, (3s) 38s-40s per step
    
        # "epochs": 4,
        # ==================================================
        # RNN Model Evaluation Statistics (Line-Level)
        # ==================================================
        # Test Accuracy:  0.7990
        # Test Precision: 0.7990
        # Test Recall:    1.0000
        # Test F1 Score:  0.7098
        # ==================================================    
        "epochs": 10,
        # ==================================================
        # RNN Model Evaluation Statistics (Line-Level)
        # ==================================================
        # Test Accuracy:  0.7990
        # Test Precision: 0.7990
        # Test Recall:    1.0000
        # Test F1 Score:  0.7098
        # ==================================================
    
        "num_workers": cores,
        "verbosity": 2,
    }
elif model_to_use == "logistic":
    model_config =  {
        "name": "Logistic Regression (line-level, words + suffixes)",
        "model_type": "logistic",
        "use_suffixes": True,
        "maxIter": 10,
        "regParam": 0.01,
        "line_level": True
    }
else:
    print(f"Unrecognized model: {model_to_use}")
    sys.exit(1)

if create_classifier or not redis_client.exists(classifier_model_name):
    # Get annotated training files
    annotated_path = Path.cwd().parent / "data" / "annotated"
    print(f"Loading annotated files from: {annotated_path}")

    if annotated_path.exists():
        annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")

        if len(annotated_files) > 0:
            print(f"Found {len(annotated_files)} annotated files")
            spark = make_spark_session()

            # Train using SkolClassifierV2 with unified API
            print("Training classifier with SkolClassifierV2...")
            classifier = SkolClassifierV2(
                spark=spark,

                # Input
                input_source='files',
                file_paths=annotated_files,

                # Model I/O
                auto_load_model=False,  # Fit a new model.
                model_storage='redis',
                redis_client=redis_client,
                redis_key=classifier_model_name,
                redis_expire=classifier_model_expire,


                # Output options
                output_dest='couchdb',
                couchdb_url=couchdb_url,
                couchdb_database=ingest_db_name,
                output_couchdb_suffix='.ann',
                
                # Model and preprocssing options
                **model_config
            )

            # Train the model
            results = classifier.fit()

            print(f"Training complete!")
            print(f"  Accuracy: {results.get('accuracy', 0):.4f}")
            print(f"  F1 Score: {results.get('f1_score', 0):.4f}")

            classifier.save_model()
            print(f"✓ Model saved to Redis with key: {classifier_model_name}")
        else:
            print(f"No annotated files found in {annotated_path}")
    else:
        print(f"Directory does not exist: {annotated_path}")
        print("Please ensure annotated training data is available.")
else:
    print(f"Skipping generation of model {classifier_model_name}.")

Loading annotated files from: /data/piggy/src/github.com/piggyatbaqaqi/skol/data/annotated
Found 190 annotated files
Training classifier with SkolClassifierV2...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

Training complete!
  Accuracy: 0.9083
  F1 Score: 0.9024
✓ Model saved to Redis with key: skol:classifier:model:rnn-v1.0


## Extract the taxa names and descriptions

We use a classifier to extract taxa names and descriptions from articles, issues, and books. The YEDDA annotated texts are written back to CouchDB.

In [22]:
# Predict from CouchDB and save back to CouchDB using SkolClassifierV2
if add_annotations:
    print("Initializing classifier with unified V2 API...")
    
    model_config2 = model_config.copy()
    model_config2.update({
        "num_workers": cores,
        "prediction_batch_size": 96,
        "verbosity": 1,
    })
    
    spark.stop()
    spark = make_spark_session()
    
    classifier = SkolClassifierV2(
        spark=spark,
        input_source='couchdb',
        couchdb_url=couchdb_url,
        couchdb_database=ingest_db_name,
        couchdb_username=couchdb_username,
        couchdb_password=couchdb_password,
        couchdb_pattern='*.txt',
        output_dest='couchdb',
        output_couchdb_suffix='.ann',
        model_storage='redis',
        redis_client=redis_client,
        redis_key=classifier_model_name,
        auto_load_model=True,
        coalesce_labels=True,
        output_format='annotated',
        **model_config2
    )
    
    print(f"Model loaded from Redis: {classifier_model_name}")
    
    # Load, predict, and save in a streamlined workflow
    print("\nLoading and classifying documents from CouchDB...")
    raw_df = classifier.load_raw()
    print(f"Loaded {raw_df.count()} text documents")
    raw_df.show(10)
    
    print("\nMaking predictions...")
    predictions = classifier.predict(raw_df)
    
    # Show sample predictions
    print("\nSample predictions:")
    predictions.select(
        "doc_id", "attachment_name", "predicted_label"
    ).show(5, truncate=50)
    
    # Save results back to CouchDB
    print("\nSaving predictions back to CouchDB...")
    classifier.save_annotated(predictions)
    
    print(f"\n✓ Predictions saved to CouchDB as .ann attachments")
else:
    print("\n Skipping annotations.")

Initializing classifier with unified V2 API...


  tar.extractall(temp_path)


Model loaded from Redis: skol:classifier:model:rnn-v1.0

Loading and classifying documents from CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

Loaded 1160796 text documents


                                                                                

+--------------------+--------------------+---------------+--------------------+-----------+
|              doc_id|           human_url|attachment_name|               value|line_number|
+--------------------+--------------------+---------------+--------------------+-----------+
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|           MYCOTAXON|          0|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Volume 108, pp. 2...|          1|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|     April–June 2009|          2|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Rattania setulife...|          3|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|an undescribed en...|          4|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|on rattans from W...|          5|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|Ashish Prabhugaon...|          6|
|006b331e284e4dc8b...|https://www.ingen...|    article.txt|*ashishprab

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


+--------------------------------+---------------+---------------+
|                          doc_id|attachment_name|predicted_label|
+--------------------------------+---------------+---------------+
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
+--------------------------------+---------------+---------------+
only showing top 5 rows


Saving predictions back to CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 42:>                                                         (0 + 2) / 2]


✓ Predictions saved to CouchDB as .ann attachments




In [23]:
if add_annotations:
    predictions.select("predicted_label", "annotated_value").where('predicted_label = "Nomenclature"').show()
    predictions.groupBy("predicted_label").count().orderBy("count").show()

Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


+---------------+--------------------+
|predicted_label|     annotated_value|
+---------------+--------------------+
|   Nomenclature|[@ s.l., Candelar...|
|   Nomenclature|[@ Szatala Ö. 195...|
|   Nomenclature|[@ Verseghy K. 19...|
|   Nomenclature|[@ Marasmius pseu...|
|   Nomenclature|[@ Rattania Prabh...|
|   Nomenclature|[@ Rattania setul...|
|   Nomenclature|[@ Masseeella flu...|
|   Nomenclature|[@ Sydow H, Petra...|
|   Nomenclature|[@ Pseudosperma a...|
|   Nomenclature|[@ Pseudosperma a...|
|   Nomenclature|[@ Pseudosperma b...|
|   Nomenclature|[@ Korf RP. 1959....|
|   Nomenclature|[@ Serendipita sa...|
|   Nomenclature|[@ Trichiès G. 20...|
|   Nomenclature|[@ Anonymous. 194...|
|   Nomenclature|[@ Carolina. J. B...|
|   Nomenclature|[@ Soc. Nat. Sci....|
|   Nomenclature|[@ Lamson-Scribne...|
|   Nomenclature|[@ Peck CH. 1899....|
|   Nomenclature|[@ Setchell WA. 1...|
+---------------+--------------------+
only showing top 20 rows



  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

+---------------+-------+
|predicted_label|  count|
+---------------+-------+
|   Nomenclature|   9269|
|    Description|  92598|
|Misc-exposition|1058929|
+---------------+-------+



                                                                                

Here we estimate an approximation for the number of Taxon structures we'd like to find. The abbreviation "nov." ("novum") indicates a new taxon in the current article. This should be a lower bound, as it is not unusual to redescribe a species, e.g. in a survey article or monograph on a genus.

In [24]:
if add_annotations:
    predictions.select("*").filter(col("annotated_value").contains("nov.")).where("predicted_label = 'Nomenclature'").count()

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

## Build the Taxon objects and store them in CouchDB
We use CouchDB to store a full record for each taxon. We copy all metadata to the taxon records.

In [25]:
from couchdb_file import CouchDBFile as CDBF
class CouchDBFile(CDBF):
    """
    File-like object that reads from CouchDB attachment content.

    This class extends FileObject to support reading text from CouchDB
    attachments while preserving database metadata (doc_id, attachment_name,
    and database name).
    """

    _doc_id: str
    _attachment_name: str
    _db_name: str
    _human_url: Optional[str]
    _content_lines: List[str]

    def __init__(
        self,
        content: str,
        doc_id: str,
        attachment_name: str,
        db_name: str,
        human_url: Optional[str] = None
    ) -> None:
        """
        Initialize CouchDBFile from attachment content.

        Args:
            content: Text content from CouchDB attachment
            doc_id: CouchDB document ID
            attachment_name: Name of the attachment (e.g., "article.txt.ann")
            db_name: Database name where document is stored (ingest_db_name)
            url: Optional URL from the CouchDB row
        """
        self._doc_id = doc_id
        self._attachment_name = attachment_name
        self._db_name = db_name
        self._human_url = human_url
        self._line_number = 0
        self._page_number = 1
        self._empirical_page_number = None

        # Split content into lines
        self._content_lines = content.split('\n')

    def _get_content_iterator(self) -> Iterator[str]:
        """Get iterator over content lines."""
        return iter(self._content_lines)

    @property
    def filename(self) -> str:
        """
        Return a composite identifier for CouchDB documents.

        Format: db_name/doc_id/attachment_name
        This allows tracking the source of each line.
        """
        return f"{self._db_name}/{self._doc_id}/{self._attachment_name}"

    @property
    def doc_id(self) -> str:
        """CouchDB document ID."""
        return self._doc_id

    @property
    def attachment_name(self) -> str:
        """Attachment filename."""
        return self._attachment_name

    @property
    def db_name(self) -> str:
        """Database name (ingest_db_name)."""
        return self._db_name

    @property
    def human_url(self) -> Optional[str]:
        """URL from the CouchDB row."""
        return self._human_url


# Module-level functions for reading CouchDB data

def read_couchdb_partition(
    partition: Iterator[Row],
    db_name: str
) -> Iterator[Line]:
    """
    Read annotated files from CouchDB rows in a PySpark partition.

    This is the UDF alternative to read_files() for CouchDB-backed data.
    It processes rows containing CouchDB attachment content and yields
    Line objects that preserve database metadata.

    Args:
        partition: Iterator of PySpark Rows with columns:
            - doc_id: CouchDB document ID
            - attachment_name: Attachment filename
            - value: Text content from attachment
        db_name: Database name to store in metadata (ingest_db_name)

    Yields:
        Line objects with content and CouchDB metadata (doc_id, attachment_name, db_name)

    Example:
        >>> # In a PySpark context
        >>> from pyspark.sql.functions import col
        >>> from couchdb_file import read_couchdb_partition
        >>>
        >>> # Assume df has columns: doc_id, attachment_name, value
        >>> def process_partition(partition):
        ...     lines = read_couchdb_partition(partition, "mycobank")
        ...     # Process lines with finder.parse_annotated()
        ...     return lines
        >>>
        >>> result = df.rdd.mapPartitions(process_partition)
    """
    for row in partition:
        # Extract url from row if available
        human_url = getattr(row, 'human_url', None)

        # Create CouchDBFile object from row data
        file_obj = CouchDBFile(
            content=row.value,
            doc_id=row.doc_id,
            attachment_name=row.attachment_name,
            db_name=db_name,
            human_url=human_url
        )

        # Yield all lines from this file
        yield from file_obj.read_line()


def read_couchdb_rows(
    rows: List[Row],
    db_name: str
) -> Iterator[Line]:
    """
    Read annotated files from a list of CouchDB rows.

    This is a convenience function for non-distributed processing or testing.
    For production use with PySpark, use read_couchdb_partition().

    Args:
        rows: List of Rows with columns:
            - doc_id: CouchDB document ID
            - attachment_name: Attachment filename
            - value: Text content from attachment
        db_name: Database name to store in metadata

    Yields:
        Line objects with content and CouchDB metadata

    Example:
        >>> from couchdb_file import read_couchdb_rows
        >>>
        >>> # Collect rows from DataFrame
        >>> rows = df.collect()
        >>>
        >>> # Process all lines
        >>> lines = read_couchdb_rows(rows, "mycobank")
        >>> paragraphs = parse_annotated(lines)
        >>> taxa = group_paragraphs(paragraphs)
    """
    return read_couchdb_partition(iter(rows), db_name)


def read_couchdb_files_from_connection(
    conn,  # CouchDBConnection
    spark,  # SparkSession
    db_name: str,
    pattern: str = "*.txt.ann"
) -> Iterator[Line]:
    """
    Load and read annotated files from CouchDB using CouchDBConnection.

    This function integrates CouchDBConnection.load_distributed() with
    read_couchdb_rows() to provide a complete pipeline from database to lines.

    Args:
        conn: CouchDBConnection instance
        spark: SparkSession
        db_name: Database name for metadata (ingest_db_name)
        pattern: Pattern for attachment names (default: "*.txt.ann")

    Returns:
        Iterator of Line objects with CouchDB metadata

    Example:
        >>> from skol_classifier.couchdb_io import CouchDBConnection
        >>> from couchdb_file import read_couchdb_files_from_connection
        >>> from finder import parse_annotated
        >>> from taxon import group_paragraphs
        >>>
        >>> # Connect to CouchDB
        >>> conn = CouchDBConnection(
        ...     "http://localhost:5984",
        ...     "mycobank",
        ...     "user",
        ...     "pass"
        ... )
        >>>
        >>> # Load files
        >>> lines = read_couchdb_files_from_connection(
        ...     conn, spark, "mycobank", "*.txt.ann"
        ... )
        >>>
        >>> # Parse and extract taxa
        >>> paragraphs = parse_annotated(lines)
        >>> taxa = group_paragraphs(paragraphs)
    """
    # Load data from CouchDB
    df = conn.load_distributed(spark, pattern)

    # Collect rows (for small datasets) or use in distributed context
    rows = df.collect()

    # Read lines with metadata
    return read_couchdb_rows(rows, db_name)



## Build Taxon objects

Here we extract the Taxon objects from the annotated attachments.

In [26]:
ingest_couchdb_url = couchdb_url
ingest_username = couchdb_username
ingest_password = couchdb_password
taxon_couchdb_url = couchdb_url
taxon_username = couchdb_username
taxon_password = couchdb_password
pattern = '*.txt.ann'

In [27]:
from extract_taxa_to_couchdb import (
    TaxonExtractor as TE,
    generate_taxon_doc_id,
    extract_taxa_from_partition,
    convert_taxa_to_rows
)

class TaxonExtractor(TE):
    pass

In [28]:
# Create TaxonExtractor instance with database configuration
spark.stop()
spark = make_spark_session()

extractor = TaxonExtractor(
    spark=spark,
    ingest_couchdb_url=ingest_couchdb_url,
    ingest_db_name=ingest_db_name,
    taxon_db_name=taxon_db_name,
    ingest_username=ingest_username,
    ingest_password=ingest_password,
    taxon_username=taxon_username,
    taxon_password=taxon_password
)

print("TaxonExtractor initialized")
print(f"  Ingest DB: {ingest_db_name}")
print(f"  Taxon DB:  {taxon_db_name}")

TaxonExtractor initialized
  Ingest DB: skol_dev
  Taxon DB:  skol_taxa_dev


In [29]:
# Step 1: Load annotated documents
if build_taxon:
    print("\nStep 1: Loading annotated documents from CouchDB...")
    annotated_df = extractor.load_annotated_documents(pattern='*.txt.ann')
    print(f"Loaded {annotated_df.count()} annotated documents")
    annotated_df.show(5, truncate=False)


Step 1: Loading annotated documents from CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 0:>                                                          (0 + 2) / 2]

Loaded 2099 annotated documents


                                                                                

+--------------------------------+------------------------------------------------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


In [30]:
if build_taxon:
    # Step 2: Extract taxa to DataFrame
    print("\nStep 2: Extracting taxa from annotated documents...")
    taxa_df = extractor.extract_taxa(annotated_df)
    print(f"Extracted {taxa_df.count()} taxa")
    taxa_df.printSchema()
    taxa_df.show(10, truncate=False)


Step 2: Extracting taxa from annotated documents...
[TaxonExtractor] Input DataFrame columns: ['doc_id', 'human_url', 'attachment_name', 'value']
[TaxonExtractor] Filtered DataFrame columns: ['doc_id', 'value', 'attachment_name']
[TaxonExtractor] Filtered DataFrame schema:
root
 |-- doc_id: string (nullable = false)
 |-- value: string (nullable = false)
 |-- attachment_name: string (nullable = false)



  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 4:>                                                          (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)

=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
                                                                                

Extracted 5239 taxa
root
 |-- taxon: string (nullable = false)
 |-- description: string (nullable = false)
 |-- source: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- line_number: integer (nullable = true)
 |-- paragraph_number: integer (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- empirical_page_number: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- json_annotated: string (nullable = true)



[Stage 7:>                                                          (0 + 1) / 1]

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+-----------+----------------+-----------+----------------


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
Exception ignored in: <_io.BufferedWriter name=5>                               
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


In [31]:
if build_taxon:
    # Step 3: Inspect actual Taxon objects from the RDD (optional debugging)
    print("\n=== Sample Taxon Objects ===")
    taxa_rdd = annotated_df.rdd.mapPartitions(
        lambda partition: extract_taxa_from_partition(iter(partition), ingest_db_name)  # type: ignore[reportUnknownArgumentType]
    )
    for i, taxon in enumerate(taxa_rdd.take(3)):
        print(f"\nTaxon {i+1}:")
        print(f"  Type: {type(taxon)}")
        print(f"  Has nomenclature: {taxon.has_nomenclature()}")
        taxon_row = taxon.as_row()
        print(f"  Taxon name: {taxon_row['taxon'][:80]}...")
        print(f"  Source: {taxon_row['source']}")


=== Sample Taxon Objects ===


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 8:>                                                          (0 + 1) / 1]


Taxon 1:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)
...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 2:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  5. Caloplaca gyalolechiiformis Szatala, Ann. Hist.-Nat. Mus. Natl. Hungarici, s...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name': 'skol_dev'}

Taxon 3:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  Taxon name:  7. Caloplaca lactea var. subimmersa Szatala, Ann. Hist.-Nat. Mus. Natl. Hungari...
  Source: {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'human_url': 'https://www.ingentaconnect.com/content/mtax/mt/2010/00000111/00000001/art00033', 'db_name'


=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
                                                                                

In [32]:
if build_taxon:
    # Step 4: Save taxa to CouchDB
    print("\nStep 4: Saving taxa to CouchDB...")
    results_df = extractor.save_taxa(taxa_df)
    
    # Show detailed results
    results_df.groupBy("success").count().show(truncate=False)
    
    # If there are failures, show error messages
    print("\nError messages:")
    results_df.filter("success = false").select("error_message").distinct().show(truncate=False)


Step 4: Saving taxa to CouchDB...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
[Stage 9:>                                                          (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
                                                                                

+-------+-----+
|success|count|
+-------+-----+
|true   |5239 |
+-------+-----+


Error messages:


[Stage 12:>                                                         (0 + 2) / 2]
=== parse_annotated Label Summary ===
Total labels counted: 46347

Label distribution:
  Misc-exposition            23583 ( 50.9%)
  Description                17476 ( 37.7%)
  Nomenclature                5287 ( 11.4%)
  None                           1 (  0.0%)
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

=== parse_annotated Label Summary ===
Total labels counted: 46616

Label distribution:
  Misc-exposition            23690 ( 50.8%)
  Description                17481 ( 37.5%)
  Nomenclature                5444 ( 11.7%)
  None                           1 (  0.0%)

+-------------+
|error_message|
+-------------+
+-------------+



In [33]:
# Alternative: Run the complete pipeline in one step
# Uncomment to use the simplified one-step approach:

# print("\nRunning complete pipeline...")
# results = extractor.run_pipeline(pattern='*.txt.ann')
#
# successful = results.filter("success = true").count()
# failed = results.filter("success = false").count()
#
# print(f"\nPipeline Results:")
# print(f"  Successful: {successful}")
# print(f"  Failed:     {failed}")
#
# results.groupBy("success").count().show(truncate=False)

### Observations on the classification models

The line-by-line classification model is classifying many Description lines as Misc-exposition. It works reasonably well for Nomenclature.

The problem with the paragraph classification model is that the heuristic paragrph parser does not generalize well to the more modern journals.

One possible approach to investigate is adding heuristics to the label-merging code to convert some Misc-exposition lines to Description if they are surrounded by Description paragraphs.

We put substantial effort into a model with some memory. Specifically, we use a bidirectional LSTM (RNN) with a sliding window. This should be better at detecting context. The computational demands tested our available compute platform. The performance of the model as of current writing is terrible--out of a million lines of text it finds only 73 lines of Description, 175223 UNKNOWN_None, no lines of Nomenclature and the rest as Misc-exposition. As expected increasing the number of epochs increases training accuracy and decreases the loss function, but confusingly it performs identically on the test set no matter how many epochs are used.

It may become necessary to hand annotate some of the more modern journals.

## Dr. Drafts document embedding

Dr. Drafts is the framework we use to embed all the descriptions into a searchable space. SBERT is a model that can embed sentences into a semantic space such that sentences with similar meaning are near each other. The data structure that we build here is central to the eventual function of the SKOL web site.

Dr. Drafts loads taxon documents from the CouchDB, and builds an embedding which it saves to redis.

In [34]:
from dr_drafts_mycosearch.data import SKOL_TAXA as STX

class SKOL_TAXA(STX):
    """Data interface for Synopotic Key of Life Taxa in CouchDB."""
    def load_data(self):
        """Load taxon data from CouchDB into a pandas DataFrame."""
        # Connect to CouchDB
        server = couchdb.Server(self.couchdb_url)
        if self.username and self.password:
            server.resource.credentials = (self.username, self.password)

        # Access the database
        if self.db_name not in server:
            raise ValueError(f"Database '{self.db_name}' not found in CouchDB server")

        db = server[self.db_name]

        # Fetch all documents from the database
        records = []
        for doc_id in db:
            # Skip design documents
            if doc_id.startswith('_design/'):
                continue

            doc = db[doc_id]
            print(f"DEBUG: doc: {doc}")  # Debugging line to inspect document structure
            records.append(doc)

        if not records:
            # Create empty DataFrame if no records found
            self.df = pd.DataFrame()
            print(f"Warning: No taxon records found in database '{self.db_name}'")
            return

        # Convert to DataFrame
        self.df = pd.DataFrame(records)
        # assert self.df.iloc[0]['source']['human_url'].startswith('http'), "Expected 'source.url' to start with 'http'"



  from .autonotebook import tqdm as notebook_tqdm
2025-12-14 12:33:35.947178: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-14 12:33:35.979401: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI AVX_VNNI_INT8 AVX_NE_CONVERT FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-14 12:33:36.678366: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [35]:
from dr_drafts_mycosearch.compute_embeddings import EmbeddingsComputer as EC
class EmbeddingsComputer(EC):
    """Class for computing and storing embeddings from narrative data."""
    
    def write_embeddings_to_redis(self):
        """Write embeddings to Redis using instance configuration."""
        if self.redis_username and self.redis_password:
            r = redis.from_url(self.redis_url, username=self.redis_username, password=self.redis_password, db=self.redis_db)
        else:
            r = redis.from_url(self.redis_url, db=self.redis_db)

        pickled_data = pickle.dumps(self.result)
        r.set(self.embedding_name, pickled_data)
        if self.redist_expire is not None:
            r.expire(self.embedding_name, self.redist_expire)
        print(f'Embeddings written to Redis (db={self.redis_db}) with key: {self.embedding_name}')


## Compute Embeddings

We use SBERT to embed the taxa into a search space.

In [36]:
skol_taxa = SKOL_TAXA(
    couchdb_url="http://localhost:5984",
    username=couchdb_username,
    password=couchdb_password,
    db_name=taxon_db_name
)
descriptions = skol_taxa.get_descriptions()

DEBUG: doc: <Document 'taxon_000aaabf9fb6bcb8ff1c66e1ddeb8c30599c1283be703156d50e45bc8779df25'@'10-bc082e2669082253774b61af8ec6d71d' {'taxon': ' added Valsaria Ces. & De Not. and Valsonectria Speg., both with uniseptate\n\n\n Mattirolia Berl. & Bres., Annuario Soc. Alpin. Trident. 14: 351 (1889)\n\n\n = Thyronectroidea Seaver, Mycologia 1: 206 (1909)\n\n\n = Balzania Speg., Anales Mus. Nac. Buenos Aires 6: 286 (1898)\n\n', 'description': ' Stroma variable, usually present, erumpent, covered with loosely interwoven\nyellowish or brownish hyphae, KOH negative. Perithecia globose, semiimmersed or isolated in the stroma. Paraphyses abundant. Asci unitunicate,\ncylindrical or clavate, non amyloid. Ascospores smooth, muriform, hyaline to\n\n\n Stroma subcortical, pulvinate, 0.5–6 mm diam. Perithecia aggregated,\nimmersed in the stroma or more rarely isolated, globose, 400–450 µm diam.,\nblack, surrounded by a yellowish tomentum, 30–50 µm thick, formed by\nhyphae 4–5 µm diam. Peridium pseudop

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



DEBUG: doc: <Document 'taxon_c1c0d716bfd2f1092e5e6a3a68987b033964a304abd73d947f45a72a64c729ff'@'10-e43f01a6ad1df771d0d0aa0192e94ff0' {'taxon': ' 3. Leucocoprinus cepistipes (Sowerby: Fr.) Pat., J. Bot., Paris 3: 336 (1889)\n\n', 'description': ' Pileus: 2.5–5.5 cm broad, ovoid to conic when young, becoming obtusely\ncampanulate, broadly umbonate to somewhat truncate at times; margin at first\nincurved, becoming straight to more often decurved, sulcate-striate; glabrous\nto finely appressed tomentose at disc, toward margin becoming diffracted\ninto sometimes slightly recurved to appressed squamulae, widely spaced near\nmargin, rest of surface radially fibrillose to farinose; disc and scales “mummy\nbrown” to “hazel” to “sayal brown” to “snuff brown” to “cinnamon brown” to\n“cinnamon” to “clay color” to “buffy brown” to “warm buff” to “honey yellow,”\nbackground white to “pale pinkish buff” to “pale cream buff;” context soft\n(but rather sturdy for this genus), more or less white. Odor: 

In [37]:
if not redis_client.exists(embedding_name):

    embedder = EmbeddingsComputer(
        idir='/dev/null',
        redis_url='redis://localhost:6379',
        redis_expire=embedding_expire,
        embedding_name=embedding_name,
    )

    embedding_result = embedder.run(descriptions)

## Compute JSON versions of all descriptions

There is an anticipated need for the details of each description to be available as a nested JSON structure, which can be used to build menus with features, subfeatures, and values.

The TaxaJSONTranslator reads taxa from the CouchDB and writes annotated taxa back out to CouchDB.

In [38]:
from taxa_json_translator import TaxaJSONTranslator as TJT

class TaxaJSONTranslator(TJT):
    """
    Translates taxa descriptions to structured JSON using a fine-tuned Mistral model.

    This class is optimized for processing PySpark DataFrames created by
    TaxonExtractor.load_taxa(), adding a new column with JSON-formatted features.
    """
    def load_taxa(
        self,
        db_name: str,
        pattern: str = "*"
    ) -> DataFrame:
        """
        Load taxa from CouchDB taxon database.

        This method loads taxa documents saved by TaxonExtractor.save_taxa()
        and returns them as a DataFrame compatible with translate_descriptions().

        Args:
            db_name: Name of taxon database
            pattern: Pattern for document IDs to load (default: "*")
                    Use "*" to load all documents
                    Use "taxon_abc*" to load specific subset

        Returns:
            DataFrame with columns:
                - _id: CouchDB document ID (for joining results)
                - taxon: String of concatenated nomenclature paragraphs
                - description: String of concatenated description paragraphs
                - source: Dict with keys doc_id, url, db_name
                - line_number: Line number of first nomenclature paragraph
                - paragraph_number: Paragraph number of first nomenclature paragraph
                - page_number: Page number of first nomenclature paragraph
                - empirical_page_number: Empirical page number of first nomenclature paragraph

        Example:
            >>> translator = TaxaJSONTranslator(
            ...     spark=spark,
            ...     couchdb_url="http://localhost:5984",
            ...     username="admin",
            ...     password="secret",
            ...     checkpoint_path="..."
            ... )
            >>> taxa_df = translator.load_taxa(db_name="mycobank_taxa")
            >>> print(f"Loaded {taxa_df.count()} taxa")
        """
        from skol_classifier.couchdb_io import CouchDBConnection
        from pyspark.sql.types import StructType, StructField, StringType, MapType, IntegerType

        # Define schema with _id for joining results
        schema = StructType([
            StructField("_id", StringType(), False),
            StructField("taxon", StringType(), False),
            StructField("description", StringType(), False),
            StructField("source", MapType(StringType(), StringType(), valueContainsNull=True), False),
            StructField("line_number", IntegerType(), True),
            StructField("paragraph_number", IntegerType(), True),
            StructField("page_number", IntegerType(), True),
            StructField("empirical_page_number", StringType(), True)
        ])

        # Use CouchDBConnection to load data
        conn = CouchDBConnection(self.couchdb_url, db_name, username=self.username, password=self.password)

        # Get matching document IDs
        doc_ids = conn.get_all_doc_ids(pattern)

        if not doc_ids:
            print(f"No documents found matching pattern '{pattern}'")
            return self.spark.createDataFrame([], schema)

        print(f"Loading {len(doc_ids)} taxa from {db_name}...")

        # Create DataFrame with doc_ids for parallel processing
        doc_ids_rdd = self.spark.sparkContext.parallelize(doc_ids)
        doc_ids_df = doc_ids_rdd.map(lambda x: (x,)).toDF(["doc_id"])

        # Prepare connection parameters for workers
        couchdb_url = self.couchdb_url
        username = self.username
        password = self.password

        # Load taxa using mapPartitions
        def load_partition(partition):
            """Load taxa from CouchDB for an entire partition."""
            from skol_classifier.couchdb_io import CouchDBConnection
            from pyspark.sql import Row

            # Create connection using CouchDBConnection API
            conn = CouchDBConnection(couchdb_url, db_name, username, password)

            try:
                db = conn.db

                # Process each row (which contains doc_id)
                for row in partition:
                    try:
                        doc_id = row.doc_id if hasattr(row, 'doc_id') else str(row[0])

                        # Load document from CouchDB
                        if doc_id in db:
                            doc = db[doc_id]

                            # Convert CouchDB document to Row (include _id for joining)
                            taxon_data = {
                                '_id': doc.get('_id', doc_id),
                                'taxon': doc.get('taxon', ''),
                                'description': doc.get('description', ''),
                                'source': doc.get('source', {}),
                                'line_number': doc.get('line_number'),
                                'paragraph_number': doc.get('paragraph_number'),
                                'page_number': doc.get('page_number'),
                                'empirical_page_number': doc.get('empirical_page_number')
                            }

                            yield Row(**taxon_data)
                        else:
                            print(f"Document {doc_id} not found in database")

                    except Exception as e:
                        print(f"Error loading taxon {doc_id}: {e}")

            except Exception as e:
                print(f"Error connecting to CouchDB: {e}")

        taxa_rdd = doc_ids_df.rdd.mapPartitions(load_partition)
        taxa_df = self.spark.createDataFrame(taxa_rdd, schema)

        count = taxa_df.count()
        print(f"✓ Loaded {count} taxa")

        return taxa_df

    def save_taxa(
        self,
        taxa_df: DataFrame,
        db_name: str,
        json_annotated_col: str = "features_json"
    ) -> DataFrame:
        """
        Save taxa DataFrame to CouchDB, including the json_annotated field.

        This method saves taxa with the translated JSON features back to CouchDB.
        It handles arbitrary JSON in the json_annotated_col by parsing it before storage.
        The save operation is idempotent - documents with the same composite key
        (source.doc_id, source.url, line_number) will be updated rather than duplicated.

        Uses credentials from self.username and self.password.

        Args:
            taxa_df: DataFrame with taxa and translations (must include json_annotated_col)
            db_name: Name of taxon database
            json_annotated_col: Name of column containing JSON features (default: "features_json")

        Returns:
            DataFrame with save results (doc_id, success, error_message)

        Example:
            >>> # Load taxa and translate
            >>> taxa_df = translator.load_taxa(db_name="mycobank_taxa")
            >>> enriched_df = translator.translate_descriptions(taxa_df)
            >>>
            >>> # Save back to CouchDB
            >>> results = translator.save_taxa(enriched_df, db_name="mycobank_taxa")
            >>> print(f"Saved: {results.filter('success = true').count()}")
        """
        from pyspark.sql import Row
        from pyspark.sql.types import StructType, StructField, StringType, BooleanType

        # Get credentials from self
        couchdb_url = self.couchdb_url
        username = self.username
        password = self.password

        # Schema for save results
        save_schema = StructType([
            StructField("doc_id", StringType(), False),
            StructField("success", BooleanType(), False),
            StructField("error_message", StringType(), False),
        ])

        def save_partition(partition):
            """Save taxa to CouchDB for an entire partition (idempotent)."""
            from skol_classifier.couchdb_io import CouchDBConnection
            import hashlib

            def generate_taxon_doc_id(doc_id: str, url: Optional[str], line_number: int) -> str:
                """Generate deterministic document ID for idempotent saves."""
                key_parts = [
                    doc_id,
                    url if url else "no_url",
                    str(line_number)
                ]
                composite_key = ":".join(key_parts)
                hash_obj = hashlib.sha256(composite_key.encode('utf-8'))
                doc_hash = hash_obj.hexdigest()
                return f"taxon_{doc_hash}"

            # Create connection using CouchDBConnection API
            conn = CouchDBConnection(couchdb_url, db_name, username, password)

            # Connect to CouchDB once per partition
            try:
                # Try to get database, create if it doesn't exist
                import couchdb
                server = couchdb.Server(couchdb_url)
                if username and password:
                    server.resource.credentials = (username, password)

                if db_name not in server:
                    server.create(db_name)

                db = conn.db

                # Process each taxon in the partition
                for row in partition:
                    success = False
                    error_msg = ""
                    doc_id = "unknown"

                    try:
                        # Extract source metadata from row
                        source_dict = row.source if hasattr(row, 'source') else {}
                        source = dict(source_dict) if isinstance(source_dict, dict) else {}
                        source_doc_id = str(source.get('doc_id', 'unknown'))
                        source_url = source.get('url')
                        line_number = row.line_number if hasattr(row, 'line_number') else 0

                        # Generate deterministic document ID
                        doc_id = generate_taxon_doc_id(
                            source_doc_id,
                            source_url if isinstance(source_url, str) else None,
                            int(line_number) if line_number else 0
                        )

                        # Convert row to dict for CouchDB storage
                        taxon_doc = row.asDict()

                        # Handle json_annotated field: parse JSON string to dict
                        if json_annotated_col in taxon_doc and taxon_doc[json_annotated_col]:
                            json_str = taxon_doc[json_annotated_col]
                            if isinstance(json_str, str):
                                try:
                                    # Parse JSON string to dict for storage
                                    taxon_doc['json_annotated'] = json.loads(json_str)
                                except json.JSONDecodeError:
                                    print(f"Warning: Invalid JSON in {json_annotated_col} for doc {doc_id}")
                                    taxon_doc['json_annotated'] = {}
                            else:
                                # Already a dict, just rename the field
                                taxon_doc['json_annotated'] = json_str
                            # Remove the original column if it has a different name
                            if json_annotated_col != 'json_annotated':
                                del taxon_doc[json_annotated_col]

                        # Check if document already exists (idempotent)
                        if doc_id in db:
                            # Document exists - update it
                            existing_doc = db[doc_id]
                            taxon_doc['_id'] = doc_id
                            taxon_doc['_rev'] = existing_doc['_rev']
                        else:
                            # New document - create it
                            taxon_doc['_id'] = doc_id

                        db.save(taxon_doc)
                        success = True

                    except Exception as e:
                        error_msg = str(e)
                        print(f"Error saving taxon {doc_id}: {e}")

                    yield Row(
                        doc_id=doc_id,
                        success=success,
                        error_message=error_msg
                    )

            except Exception as e:
                print(f"Error connecting to CouchDB: {e}")
                # Yield failures for all rows
                for row in partition:
                    yield Row(
                        doc_id="unknown_connection_error",
                        success=False,
                        error_message=str(e)
                    )

        print(f"Saving taxa to {db_name}...")
        results_df = taxa_df.rdd.mapPartitions(save_partition).toDF(save_schema)

        total = results_df.count()
        successes = results_df.filter("success = true").count()
        failures = total - successes

        print(f"✓ Save complete:")
        print(f"  Total: {total}")
        print(f"  Successful: {successes}")
        print(f"  Failed: {failures}")

        return results_df



In [39]:
spark.stop()
spark = make_spark_session()

translator = TaxaJSONTranslator(
    spark=spark,
    base_model_id="mistralai/Mistral-7B-Instruct-v0.3",
    max_length=2048,
    max_new_tokens=1024,
    device="cuda",
    load_in_4bit=True,
    use_auth_token=True,
    couchdb_url=couchdb_url,
    username=couchdb_username,
    password=couchdb_password
)

TaxaJSONTranslator initialized
  CouchDB URL: http://127.0.0.1:5984
  Base model: mistralai/Mistral-7B-Instruct-v0.3
  Checkpoint: None (using base model)
  Device: cuda
  4-bit quantization: True


### Run the mistral model to generate JSON from each Taxon description.

In [40]:
if generate_json:
    descriptions_df = translator.load_taxa(db_name=taxon_db_name).limit(10)

Loading 5239 taxa from skol_taxa_dev...


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version

✓ Loaded 5239 taxa


                                                                                

In [41]:
if generate_json:
    descriptions_df.show()

+--------------------+--------------------+--------------------+--------------------+-----------+----------------+-----------+---------------------+
|                 _id|               taxon|         description|              source|line_number|paragraph_number|page_number|empirical_page_number|
+--------------------+--------------------+--------------------+--------------------+-----------+----------------+-----------+---------------------+
|taxon_000aaabf9fb...| added Valsaria C...| Stroma variable,...|{human_url -> sko...|         85|           37287|          1|                 4666|
|taxon_001de8a5aa8...| Geastrum corolli...| subglobose, ovoi...|{human_url -> sko...|        188|           20293|          1|                 4666|
|taxon_0024cb7ea83...|    Rick (1959).\n\n| Basidiomata whit...|{human_url -> sko...|        246|           43960|          1|                 4666|
|taxon_002d9147281...| Graphis caesioca...| Thallus corticol...|{human_url -> sko...|         83|         

  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version
Exception ignored in: <_io.BufferedWriter name=5>
Traceback (most recent call last):
  File "/data/piggy/miniconda3/envs/skol3/lib/python3.13/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193, in manager
BrokenPipeError: [Errno 32] Broken pipe


In [None]:
if generate_json:
    json_annotated_df = translator.translate_descriptions_batch(
        taxa_df=descriptions_df,
        batch_size=10,
        description_col="description",
        output_col="json_annotated"
    )

### Add the generated fields as a field on the objects generated by save_taxa.

## Hierarchical clustering

We use Agglomerative Clustering to group the taxa into "clades" based in cosine similarity of their SBERT embeddings. We then load them into neo4j.

In [None]:
from taxon_clusterer import TaxonClusterer as TC, ClusterNode
from scipy.cluster.hierarchy import linkage, to_tree

class TaxonClusterer(TC):
    
    def load_embeddings(self, embedding_key: str) -> Tuple[np.ndarray, List[str]]:
        """
        Load embeddings from Redis.

        Args:
            embedding_key: Redis key containing pickled embeddings

        Returns:
            Tuple of (embeddings array, taxon names list, taxon metadata list)

        Raises:
            ValueError: If key doesn't exist or data is invalid
        """
        print(f"Loading embeddings from Redis key: {embedding_key}")

        if not self.redis_client.exists(embedding_key):
            raise ValueError(f"Redis key '{embedding_key}' does not exist")

        # Load pickled data from Redis
        pickled_data = self.redis_client.get(embedding_key)
        data = pickle.loads(pickled_data)

        # Assume it's a pandas DataFrame from EmbeddingsComputer
        try:
            assert isinstance(data, pd.DataFrame)
            # Extract embedding columns (F0, F1, F2, ...)
            embedding_cols = [col for col in data.columns if col.startswith('F')]
            self.embeddings = data[embedding_cols].values

            # Extract taxon names from 'taxon' field (nomenclature)
            # If 'taxon' column doesn't exist, fall back to 'description'
            if 'taxon' in data.columns:
                self.taxon_names = data['taxon'].tolist()
            else:
                self.taxon_names = data['description'].tolist()

            # Extract metadata from other columns
            self.taxon_metadata = []
            for _, row in data.iterrows():
                metadata = {}

                # Flatten source dict for neo4j storage.
                if 'source' in data.columns:
                    source = row['source']
                    assert isinstance(source, dict), "Source field must be dict"
                    for key in source.keys():
                        metadata[f'source_{key}'] = source[key]

                # Add other metadata fields
                if 'filename' in data.columns:
                    metadata['filename'] = row.get('filename')
                if 'row' in data.columns:
                    metadata['row'] = row.get('row')
                if 'line_number' in data.columns:
                    metadata['line_number'] = row.get('line_number')
                if 'paragraph_number' in data.columns:
                    metadata['paragraph_number'] = row.get('paragraph_number')
                if 'page_number' in data.columns:
                    metadata['page_number'] = row.get('page_number')
                if 'empirical_page_number' in data.columns:
                    metadata['empirical_page_number'] = row.get('empirical_page_number')

                # Always include description
                metadata['description'] = row.get('description', '')

                self.taxon_metadata.append(metadata)
        except Exception as e:
            raise ValueError(f"Failed to parse data from Redis: {e}")

        print(f"✓ Loaded {len(self.taxon_names)} taxa with {self.embeddings.shape[1]}-dimensional embeddings")
        
        return self.embeddings, self.taxon_names, self.taxon_metadata

    def store_in_neo4j(
        self,
        root_name: str = "Fungi",
        clear_existing: bool = True
    ):
        """
        Store the clustering tree in Neo4j.

        Creates:
        - Taxon nodes (leaf nodes) with properties: name, node_id
        - Pseudoclade nodes (internal nodes) with properties: name, node_id, count
        - PARENT_OF relationships with property: distance (cosine similarity)

        Args:
            root_name: Name for the root pseudoclade
            clear_existing: Whether to clear existing Taxon and Pseudoclade nodes
        """
        if self.root_node is None:
            raise ValueError("No clustering tree available. Call cluster() first.")

        print(f"Storing tree in Neo4j...")
        print(f"  Root name: {root_name}")

        with self.neo4j_driver.session() as session:
            # Optionally clear existing data
            if clear_existing:
                print("  Clearing existing Taxon and Pseudoclade nodes...")
                session.run("""
                    MATCH (n)
                    WHERE n:Taxon OR n:Pseudoclade
                    DETACH DELETE n
                """)

            # Create indexes for performance
            session.run("CREATE INDEX taxon_node_id IF NOT EXISTS FOR (t:Taxon) ON (t.node_id)")
            session.run("CREATE INDEX pseudoclade_node_id IF NOT EXISTS FOR (p:Pseudoclade) ON (p.node_id)")

            # Store tree recursively
            pseudoclade_counter = [0]  # Use list for mutability in nested function

            def store_node(node: ClusterNode, parent_id: Optional[int] = None, is_root: bool = False):
                """Recursively store nodes in Neo4j."""
                if node.is_leaf:
                    # Create Taxon node with metadata
                    taxon_props = {
                        'name': node.taxon_name,
                        'node_id': node.node_id
                    }

                    # Add metadata fields if available
                    if node.metadata:
                        for key, value in node.metadata.items():
                            # Convert values to Neo4j-compatible types
                            if value is not None and not isinstance(value, (bool, int, float, str)):
                                taxon_props[key] = str(value)
                            else:
                                taxon_props[key] = value

                    session.run("""
                        CREATE (t:Taxon $props)
                    """, props=taxon_props)

                    # Create relationship to parent if exists
                    if parent_id is not None:
                        session.run("""
                            MATCH (parent:Pseudoclade {node_id: $parent_id})
                            MATCH (child:Taxon {node_id: $child_id})
                            CREATE (parent)-[:PARENT_OF {distance: $distance}]->(child)
                        """, parent_id=parent_id, child_id=node.node_id, distance=node.distance)
                else:
                    # Create Pseudoclade node
                    if is_root:
                        pseudoclade_name = root_name
                    else:
                        pseudoclade_counter[0] += 1
                        pseudoclade_name = f"Pseudoclade_{pseudoclade_counter[0]}"

                    session.run("""
                        CREATE (p:Pseudoclade {
                            name: $name,
                            node_id: $node_id,
                            count: $count
                        })
                    """, name=pseudoclade_name, node_id=node.node_id, count=node.count)

                    # Create relationship to parent if exists
                    if parent_id is not None:
                        session.run("""
                            MATCH (parent:Pseudoclade {node_id: $parent_id})
                            MATCH (child:Pseudoclade {node_id: $child_id})
                            CREATE (parent)-[:PARENT_OF {distance: $distance}]->(child)
                        """, parent_id=parent_id, child_id=node.node_id, distance=node.distance)

                    # Recursively store children
                    if node.left_child:
                        store_node(node.left_child, node.node_id, False)
                    if node.right_child:
                        store_node(node.right_child, node.node_id, False)

            # Start from root
            store_node(self.root_node, None, True)

        print(f"✓ Tree stored in Neo4j")

        # Print summary statistics
        self._print_neo4j_stats()

    def _print_neo4j_stats(self):
        """Print statistics about stored data."""
        with self.neo4j_driver.session() as session:
            # Count taxa
            result = session.run("MATCH (t:Taxon) RETURN count(t) as count")
            taxon_count = result.single()['count']

            # Count pseudoclades
            result = session.run("MATCH (p:Pseudoclade) RETURN count(p) as count")
            pseudoclade_count = result.single()['count']

            # Count relationships
            result = session.run("MATCH ()-[r:PARENT_OF]->() RETURN count(r) as count")
            relationship_count = result.single()['count']

            print(f"  Taxon nodes: {taxon_count}")
            print(f"  Pseudoclade nodes: {pseudoclade_count}")
            print(f"  PARENT_OF relationships: {relationship_count}")

    def get_subtree(self, clade_name: str) -> List[str]:
        """
        Get all taxa descendant from a given clade.

        Args:
            clade_name: Name of the clade (Pseudoclade or Taxon)

        Returns:
            List of taxon names in the subtree
        """
        with self.neo4j_driver.session() as session:
            result = session.run("""
                MATCH (root {name: $clade_name})-[:PARENT_OF*]->(t:Taxon)
                RETURN t.name as taxon_name
                ORDER BY taxon_name
            """, clade_name=clade_name)

            return [record['taxon_name'] for record in result]

    def get_tree_path(self, taxon_name: str) -> List[Tuple[str, float]]:
        """
        Get the path from root to a specific taxon.

        Args:
            taxon_name: Name of the taxon

        Returns:
            List of (node_name, distance) tuples from root to taxon
        """
        with self.neo4j_driver.session() as session:
            result = session.run("""
                MATCH path = (root:Pseudoclade)-[:PARENT_OF*]->(t:Taxon {name: $taxon_name})
                WHERE NOT (root)<-[:PARENT_OF]-()
                UNWIND nodes(path) as node
                RETURN node.name as name,
                       CASE WHEN node:Taxon THEN 0.0
                            ELSE relationships(path)[0].distance
                       END as distance
            """, taxon_name=taxon_name)

            return [(record['name'], record['distance']) for record in result]



In [None]:
clusterer = TaxonClusterer(
    redis_host="localhost",
    redis_port=6379,
    redis_db=0,
    neo4j_uri=neo4j_uri,
)

# Load embeddings from Redis
(embeddings, taxon_names, metadata) = clusterer.load_embeddings(embedding_name)

In [None]:
metadata[0]

In [None]:
# Perform clustering
linkage_matrix = clusterer.cluster(method="average", metric="cosine")

# Store in Neo4j with root named "Fungi"
clusterer.store_in_neo4j(root_name="Fungi", clear_existing=True)

print("✓ Clustering complete!")

## Bibliography

* doi Foundation, "DOI Citation Formatter HTTP API", https://citation.doi.org/api-docs.html, accessed 2025-11-12.
* Yang, Jie and Zhang, Yue and Li, Linwei and Li, Xingxuan, 2018, "YEDDA: A Lightweight Collaborative Text Span Annotation Tool", Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, http://aclweb.org/anthology/P18-4006
* neo4j
* couchDB
* redis
* 


## Appendix: On the use of an AI Coder

Portions of this work were completed with the aid of Claude Code Pro. I wish to give a clarifying example of how I've used this very powerful tool, and reveal why I am comfortable with claiming authorship of the resulting code.

For this project I needed results from an earlier class project in which a trio of students built and evaluated models for classifying paragraphs. The earlier work was built as a iPython Notebook, with many examples and inline code. Just copying the earlier notebook would have introduced many irrelevant details and would not further the overall project.

I asked Claude Code to translate the notebook into a module that I could import. It did a pretty good job. Without being told, it made a submodule, extracted the illustrative code as examples, wrote reasonable documentation, and created packaging for the module.

The skill level of the coding was roughly that of a highly disciplined average junior programmer. The architecture was simplistic and violated several design constraints such as DRY. I requested specific refactorings, such as asking for a group of functions to be converted into an object that shared duplicated parameters.

The initial code used REST interfaces directly, and read all the data into a single machine, not using pyspark correctly. Through a series of refactorings, I asked that the code use appropriate libraries I named, and create correct udf functions to execute transformations in parallel.

I walked the AI through creating an object that I could use to illustrate my use of redis and couchdb interfaces, while leaving the irrelevant details in a separate library.

In short, I still have to understand good design principles. I have to be able to recognize where appropriate libraries were applicable. I still have to understand the frameworks I am working with.

I now have a strong understanding of the difference between "vibe coding" and AI-assisted software engineering. In my first 4 hours with Claude Code, I was able to produce roughly 4 days' worth of professional-grade working code.

I'm still learning how to use Claude Code effectively for debugging. Feeding it a series of error messages leads to increasingly convoluted code. Using it to help produce a small test program which I can hand inspect seems to work better. I've had moderate success with "Run the test program and correct any errors.", especially where I'm willing to review each edit as it is produced.