# SKOL IV: All the Data

In [1]:
bahir_package = 'org.apache.bahir:spark-sql-cloudant_2.12:2.4.0'
!spark-shell --packages $bahir_package < /dev/null

25/11/26 09:55:04 WARN Utils: Your hostname, puchpuchobs resolves to a loopback address: 127.0.1.1; using 10.1.10.58 instead (on interface wlp130s0f0)
25/11/26 09:55:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/piggy/.ivy2/cache
The jars for the packages stored in: /home/piggy/.ivy2/jars
org.apache.bahir#spark-sql-cloudant_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-86518068-3597-45c9-b4f7-2b944db6b6ce;1.0
	confs: [default]
	found org.apache.bahir#spark-sql-cloudant_2.12;2.4.0 in central
	found org.apache.bahir#bahir-common_2.12;2.4.0 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found com.cloudant#cloudant-client;2.17.0 in central
	found com.google.code.gson#gson;2.8.2 in central
	found 

In [2]:
from io import BytesIO
import json
import hashlib
import os
from pathlib import Path, PurePath
import pickle
import requests
import shutil
import sys
import tempfile
from typing import Any, Dict, Iterator, List, Optional
from urllib.robotparser import RobotFileParser

# Be sure to get version 2: https://simple-repository.app.cern.ch/project/bibtexparser/2.0.0b8/description
import bibtexparser
import couchdb
import feedparser
import fitz # PyMuPDF

import pandas as pd  # TODO(piggy): Remove this dependency in favor of pure pyspark DataFrames.

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import (
    Tokenizer, CountVectorizer, IDF, StringIndexer, VectorAssembler, IndexToString
)
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

from pyspark.sql import SparkSession, DataFrame, Row
from pyspark.sql.functions import (
    input_file_name, collect_list, regexp_extract, col, udf,
    explode, trim, row_number, min, expr, concat, lit
)
from pyspark.sql.types import (
    ArrayType, BooleanType, IntegerType, MapType, NullType,
    StringType, StructType, StructField 
)
from pyspark.sql.window import Window

import redis
import torch
from uuid import uuid4

# Local modules
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
parent_path = Path(parent_dir)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from couchdb_file import CouchDBFile as CDBF
from fileobj import FileObject
from finder import parse_annotated, remove_interstitials
import line
from line import Line

# Import the SKOL classifier jupyter/ist769_skol.ipynb
from skol_classifier import SkolClassifier as SC, get_file_list
from skol_classifier.preprocessing import SuffixTransformer, ParagraphExtractor
from skol_classifier.utils import calculate_stats

from taxon import group_paragraphs, Taxon


  __version__ = __import__('pkg_resources').get_distribution('CouchDB').version


In [3]:
couchdb_host = "127.0.0.1:5984" # e.g., "ACCOUNT.cloudant.com" or "localhost"
couchdb_username = "admin"
couchdb_password = "SU2orange!"
ingest_db_name = "skol_dev"
taxon_db_name = "skol_taxa_dev"

couchdb_url = f'http://{couchdb_host}'

spark = SparkSession \
    .builder \
    .appName("CouchDB Spark SQL Example in Python using dataframes") \
    .master("local[2]") \
    .config("cloudant.protocol", "http") \
    .config("cloudant.host", couchdb_host) \
    .config("cloudant.username", couchdb_username) \
    .config("cloudant.password", couchdb_password) \
    .config("spark.jars.packages", bahir_package) \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "20g") \
    .config("spark.submit.pyFiles",
            f'{parent_path / "line.py"},{parent_path / "fileobj.py"},'
            f'{parent_path / "couchdb_file.py"},{parent_path / "finder.py"},'
            f'{parent_path / "taxon.py"},{parent_path / "paragraph.py"},'
            f'{parent_path / "label.py"},{parent_path / "file.py"},'
            f'{parent_path / "extract_taxa_to_couchdb.py"}'
           ) \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR") # Keeps the noise down!!!

couch = couchdb.Server(couchdb_url)
couch.resource.credentials = (couchdb_username, couchdb_password)

if ingest_db_name not in couch:
    db = couch.create(ingest_db_name)
else:
    db = couch[ingest_db_name]

user_agent = "synoptickeyof.life"

ingenta_rp = RobotFileParser()
ingenta_rp.set_url("https://www.ingentaconnect.com/robots.txt")
ingenta_rp.read() # Reads and parses the robots.txt file from the URL



:: loading settings :: url = jar:file:/data/piggy/miniconda3/envs/skol/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


## The Data Sources

The goal is to collect all the open access taxonomic literature in Mycology. Most of the sources below mainly cover macro-fungi and slime molds.

### Ingested Data Sources

* [Mycotaxon at Ingenta Connect](https://www.ingentaconnect.com/content/mtax/mt)
* [Studies in Mycology at Ingenta Connect](https://www.studiesinmycology.org/)

### Source of many older public domain and open access works

Mycoweb includes scans of many older works in mycology. I have local copies but need to ingest them.

* [Mycoweb](https://mykoweb.com/)

### Journals in hand

These are journals we've collected over the years. The initial annotated issues are from early years of Mycotaxon. We still need to ingest all of these.

* Mycologia (back issues)
* [Mycologia at Taylor and Francis](https://www.tandfonline.com/journals/umyc20)
  Mycologia is the main journal of the Mycological Society of America. It is a mix of open access and traditional access articles. The connector for this journal will need to identify the open access articles.
* Persoonia (all issues)
  Persoonia is no longer published.
* Mycotaxon (back issues)
  Mycotaxon is no longer published.

### Journals that need connectors

These are journals we're aware that include open access articles.

* [Amanitaceae.org](http://www.tullabs.com/amanita/?home)
* [Mycosphere](https://mycosphere.org/)
* [Mycoscience](https://mycoscience.org/)
* [Journal of Fungi](https://www.mdpi.com/journal/jof)
* [Mycology](https://www.tandfonline.com/journals/tmyc20)
* [Open Access Journal of Mycology & Mycological Sciences](https://www.medwinpublishers.com/OAJMMS/)
* [Mycokeys](https://mycokeys.pensoft.net/)


## Ingestion

Each journal or other data source gets an ingester that puts PDFs into our document store along with any metadata we can collect. The metadata is sufficient to create citations for each issue, book, or article. If bibtex citations are available we prefer to store these verbatim.

### Ingenta RSS ingestion

Ingenta Connect is an electronic publisher that holds two Mycology journals. New articles are available via RSS (Really Simple Syndication).

In [4]:
def ingest_from_bibtex(
        db: couchdb.Database,
        content: bytes,
        bibtex_link: str,
        meta: Dict[str, Any],
        rp
        ) -> None:
    """Load documents referenced in an Ingenta BibTeX database."""
    bib_database = bibtexparser.parse_string(content)

    bibtex_data = {
        'link': bibtex_link,
        'bibtex': bibtexparser.write_string(bib_database),
    }
    
    for bib_entry in bib_database.entries:
        doc = {
            '_id': uuid4().hex,
            'meta': meta,
            'pdf_url': f"{bib_entry['url']}?crawler=true",
        }

        # Do not fetch if we already have an entry.
        selector = {'selector': {'pdf_url': doc['pdf_url']}}
        found = False
        for e in db.find(selector):
            found = True
        if found:
            print(f"Skipping {doc['pdf_url']}")
            continue

        if not rp.can_fetch(user_agent, doc['pdf_url']):
            # TODO(piggy): We should probably record blocked URLs.
            print(f"Robot permission denied {doc['pdf_url']}")
            continue

        print(f"Adding {doc['pdf_url']}")
        for k in bib_entry.fields_dict.keys():
            doc[k] = bib_entry[k]
        
        doc_id, doc_rev = db.save(doc)
        with requests.get(doc['pdf_url'], stream=False) as pdf_f:
            pdf_f.raise_for_status()
            pdf_doc = pdf_f.content
        
        attachment_filename = 'article.pdf'
        attachment_content_type = 'application/pdf'
        attachment_file = BytesIO(pdf_doc)

        db.put_attachment(doc, attachment_file, attachment_filename, attachment_content_type)

        print("-" * 10)

In [5]:
def ingest_ingenta(
        db: couchdb.Database,
        rss_url: str,
        rp
) -> None:
    """Ingest documents from an Ingenta RSS feed."""

    feed = feedparser.parse(rss_url)
    
    feed_meta = {
        'url': rss_url,
        'title': feed.feed.title,
        'link': feed.feed.link,
        'description': feed.feed.description,
    }

    for entry in feed.entries:
        entry_meta = {
            'title': entry.title,
            'link': entry.link,
        }
        if hasattr(entry, 'summary'):
            entry_meta['summary'] = entry.summary
        if hasattr(entry, 'description'):
            entry_meta['description'] = entry.description

        bibtex_link = f'{entry.link}?format=bib'
        print(f"bibtex_link: {bibtex_link}")

        if not rp.can_fetch(user_agent, bibtex_link):
            print(f"Robot permission denied {bibtex_link}")
            continue

        with requests.get(bibtex_link, stream=False) as bibtex_f:
            bibtex_f.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

            ingest_from_bibtex(
                db=db,
                content=bibtex_f.content\
                    .replace(b"\"\nparent", b"\",\nparent")\
                    .replace(b"\n", b""),
                bibtex_link=bibtex_link,
                meta={
                    'feed': feed_meta,
                    'entry': entry_meta,
                },
                rp=rp
            )
        print("=" * 20)

In [6]:
def ingest_from_local_bibtex(
    db: couchdb.Database,
    root: Path,
    rp
) -> None:
    """Ingest from a local directory with Ingenta bibtext files in it."""
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if not filename.endswith('format=bib'):
                continue
            full_filepath = os.path.join(dirpath, filename)
            bibtex_link = f"https://www.ingentaconnect.com/{full_filepath[len(str(root)):]}"
            with open(full_filepath) as f:
                content = f.read()\
                    .replace("\"\nparent", "\",\nparent")\
                    .replace("\n", "")
                ingest_from_bibtex(db, content, bibtex_link, meta={}, rp=rp)


Download the RSS

Read bibtex files and create records for each article.

Download the PDFs at the URLs in the bibtex entries.

Create a JSON record with the PDF as an attachment.

### Text extraction

We extract the text, optionally with OCR. Add as an additional attachment on the source record.

In [7]:
df = spark.read.load(
    format="org.apache.bahir.cloudant",
    database=ingest_db_name
)

In [8]:
df.describe()

DataFrame[summary: string, _id: string, _rev: string, abstract: string, author: string, doi: string, eissn: string, issn: string, itemtype: string, journal: string, number: string, pages: string, parent_itemid: string, pdf_url: string, publication date: string, publishercode: string, title: string, url: string, volume: string, year: string]

In [9]:
# Content-Type: text/html; charset=UTF-8

def pdf_to_text(pdf_contents: bytes) -> bytes:
    doc = fitz.open(stream=BytesIO(pdf_contents), filetype="pdf")

    full_text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        # Possibly perform OCR on the page
        text = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_DEHYPHENATE)
        # full_text += f"\n--- PDF Page {page_num+1} ---\n"
        full_text += text

    return full_text.encode("utf-8")

def add_text_to_partition(iterator) -> None:
    couch = couchdb.Server(couchdb_url)
    couch.resource.credentials = (couchdb_username, couchdb_password)
    local_db = couch[ingest_db_name]
    for row in iterator:
        if not row:
            continue
        if not row._attachments:
            continue
        row_dict = row.asDict()
        attachment_dict = row._attachments.asDict()
        for pdf_filename in attachment_dict:
            pdf_path = PurePath(pdf_filename)
            if pdf_path.suffix != '.pdf':
                continue
            pdf_path = PurePath(pdf_filename)
            txt_path_str = pdf_path.stem + '.txt'
            # if txt_path_str in attachment_dict:
            #     # TODO(piggy): Recalculate text if text is terrible. Too much noise vocabulary?
            #     print(f"Already have text for {row.pdf_url}")
            #     continue
            print(f"{row._id}, {row.pdf_url}")
            pdf_file = local_db.get_attachment(row._id, str(pdf_path)).read()
            txt_file = pdf_to_text(pdf_file)
            attachment_content_type = 'text/simple; charset=UTF-8'
            attachment_file = BytesIO(txt_file)
            local_db.put_attachment(row_dict, attachment_file, txt_path_str, attachment_content_type)
    


In [10]:
# Identical to skol_classifier.CouchDBConnection.
from skol_classifier import CouchDBConnection as CDBC

class CouchDBConnection(CDBC):
    """
    Manages CouchDB connection and provides I/O operations.

    This class encapsulates connection parameters and provides an idempotent
    connection method that can be safely called multiple times.
    """


In [11]:
"""
Main classifier module for SKOL text classification
"""
class SkolClassifier(SC):
    """
    Text classifier for taxonomic literature.

    This version only includes the redis and couchdb I/O methods.
    All other methods are in SC.

    Supports multiple classification models (Logistic Regression, Random Forest)
    and feature types (word TF-IDF, suffix TF-IDF, combined).
    """

    # def save_to_redis(self) -> bool:

    # def load_from_redis(self) -> bool:
    # def load_from_couchdb(self, pattern: str = "*.txt") -> DataFrame:

    # def predict_from_couchdb(

    # def save_to_couchdb(


## Build a classifier to identify paragraph types.

We save this to redis so that we don't need to train the model every time.

In [12]:
# Train classifier on annotated data and save to Redis
# Connect to Redis
redis_client = redis.Redis(
    host='localhost',
    port=6379,
    db=0,
    decode_responses=False
)
classifier_model_name = "skol:classifier:model:v1.2"

if not redis_client.exists(classifier_model_name):

    # Initialize classifier with Redis connection
    classifier = SkolClassifier(
        spark=spark,
        redis_client=redis_client,
        redis_key=classifier_model_name,
        auto_load=False,  # Don't auto-load, we want to train fresh
        couchdb_url=couchdb_url,
        database=ingest_db_name,
        username=couchdb_username,
        password=couchdb_password
    )
    
    # Get annotated training files
    annotated_path = Path.cwd().parent / "data" / "annotated"
    print(f"Loading annotated files from: {annotated_path}")
    
    if annotated_path.exists():
        annotated_files = get_file_list(str(annotated_path), pattern="**/*.ann")
        
        if len(annotated_files) > 0:
            print(f"Found {len(annotated_files)} annotated files")
            
            # Train the classifier
            print("Training classifier...")
            results = classifier.fit(
                annotated_files,
                model_type='logistic',
                use_suffixes=False,
                line_level=True
            )
            
            print(f"Training complete!")
            print(f"  Accuracy: {results['accuracy']:.4f}")
            print(f"  F1 Score: {results['f1_score']:.4f}")
            print(f"  Labels: {classifier.labels}")
            
            # Save model to Redis
            print("\nSaving model to Redis...")
            if classifier.save_to_redis():
                print(f"✓ Model successfully saved to Redis with key: {classifier_model_name}.")
            else:
                print("✗ Failed to save model to Redis")
        else:
            print(f"No annotated files found in {annotated_path}")
    else:
        print(f"Directory does not exist: {annotated_path}")
        print("Please ensure annotated training data is available.")
else:
    print(f"Skipping generation of model {classifier_model_name}.")

Skipping generation of model skol:classifier:model:v1.2.


## Extract the taxa names and descriptions

We use a classifier to extract taxa names and descriptions from articles, issues, and books. The YEDDA annotated texts are written back to CouchDB.

In [13]:
classifier = SkolClassifier(
    spark=spark,
    redis_client=redis_client,
    redis_key=classifier_model_name,
    auto_load=True,
    couchdb_url=couchdb_url,
    database=ingest_db_name,
    username=couchdb_username,
    password=couchdb_password
)

if classifier.labels is None:
    raise ValueError("No model found in Redis. Please train a model first.")

print(f"Model loaded with labels: {classifier.labels}")

print("\nLoading and classifying documents from CouchDB...")
predictions = classifier.predict_from_couchdb(pattern="*.txt", line_level=True)

# Show sample predictions
print("\nSample predictions:")
predictions.select(
    "doc_id", "attachment_name", "predicted_label"
).show(5, truncate=50)

# Save results back to CouchDB using distributed writes
print("\nSaving predictions back to CouchDB...")
print("(Each partition writes its documents using a single connection)")
results = classifier.save_to_couchdb(predictions=predictions, suffix=".ann", coalesce_labels=True)

# Report results
successful = sum(1 for r in results if r['success'])
failed = len(results) - successful

print(f"\nSaved {successful} annotated files to CouchDB")
if failed > 0:
    print(f"Failed to save {failed} files")
    for r in results:
        if not r['success']:
            print(f"  {r['doc_id']}/{r['attachment_name']}")


Model loaded with labels: ['Misc-exposition', 'Description', 'Nomenclature']

Loading and classifying documents from CouchDB...

Sample predictions:
+--------------------------------+---------------+---------------+
|                          doc_id|attachment_name|predicted_label|
+--------------------------------+---------------+---------------+
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
|0020c88329ed456a95a18e0c219269f4|    article.txt|Misc-exposition|
+--------------------------------+---------------+---------------+
only showing top 5 rows


Saving predictions back to CouchDB...
(Each partition writes its documents using a single connection)

Saved 2099 annotated files to CouchDB


In [14]:
predictions.select("predicted_label", "annotated_pg").where('predicted_label = "Nomenclature"').show()
predictions.groupBy("predicted_label").count().orderBy("count").show()

+---------------+--------------------+
|predicted_label|        annotated_pg|
+---------------+--------------------+
|   Nomenclature|[@ 2. Caloplaca b...|
|   Nomenclature|[@ 5. Caloplaca g...|
|   Nomenclature|[@ Handlingar, Sj...|
|   Nomenclature|[@ 7. Caloplaca l...|
|   Nomenclature|[@ nom. nud.#Nome...|
|   Nomenclature|[@ wiss. Kl. Wien...|
|   Nomenclature|[@ Boom & Etayo (...|
|   Nomenclature|[@ 13. Caloplaca ...|
|   Nomenclature|[@ 14. Caloplaca ...|
|   Nomenclature|[@ 16. Gasparrini...|
|   Nomenclature|[@ Iran, 1937. An...|
|   Nomenclature|[@ Aegeischen Mee...|
|   Nomenclature|[@ Szatala Ö. 195...|
|   Nomenclature|[@ Marasmius pseu...|
|   Nomenclature|[@ Rattania Prabh...|
|   Nomenclature|[@ M. acerina (R....|
|   Nomenclature|[@ Rattania setul...|
|   Nomenclature|[@ Masseeella flu...|
|   Nomenclature|[@ 70. 1907.\t#No...|
|   Nomenclature|[@ Phellodon tome...|
+---------------+--------------------+
only showing top 20 rows

+---------------+-------+
|predicted_l

Here we estimate an upper bound for the Taxon structures we'd like to find. The abbreviation "nov." ("novum") indicates a new taxon in the current article.

In [15]:
predictions.select("*").filter(col("annotated_pg").contains("nov.")).where("predicted_label = 'Nomenclature'").count()

4014

## Build the Taxon objects and store them in CouchDB
We use CouchDB to store a full record for each taxon. We copy all metadata to the taxon records.

In [16]:
class CouchDBFile(CDBF):
    """
    File-like object that reads from CouchDB attachment content.

    This class extends FileObject to support reading text from CouchDB
    attachments while preserving database metadata (doc_id, attachment_name,
    and database name).
    """



In [17]:
from couchdb_file import read_couchdb_partition, read_couchdb_rows

In [18]:
from extract_taxa_to_couchdb import (
    generate_taxon_doc_id,
    extract_taxa_from_partition,
    convert_taxa_to_rows,
    extract_and_save_taxa_pipeline
)

## Build Taxon objects

Here we extract the Taxon objects from the annotated attachments.

In [19]:
ingest_couchdb_url = couchdb_url
ingest_username = couchdb_username
ingest_password = couchdb_password
taxon_couchdb_url = couchdb_url
taxon_username = couchdb_username
taxon_password = couchdb_password
pattern = '*.txt.ann'

In [20]:
from extract_taxa_to_couchdb import (
    load_annotated_documents,
    extract_taxa_dataframe,
    save_taxa_dataframe,
    extract_taxa_from_partition
)

In [21]:
# Step 1: Load annotated documents
annotated_df = load_annotated_documents(
    spark, ingest_couchdb_url, ingest_db_name,
    ingest_username, ingest_password, pattern='*.txt.ann'
)
print(f"Loaded {annotated_df.count()} annotated documents")
annotated_df.show(5, truncate=False)

Loaded 2099 annotated documents
+--------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
# Step 2: Extract taxa to DataFrame
taxa_df = extract_taxa_dataframe(spark, annotated_df, ingest_db_name)
print(f"\nExtracted {taxa_df.count()} taxa")
taxa_df.printSchema()
taxa_df.show(10, truncate=False)


Extracted 5239 taxa
root
 |-- taxon: string (nullable = false)
 |-- description: string (nullable = false)
 |-- source: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- line_number: integer (nullable = true)
 |-- paragraph_number: integer (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- empirical_page_number: string (nullable = true)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [23]:
# Step 3: Inspect actual Taxon objects from the RDD
print("\n=== Sample Taxon Objects ===")
taxa_rdd = annotated_df.rdd.mapPartitions(
    lambda partition: extract_taxa_from_partition(partition, ingest_db_name)
)
for i, taxon in enumerate(taxa_rdd.take(3)):
    print(f"\nTaxon {i+1}:")
    print(f"  Type: {type(taxon)}")
    print(f"  Has nomenclature: {taxon.has_nomenclature()}")
    print(f"  As row: {taxon.as_row()}")


=== Sample Taxon Objects ===

Taxon 1:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  As row: {'taxon': ' 2. Caloplaca brachyspora Mereschk., Lich. Ross. Exs., fasc. 22, no. 276 (1913)\n\n', 'description': ' orange apothecia, c. 0.5–0.7 mm in diam., with paler margin on a very thin, ±\n\n\n (< 0.3 mm in diam.), short asci (45–55 µm long) containing 14–15 ascospores;\n\n\n The type specimen has a yellow thallus of tall, convex areoles and orange\napothecia, c. 0.3–0.5 mm in diam. Ascospores (11.25–)13.1±1.3(–15.0) × (5.25–)\n\n', 'source': {'doc_id': '0020c88329ed456a95a18e0c219269f4', 'url': 'skol_dev/0020c88329ed456a95a18e0c219269f4/article.txt.ann', 'db_name': 'skol_dev'}, 'line_number': 68, 'paragraph_number': 2, 'page_number': 1, 'empirical_page_number': None}

Taxon 2:
  Type: <class 'taxon.Taxon'>
  Has nomenclature: True
  As row: {'taxon': ' 5. Caloplaca gyalolechiiformis Szatala, Ann. Hist.-Nat. Mus. Natl. Hungarici, s.n. 7:\n\n\n Handlingar, Sjätte Följden, ser. B 

In [25]:
# Step 4: Only try saving if extraction looks good
results_df = save_taxa_dataframe(
    taxa_df, taxon_couchdb_url, taxon_db_name,
    taxon_username, taxon_password
)
results_df.groupBy("success").count().show(truncate=False)


+-------+-----+
|success|count|
+-------+-----+
|true   |5239 |
+-------+-----+



## Dr. Drafts document embedding

Dr. Drafts loads documents from CouchDB. Save the embedding to redis.


In [None]:
from dr_drafts_mycosearch.data import SKOL_TAXA as STX
from dr_drafts_mycosearch.compute_embeddings import EmbeddingsComputer as EC

class SKOL_TAXA(STX):

    def load_data(self):
        """Load taxon data from CouchDB into a pandas DataFrame."""
        # TODO(piggy): Convert whole package to pyspark DataFrame for better scaling.
        # Connect to CouchDB
        server = couchdb.Server(self.couchdb_url)
        if self.username and self.password:
            server.resource.credentials = (self.username, self.password)

        # Access the database
        if self.db_name not in server:
            raise ValueError(f"Database '{self.db_name}' not found in CouchDB server")

        db = server[self.db_name]

        # Fetch all documents from the database
        records = []
        for doc_id in db:
            # Skip design documents
            if doc_id.startswith('_design/'):
                continue

            doc = db[doc_id]
            records.append(doc)

        if not records:
            # Create empty DataFrame if no records found
            self.df = pd.DataFrame()
            print(f"Warning: No taxon records found in database '{self.db_name}'")
            return

        # Convert to DataFrame
        self.df = pd.DataFrame(records)
        print(f"Loaded {len(self.df)} taxon records from CouchDB database '{self.db_name}'")


In [None]:
class EmbeddingsComputer(EC):
    """Class for computing and storing embeddings from narrative data."""

    def write_embeddings_to_redis(self):
        """Write embeddings to Redis using instance configuration."""
        import redis

        if self.redis_username and self.redis_password:
            r = redis.from_url(self.redis_url, username=self.redis_username, password=self.redis_password, db=self.redis_db)
        else:
            r = redis.from_url(self.redis_url, db=self.redis_db)

        pickled_data = pickle.dumps(self.result)
        r.set(self.embedding_name, pickled_data)
        print(f'Embeddings written to Redis (db={self.redis_db}) with key: {self.embedding_name}')

    def run(self, descriptions):
        """Run the full embeddings computation pipeline.

        Returns:
            pd.DataFrame: The computed embeddings
        """
        df = descriptions.drop_duplicates(
            subset=['description'],
            keep='last',
            ignore_index=True
        )

        if not torch.cuda.is_available():
            print('Warning: No GPU detected. Using CPU.')

        embeddings = self.encode_narratives(df.description.astype(str))
        self.result = pd.concat([df, embeddings], axis=1)

        # Write to Redis if embedding name is specified
        if self.embedding_name:
            if not self.redis_url:
                raise ValueError("redis_url must be provided when embedding_name is specified")
            self.write_embeddings_to_redis()
        else:
            # Write to local filesystem
            self.write_embeddings_to_file()

        return self.result



## Compute Embeddings

We use SBERT to embed the taxa into a search space.

In [None]:
skol_taxa = SKOL_TAXA(
    couchdb_url="http://localhost:5984",
    username=couchdb_username,
    password=couchdb_password,
    db_name=taxon_db_name
)
descriptions = skol_taxa.get_descriptions()

In [None]:
embedder = EmbeddingsComputer(
    idir='/dev/null',
    redis_url='redis://localhost:6379',
    embedding_name='skol:embedding:v0.1',
)

embedding_result = embedder.run(descriptions)

## Bibliography

* doi Foundation, "DOI Citation Formatter HTTP API", https://citation.doi.org/api-docs.html, accessed 2025-11-12.
* Yang, Jie and Zhang, Yue and Li, Linwei and Li, Xingxuan, 2018, "YEDDA: A Lightweight Collaborative Text Span Annotation Tool", Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, http://aclweb.org/anthology/P18-4006


## Appendix: On the use of an AI Coder

Portions of this work were completed with the aid of Claude Code Pro. I wish to give a clarifying example of how I've used this very powerful tool, and reveal why I am comfortable with claiming authorship of the resulting code.

For this project I needed results from an earlier class project in which a trio of students built and evaluated models for classifying paragraphs. The earlier work was built as a iPython Notebook, with many examples and inline code. Just copying the earlier notebook would have introduced many irrelevant details and would not further the overall project.

I asked Claude Code to translate the notebook into a module that I could import. It did a pretty good job. Without being told, it made a submodule, extract the illustrative code as examples, wrote reasonable documentation and created packaging for the module.

The skill level of the coding was roughly that of a highly disciplined average junior programmer. The architecture was simplistic and violated several design constraints such as DRY. I requested specific refactorings, such as asking for a group of functions to be converted into an object that shared duplicated parameters.

The initial code used REST interfaces directly, and read all the data into a single machine, not using pyspark correctly. Through a series of refactorings, I asked that the code use appropriate libraries I named, and create correct udf functions to execute transformations in parallel.

I walked the AI through creating an object that I could use to illustrate my use of redis and couchdb interfaces, while leaving the irrelevant details in a separate library.

In short, I still have to understand good design principles. I have to be able to recognize where appropriate libraries were applicable. I still have to understand the frameworks I am working with.

I now have a strong understanding of the difference between "vibe coding" and AI-assisted software engineering. In my first 4 hours with Claude Code, I was able to produce roughly 4 days' worth of professional-grade working code.