# ChromaDB Metadata Review for ZotMCP

This notebook reviews the metadata stored in ChromaDB for Zotero items and compares it with available Zotero API fields.

## Goals:
1. Initialize buttermilk with zotero configuration
2. Query ChromaDB to inspect stored metadata
3. Identify which Zotero fields are currently stored
4. Compare with available Zotero API fields
5. Document missing citation fields

## 1. Initialize Buttermilk with Zotero Config

In [1]:
import sys
import json
from pathlib import Path
from pprint import pprint

# Add project root to path for imports
project_root = Path.cwd() / "projects" / "zotmcp"
sys.path.insert(0, str(project_root / "src"))
conf_dir = str(project_root / "conf")

# Import buttermilk
from buttermilk import init_async
from buttermilk.tools import (
    ChromaDBSearchTool,
)

# Initialize buttermilk with zotero config
bm = await init_async(config_dir=conf_dir, config_name="zotero")

print(f"Buttermilk initialized")
print(f"Config dir: {conf_dir}")
print(f"\nStorage config:")
pprint(bm.cfg.storage.zotero_vectors)




E0000 00:00:1759902929.277978 1435097 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Buttermilk initialized
Config dir: /home/nic/src/writing/projects/zotmcp/conf

Storage config:
{'type': 'chromadb', 'collection_name': 'prosocial_zot', 'persist_directory': '/home/nic/src/writing/projects/zotmcp/.cache/zotero-prosocial-fulltext/files', 'embedding_model': 'gemini-embedding-001', 'dimensionality': 3072}


## 2. Connect to ChromaDB and Get Sample Documents

In [2]:
# Create search tool and get collection
storage_config = bm.cfg.storage.zotero_vectors
search_tool = ChromaDBSearchTool(
    type="chromadb",
    collection_name=storage_config.collection_name,
    persist_directory=storage_config.persist_directory,
    embedding_model=storage_config.embedding_model,
    dimensionality=storage_config.dimensionality,
)

await search_tool.ensure_cache_initialized()
collection = search_tool.collection

print(f"Collection: {collection.name}")
print(f"Total documents: {collection.count()}")


Collection: prosocial_zot
Total documents: 109432


## 3. Sample a Few Documents to Inspect Metadata

In [3]:
# Get a sample of documents with their metadata
sample_results = collection.get(limit=5, include=["metadatas", "documents"])

print(f"Sampled {len(sample_results['metadatas'])} documents\n")
print("=" * 80)

for i, (meta, doc) in enumerate(
    zip(sample_results["metadatas"], sample_results["documents"])
):
    print(f"\n### Document {i + 1}")
    print(f"\nMetadata keys: {list(meta.keys())}")
    print(f"\nFull metadata:")
    pprint(meta)
    print(f"\nDocument preview (first 200 chars):")
    print(doc[:200] + "..." if len(doc) > 200 else doc)
    print("\n" + "=" * 80)


Sampled 5 documents


### Document 1

Metadata keys: ['processing_run_id', 'zotero_data', 'citation', 'deduplication_strategy', 'created_timestamp', 'uri', 'chunk_index', 'content_hash', 'document_title', 'chunk_type', 'content_type', 'embedding_model', 'title', 'doi_or_url', 'document_id']

Full metadata:
{'chunk_index': 0,
 'chunk_type': 'unknown',
 'citation': 'Chipty, T. (2001). Vertical integration, market foreclosure, and '
             'consumer welfare in the cable television industry. The American '
             'Economic Review, 91(3), 428-453. '
             'https://www.jstor.org/stable/2677872',
 'content_hash': '2d6d5d6e8c8f1765fe97dd35195ec1d927257decad50edcbae941ca31302d49d',
 'content_type': 'unknown',
 'created_timestamp': '2025-08-12T07:57:11.684416',
 'deduplication_strategy': 'both',
 'document_id': 'UFEQ4F94',
 'document_title': 'Vertical Integration, Market Foreclosure, and Consumer '
                   'Welfare in the Cable Television Industry',
 'doi_or_url': '

## 4. Inspect zotero_data Field

The `zotero_data` field contains the raw Zotero API response. Let's extract and examine it.

In [4]:
# Find a document with zotero_data and parse it
for meta in sample_results["metadatas"]:
    if "zotero_data" in meta:
        print("Found zotero_data field!\n")
        print(f"Type: {type(meta['zotero_data'])}")
        print(f"\nRaw value (first 500 chars):")
        print(str(meta["zotero_data"])[:500])

        # Try to parse if it's a JSON string
        if isinstance(meta["zotero_data"], str):
            try:
                zotero_data = json.loads(meta["zotero_data"])
                print(f"\n\nParsed zotero_data fields:")
                pprint(list(zotero_data.keys()))
                print(f"\n\nFull zotero_data:")
                pprint(zotero_data)
                break
            except json.JSONDecodeError as e:
                print(f"Failed to parse as JSON: {e}")
        else:
            print(f"\n\nzotero_data fields:")
            pprint(
                list(meta["zotero_data"].keys())
                if isinstance(meta["zotero_data"], dict)
                else "Not a dict"
            )
            break
else:
    print("No zotero_data field found in sample documents")


Found zotero_data field!

Type: <class 'str'>

Raw value (first 500 chars):
{"key": "UFEQ4F94", "version": 48842, "itemType": "journalArticle", "title": "Vertical Integration, Market Foreclosure, and Consumer Welfare in the Cable Television Industry", "creators": [{"creatorType": "author", "firstName": "Tasneem", "lastName": "Chipty"}], "abstractNote": "I examine the effects of vertical integration between programming and distribution in the cable television industry. I assess the effects of ownership structure on program offerings, prices, and subscriptions, and I comp


Parsed zotero_data fields:
['key',
 'version',
 'itemType',
 'title',
 'creators',
 'abstractNote',
 'publicationTitle',
 'volume',
 'issue',
 'pages',
 'date',
 'series',
 'seriesTitle',
 'seriesText',
 'journalAbbreviation',
 'language',
 'DOI',
 'ISSN',
 'shortTitle',
 'url',
 'accessDate',
 'archive',
 'archiveLocation',
 'libraryCatalog',
 'callNumber',
 'rights',
 'extra',
 'tags',
 'collections',
 'relations',


## 5. Analyze All Metadata Fields Across Sample

In [5]:
# Get a larger sample to understand metadata distribution
larger_sample = collection.get(limit=100, include=["metadatas"])

# Collect all unique metadata keys
all_metadata_keys = set()
for meta in larger_sample["metadatas"]:
    all_metadata_keys.update(meta.keys())

print(f"Total unique metadata keys across {len(larger_sample['metadatas'])} documents:")
print(f"\n{sorted(all_metadata_keys)}")
print(f"\nTotal: {len(all_metadata_keys)} unique keys")


Total unique metadata keys across 100 documents:

['chunk_index', 'chunk_type', 'citation', 'content_hash', 'content_type', 'created_timestamp', 'deduplication_strategy', 'document_id', 'document_title', 'doi_or_url', 'embedding_model', 'processing_run_id', 'title', 'uri', 'zotero_data']

Total: 15 unique keys


## 6. Check What Zotero Fields Are Available

Let's check the actual zotero_data to see what fields are available from the Zotero API.

In [6]:
# Collect all zotero_data fields across the sample
zotero_fields_by_type = {}

for meta in larger_sample["metadatas"]:
    if "zotero_data" in meta:
        try:
            # Parse zotero_data
            if isinstance(meta["zotero_data"], str):
                zotero_data = json.loads(meta["zotero_data"])
            else:
                zotero_data = meta["zotero_data"]

            item_type = zotero_data.get("itemType", "unknown")

            if item_type not in zotero_fields_by_type:
                zotero_fields_by_type[item_type] = set()

            zotero_fields_by_type[item_type].update(zotero_data.keys())
        except (json.JSONDecodeError, AttributeError, TypeError) as e:
            continue

print("Zotero API fields by item type:\n")
for item_type, fields in sorted(zotero_fields_by_type.items()):
    print(f"\n{item_type.upper()}:")
    print(f"  Fields: {sorted(fields)}")
    print(f"  Total: {len(fields)} fields")


Zotero API fields by item type:


BOOKSECTION:
  Fields: ['ISBN', 'abstractNote', 'accessDate', 'archive', 'archiveLocation', 'bookTitle', 'callNumber', 'collections', 'creators', 'date', 'dateAdded', 'dateModified', 'edition', 'extra', 'itemType', 'key', 'language', 'libraryCatalog', 'numberOfVolumes', 'pages', 'place', 'publisher', 'relations', 'rights', 'series', 'seriesNumber', 'shortTitle', 'tags', 'title', 'url', 'version', 'volume']
  Total: 32 fields

JOURNALARTICLE:
  Fields: ['DOI', 'ISSN', 'abstractNote', 'accessDate', 'archive', 'archiveLocation', 'callNumber', 'collections', 'creators', 'date', 'dateAdded', 'dateModified', 'extra', 'issue', 'itemType', 'journalAbbreviation', 'key', 'language', 'libraryCatalog', 'pages', 'publicationTitle', 'relations', 'rights', 'series', 'seriesText', 'seriesTitle', 'shortTitle', 'tags', 'title', 'url', 'version', 'volume']
  Total: 32 fields

PREPRINT:
  Fields: ['DOI', 'abstractNote', 'accessDate', 'archive', 'archiveID', 'archiveLocati

## 7. Compare: What's Stored in ChromaDB vs What's Available

Let's identify which Zotero fields are:
1. Currently stored as top-level metadata in ChromaDB
2. Available in zotero_data but not easily accessible
3. Missing important citation fields

In [7]:
# Top-level metadata keys (excluding zotero_data itself)
top_level_keys = all_metadata_keys - {"zotero_data"}

print("TOP-LEVEL METADATA KEYS (directly accessible):")
print(sorted(top_level_keys))
print(f"\nTotal: {len(top_level_keys)}")

# Important citation fields that should be easily accessible
important_citation_fields = {
    "creators",  # Authors
    "title",
    "publicationTitle",  # Journal name
    "publisher",
    "date",
    "DOI",
    "ISBN",
    "url",
    "abstractNote",
    "itemType",
    "tags",
    "collections",
    "volume",
    "issue",
    "pages",
    "language",
}

print("\n" + "=" * 80)
print("\nIMPORTANT CITATION FIELDS ANALYSIS:")
print("\nFields that ARE in top-level metadata:")
accessible = important_citation_fields & top_level_keys
print(sorted(accessible))

print("\nFields that are ONLY in zotero_data (not easily accessible):")
# We need to check what's in zotero_data
all_zotero_fields = set()
for fields in zotero_fields_by_type.values():
    all_zotero_fields.update(fields)

buried_fields = important_citation_fields & all_zotero_fields - top_level_keys
print(sorted(buried_fields))

print("\nFields that are MISSING entirely:")
missing = important_citation_fields - all_zotero_fields - top_level_keys
print(sorted(missing) if missing else "None")


TOP-LEVEL METADATA KEYS (directly accessible):
['chunk_index', 'chunk_type', 'citation', 'content_hash', 'content_type', 'created_timestamp', 'deduplication_strategy', 'document_id', 'document_title', 'doi_or_url', 'embedding_model', 'processing_run_id', 'title', 'uri']

Total: 14


IMPORTANT CITATION FIELDS ANALYSIS:

Fields that ARE in top-level metadata:
['title']

Fields that are ONLY in zotero_data (not easily accessible):
['DOI', 'ISBN', 'abstractNote', 'collections', 'creators', 'date', 'issue', 'itemType', 'language', 'pages', 'publicationTitle', 'publisher', 'tags', 'url', 'volume']

Fields that are MISSING entirely:
None


## 8. Example: Full Metadata Structure for One Document

In [8]:
# Pick one document and show complete metadata structure
example_meta = sample_results["metadatas"][0]

print("EXAMPLE COMPLETE METADATA STRUCTURE:\n")
print("=" * 80)
for key, value in sorted(example_meta.items()):
    print(f"\n{key}:")
    if key == "zotero_data" and isinstance(value, str):
        try:
            parsed = json.loads(value)
            print(f"  Type: dict (parsed from JSON string)")
            print(f"  Keys: {sorted(parsed.keys())}")
            print(f"  \n  Sample values:")
            for k in list(parsed.keys())[:5]:  # Show first 5 fields
                print(f"    {k}: {parsed[k]}")
        except:
            print(f"  {str(value)[:200]}...")
    else:
        print(f"  {value}")


EXAMPLE COMPLETE METADATA STRUCTURE:


chunk_index:
  0

chunk_type:
  unknown

citation:
  Chipty, T. (2001). Vertical integration, market foreclosure, and consumer welfare in the cable television industry. The American Economic Review, 91(3), 428-453. https://www.jstor.org/stable/2677872

content_hash:
  2d6d5d6e8c8f1765fe97dd35195ec1d927257decad50edcbae941ca31302d49d

content_type:
  unknown

created_timestamp:
  2025-08-12T07:57:11.684416

deduplication_strategy:
  both

document_id:
  UFEQ4F94

document_title:
  Vertical Integration, Market Foreclosure, and Consumer Welfare in the Cable Television Industry

doi_or_url:
  https://www.jstor.org/stable/2677872

embedding_model:
  gemini-embedding-001

processing_run_id:
  20250812T0756Z-JWC6-nicdev-nic

title:
  Vertical Integration, Market Foreclosure, and Consumer Welfare in the Cable Television Industry

uri:
  /home/nic/.cache/zotero/items/UFEQ4F94.json

zotero_data:
  Type: dict (parsed from JSON string)
  Keys: ['DOI', 'ISSN', 

## 9. Recommendations

Based on the analysis above, document:
1. Which important citation fields are missing from top-level metadata
2. Which fields should be extracted from zotero_data and promoted to top-level
3. Any fields that are completely missing from the Zotero API data

In [9]:
# Summary of findings
print("METADATA STORAGE REVIEW SUMMARY")
print("=" * 80)
print(f"\n1. Total ChromaDB documents: {collection.count()}")
print(f"2. Unique metadata keys (top-level): {len(top_level_keys)}")
print(f"3. Important citation fields accessible: {len(accessible)}")
print(f"4. Important citation fields buried in zotero_data: {len(buried_fields)}")
print(f"5. Important citation fields missing: {len(missing)}")

print("\n" + "=" * 80)
print("\nRECOMMENDATIONS:")
if buried_fields:
    print("\n✅ Extract these fields from zotero_data to top-level metadata:")
    for field in sorted(buried_fields):
        print(f"   - {field}")

if missing:
    print("\n⚠️  These fields are not available in current data:")
    for field in sorted(missing):
        print(f"   - {field}")

print("\n✓ These fields are already accessible:")
for field in sorted(accessible):
    print(f"   - {field}")


METADATA STORAGE REVIEW SUMMARY

1. Total ChromaDB documents: 109432
2. Unique metadata keys (top-level): 14
3. Important citation fields accessible: 1
4. Important citation fields buried in zotero_data: 15
5. Important citation fields missing: 0


RECOMMENDATIONS:

✅ Extract these fields from zotero_data to top-level metadata:
   - DOI
   - ISBN
   - abstractNote
   - collections
   - creators
   - date
   - issue
   - itemType
   - language
   - pages
   - publicationTitle
   - publisher
   - tags
   - url
   - volume

✓ These fields are already accessible:
   - title
