# 📘 Zotero PDF to Markdown Converter

This notebook connects to Zotero via its API, retrieves PDFs from a specific collection, extracts their content, and saves them as Markdown files.


In [2]:
from pathlib import Path
import requests
import subprocess

## 🧩 How to Get Your Zotero API Key, User ID, and Collection Key

To connect this notebook to your Zotero library, you'll need:

1. A **Zotero API key**  
2. Your **Zotero User ID**  
3. The **Collection Key** for the folder containing your papers  

---

### 🔑 Step 1 — Get Your Zotero API Key

1. Go to: [https://www.zotero.org/settings/keys](https://www.zotero.org/settings/keys)
2. Click **Create new private key**
3. Give it a name like `Jupyter Notebook`
4. Choose permission level: **Read Only**
5. (Optional) Limit to a specific group or collection
6. Click **Save Key**
7. Copy your **API Key** and paste it into the notebook config

---

### 🆔 Step 2 — Find Your Zotero User ID

1. Still at [https://www.zotero.org/settings/keys](https://www.zotero.org/settings/keys)
2. Scroll down to the **Applications** section
3. Your **User ID** is shown right below the blue button  
   > Example: `Your user ID for use in API calls is 12345678`

---

### 🗂️ Step 3 — Find Your Collection Key

1. Go to your **Zotero Web Library**: 
https://www.zotero.org/your_username/library

2. Click on the desired collection (e.g. “Creativity in vitro”)
3. Look at the browser URL — it will look like this:

https://www.zotero.org/your_username/collections/ABCD1234


4. Copy the last part (`ABCD1234`) — that is your **Collection Key**



In [3]:
# --- Configuration ---
ZOTERO_USER_ID = "13831668"           # Replace with your Zotero user ID
ZOTERO_API_KEY = "1RZ3XyNVJkBEoTAJDhx5jwcb"           # Replace with your Zotero API key
COLLECTION_KEY = "4Z2H2D4K"     # Replace with the key of the desired collection
ZOTERO_STORAGE_PATH = Path.home() / "Zotero" / "storage"  # Default Zotero local storage path
HEADERS = {"Zotero-API-Key": ZOTERO_API_KEY}

## 🔧 About the Functions Below

The following functions are used to interact with the Zotero API and process the local PDFs:

- **`get_items_from_collection(...)`**: Fetches all items (e.g., papers) from a specific Zotero collection using your User ID and Collection Key.
- **`get_pdfs_for_item(...)`**: For each item (paper), this function retrieves all child attachments and filters only those with a PDF content type.
- **`extract_text_from_pdf(...)`**: Reads the actual PDF file from local storage and extracts text from all pages using the PyPDF2 library.

These functions together allow you to:
1. Identify and locate the PDFs stored locally under Zotero's storage folder (named with unique keys like `2UH3FH6Z`).
2. Extract the content of those PDFs and prepare them for further processing or conversion into Markdown.

In [4]:
def get_items_from_collection(user_id, collection_key):
    url = f"https://api.zotero.org/users/{user_id}/collections/{collection_key}/items?limit=100"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()

def get_pdfs_for_item(user_id, item_key):
    url = f"https://api.zotero.org/users/{user_id}/items/{item_key}/children"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return [att for att in response.json() if att["data"].get("contentType") == "application/pdf"]

def extract_text_from_pdf(pdf_path):
    try:
        reader = PdfReader(pdf_path)
        return "\n".join([page.extract_text() or "" for page in reader.pages])
    except Exception as e:
        print(f"Failed to read {pdf_path}: {e}")
        return ""

## 🔄 Mapping API Results to Local Zotero Storage

This section connects the Zotero metadata (retrieved using the API) with the actual PDF files saved locally in Zotero's `storage` folder. Each PDF in Zotero is saved in a subfolder with a randomly generated name (like `2UH3FH6Z`), and the Zotero API provides the `key` that matches that folder name.

Here's what happens:

1. We call `get_items_from_collection(...)` to get all items from the specified collection.
2. For each item (typically a paper), we call `get_pdfs_for_item(...)` to retrieve its child attachments and filter only the PDF files.
3. For each PDF, we extract its `key` (which matches the local Zotero folder name).
4. We check if that folder exists on disk and look for a `.pdf` file inside it.
5. If a PDF is found, we save a `(title, path)` pair in the `pdf_map` list, which will be used later for conversion to Markdown.


In [5]:
pdf_map = []
items = get_items_from_collection(ZOTERO_USER_ID, COLLECTION_KEY)

for item in items:
    title = item["data"].get("title", "Untitled")
    item_key = item["data"]["key"]
    pdfs = get_pdfs_for_item(ZOTERO_USER_ID, item_key)

    for pdf in pdfs:
        pdf_key = pdf["data"]["key"]
        folder_path = ZOTERO_STORAGE_PATH / pdf_key
        if folder_path.exists():
            pdf_files = list(folder_path.glob("*.pdf"))
            if pdf_files:
                pdf_map.append((title, pdf_files[0]))

In [6]:
pdf_map

[('Brain decoding in multiple languages: Can cross-language brain decoding work?',
  PosixPath('/Users/linalopes/Zotero/storage/2TSQFZCN/Xu et al. - 2021 - Brain decoding in multiple languages Can cross-language brain decoding work.pdf')),
 ('Thinking with images or thinking with language: a pilot EEG probability mapping study',
  PosixPath('/Users/linalopes/Zotero/storage/R4VEIFZ5/Petsche et al. - 1992 - Thinking with images or thinking with language a pilot EEG probability mapping study.pdf')),
 ('Analysis of EEG Signals Related to Artists and Nonartists during Visual Perception, Mental Imagery, and Rest Using Approximate Entropy',
  PosixPath('/Users/linalopes/Zotero/storage/5NQIFBDC/Shourie et al. - 2014 - Analysis of EEG Signals Related to Artists and Nonartists during Visual Perception, Mental Imagery,.pdf')),
 ('A neural speech decoding framework leveraging deep learning and speech synthesis',
  PosixPath('/Users/linalopes/Zotero/storage/FW2MCAN8/Chen et al. - 2024 - A neural sp

## 🪄 Convert PDF to Markdown with Docling (CLI Version)

In this section, we use the [Docling CLI](https://github.com/doclinghq/docling) to convert each extracted PDF into Markdown format.

The process involves calling `docling` from the command line using Python's `subprocess` module. The output is saved as a `.md` file in the `/markdown` folder at the root of this project.

Make sure Docling is installed on your system and available in your terminal path. You can install it via pip if needed:

```
pip install docling
```

In [None]:
output_dir = Path("../../data/markdown")
output_dir.mkdir(exist_ok=True)

for title, pdf_path in pdf_map:
    print(f"Converting with Docling: {pdf_path.name} → markdown/")

    subprocess.run([
        "docling",
        str(pdf_path),
        "--output",
        str(output_dir)
    ])