# Add existing items in a collection

## What is this notebook about?

This notebook allows the upload of an existing local collection into the app.


In [11]:
%pip install -q impresso --upgrade

Note: you may need to restart the kernel to use updated packages.


### Open a working client connection to the Impresso API

Requires a valid API key and secret. You can obtain these from the Impresso website.


In [12]:
from impresso import connect

client = connect()

🎉 You are now connected to the Impresso API!  🎉
🔗 Using API: https://impresso-project.ch/public-api


### 📂 Load a local CSV file

And now, we load a **local CSV file** that you have saved on your computer.  
This file comes from the Impresso platform and contains your sampled articles.  

Because Impresso exports include some disclaimers and extra lines at the top, we tell [pandas](https://pandas.pydata.org/) to skip the first 4 rows so we only keep the useful data.  

- We're loading the CSV from your computer using [pandas](https://pandas.pydata.org/), a popular Python library for working with tabular data.  
- The dataset contains article metadata such as `uid`, `title`, `publicationDate`, etc.  
- We will use the `uid` column later to add these articles into a new Impresso user collection.  

In [13]:
import pandas as pd

csv_path = "samples/2025-09-18T12-09-21-f40f40f5.csv"

# Skip the disclaimer rows
df = pd.read_csv(csv_path, sep=';', skiprows=4)

print("Columns:", df.columns.tolist())
print("Shape:", df.shape)

# Peek at first few rows
df.head()


Columns: ['uid', 'access_right', 'collections', 'countryCode', 'dataProviderCode', 'excerpt', 'isOnFrontPage', 'is_content_available', 'is_olr', 'issue', 'languageCode', 'locations_mentioned', 'mediaCode', 'mediaPoliticalOrientation', 'mediaTopics', 'newsagencies_mentioned', 'pages', 'periodicity', 'persons_mentioned', 'provinceCode', 'publicationDate', 'relevance', 'title', 'topics', 'totalPages', 'transcript', 'transcriptLength', 'type', 'year']
Shape: (26, 29)


Unnamed: 0,uid,access_right,collections,countryCode,dataProviderCode,excerpt,isOnFrontPage,is_content_available,is_olr,issue,...,provinceCode,publicationDate,relevance,title,topics,totalPages,transcript,transcriptLength,type,year
0,DTT-1958-07-19-a-i0041,,,CH,Migros,[Copyright restricted],False,N,True,DTT-1958-07-19-a,...,na,1958-07-19T00:00:00Z,4.102252,Jazzecke ää ^ um—¦Mm —uaamri+laJzZtäm,tm-de-all-v2.0_tp87_de|0.619 tm-de-all-v2.0_tp...,1,[Copyright restricted],893,ar,1958
1,DTT-1959-06-27-a-i0223,,,CH,Migros,[Copyright restricted],False,N,True,DTT-1959-06-27-a,...,na,1959-06-27T00:00:00Z,1.096451,mwm®fmmm WMMMMEMW*W>M@>@IMMMM,tm-de-all-v2.0_tp62_de|0.389 tm-de-all-v2.0_tp...,1,[Copyright restricted],3826,ar,1959
2,DTT-1959-07-02-a-i0035,,,CH,Migros,[Copyright restricted],False,N,True,DTT-1959-07-02-a,...,na,1959-07-02T00:00:00Z,4.767456,"Radio Beromünster Mittwoch, 1. Juli 1959...",tm-de-all-v2.0_tp62_de|0.307 tm-de-all-v2.0_tp...,1,[Copyright restricted],536,ar,1959
3,DTT-1970-01-03-a-i0046,,,CH,Migros,[Copyright restricted],False,N,True,DTT-1970-01-03-a,...,na,1970-01-03T00:00:00Z,8.846883,Boro Drascovic: «Horoscope»,tm-de-all-v2.0_tp46_de|0.435 tm-de-all-v2.0_tp...,1,[Copyright restricted],613,ar,1970
4,DTT-1970-01-03-a-i0047,,,CH,Migros,[Copyright restricted],False,N,True,DTT-1970-01-03-a,...,na,1970-01-03T00:00:00Z,4.994109,* -. i.._J,tm-de-all-v2.0_tp87_de|0.407 tm-de-all-v2.0_tp...,1,[Copyright restricted],1419,ar,1970


### 🔑 Extract article identifiers (UIDs)

Every article in Impresso has a unique identifier called a **UID** (e.g. `DTT-1958-07-19-a-i0041`).  
This is the key we will use to tell the Impresso API which articles to add to our collection.  

From the CSV file, we now extract the values in the `uid` column and store them in a Python list called `doc_ids`.  
We also print out how many UIDs were found, along with the first few as an example.

In [14]:
# --- Extract document IDs ---
if "uid" not in df.columns:
    raise ValueError("Expected 'uid' column with document IDs not found.")

doc_ids = df["uid"].dropna().astype(str).tolist()
print(f"Extracted {len(doc_ids)} doc_ids, example: {doc_ids[:5]}")

Extracted 26 doc_ids, example: ['DTT-1958-07-19-a-i0041', 'DTT-1959-06-27-a-i0223', 'DTT-1959-07-02-a-i0035', 'DTT-1970-01-03-a-i0046', 'DTT-1970-01-03-a-i0047']


### ⚙️ Functions to create and populate an Impresso user collection

In this section, we define helper functions that make it easy to work with Impresso user collections:

1. **`create_collection`**  
   Creates a new collection in your Impresso account by calling the public API.  
   You provide a name and (optionally) a description. The function returns the collection details, including its unique ID.

2. **`add_docids`**  
   Adds a list of article identifiers (`uid`s) to a collection.  
   Since the API accepts documents in batches, this function uploads them in groups of up to 200 articles at a time.

3. **`create_collection_with_docs`**  
   Combines the two steps above: it first creates a new collection, then immediately fills it with the article UIDs you provide.

These functions allow us to take the list of UIDs we extracted from the CSV and send them directly to the Impresso platform, so that we can view and analyze them inside the Impresso interface.

In [15]:
import time
import requests

def create_collection(client, name: str, description: str = "") -> dict:
    """
    Create a new collection in the Impresso public API.
    """
    token = getattr(client, "_api_bearer_token", None)
    if not token:
        raise ValueError("Client does not have a valid _api_bearer_token.")

    url = "https://impresso-project.ch/public-api/v1/collections"
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
        "Accept": "application/json",
    }
    payload = {"name": name, "description": description, "accessLevel": "private"}

    response = requests.post(url, headers=headers, json=payload)

    if response.ok:
        data = response.json()
        print(f"✅ Collection '{name}' created successfully (ID: {data.get('uid')})")
        return data
    else:
        raise RuntimeError(
            f"❌ Failed to create collection: {response.status_code} {response.text}"
        )


In [16]:

def add_docids(client, collection_id: str, doc_ids: list[str], delay: float = 4) -> None:
    """
    Add article IDs to a collection in batches of 200.
    """
    print(f"⏳ Adding {len(doc_ids)} documents to collection {collection_id}...")
    batch_size = 200

    for i in range(0, len(doc_ids), batch_size):
        batch = doc_ids[i : i + batch_size]
        try:
            client.collections.add_items(collection_id, batch)
            print(f"   ✅ Added {len(batch)} documents ({i+1}–{i+len(batch)})")
        except Exception as e:
            print(f"   ❌ Error adding batch: {e}")
            raise

        if i + batch_size < len(doc_ids):  # sleep only if more batches remain
            print(f"   ⏸ Waiting {delay} seconds before next batch...")
            time.sleep(delay)

    print(f"🎉 Done! Added all {len(doc_ids)} documents.")



In [None]:
This method `create_collection_with_docs` uses the previous two methods to create a collection and add 

In [17]:

def create_collection_with_docs(
    client, name: str, doc_ids: list[str], description: str = "", delay: float = 4
) -> str:
    """
    Create a new collection and populate it with document IDs.
    """
    print(f"🚀 Creating collection '{name}' with {len(doc_ids)} documents...")

    collection = create_collection(client, name, description or "")
    collection_id = collection.get("uid")
    if not collection_id:
        raise ValueError("❌ Collection ID not found in the response.")

    time.sleep(delay)
    add_docids(client, collection_id, doc_ids, delay)

    print(f"✅ Collection '{name}' ready (ID: {collection_id})")
    return collection_id


### 🚀 Example: Create and fill a collection from our CSV

* Give our collection a **name** and an optional **description**.  
* Call `create_collection_with_docs` with the list of `doc_ids`.  
* The function creates the collection in Impresso, then uploads all article UIDs in batches.  
* Finally, we print the **collection ID** so we can look it up later in the Impresso app.

⚠️ **Note about duplicate collection names**

If you see an error like this:
> Failed to create collection: 409 {“type”:“https://impresso-project.ch/probs/unclassified-error”,“title”:“An unclassified error occurred”,“status”:409,“detail”:“ConstraintValidationFailed: collections_name_creator_id_85ceb8cf_uniq must be unique”}

it means that you already have a collection with the same **name** in your Impresso account.  
To fix this, simply choose a different name when creating your collection (for example, by adding today’s date or a version number).  

In [None]:
# --- Example usage ---
collection_name = "My Awesome Collection" # This needs to be unique
collection_description = "My Awesome Collection uploaded from CSV on 2025-09-18"

collection_id = create_collection_with_docs(
    client,
    name=collection_name,
    doc_ids=doc_ids,   # list of UIDs you extracted from CSV
    description=collection_description,
    delay=4,  # wait 4s between batches
)

print("📦 Final collection ID:", collection_id)