# Stratified Sampling of Articles into an Impresso User Collection

## What is this notebook about?

This notebook demonstrates how to perform stratified sampling on search results from the Impresso API. It allows you to systematically sample articles by year and newspaper, and add them to a collection for further analysis.

## Why is this useful?

When working with large historical newspaper archives, it is often impractical to analyze all available articles. Stratified sampling allows you to create a diverse subset of the data, ensuring that different time periods and publications are represented.

## What will you learn?

In this notebook, you will learn how to:

- Connect to the Impresso API.
- Perform faceted searches to get article counts per year and newspaper.
- Implement a stratified sampling strategy to select articles.
- Create a new Impresso user collection via the Impresso API.
- Add sampled articles to an Impresso user collection.
- Double-check the collection to ensure it contains the expected articles.

## Recommended resources

- [Impresso Python Documentation](https://pypi.org/project/impresso/)
- [Video Tutorial on Impresso User Collections](https://vimeo.com/347022422)

---

## Prerequisites

Install necessary packages


In [None]:
%pip install -q impresso

## Open a working client connection to the Impresso API

Requires a valid API key and secret. You can obtain these from the Impresso website.


In [None]:
from impresso import connect

client = connect()

## Setup a proper logging that will be used throughout the notebook

In order to better understand how the collection was created, we store rich information
about the sampling process in a log file. This will help us trace the steps taken during the sampling and collection creation.


In [None]:
import logging
import os


def setup_logging(log_filename: str = "sampling_log.txt"):
    """
    Set up logging to a user-specified file.

    Args:
        log_filename (str): Path to the log file. Defaults to "sampling_log.txt".
    """
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)

    # Clear existing handlers
    logger.handlers.clear()

    # Create formatter
    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")

    # Create file handler
    file_handler = logging.FileHandler(log_filename, mode="w", encoding="utf-8")

    file_handler.setLevel(logging.DEBUG)
    file_handler.setFormatter(formatter)

    # Create console handler
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    console_handler.setFormatter(formatter)

    # Add handlers to logger
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)

    # Prevent duplicate logs
    logger.propagate = False

    # Get absolute path of the log file
    absolute_log_path = os.path.abspath(log_filename)
    print(f"Logging configured. Logs will be saved to: {absolute_log_path}")
    return logger


# Set up default logging
logger = setup_logging()

## Configure logging (optional)

You can specify a custom log file to trace the sampling process. If you don't run this
cell, logging will use the default output file "sampling_log.txt".


In [None]:
# Uncomment and run this cell to use a custom log file
# logger = setup_logging("my_custom_sampling_log.txt")

## Sampling function that retrieves article IDs in ascending time order

Let's define a function that retrieves article IDs in ascending time order, which is useful for sampling articles from the Impresso API.
In order to be gentle on the API, we will add a delay between requests. This is
important to avoid overwhelming the API with too many requests in a short period of
time.


In [None]:
import random
import time
from impresso import DateRange
from impresso.client import ImpressoClient


def sample_impresso_uids(
    client: ImpressoClient,
    keyword: str,
    start_date: str | None = None,
    end_date: str | None = None,
    limit_per_query: int = 20,
    max_hits: int = 20,
    delay: float = 1.0,
) -> list[str]:
    """
    Sample article UIDs from Impresso API based on a search keyword and date range.

    Args:
        client (connect): Impresso API client.
        keyword (str): Keyword to search for.
        start_date (str | None): Start date for filtering (YYYY-MM-DD format).
        end_date (str | None): End date for filtering (YYYY-MM-DD format).
        limit_per_query (int): Maximum number of articles per query.
        max_hits (int): Maximum number of articles to sample.
        delay (float): Delay in seconds between API requests.

    Returns:
        list[str]: List of sampled article UIDs.

    Raises:
        ValueError: If limit_per_query is not between 1 and 100.
        Exception: If API requests fail.
    """
    logger = logging.getLogger(__name__)

    logger.info(f"Starting sampling process for keyword: '{keyword}'")
    logger.debug(
        f"Parameters: limit_per_query={limit_per_query}, max_hits={max_hits},"
        f" delay={delay}"
    )

    if not 0 < limit_per_query <= 100:
        logger.error(
            f"Invalid limit_per_query: {limit_per_query}. Must be between 1 and 100."
        )
        raise ValueError(
            f"Invalid limit_per_query: {limit_per_query}. Must be between 1 and 100."
        )

    sampled_uids = []
    found = 0

    # Step 1: Get all years with mentions of the keyword in the date range
    logger.debug("Step 1: Fetching year facets for keyword")
    if start_date or end_date:
        date_range = DateRange(start_date, end_date)
        logger.info(f"Using date range: {date_range}")
    else:
        date_range = None
        logger.info("No date range specified, using all available data.")

    try:
        year_hits = client.search.facet(
            "year", term=keyword, date_range=date_range, limit=200
        ).raw
    except Exception as e:
        logger.error(f"Failed to fetch year facets: {e}")
        raise

    year_buckets = year_hits.get("data", [])
    logger.debug(f"Year facets: {year_hits}")

    if not year_buckets:
        logger.warning(f"No hits found for keyword: '{keyword}'")
        return []

    sorted_year_buckets = sorted(year_buckets, key=lambda b: b.get("value"))
    logger.info(f"Found {len(sorted_year_buckets)} years mentioning '{keyword}'")
    logger.info(f"Years found: {[b.get('value') for b in sorted_year_buckets]}")

    for year_bucket in sorted_year_buckets:
        year = year_bucket.get("value")
        if not year:
            continue

        logger.debug(f"Processing year: {year}")
        date_range = DateRange(f"{year}-01-01", f"{year}-12-31")

        # Step 2: For each year, get all newspapers with hits
        logger.info(f"Step 2: Fetching newspaper facets for year {year}")
        newspapers_raw = client.search.facet(
            "newspaper", term=keyword, date_range=date_range, limit=200
        ).raw
        newspaper_buckets = newspapers_raw.get("data", [])
        logger.info(f"Newspaper facets for {year}: {newspapers_raw}")

        if not newspaper_buckets:
            logger.warning(f"No newspapers found for year {year}")
            continue

        logger.debug(f"Found {len(newspaper_buckets)} newspapers for year {year}")

        for paper in newspaper_buckets:
            newspaper_id = paper.get("value")
            if not newspaper_id:
                logger.warning(f"Missing newspaper ID in facet bucket: {paper}")
                continue

            logger.debug(f"Processing newspaper: {newspaper_id} for year {year}")

            try:
                logger.debug(
                    f"Searching for articles in {newspaper_id} for year {year}"
                )
                results = client.search.find(
                    term=keyword,
                    newspaper_id=newspaper_id,
                    date_range=date_range,
                    with_text_contents=False,
                    limit=limit_per_query,
                ).raw
                hits = results.get("data", [])
                logger.debug(
                    f"Found {len(hits)} hits for '{newspaper_id}' in {year}. Waiting"
                    f" for {delay} seconds..."
                )
                time.sleep(delay)  # Respectful delay between requests

                if hits:
                    hit = random.choice(hits)
                    uid = hit.get("uid")
                    if uid:
                        logger.debug(
                            f"Selected UID: {uid} from {newspaper_id} in {year}"
                        )
                        sampled_uids.append(uid)
                        found += 1
                        logger.info(
                            f"Progress: {found}/(max.){max_hits} articles sampled"
                        )
                        if found >= max_hits:
                            logger.info(
                                f"Reached maximum number of articles ({max_hits})"
                            )
                            return sampled_uids
                else:
                    logger.debug(f"No results for {newspaper_id} in {year}")

            except Exception as e:
                logger.error(f"Error processing '{newspaper_id}' in {year}: {e}")

    logger.info(
        f"Sampling completed. Collected {len(sampled_uids)} UIDs for keyword"
        f" '{keyword}'"
    )
    return sampled_uids

In [None]:
doc_ids = sample_impresso_uids(client, "United Press International", max_hits=10000)

In [None]:
doc_ids

## Functions to create and populate an Impresso user collection

We will define functions to create a new Impresso user collection and add sampled articles to it.
If the collection already exists, we will skip the creation step and proceed to add
articles.


In [None]:
import requests


def create_collection(client, name: str, description: str = "") -> dict:
    """
    Create a new collection in the Impresso public API using a client with a stored bearer token.

    Args:
        client: An authenticated Impresso API client with a `_api_bearer_token` attribute.
        name (str): Name of the collection.
        description (str): Optional description of the collection.

    Returns:
        dict: JSON response from the API.

    Raises:
        ValueError: If the client does not have a valid _api_bearer_token.
        RuntimeError: If the API request fails.
    """
    token = getattr(client, "_api_bearer_token", None)
    if not token:
        raise ValueError("Client does not have a valid _api_bearer_token.")

    url = "https://impresso-project.ch/public-api/v1/collections"
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
        "Accept": "application/json",
    }
    payload = {"name": name, "description": description, "accessLevel": "private"}

    response = requests.post(url, headers=headers, json=payload)

    if response.ok:
        logging.info(f"Collection '{name}' created successfully.")
        logging.debug(f"Response: {response.json()}")
        return response.json()
    else:
        raise RuntimeError(
            f"Failed to create collection: {response.status_code} {response.text}"
        )


def add_docids(
    client: connect,
    collection_id: str,
    doc_ids: list[str],
    delay: float = 30,
) -> None:
    """
    Add article IDs to a collection in batches of 200.

    Args:
        client (connect): Impresso API client.
        collection_id (str): ID of the user collection.
        doc_ids (list[str]): List of article IDs to add.
        delay (float): Delay in seconds between batch uploads. Defaults to 2.

    Returns:
        None
    """
    collection = client.collections.get(collection_id)
    logger = logging.getLogger(__name__)

    logger.info(
        f"Starting to add {len(doc_ids)} documents to collection '{collection_id}'"
    )
    logger.debug(f"Collection details: {collection}")

    batch_size = 200
    for i in range(0, len(doc_ids), batch_size):
        batch = doc_ids[i : i + batch_size]
        logger.debug(
            f"Processing batch {i//batch_size + 1}: documents {i+1} to"
            f" {min(i+batch_size, len(doc_ids))}"
        )

        try:
            client.collections.add_items(collection_id, batch)
            logger.info(
                f"Successfully added {len(batch)} documents to collection"
                f" '{collection_id}'"
            )
        except Exception as e:
            logger.error(f"Error adding batch to collection: {e}")
            raise

        if i + batch_size < len(doc_ids):  # Sleep only if there are more batches
            logger.info(f"Sleeping for {delay} seconds before adding the next batch...")
            time.sleep(delay)

    logger.info(
        f"Completed adding all {len(doc_ids)} documents to collection '{collection_id}'"
    )


def create_collection_with_docs(
    client,
    name: str,
    doc_ids: list[str],
    description: str = "",
    delay: float = 4,
) -> str:
    """
    Create a new collection and populate it with document IDs.

    Args:
        client (connect): Impresso API client.
        name (str): Name of the new collection.
        doc_ids (list[str]): List of document IDs to add to the collection.
        description (str): Optional description of the collection.
        delay (float): Delay in seconds between batch uploads. Defaults to 2.

    Returns:
        str: The ID of the created collection.

    Raises:
        ValueError: If the collection ID is not found in the response.
        Exception: If collection creation or document addition fails.
    """
    logger = logging.getLogger(__name__)
    try:
        logger.info(
            f"Starting to create collection '{name}' with {len(doc_ids)} documents"
        )
        logger.debug(f"Collection description: {description}")

        collection = create_collection(client, name, description or "")
        collection_id = collection.get("uid")
        if not collection_id:
            raise ValueError("Collection ID not found in the response.")

        logger.info(f"Collection '{name}' created with ID: {collection_id}")
        logger.debug(f"Collection response: {collection}")

        time.sleep(delay)
        add_docids(client, collection_id, doc_ids, delay)
        logger.info(
            f"Successfully added {len(doc_ids)} documents to collection '{name}'"
        )

        return collection_id
    except Exception as e:
        logger.error(f"Failed to create collection with docs: {e}")
        raise

## Let's specify a name for the Impresso user collection

Each collection must have a unique name. If you run this notebook multiple times, make sure to change the collection name to avoid conflicts.


In [None]:
collection_name = "United Press International - Sampling 2025-07-08"
collection_description = (
    "Searching for the phrase 'United Press International' in the Impresso dataset."
)

In [None]:
collection_id = create_collection_with_docs(
    client,
    collection_name,
    doc_ids,
    collection_description,
)

Sometimes some articles get lost during collection building. This can happen due to
various reasons, such as network issues or API limitations.
It is absolutely save to rerun the function that adds content items without creating a
new collection. If a content item is already in the collection, it will be skipped silently.


In [None]:
add_docids(client, collection_id, doc_ids, delay=20)

Let's look at the collection we just created. This will help us verify that the articles
were added successfully and that the collection is ready for further analysis.
There is also a link to the collection in the Impresso web interface, which you can use
to view the articles directly and explore its facets and metadatda interactively.


In [None]:
r = client.collections.get(collection_id)
r

## Comparing the collection content with local document IDs

To double-check if any and which articles were lost during the collection building process, we can compare the local document IDs with the actual content of the cloud collection.
This will help us identify any discrepancies and ensure that the collection contains the expected articles.
If there are any missing articles, we can re-run the function to add them to the collection.
This is particularly useful for ensuring the integrity of the collection and verifying that all intended articles are included.
This step is optional, but it can be helpful for debugging and ensuring the completeness of the collection


In [None]:
def compare_collection_content(
    client: ImpressoClient, collection_id: str, local_doc_ids: list[str]
) -> dict[str, list[str]]:
    """
    Compare local document IDs with the actual content of a cloud collection.

    Args:
        client (connect): Impresso API client.
        collection_id (str): ID of the collection to compare against.
        local_doc_ids (list[str]): List of document IDs that should be in the collection.

    Returns:
        dict[str, list[str]]: Dictionary containing:
            - "stored_docids": List of document IDs that are successfully stored in the collection
            - "missing_docids": List of document IDs that are missing from the collection

    Raises:
        Exception: If unable to retrieve collection content from the API.
    """
    logger = logging.getLogger(__name__)

    logger.info(f"Comparing local document list with collection '{collection_id}'")
    logger.debug(f"Local document count: {len(local_doc_ids)}")

    try:
        # Get the collection details
        collection = client.collections.get(collection_id)
        logger.debug(f"Collection details: {collection}")

        # Get all document IDs from the collection
        # Note: We need to fetch all items, so we use a high limit
        collection_items = client.collections.items(collection_id, limit=10000).raw[
            "data"
        ]
        print(collection_items)
        stored_doc_ids = [
            item.get("uid") for item in collection_items if item.get("uid")
        ]

        logger.info(f"Retrieved {len(stored_doc_ids)} documents from collection")
        logger.debug(f"First 10 stored document IDs: {stored_doc_ids[:10]}")

        # Convert to sets for efficient comparison
        local_set = set(local_doc_ids)
        stored_set = set(stored_doc_ids)

        # Find intersection (stored) and difference (missing)
        stored_docids = list(local_set.intersection(stored_set))
        missing_docids = list(local_set.difference(stored_set))

        logger.info(f"Comparison results:")
        logger.info(f"  - Documents successfully stored: {len(stored_docids)}")
        logger.info(f"  - Documents missing from collection: {len(missing_docids)}")

        if missing_docids:
            logger.warning(f"Missing document IDs (first 10): {missing_docids[:10]}")
        else:
            logger.info("All local documents are present in the collection!")

        return {"stored_docids": stored_docids, "missing_docids": missing_docids}

    except Exception as e:
        logger.error(f"Failed to compare collection content: {e}")
        raise

In [None]:
comparison = compare_collection_content(client, collection_id, doc_ids)
print(f"Successfully stored: {len(comparison['stored_docids'])} documents")
print(f"Missing from collection: {len(comparison['missing_docids'])} documents")

# If there are missing documents, you can add them
if comparison["missing_docids"]:
    add_docids(client, collection_id, comparison["missing_docids"])

## Adding articles to an existing user collection

If you already have a collection where you want to store additional articles, go to
[your Impresso collection overview](https://impresso-project.ch/app/collections). Select the the collection that you
want to add to. Look at the URL to find the technical unique collection id. It is
everything after https://impresso-project.ch/app/collections/{collection_id}.


# Outlook

In this notebook, we have demonstrated how to perform stratified sampling of articles
from the Impresso API and add them to a user collection. This process allows for
efficient analysis of large newspaper archives by creating a representative subset of
articles.

You can now proceed to analyze the sampled articles in your collection.
You can use the Impresso web interface to explore the collection, visualize the data,
and perform further analyses using the Impresso API.
