# Working with Cultural Heritage APIs: The Finnish National Gallery

Welcome to this workshop on working with APIs and cultural heritage data. In this notebook, you will:

1. **Learn** what an API is and why they matter for digital humanities research
2. **Download** metadata from the Finnish National Gallery's open data
3. **Explore** the structure of the metadata interactively
4. **Filter** artworks by artist, keyword, or other criteria
5. **Download** images of selected artworks

---

## About the Finnish National Gallery

The [Finnish National Gallery](https://www.kansallisgalleria.fi/en) (Kansallisgalleria) is Finland's national art museum, managing:
- **Ateneum Art Museum** - Finnish art from the 1750s to the 1960s
- **Museum of Contemporary Art Kiasma** - Contemporary art from 1960 onwards
- **Sinebrychoff Art Museum** - European old masters
- **Central Art Archives** - Documentation and archives

They provide **open access** to their collection metadata under a CC0 license, making it freely available for research and creative projects.

---

## Part 1: What is an API?

**API** stands for **Application Programming Interface**. Think of it as a waiter in a restaurant:

- You (the customer) want food from the kitchen
- The waiter takes your order and brings back your food
- You don't need to know how to cook - the waiter handles the communication

Similarly, an API:
- Takes your **request** ("give me all artworks by Helene Schjerfbeck")
- Sends it to a **server** (the database)
- Returns a **response** (the data you asked for)

### Why APIs matter for Digital Humanities

- **Scale**: Download thousands of records automatically instead of clicking through web pages
- **Structure**: Data comes in machine-readable formats (JSON, XML) ready for analysis
- **Reproducibility**: Your code documents exactly how you obtained your data
- **Updates**: Re-run your code to get the latest data

### Common Data Formats

| Format | Description | Example |
|--------|-------------|---------|
| **JSON** | JavaScript Object Notation - human-readable, widely used | `{"name": "Mona Lisa", "year": 1503}` |
| **XML** | eXtensible Markup Language - similar to HTML | `<artwork><name>Mona Lisa</name></artwork>` |
| **CSV** | Comma-Separated Values - spreadsheet-like | `name,year\nMona Lisa,1503` |

---

## Part 2: Setup

First, let's import the libraries we need and set up our project structure.

In [None]:
# Standard library imports
import os
import json
import time
from pathlib import Path

# External libraries (you may need to install these)
import requests
from IPython.display import display, Image, HTML

# Set up paths
PROJECT_ROOT = Path("../").resolve()
DATA_DIR = PROJECT_ROOT / "data"
IMAGES_DIR = PROJECT_ROOT / "images"

# Create directories if they don't exist
DATA_DIR.mkdir(exist_ok=True)
IMAGES_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Images directory: {IMAGES_DIR}")

---

## Part 3: Understanding the Finnish National Gallery API

The Finnish National Gallery offers their data in multiple ways:

1. **Full data package** - A single JSON file with ALL metadata (~80,000+ objects)
2. **API endpoints** - Query specific artworks (requires API key)

For this workshop, we'll download the **full data package** which is:
- Freely available without registration
- Contains everything we need
- Perfect for offline analysis

### The Data Package URL

In [None]:
# The URL to the complete data package (JSON format)
# This GitHub mirror contains a stable copy of the Finnish National Gallery data
DATA_PACKAGE_URL = "https://raw.githubusercontent.com/hugovk/finnishnationalgallery/master/fng-data-dc.json"

# Alternative: Official source (may change or require API key)
# DATA_PACKAGE_URL = "https://kokoelmat.fng.fi/app/si/fng-data-dc.json"

# Where we'll save the data
OBJECTS_FILE = DATA_DIR / "objects.json"

print(f"Data source: {DATA_PACKAGE_URL}")
print(f"Local file: {OBJECTS_FILE}")

### Making an HTTP Request

When you access a website or API, your computer sends an **HTTP request**. The most common types are:

| Method | Purpose | Example |
|--------|---------|--------|
| **GET** | Retrieve data | "Give me information about artwork #12345" |
| **POST** | Send data | "Create a new user account" |
| **PUT** | Update data | "Change the title of this record" |
| **DELETE** | Remove data | "Delete this comment" |

We'll use **GET** requests to download data.

---

## Part 4: Downloading the Complete Dataset

Let's download the complete metadata dump. This is a large file (~75 MB), so it may take a few minutes.

**Note:** If you already have `objects.json` in the `data/` folder, you can skip this step.

In [None]:
def download_data_package(url, output_path, force_download=False):
    """
    Download the Finnish National Gallery data package.
    
    Parameters:
        url: The URL to download from
        output_path: Where to save the file
        force_download: If True, download even if file exists
    """
    output_path = Path(output_path)
    
    # Check if file already exists
    if output_path.exists() and not force_download:
        size_mb = output_path.stat().st_size / (1024 * 1024)
        print(f"File already exists: {output_path}")
        print(f"Size: {size_mb:.1f} MB")
        print("Set force_download=True to re-download.")
        return True
    
    print(f"Downloading from: {url}")
    print("This may take a few minutes...")
    
    try:
        # Stream the download to handle large files
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an error for bad status codes
        
        # Get total size if available
        total_size = int(response.headers.get('content-length', 0))
        
        # Write to file
        downloaded = 0
        with open(output_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                downloaded += len(chunk)
                if total_size:
                    percent = (downloaded / total_size) * 100
                    print(f"\rProgress: {percent:.1f}%", end="", flush=True)
        
        print(f"\nDownload complete!")
        print(f"Saved to: {output_path}")
        print(f"Size: {downloaded / (1024*1024):.1f} MB")
        return True
        
    except requests.exceptions.RequestException as e:
        print(f"Error downloading data: {e}")
        return False

In [None]:
# Download the data (skip if already downloaded)
download_data_package(DATA_PACKAGE_URL, OBJECTS_FILE)

---

## Part 5: Loading and Exploring the Data

Now let's load the data and explore its structure.

In [None]:
# Load the JSON data
print("Loading data (this may take a moment)...")

with open(OBJECTS_FILE, 'r', encoding='utf-8') as f:
    objects = json.load(f)

print(f"Loaded {len(objects):,} objects!")

### Inspecting a Single Object

Let's look at one artwork to understand the data structure.

In [None]:
# Look at the first object with an image
sample_object = None
for obj in objects:
    if obj.get('multimedia') and obj.get('people') and obj.get('title'):
        sample_object = obj
        break

# Pretty print the JSON
print(json.dumps(sample_object, indent=2, ensure_ascii=False))

### Understanding the Metadata Fields

Each object contains various fields. Here are the most important ones:

| Field | Description | Example |
|-------|-------------|---------|
| `objectId` | Unique identifier | `406873` |
| `title` | Artwork title (multilingual: fi, en, sv) | `{"fi": "Kukkia", "en": "Flowers"}` |
| `people` | List of associated people (artists, etc.) | Artist name, birth/death dates |
| `yearFrom` / `yearTo` | Date of creation | `1890` |
| `keywords` | Subject keywords (multilingual) | `[{"en": "landscape"}, {"en": "summer"}]` |
| `classifications` | Art form/medium | `"painting"`, `"graphic arts"` |
| `materials` | Materials used | `"oil paint"`, `"canvas"` |
| `multimedia` | Image files in various resolutions | URLs to images (25px to 4000px) |
| `responsibleOrganisation` | Which museum holds it | `"Ateneumin taidemuseo"` |

In [None]:
def get_all_keys(objects, max_objects=1000):
    """Find all unique keys used across objects."""
    all_keys = set()
    for obj in objects[:max_objects]:
        all_keys.update(obj.keys())
    return sorted(all_keys)

print("All available fields in the metadata:")
print("-" * 40)
for key in get_all_keys(objects):
    print(f"  - {key}")

---

## Part 6: Interactive Data Exploration

Let's create some helper functions to explore the data interactively.

In [None]:
def get_artist_counts(objects):
    """Count artworks per artist."""
    artists = {}
    for obj in objects:
        for person in obj.get('people', []):
            first = person.get('firstName', '')
            last = person.get('familyName', '')
            name = f"{first} {last}".strip()
            if name and name != "None":
                artists[name] = artists.get(name, 0) + 1
    return dict(sorted(artists.items(), key=lambda x: -x[1]))


def get_keyword_counts(objects, language='en'):
    """Count how often each keyword appears."""
    keywords = {}
    for obj in objects:
        for kw in obj.get('keywords', []):
            term = kw.get(language, kw.get('fi', ''))
            if term:
                keywords[term] = keywords.get(term, 0) + 1
    return dict(sorted(keywords.items(), key=lambda x: -x[1]))


def get_classification_counts(objects, language='en'):
    """Count artworks per classification."""
    classifications = {}
    for obj in objects:
        for c in obj.get('classifications', []):
            term = c.get(language, c.get('fi', ''))
            if term:
                classifications[term] = classifications.get(term, 0) + 1
    return dict(sorted(classifications.items(), key=lambda x: -x[1]))

In [None]:
# Top 20 artists by number of works
artist_counts = get_artist_counts(objects)

print("Top 20 Artists by Number of Works")
print("=" * 45)
for i, (name, count) in enumerate(list(artist_counts.items())[:20], 1):
    print(f"{i:2}. {name:<30} {count:>5} works")

In [None]:
# Top 20 keywords
keyword_counts = get_keyword_counts(objects)

print("Top 20 Keywords (English)")
print("=" * 45)
for i, (keyword, count) in enumerate(list(keyword_counts.items())[:20], 1):
    print(f"{i:2}. {keyword:<30} {count:>5} works")

In [None]:
# Classifications (art types)
classification_counts = get_classification_counts(objects)

print("Art Classifications")
print("=" * 45)
for i, (classification, count) in enumerate(list(classification_counts.items())[:15], 1):
    print(f"{i:2}. {classification:<30} {count:>5} works")

---

## Part 7: Filtering the Data

Now let's learn how to filter artworks based on specific criteria. This is where you can customize the code!

In [None]:
def filter_by_artist(objects, artist_name):
    """
    Filter objects by artist name (partial match, case-insensitive).
    
    Parameters:
        objects: List of all artwork objects
        artist_name: Name to search for (e.g., "Schjerfbeck")
    
    Returns:
        List of matching objects
    """
    results = []
    search_term = artist_name.lower()
    
    for obj in objects:
        for person in obj.get('people', []):
            first = person.get('firstName', '') or ''
            last = person.get('familyName', '') or ''
            full_name = f"{first} {last}".lower()
            
            if search_term in full_name:
                results.append(obj)
                break  # Avoid duplicates if multiple people match
    
    return results


def filter_by_keyword(objects, keyword, language='en'):
    """
    Filter objects by keyword (partial match, case-insensitive).
    
    Parameters:
        objects: List of all artwork objects
        keyword: Keyword to search for (e.g., "landscape")
        language: Language code ('en', 'fi', 'sv')
    
    Returns:
        List of matching objects
    """
    results = []
    search_term = keyword.lower()
    
    for obj in objects:
        for kw in obj.get('keywords', []):
            term = kw.get(language, '') or ''
            if search_term in term.lower():
                results.append(obj)
                break
    
    return results


def filter_by_classification(objects, classification, language='en'):
    """
    Filter objects by classification (exact match, case-insensitive).
    
    Parameters:
        objects: List of all artwork objects
        classification: Classification to filter (e.g., "painting")
        language: Language code ('en', 'fi', 'sv')
    
    Returns:
        List of matching objects
    """
    results = []
    search_term = classification.lower()
    
    for obj in objects:
        for c in obj.get('classifications', []):
            term = c.get(language, '') or ''
            if search_term == term.lower():
                results.append(obj)
                break
    
    return results


def filter_with_images(objects):
    """Keep only objects that have downloadable images."""
    return [obj for obj in objects if obj.get('multimedia')]

---

## Exercise 1: Filter by Artist

**Your task:** Change the `ARTIST_NAME` variable to search for a different artist.

Some suggestions:
- `"Helene Schjerfbeck"` - Famous Finnish modernist painter
- `"Albert Edelfelt"` - National romantic painter (most works in collection)
- `"Hugo Simberg"` - Symbolist painter (The Wounded Angel)
- `"Ellen Thesleff"` - Expressionist painter
- `"Akseli Gallen-Kallela"` - Kalevala painter

In [None]:
# ============================================================
# EXERCISE 1: Change the artist name below!
# ============================================================

ARTIST_NAME = "Helene Schjerfbeck"  # <-- CHANGE THIS!

# ============================================================

# Filter by artist
artist_works = filter_by_artist(objects, ARTIST_NAME)
artist_works_with_images = filter_with_images(artist_works)

print(f"Found {len(artist_works)} works by '{ARTIST_NAME}'")
print(f"Of these, {len(artist_works_with_images)} have downloadable images")

# Show a few titles
print(f"\nSample works:")
for obj in artist_works_with_images[:5]:
    title = obj.get('title', {})
    title_str = title.get('en') or title.get('fi') or 'Untitled'
    year = obj.get('yearFrom', 'n.d.')
    print(f"  - {title_str} ({year})")

---

## Exercise 2: Filter by Keyword

**Your task:** Change the `KEYWORD` variable to search for artworks with a specific subject.

Some suggestions:
- `"landscape"` - Landscape paintings
- `"portrait"` - Portraits
- `"woman"` - Depictions of women
- `"sea"` or `"water"` - Maritime scenes
- `"forest"` - Forest scenes
- `"winter"` - Winter scenes

In [None]:
# ============================================================
# EXERCISE 2: Change the keyword below!
# ============================================================

KEYWORD = "landscape"  # <-- CHANGE THIS!

# ============================================================

# Filter by keyword
keyword_works = filter_by_keyword(objects, KEYWORD)
keyword_works_with_images = filter_with_images(keyword_works)

print(f"Found {len(keyword_works)} works with keyword '{KEYWORD}'")
print(f"Of these, {len(keyword_works_with_images)} have downloadable images")

# Show a few titles with their artists
print(f"\nSample works:")
for obj in keyword_works_with_images[:5]:
    title = obj.get('title', {})
    title_str = title.get('en') or title.get('fi') or 'Untitled'
    
    people = obj.get('people', [])
    if people:
        artist = f"{people[0].get('firstName', '')} {people[0].get('familyName', '')}".strip()
    else:
        artist = "Unknown"
    
    print(f"  - {title_str} by {artist}")

---

## Part 8: Previewing Images

Before downloading, let's preview some images directly in the notebook.

In [None]:
def preview_artwork(obj, size='500'):
    """
    Display an artwork image in the notebook.
    
    Parameters:
        obj: An artwork object
        size: Image size ('25', '250', '500', '1000', '2000', '4000')
    """
    # Get title
    title = obj.get('title', {})
    title_str = title.get('en') or title.get('fi') or 'Untitled'
    
    # Get artist
    people = obj.get('people', [])
    if people:
        artist = f"{people[0].get('firstName', '')} {people[0].get('familyName', '')}".strip()
    else:
        artist = "Unknown"
    
    # Get year
    year = obj.get('yearFrom', 'n.d.')
    
    # Get image URL
    multimedia = obj.get('multimedia', [])
    if not multimedia:
        print("No image available")
        return
    
    image_url = multimedia[0].get('jpg', {}).get(size)
    if not image_url:
        print(f"No image available at size {size}")
        return
    
    # Display info
    print(f"{title_str}")
    print(f"by {artist}, {year}")
    print(f"License: {multimedia[0].get('license', 'Unknown')}")
    print()
    
    # Fetch and display the image
    try:
        response = requests.get(image_url, timeout=10)
        response.raise_for_status()
        display(Image(data=response.content, width=400))
    except requests.exceptions.RequestException as e:
        print(f"Error loading image: {e}")
        print(f"URL: {image_url}")
        # Fallback: try to display as HTML img tag
        display(HTML(f'<img src="{image_url}" width="400" onerror="this.style.display=\'none\'"/>'))

In [None]:
# Preview a few works from your filtered selection
# Using the artist works from Exercise 1

print(f"Previewing works by {ARTIST_NAME}:")
print("=" * 50)

for obj in artist_works_with_images[:3]:
    preview_artwork(obj)
    print("\n" + "-" * 50 + "\n")

---

## Part 9: Downloading Images

Now let's download images from your filtered selection. You can choose:
- **Resolution**: 25, 250, 500, 1000, 2000, or 4000 pixels
- **Which artworks**: Use your filtered list from Exercise 1 or 2

In [None]:
def sanitize_filename(name):
    """Remove problematic characters from filenames."""
    if not name:
        return "unknown"
    # Keep only alphanumeric, spaces, dots, underscores, hyphens
    safe = "".join(c for c in name if c.isalnum() or c in ' ._-')
    return safe.strip()[:100]  # Limit length


def download_images(objects_list, output_dir, resolution='1000', max_images=10, delay=0.5):
    """
    Download images from a list of artwork objects.
    
    Parameters:
        objects_list: List of artwork objects to download
        output_dir: Directory to save images
        resolution: Image resolution ('25', '250', '500', '1000', '2000', '4000')
        max_images: Maximum number of images to download
        delay: Seconds to wait between downloads (be nice to the server!)
    
    Returns:
        List of downloaded file paths
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    downloaded = []
    
    # Filter to only objects with images
    objects_with_images = [obj for obj in objects_list if obj.get('multimedia')]
    
    # Limit the number
    to_download = objects_with_images[:max_images]
    
    print(f"Downloading {len(to_download)} images at {resolution}px resolution...")
    print(f"Saving to: {output_dir}")
    print()
    
    for i, obj in enumerate(to_download, 1):
        # Get metadata for filename
        object_id = obj.get('objectId', 'unknown')
        
        people = obj.get('people', [])
        if people:
            artist = f"{people[0].get('firstName', '')}_{people[0].get('familyName', '')}"
        else:
            artist = "Unknown"
        artist = sanitize_filename(artist)
        
        title = obj.get('title', {})
        title_str = sanitize_filename(title.get('en') or title.get('fi') or '')
        
        # Get image URL
        multimedia = obj.get('multimedia', [])[0]
        image_url = multimedia.get('jpg', {}).get(resolution)
        
        if not image_url:
            print(f"  [{i}/{len(to_download)}] No image at {resolution}px, skipping...")
            continue
        
        # Create filename
        filename = f"{artist}_{object_id}_{title_str[:30]}.jpg"
        filepath = output_dir / filename
        
        # Download
        try:
            response = requests.get(image_url, stream=True)
            response.raise_for_status()
            
            with open(filepath, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            
            downloaded.append(filepath)
            print(f"  [{i}/{len(to_download)}] Downloaded: {filename}")
            
        except Exception as e:
            print(f"  [{i}/{len(to_download)}] Error: {e}")
        
        # Be nice to the server
        if i < len(to_download):
            time.sleep(delay)
    
    print(f"\nDownloaded {len(downloaded)} images to {output_dir}")
    return downloaded

---

## Exercise 3: Download Images

**Your task:** Configure the download settings below:

1. Choose `RESOLUTION`: `'250'` (small), `'500'` (medium), `'1000'` (large), `'2000'` (very large)
2. Choose `MAX_IMAGES`: How many images to download
3. Choose which filtered set to use: `artist_works_with_images` or `keyword_works_with_images`

In [None]:
# ============================================================
# EXERCISE 3: Configure your download!
# ============================================================

# Choose resolution: '250', '500', '1000', '2000', '4000'
RESOLUTION = '1000'  # <-- CHANGE THIS!

# How many images to download 
MAX_IMAGES = 5  # <-- CHANGE THIS!

# Which dataset to use? 
# Options: artist_works_with_images, keyword_works_with_images
SELECTED_WORKS = artist_works_with_images  # <-- CHANGE THIS!

# Folder name for downloaded images
FOLDER_NAME = f"{ARTIST_NAME.replace(' ', '_')}_images"  # <-- CHANGE THIS!

# ============================================================

In [None]:
# Download the images!
download_dir = IMAGES_DIR / FOLDER_NAME

downloaded_files = download_images(
    SELECTED_WORKS,
    download_dir,
    resolution=RESOLUTION,
    max_images=MAX_IMAGES
)

In [None]:
# View one of your downloaded images
if downloaded_files:
    from IPython.display import Image as IPImage
    display(IPImage(filename=str(downloaded_files[0]), width=500))

---

## Part 10: Saving Filtered Metadata

You might want to save the metadata for your filtered selection for later analysis.

In [None]:
def save_filtered_metadata(objects_list, filename, output_dir=None):
    """
    Save a filtered list of objects to a JSON file.
    
    Parameters:
        objects_list: List of artwork objects
        filename: Name of the output file (e.g., 'schjerfbeck_works.json')
        output_dir: Directory to save to (default: DATA_DIR)
    """
    if output_dir is None:
        output_dir = DATA_DIR
    
    output_path = Path(output_dir) / filename
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(objects_list, f, indent=2, ensure_ascii=False)
    
    print(f"Saved {len(objects_list)} objects to {output_path}")
    return output_path

In [None]:
# Save your filtered data
save_filtered_metadata(
    artist_works_with_images, 
    f"{ARTIST_NAME.replace(' ', '_').lower()}_metadata.json"
)

---

## Summary

In this notebook, you learned:

1. **What APIs are** and why they're useful for digital humanities research
2. **How to download** large datasets from cultural heritage institutions
3. **How to explore** JSON metadata structure
4. **How to filter** data by artist, keyword, or other criteria
5. **How to download** images at various resolutions

### Next Steps

With your downloaded data and images, you can:
- Perform **statistical analysis** of the collection
- Use **computer vision** techniques (like CLIP) to analyze images
- Create **visualizations** of artistic trends
- Build **machine learning** models
- Create **digital exhibitions** or interactive presentations

### License & Attribution

The Finnish National Gallery metadata is released under **CC0 1.0 Universal (Public Domain)**.

Individual artworks may have different copyright status. Always check the `license` field in the multimedia data before using images.

---

## Bonus: Quick Reference Functions

Here are some additional useful functions for exploring the data.

In [None]:
def search_titles(objects, search_term, language='en'):
    """Search for artworks by title."""
    results = []
    search_term = search_term.lower()
    
    for obj in objects:
        title = obj.get('title', {})
        title_str = (title.get(language) or title.get('fi') or '').lower()
        if search_term in title_str:
            results.append(obj)
    
    return results


def filter_by_year_range(objects, year_from, year_to):
    """Filter objects created within a year range."""
    results = []
    for obj in objects:
        year = obj.get('yearFrom')
        if year and year_from <= year <= year_to:
            results.append(obj)
    return results


def get_year_distribution(objects):
    """Get the distribution of artworks by year."""
    years = {}
    for obj in objects:
        year = obj.get('yearFrom')
        if year:
            years[year] = years.get(year, 0) + 1
    return dict(sorted(years.items()))


# Example: Search titles
# results = search_titles(objects, "self-portrait")
# print(f"Found {len(results)} works with 'self-portrait' in title")

# Example: Filter by year
# golden_age = filter_by_year_range(objects, 1880, 1910)
# print(f"Found {len(golden_age)} works from 1880-1910 (Finnish Golden Age)")