*Disclaimer: Currently, this notebook fetches resources from Internet Archive's Wayback Machine. In a future where larger amounts of the National Library of Norway's Web Archive are accessible, we hope to enable similar functionality for our own collection.*
____

# Objective of this notebook

The objective of this notebook is to obtain archived versions of webpages from the Internet Archive (IA), and then analyse changes in their layout over time. This can be helpful for a range over problems and disciplines, including media and communication studies, computer science and not least media history.

This notebook uses the case of front pages from the Norwegian Broadcasting Company (NRK), which has been archived by IA since 1996 and up until today. We will fetch and capture the archived versions of `nrk.no`, but it is very easy to change that to another domain, such as `bbc.co.uk`. The script is tailored to fetch and capture with ~90 days intervals, but that can also be adjusted easily.

The notebook is structured into several steps:
1. Fetching URLs of different versions of a page
2. Capture screenshots
3. Compare screenshot similarity (SSIM)
4. Visualise changes in SSIM score over time

## Installing necessery packages
You need to install various packages before running this notebook.

If you are comfortable with terminal/CLI, you can activate your desired python/conda environment and run `pip install waybackpy selenium opencv-python scikit-image plotly`.

An alternative is to install the packages from this notebook. Remove `# ` in the start of each line in the cell below, and then run the cell with **⇧** + **↵ Enter**

In [None]:
# !pip install waybackpy
# !pip install selenium
# !pip install opencv-python
# !pip install scikit-image
# !pip install plotly

# 1. Fetch IAWB URLs

First, we need to fetch the URLs of different versions that can be replayed in IA's Wayback Machine. To do this, we make use of the `waybackpy` package that utilises IA's CDX Server API.

The code cell below contains a function to fetch URLs for archived versions between 1996 and 2024 with a 90 days interval. To change the interval, simply change the value If you want another page than `http://www.nrk.no/`, simply change that value into e.g. of `http://bbc.co.uk`. 

In [None]:
# Import necessary packages
from waybackpy import WaybackMachineCDXServerAPI
from datetime import datetime

def get_quarterly_archived_urls(domain, start_year=1996, end_year=2024):
    """Fetch one archived snapshot every 3 months from Wayback Machine."""
    cdx = WaybackMachineCDXServerAPI(domain)
    all_snapshots = list(cdx.snapshots())
    
    quarterly_urls = []
    last_date = None

    for snapshot in all_snapshots:
        snapshot_date = datetime.strptime(snapshot.timestamp, "%Y%m%d%H%M%S")
        
        # One snapshot per quarter
        if last_date is None or (snapshot_date - last_date).days >= 90: # "90" defines the number of days between each version
            quarterly_urls.append(snapshot.archive_url)
            last_date = snapshot_date

    return quarterly_urls

urls = get_quarterly_archived_urls("http://www.nrk.no/")
print(f"Retrieved {len(urls)} URLs")

For documentation, the fetched URLs can also be stored in a JSONL file.

In [None]:
# Export URLs to JSONL
import json
with open("./export/urls_from_IAWB.jsonl", "w", encoding="utf-8") as f:
    for item in urls:
        json.dump({"url": item}, f)
        f.write("\n")

print(f"Exported to urls_from_IAWB.jsonl")

# 2. Capture screenshots

All URLs are now stored in a list with the variable name `urls`. For each of these URLs, we want to create screenshots. This can be done automatically, using what is called a [headless browser](https://en.wikipedia.org/wiki/Headless_browser).

The cell below make use of Selenium Webdriver and a headless version of Chrome, and visits each of the archived versions we fetched earlier, producing a screenshot of how the archived version is rendered in IA's Wayback Machine.

***NOTICE:*** *For the case of* ***`nrk.no`***, *you should expect 104 images to be created in the* ***`screenshots`*** *folder. This will take quite some time (~60 min), as the code only uses instance of the headless browser. If you want to reduce the runtime, it is possible to setup up several instances, but you can also reduce the chronological scope (1996-2024), or increase the interval (90 days).*

In [None]:
from selenium import webdriver
import time

def capture_screenshots(urls):
    """Captures screenshots for a list of archived URLs."""
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  # Run in headless mode
    driver = webdriver.Chrome(options=options)

    for i, url in enumerate(urls):
        try:
            driver.get(url)
            time.sleep(3)  # Allow some time for page elements to load
            screenshot_path = f"screenshots/snapshot_{i}.png"
            driver.save_screenshot(screenshot_path)
            print(f"Saved: {screenshot_path}")
        except Exception as e:
            print(f"Failed to capture {url}: {e}")

    driver.quit()

# Capture screenshots
capture_screenshots(urls)

# 3. Compare screenshot similarity

After all screenshot images have been created, it is time to analyse their similarity.

This script uses a simple approach, using `opencv` to identify visual edges in the screenshots and calculate a [Structural Similarity Index Measure (SSIM) score](https://en.wikipedia.org/wiki/Structural_similarity_index_measure) for each pair of images. Each image is compared to the next, so that a screenshot from Dec 1996 will be compared to a screenshot from March 1997.

SSIM scores ranges from -1 to +1. -1 indicates no similarity at all (one complete white and one complete black image), while +1 indicates that the images are identical.

In [None]:
import cv2
import numpy as np
from skimage.metrics import structural_similarity as ssim
import os
import re

def compare_all_screenshots(directory="screenshots/"):
    """Compare all screenshots sequentially to detect major layout shifts."""

    # Get all valid snapshot files matching "snapshot_X.png"
    files = [f for f in os.listdir(directory) if re.match(r"snapshot_\d+\.png", f)]
    
    # Sort numerically by extracting the snapshot number
    files = sorted(files, key=lambda x: int(re.search(r"snapshot_(\d+).png", x).group(1)))

    ssim_scores = []

    for i in range(len(files) - 1):
        img1_path = os.path.join(directory, files[i])
        img2_path = os.path.join(directory, files[i + 1])

        img1 = cv2.imread(img1_path, cv2.IMREAD_GRAYSCALE)
        img2 = cv2.imread(img2_path, cv2.IMREAD_GRAYSCALE)

        # Resize to match dimensions
        img2 = cv2.resize(img2, (img1.shape[1], img1.shape[0]))

        # Compute SSIM
        score, diff = ssim(img1, img2, full=True)
        diff = (diff * 255).astype("uint8")

        ssim_scores.append(score)
        print(f"SSIM between {files[i]} and {files[i+1]}: {score:.4f}")

        # Ensure the "diff" directory exists
        os.makedirs("diff", exist_ok=True)

        # Save difference images for visualization
        cv2.imwrite(f"diff/diff_{i}.png", diff)

    return ssim_scores

# Example usage
ssim_scores = compare_all_screenshots()

# 4. Visualise SSIM scores over time

When we have assigned SSIM scores to each pair of images, we can visualise the similarity of screenshots over time. High SSIM scores indicates a high similarity between one version and the next, while low SSIM scores indicates more significant changes.

Using `plotly`, we can make an interactive line plot graph that shows development over time.
This can be be hovered over and exported as a png.

In [None]:
import plotly.graph_objects as go

def plot_ssim_timeline_interactive(ssim_scores, urls):
    """Plot interactive SSIM scores against archive timestamps using Plotly."""
    timestamps = [extract_timestamp(url) for url in urls]

    # Adjust the timestamps list to match SSIM score count
    timestamps = timestamps[1:len(ssim_scores)+1]  # Keep only valid timestamps

    # Create interactive Plotly figure
    fig = go.Figure()

    # Add SSIM scores as a line plot
    fig.add_trace(go.Scatter(
        x=timestamps,
        y=ssim_scores,
        mode='lines+markers',
        marker=dict(size=6, color='blue'),
        line=dict(width=2),
        name="SSIM Score"
    ))

    # Add a threshold line at SSIM = 0.55
    fig.add_trace(go.Scatter(
        x=timestamps,
        y=[0.55] * len(ssim_scores),
        mode='lines',
        line=dict(color='red', dash='dash'),
        name="Breakage Threshold (0.55)"
    ))

    # Layout customization
    fig.update_layout(
        title="Interactive Web Layout Change Over Time (NRK.no)",
        xaxis_title="Date",
        yaxis_title="SSIM Similarity Score",
        xaxis=dict(tickangle=45),
        hovermode="x unified"
    )

    fig.show()

# Example usage
plot_ssim_timeline_interactive(ssim_scores, urls)


# 5. Assemble all screenshot in a montage

When we have all these images, we can also produce a montage.

Running the cell below will produce a `montage.png` in the export folder.

In [None]:
import os
import math
from PIL import Image
import matplotlib.pyplot as plt

# ---- CONFIGURATIONS ----
input_folder = "screenshots"  # Change this to your folder path
output_file = "export/montage.png"

# ---- MONTAGE SETTINGS ----
num_images = 103
img_width, img_height = 1200, 992  # Your image size
columns = 10  # Number of columns in the grid
rows = math.ceil(num_images / columns)  # Number of rows needed
resize_width, resize_height = 192, 180  # Resize each image to fit 1920x1080

# ---- LOAD IMAGES ----
image_files = sorted([os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.endswith(('png', 'jpg', 'jpeg'))])
image_files = image_files[:num_images]  # Limit to 104 images

# Check if we have enough images
assert len(image_files) == num_images, f"Expected {num_images} images, but found {len(image_files)}"

# ---- CREATE A BLANK CANVAS ----
montage_width = columns * resize_width
montage_height = rows * resize_height
montage = Image.new('RGB', (montage_width, montage_height), (255, 255, 255))

# ---- PROCESS AND PLACE IMAGES ----
for index, img_path in enumerate(image_files):
    img = Image.open(img_path)
    img = img.resize((resize_width, resize_height))  # Resize image

    # Compute position in the grid
    x_offset = (index % columns) * resize_width
    y_offset = (index // columns) * resize_height

    # Paste onto the montage canvas
    montage.paste(img, (x_offset, y_offset))

# ---- SAVE AND DISPLAY ----
montage.save(output_file)
plt.figure(figsize=(10, 5))
plt.imshow(montage)
plt.axis("off")
plt.show()

print(f"Montage saved as {output_file}")
