<a href="https://colab.research.google.com/github/ohgawditsal/SWMTWA/blob/main/Seeing_with_Machines%2C_Thinking_with_Archives_Computational_Workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Seeing with Machines, Thinking with Archives: Computational Workflow

## Thesis & Workflow Description

This script provides a complete Python pipeline for analyzing the color characteristics of a historical film, designed to be run in Google Colab.
It follows the methodology outlined in the thesis:
1. Setup: Installs necessary libraries and prepares the environment.
2. Frame Extraction: Uses FFmpeg to extract frames from a video file.
3. Colro Analysis: For each frame, it calculates:
  - The 'k' dominant colors using k-means clustering.
  - The average Hue, Saturation, and Value (HSV)
4. Data Aggregation: Stores the extracted color data in a Pandas DataFrame.
5. Visualization: Generates several plots to visualize the film's color:
  - A temporal "color barcode" of the entire film.
  - An overall color palette for the film.
  - Line graphs of average, Hue, Stauration, and Value over time.
  - A histogram of hue distribution for the entire film.

How to use on Google Colab:
1. Upload the video file to your Google Drive.
2. Run the first cell (Step 1) to install libraries and mount your Google Drive.
3. In the second cell (Step 2), update the 'video_path' variable to point to your video file in Google Drive.
4. Adjust parameters like 'K_CLUSTERS' and 'FPS_TO_EXTRACT' as needed.
5. Run remaining cells sequentially. The script will create output folders in your specifices 'output_base_path' for frames, data (CSV), and plots.

## STEP 1: SETUP AND IMPORTS


Install necessary libraries.
tqdm is used for progress bars.

In [None]:
!pip install tqdm

import os
import subprocess
import cv2
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from google.colab import drive
from tqdm import tqdm
import shutil

# mount google drive to access files
try:
    drive.mount('/content/drive')
    print("Google Drive mounted successfully.")
except Exception as e:
    print(f"Error mounting Google Drive: {e}")
    print("Please ensure you are running this in a Google Colab environment.")

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 578, in _determine_conflicts
    return check_install_conflicts(to_install)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/operations/check.py", line 101, in check_install_conflicts
    package_set, _ = create_package_set_from_installed()
              

### STEP 1.1: DOWNLOADING FROM A PUBLIC SOURCE

In [None]:
# install yt-dlp, a tool for downloading videos from YouTube and other sites.
!pip install yt-dlp

# URL of a legitimately free-to-view archival film (e.g., from BFI's YouTube channel)
# here we have the URL of "The Cabinet of Dr. Caligary (1920)"
video_url = 'https://www.youtube.com/watch?v=nQSzEe3xqf4'

# define where to save the downloaded video
download_folder = '/content/drive/MyDrive/MASTERSTHESIS/VIDEOS'
os.makedirs(download_folder, exist_ok=True)
downloaded_video_path = os.path.join(download_folder, 'royaljourney_1951.mp4')

# use yt-dlp to download the video in mp4 format
!yt-dlp -o "{downloaded_video_path}" -f 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' "{video_url}"

# NOW, you can set your VIDEO_PATH to this downloaded file
# VIDEO_PATH = downloaded_video_path

[youtube] Extracting URL: https://www.youtube.com/watch?v=nQSzEe3xqf4
[youtube] nQSzEe3xqf4: Downloading webpage
[youtube] nQSzEe3xqf4: Downloading tv client config
[youtube] nQSzEe3xqf4: Downloading tv player API JSON
[youtube] nQSzEe3xqf4: Downloading ios player API JSON
[youtube] nQSzEe3xqf4: Downloading m3u8 information
[info] nQSzEe3xqf4: Downloading 1 format(s): 137+140
[download] Destination: /content/drive/MyDrive/MASTERSTHESIS/VIDEOS/royaljourney_1951.f137.mp4
[K[download] 100% of    1.16GiB in [1;37m00:07:44[0m at [0;32m2.55MiB/s[0m
[download] Destination: /content/drive/MyDrive/MASTERSTHESIS/VIDEOS/royaljourney_1951.f140.m4a
[K[download] 100% of   47.85MiB in [1;37m00:00:13[0m at [0;32m3.55MiB/s[0m
[Merger] Merging formats into "/content/drive/MyDrive/MASTERSTHESIS/VIDEOS/royaljourney_1951.mp4"
Deleting original file /content/drive/MyDrive/MASTERSTHESIS/VIDEOS/royaljourney_1951.f140.m4a (pass -k to keep)
Deleting original file /content/drive/MyDrive/MASTERSTHESIS/V

## STEP 2: CONFIGURATION AND PARAMETERS

In [None]:
# TODO: CHANGE THIS to the path of your video file in Google Drive.
VIDEO_PATH = '/content/drive/MyDrive/MASTERSTHESIS/VIDEOS/royaljourney_1951.mp4'

# TODO: CHANGE THIS to a base path in your Google Drive where all output will be saved.
# the script will create subfolders within this path.
OUTPUT_BASE_PATH = '/content/drive/MyDrive/MASTERSTHESIS/OUTPUT'

# name for this specific film's analysis (used to create a unique output folder).
FILM_NAME = 'royaljourney_1951'

# number of dominant colors to extract from each frame.
K_CLUSTERS = 5

# number of frames to extract per second from the video.
# 1 is a good default for general analysis.
FPS_TO_EXTRACT = 1

# width to resize frames to for faster processing. Aspect ratio is maintained.
RESIZE_WIDTH = 640

In [None]:
# AUTOMATIC PATH CONFIGURATION
# create a main directory for this film's analysis.
FILM_OUTPUT_PATH = os.path.join(OUTPUT_BASE_PATH, FILM_NAME)
FRAMES_FOLDER = os.path.join(FILM_OUTPUT_PATH, 'frames')
PLOTS_FOLDER = os.path.join(FILM_OUTPUT_PATH, 'plots')
DATA_FOLDER = os.path.join(FILM_OUTPUT_PATH, 'data')

# create the necessary directories if they don't exist.
os.makedirs(FRAMES_FOLDER, exist_ok=True)
os.makedirs(PLOTS_FOLDER, exist_ok=True)
os.makedirs(DATA_FOLDER, exist_ok=True)

print(f"Configuration complete. Output will be saved to: {FILM_OUTPUT_PATH}")

Configuration complete. Output will be saved to: /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951


## STEP 3: HELPER FUNCTIONS

In [None]:
def extract_frames(video_path, output_folder, fps):
    """
    extracts frames from a video file using FFmpeg.
    args:
        video_path (str): path to the video file.
        output_folder (str): directory to save the extracted frames.
        fps (int): number of frames to extract per second.
    """
    print(f"\n--- Starting Frame Extraction (at {fps} FPS) ---")

    # clean up old frames if they exist
    if os.path.exists(output_folder):
        shutil.rmtree(output_folder)
    os.makedirs(output_folder)

    command = [
        'ffmpeg',
        '-i', video_path,
        '-vf', f'fps={fps}',
        '-q:v', '2',  # high quality JPEGs
        os.path.join(output_folder, 'frame_%05d.jpg')
    ]
    try:
        # Using subprocess.run to execute the command
        process = subprocess.run(command, check=True, capture_output=True, text=True)
        print("Frames extracted successfully.")
    except FileNotFoundError:
        print("ERROR: FFmpeg not found. This script requires FFmpeg to be installed.")
    except subprocess.CalledProcessError as e:
        print(f"ERROR during frame extraction: {e}")
        print(f"FFmpeg stderr: {e.stderr}")

def rgb_to_hex(rgb_color):
    """converts an RGB color tuple to a hex string."""
    return mcolors.to_hex(np.array(rgb_color) / 255.0)

def analyze_frame_colors(frame_path, k, resize_width):
    """
    analyzes a single image frame to extract dominant colors and average HSV.
    args:
        frame_path (str): path to the image file.
        k (int): number of dominant colors to find.
        resize_width (int): width to resize the image to.
    returns:
        dict: a dictionary containing analysis results, or None if error.
    """
    try:
        # load the image with OpenCV
        img = cv2.imread(frame_path)
        if img is None:
            return None

        # resize the image to speed up processing
        aspect_ratio = img.shape[0] / img.shape[1]
        new_height = int(resize_width * aspect_ratio)
        resized_img = cv2.resize(img, (resize_width, new_height), interpolation=cv2.INTER_AREA)

        # convert from BGR (OpenCV default) to RGB for analysis and display
        img_rgb = cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB)

        # convert to HSV for calculating average values
        img_hsv = cv2.cvtColor(resized_img, cv2.COLOR_BGR2HSV)
        avg_h = np.mean(img_hsv[:, :, 0])
        avg_s = np.mean(img_hsv[:, :, 1])
        avg_v = np.mean(img_hsv[:, :, 2])

        # reshape the image to be a list of pixels for k-means
        pixels = img_rgb.reshape((-1, 3))
        pixels = np.float32(pixels)

        # perform k-means clustering
        kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
        kmeans.fit(pixels)

        # get the cluster centers (dominant colors) and their proportions
        dominant_colors_rgb = kmeans.cluster_centers_.astype(int)
        labels, counts = np.unique(kmeans.labels_, return_counts=True)
        proportions = counts / len(pixels)

        # sort colors by proportion (most dominant first)
        sorted_indices = np.argsort(proportions)[::-1]
        dominant_colors_rgb = dominant_colors_rgb[sorted_indices]
        proportions = proportions[sorted_indices]

        # convert dominant colors to hex codes
        dominant_colors_hex = [rgb_to_hex(color) for color in dominant_colors_rgb]

        return {
            'dominant_colors_hex': dominant_colors_hex,
            'dominant_colors_rgb': [tuple(c) for c in dominant_colors_rgb],
            'proportions': proportions.tolist(),
            'avg_hue': avg_h,
            'avg_saturation': avg_s,
            'avg_value': avg_v
        }

    except Exception as e:
        print(f"Could not process frame {frame_path}: {e}")
        return None

## STEP 4: VISUALIZATION FUNCTIONS

In [None]:
def plot_color_barcode(df, output_path, title):
    """generates and saves a color barcode visualization for the film."""
    print("\n--- Generating Color Barcode ---")
    num_frames = len(df)
    barcode = np.zeros((max(100, num_frames // 10), num_frames, 3), dtype=np.uint8)

    # get the most dominant color for each frame
    dominant_colors = [colors[0] for colors in df['dominant_colors_rgb']]

    for i, color in enumerate(dominant_colors):
        barcode[:, i, :] = color

    plt.figure(figsize=(20, 3))
    plt.imshow(barcode, aspect='auto')
    plt.title(title, fontsize=16)
    plt.xlabel("Time (Frame Number / Extracted FPS)")
    plt.yticks([])
    plt.tight_layout()
    plt.savefig(output_path, dpi=300)
    plt.close()
    print(f"Color barcode saved to {output_path}")

def plot_temporal_graphs(df, output_path_base, title):
    """generates and saves plots of average HSV over time."""
    print("\n--- Generating Temporal HSV Plots ---")
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(15, 10), sharex=True)
    fig.suptitle(f'Color Properties Over Time: {title}', fontsize=16)

    # hue plot
    ax1.plot(df.index, df['avg_hue'], color='r', alpha=0.7)
    ax1.set_title('Average Hue')
    ax1.set_ylabel('Hue (0-179)')
    ax1.grid(True, linestyle='--', alpha=0.5)

    # saturation plot
    ax2.plot(df.index, df['avg_saturation'], color='g', alpha=0.7)
    ax2.set_title('Average Saturation')
    ax2.set_ylabel('Saturation (0-255)')
    ax2.grid(True, linestyle='--', alpha=0.5)

    # value (brightness) plot
    ax3.plot(df.index, df['avg_value'], color='b', alpha=0.7)
    ax3.set_title('Average Value (Brightness)')
    ax3.set_ylabel('Value (0-255)')
    ax3.set_xlabel('Time (Frame Number / Extracted FPS)')
    ax3.grid(True, linestyle='--', alpha=0.5)

    plt.tight_layout(rect=[0, 0, 1, 0.96])
    output_path = f"{output_path_base}_hsv_temporal.png"
    plt.savefig(output_path, dpi=300)
    plt.close()
    print(f"Temporal graphs saved to {output_path}")

def plot_overall_palette(df, output_path, title, top_n=10):
    """Generates and saves a summary color palette for the entire film."""
    print("\n--- Generating Overall Film Palette ---")
    all_colors = []
    all_proportions = []

    for _, row in df.iterrows():
        all_colors.extend(row['dominant_colors_rgb'])
        all_proportions.extend(row['proportions'])

    # create a DataFrame of all dominant colors and their frame proportions
    palette_df = pd.DataFrame({
        'color_rgb': [tuple(c) for c in all_colors],
        'proportion': all_proportions
    })

    # group by color and sum the proportions to get overall dominance
    overall_palette = palette_df.groupby('color_rgb')['proportion'].sum().sort_values(ascending=False)
    overall_palette = (overall_palette / df.shape[0]) # normalize by number of frames

    top_colors = overall_palette.head(top_n)

    plt.figure(figsize=(15, 3))
    plt.pie(top_colors, labels=[str(c) for c in top_colors.index],
            colors=[np.array(c)/255.0 for c in top_colors.index],
            autopct='%1.1f%%', startangle=90)
    plt.title(f'Overall Color Palette (Top {top_n} Dominant Hues): {title}', fontsize=14)
    plt.axis('equal') # equal aspect ratio ensures that pie is drawn as a circle.
    plt.tight_layout()
    plt.savefig(output_path, dpi=300)
    plt.close()
    print(f"Overall palette saved to {output_path}")

def plot_hue_histogram(df, output_path, title):
    """generates and saves a histogram of hue distribution for the entire film."""
    print("\n--- Generating Hue Distribution Histogram ---")
    plt.figure(figsize=(12, 6))
    plt.hist(df['avg_hue'], bins=45, range=(0, 180), color='cyan', edgecolor='black')
    plt.title(f'Overall Hue Distribution: {title}', fontsize=16)
    plt.xlabel('Hue Value (0-179, from HSV)')
    plt.ylabel('Number of Frames')
    plt.grid(axis='y', alpha=0.75, linestyle='--')
    plt.tight_layout()
    plt.savefig(output_path, dpi=300)
    plt.close()
    print(f"Hue histogram saved to {output_path}")

## STEP 5: MAIN EXECUTION SCRIPT

In [None]:
if __name__ == '__main__':
    # --- 1. extract frames ---
    extract_frames(VIDEO_PATH, FRAMES_FOLDER, FPS_TO_EXTRACT)

    # --- 2. analyze frames ---
    print("\n--- Analyzing Frames for Color Data ---")
    frame_files = sorted([os.path.join(FRAMES_FOLDER, f) for f in os.listdir(FRAMES_FOLDER) if f.endswith('.jpg')])

    all_frame_data = []

    for frame_path in tqdm(frame_files, desc="Processing Frames"):
        analysis_result = analyze_frame_colors(frame_path, K_CLUSTERS, RESIZE_WIDTH)
        if analysis_result:
            analysis_result['frame_path'] = os.path.basename(frame_path)
            all_frame_data.append(analysis_result)

    # --- 3. aggregate data ---
    if all_frame_data:
        color_df = pd.DataFrame(all_frame_data)

        # save the DataFrame to a CSV file for future analysis
        csv_path = os.path.join(DATA_FOLDER, f'{FILM_NAME}_color_analysis.csv')
        color_df.to_csv(csv_path, index=False)
        print(f"\nColor analysis data saved to: {csv_path}")

        # --- 4. generate visualizations ---
        plot_color_barcode(color_df, os.path.join(PLOTS_FOLDER, f'{FILM_NAME}_barcode.png'), FILM_NAME)
        plot_temporal_graphs(color_df, os.path.join(PLOTS_FOLDER, f'{FILM_NAME}'), FILM_NAME)
        plot_overall_palette(color_df, os.path.join(PLOTS_FOLDER, f'{FILM_NAME}_palette.png'), FILM_NAME)
        plot_hue_histogram(color_df, os.path.join(PLOTS_FOLDER, f'{FILM_NAME}_hue_hist.png'), FILM_NAME)

        print("\n--- Analysis and Visualization Complete! ---")
        print(f"All outputs can be found in your Google Drive at: {FILM_OUTPUT_PATH}")
    else:
        print("\n--- No frames were processed. Halting script. ---")



--- Starting Frame Extraction (at 1 FPS) ---
Frames extracted successfully.

--- Analyzing Frames for Color Data ---


  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
Processing Frames: 100%|██████████| 3100/3100 [15:25<00:00,  3.35it/s]



Color analysis data saved to: /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951/data/royaljourney_1951_color_analysis.csv

--- Generating Color Barcode ---
Color barcode saved to /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951/plots/royaljourney_1951_barcode.png

--- Generating Temporal HSV Plots ---
Temporal graphs saved to /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951/plots/royaljourney_1951_hsv_temporal.png

--- Generating Overall Film Palette ---
Overall palette saved to /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951/plots/royaljourney_1951_palette.png

--- Generating Hue Distribution Histogram ---
Hue histogram saved to /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951/plots/royaljourney_1951_hue_hist.png

--- Analysis and Visualization Complete! ---
All outputs can be found in your Google Drive at: /content/drive/MyDrive/MASTERSTHESIS/OUTPUT/royaljourney_1951
