# 545 M7 Project: Latent Space Cluster Analysis

- Stace Worrell
- Tochukwu "Sylvester" Nwizu
- Giuseppe Schintu

- Movie: Honey, I Shrunk The Kids!


`In this project, we aim to address the following three hypotheses:`
- Analyzing Narrative Impact: The Role of Key Characters and Objects in Film as Identified by CLIP
- Scene Consistency and Transition: Frames that are visually and thematically similar cluster together tightly in t-SNE and PCA visualizations, and distinct clusters correspond to different scenes/settings in the movie.
- Quantitative Analysis of Object Distribution in Images Using Deep Learning Models 


### Global Code and Functions
Run this first to import modules and global functions

In [None]:
#Modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from IPython.display import Image, display
import re
import math
import torch
import os
import cv2
from collections import Counter
import pandas as pd

# Models
from scipy.spatial import distance
from sklearn.neighbors import NearestNeighbors
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from transformers import CLIPProcessor, CLIPModel
from sklearn.metrics.pairwise import cosine_similarity




# Global Functions

def get_top50_ann(target_embedding, embeddings):
    nn = NearestNeighbors(n_neighbors=51, metric='cosine', algorithm='brute')
    nn.fit(embeddings)
    distances, indices = nn.kneighbors([target_embedding])
    return indices[0][1:]  # Skip the first index because it's the target itself

def get_top50_euclidean(target_embedding, embeddings):
    distances = [distance.euclidean(target_embedding, emb) for emb in embeddings]
    indices = np.argsort(distances)[1:51]  # Skip the first index because it's the target itself
    return indices

def display_image(index):
    display(Image(filename=f"thumbnails_folder2large/{g_movie_embeddings[index]['input']}"))

def display_images(indices, embeddings):
    fig, axes = plt.subplots(10, 10, figsize=(20, 10))
    for i, ax in enumerate(axes.flat):
        ax.imshow(plt.imread(f"thumbnails_folder2large/{g_movie_embeddings[indices[i]]['input']}"))
        ax.axis('off')
    plt.show()

def display_images_first_x_last_x(indices, first_x, last_x, cluster_n=0):
    # Select the first_x and last_x indices
    selected_indices = indices[:first_x] + indices[-last_x:]
    
    # Calculate the number of rows and columns for the subplot
    total_images = first_x + last_x
    cols = 10
    rows = math.ceil(total_images / cols)
    
    fig, axes = plt.subplots(rows, cols, figsize=(20, 2 * rows))
    axes = axes.ravel()  # Flatten the axes array
    
    # Hide all axes
    for ax in axes:
        ax.axis('off')
    
    # Display images on the first len(selected_indices) axes
    for i, idx in enumerate(selected_indices):
        axes[i].imshow(plt.imread(f"thumbnails_folder2large/{g_movie_embeddings[idx]['input']}"))
        axes[i].axis('on')
    
    plt.tight_layout()
    plt.title(f"Cluster {cluster_n} -  First {first_x} and Last {last_x} Images - Total Images in Cluster: {len(indices)}")
    plt.show()

def display_cluster_images(cluster_labels, cluster_number):
    # Get indices of images in the cluster
    indices = [i for i, label in enumerate(cluster_labels) if label == cluster_number]
    
    # Display images
    display_images(indices)

def display_cluster_images_first_last_x(cluster_labels, cluster_number, first_x, last_x):
    # Get indices of images in the cluster
    indices = [i for i, label in enumerate(cluster_labels) if label == cluster_number]
    
    # Display images
    display_images_first_x_last_x(indices, first_x, last_x, cluster_number)

def find_and_remove_intro_and_subtitles(g_only_embeddings, threshold=0.7):
    # Load the CLIP model and processor
    model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

    # Get the text embeddings for the intro and subtitles
    inputs = processor(text=["Image of Walt Disney Movie Intro", "Walt Disney Movie Intro", "Movie closing credits", "Movie end credits", "Image with lots of closing credits"], return_tensors="pt", padding=True)
    text_embeddings = model.get_text_features(**inputs)
    text_embeddings_np = text_embeddings.detach().numpy()

    # Calculate the cosine similarity between the text embeddings and the movie embeddings
    similarities = cosine_similarity(text_embeddings_np, g_only_embeddings)

    # Find the indices of the embeddings that are similar to the intro and subtitles
    intro_subtitle_indices = np.where(similarities.max(axis=0) > threshold)[0]

    #print("Number of images(Intro and Closing credits) to remove:", intro_subtitle_indices)

    # Create new lists that exclude the intro and subtitles
    new_g_movie_embeddings = [emb for i, emb in enumerate(g_movie_embeddings) if i not in intro_subtitle_indices]
    new_g_only_embeddings = np.array([emb for i, emb in enumerate(g_only_embeddings) if i not in intro_subtitle_indices])

    return new_g_movie_embeddings, new_g_only_embeddings

# Global Variables
g_movie_embeddings = json.load(open("honey_i_shrunk_the_kids_movie_embeddings_1_second.json"))
g_only_embeddings = np.array([emb['embedding'] for emb in g_movie_embeddings])

g_movie_embeddings, g_only_embeddings = find_and_remove_intro_and_subtitles(g_only_embeddings, threshold=0.237)


## Hypothesis 1

In [None]:
dog_idx = 5038
mower_idx = 4733
ant_idx = 3230

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

with open("honey_i_shrunk_the_kids_movie_embeddings_1_second.json", 'r') as file:
    movie_embeddings = json.load(file)

def euclidean_distance(array1, array2):
    # Convert the arrays to NumPy arrays
    array1_np = np.array(array1)
    array2_np = np.array(array2)
    
    # Calculate the Euclidean distance
    distance = np.linalg.norm(array1_np - array2_np)
    return distance

def find_and_display_matches(text_queries, top_k=5):
    inputs = processor(text=text_queries, return_tensors="pt", padding=True)
    text_embeddings = model.get_text_features(**inputs)
    text_embeddings_np = text_embeddings.detach().numpy()
    movie_embeddings_np = np.array([movie['embedding'] for movie in movie_embeddings])
    similarities = cosine_similarity(text_embeddings_np, movie_embeddings_np)
    for index, text_query in enumerate(text_queries):
        print(f"Top matches for: {text_query}")
        top_indices = np.argsort(similarities[index])[::-1][:top_k]
        for i in top_indices:
            frame = movie_embeddings[i]['input']
            print(f"Displaying frame: {frame}")
            display(Image(filename=f'thumbnails_folder2large/{frame}'))

def plot_euclidean_distance_from(target_idx):
    target = movie_embeddings[target_idx]
    image_path = image_path = 'thumbnails_folder2large/' + target["input"]
    # Display the image
    display(Image(filename=image_path))

    index_to_distance = []

    # Iterate through the input list
    for emb in movie_embeddings:
        current_dist = euclidean_distance(emb["embedding"], target["embedding"])
        index_to_distance.append(current_dist)

    # Create a plot using Seaborn
    sns.set(style="whitegrid")  # Set the style
    plt.figure(figsize=(10, 6))  # Set the figure size
    sns.lineplot(x=range(len(index_to_distance)), y=index_to_distance)  # Plot the array with index as x-axis
    plt.xlabel("Index")  # Set the x-axis label
    plt.ylabel("Distance")  # Set the y-axis label
    plt.title("Distance from Target Over Film")  # Set the title
    plt.show()  # Show the plot

def display_surrounding_frames(target_idx, frame_range=5):
    idx_begin = target_idx - frame_range
    idx_end = target_idx + frame_range + 1

    display_frames = movie_embeddings[idx_begin:idx_end]
    for emb in display_frames:
        print(f'frame {emb["input"]}')
        image_path = 'thumbnails_folder2large/' + emb["input"]
        display(Image(filename=image_path))


#### Exploring the Dog
Dislpay the images

In [None]:
display_surrounding_frames(dog_idx)

In [None]:
plot_euclidean_distance_from(dog_idx)

In [None]:
# Using the following to explore 500, 2100 and 3400
#display_surrounding_frames(3400, frame_range=40)
# Discovered the dog in
# thumbnail_0514.jpg
# thumbnail_2078.jpg
# thumbnail_2131.jpg
# thumbnail_3427.jpg
# Indexes are are 0 based
display_surrounding_frames(513, frame_range=2)

The dog is watching the father's scientific experiment.

In [None]:
display_surrounding_frames(2077, frame_range=2)

The dog seems to notice something outside.

In [None]:
display_surrounding_frames(2130, frame_range=2)

The parents are distracted, and the dog wants to investigate what is going on outside, once again exhibiting a higher sense of awareness than the humans.

In [None]:
display_surrounding_frames(3426, frame_range=2)

Here the dog is spinning the father around and disrupting his search for the children, which is an example of a comedic scene.

#### Exploring the Lawnmower
Dislpay the images

In [None]:
display_surrounding_frames(mower_idx)

In [None]:
plot_euclidean_distance_from(mower_idx)

In [None]:
# Display around 790
#display_surrounding_frames(790, frame_range=40)
display_surrounding_frames(790, frame_range=5)

#### Exploring the Ant
Dislpay the images

In [None]:
display_surrounding_frames(ant_idx)

In [None]:
plot_euclidean_distance_from(ant_idx)

In [None]:
# Display around 2180
#display_surrounding_frames(2180, frame_range=40)
display_surrounding_frames(2180, frame_range=1)

In [None]:
# Display around 4400
#display_surrounding_frames(4400, frame_range=40)
display_surrounding_frames(4435, frame_range=1)

The scene around 4435 shows an ant fighting a scorpion and appearing to save the kids.

Use CLIP to ask about 'a photo of an ant fighting a scorpion' to see what it returns.

In [None]:
exploration_query = ['a photo of an ant fighting a scorpion'] 
find_and_display_matches(exploration_query, top_k=5)

## Hypothesis 2
### - Scene Consistency and Transition with dimensionality reduction and clustering -


### Cluster Analysis

We will compare clustering with t-SNE (t-Distributed Stochastic Neighbor Embedding) and PCA (Principal Component Analysis) dimensionality reduction algorithms.

#### t-SNE

In [None]:
# Using t-SNE to embed the vectors into 2D
tsne = TSNE(n_components=2, random_state=42)
tSNE_embedded_vectors = tsne.fit_transform(g_only_embeddings)


#### PCA

In [None]:
# Using PCA to embed the vectors into 2D
pca = PCA(n_components=2)
PCA_embedded_vectors = pca.fit_transform(g_only_embeddings)

#### Cluster t-SNE and PCA with K-Means and display Silhoutte Score

**Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Lets find the best clustering number for t-SNE and PCA...

In [None]:
X = tSNE_embedded_vectors

# Range of clusters to try
num_clusters = range(2, 20)

# List to hold silhouette scores
sil_scores = []

# Loop over number of clusters
for k in num_clusters:
    # Perform clustering
    kmeans = KMeans(n_init="auto", n_clusters=k, random_state=42).fit(X)
    
    # Get cluster labels
    labels = kmeans.labels_
    
    # Compute silhouette score and append to list
    sil_score = silhouette_score(X, labels)
    sil_scores.append(sil_score)

# Plot silhouette scores
plt.plot(num_clusters, sil_scores, 'bx-')
plt.title('t-SNE')
plt.xlabel('k (number of clusters)')
plt.ylabel('Silhouette Score')
plt.show()



X = PCA_embedded_vectors

# Range of clusters to try
num_clusters = range(2, 20)

# List to hold silhouette scores
sil_scores = []

# Loop over number of clusters
for k in num_clusters:
    # Perform clustering
    kmeans = KMeans(n_init="auto", n_clusters=k, random_state=42).fit(X)
    
    # Get cluster labels
    labels = kmeans.labels_
    
    # Compute silhouette score and append to list
    sil_score = silhouette_score(X, labels)
    sil_scores.append(sil_score)

# Plot silhouette scores
plt.plot(num_clusters, sil_scores, 'bx-')
plt.title('PCA')
plt.xlabel('k (number of clusters)')
plt.ylabel('Silhouette Score')
plt.show()

Lets cluster with t-SNE and PCA best Silhoutte Scores.

In [None]:
# Performing KMeans clustering with best k silhoutte score.
kmeans = KMeans(n_init="auto", n_clusters=19, random_state=42)
tSNE_clusters = kmeans.fit_predict(tSNE_embedded_vectors)

kmeans = KMeans(n_init="auto", n_clusters=4, random_state=42)
PCA_clusters = kmeans.fit_predict(PCA_embedded_vectors)


# Extracting numbers from file names for labels
labels = [re.search(r'\d+', vector['input']).group() for vector in g_movie_embeddings]

#t-SNE
# Plotting the embedded vectors with cluster coloring
sns.set_theme()
plt.figure(figsize=(12, 8))  # Adjust the figure size as needed
sns.scatterplot(x=tSNE_embedded_vectors[:, 0], y=tSNE_embedded_vectors[:, 1], hue=tSNE_clusters, palette='bright', legend='full', s=100)
for i, vec in enumerate(tSNE_embedded_vectors):
    plt.text(vec[0] + 0.02, vec[1] + 0.02, labels[i], fontsize=6)  # Adding labels
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('t-SNE Embedded Vectors with KMeans Clustering (k=16)')
plt.legend(title='Cluster')
plt.show()

#PCA
# Plotting the embedded vectors with cluster coloring
sns.set_theme()
plt.figure(figsize=(12, 8))  # Adjust the figure size as needed
sns.scatterplot(x=PCA_embedded_vectors[:, 0], y=PCA_embedded_vectors[:, 1], hue=PCA_clusters, palette='bright', legend='full', s=100)
for i, vec in enumerate(PCA_embedded_vectors):
    plt.text(vec[0] + 0.02, vec[1] + 0.02, labels[i], fontsize=6)  # Adding labels
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('PCA Embedded Vectors with KMeans Clustering (k=4)')
plt.legend(title='Cluster')
plt.show()


#### Let sample images in t-SNE clusters

In [None]:
unique_clusters = set(tSNE_clusters)

for cluster in unique_clusters:
    display_cluster_images_first_last_x(tSNE_clusters, cluster, 10, 10)

#### Lets sample images in PCA clusters

In [None]:
unique_clusters = set(PCA_clusters)

for cluster in unique_clusters:
    display_cluster_images_first_last_x(PCA_clusters, cluster, 10, 10)

#### t-SNE and PCA with K-means clusters over a timeline
By looking at these plots, we can see if frames from the same cluster tend to occur close together in time, which might indicate that the clustering is capturing some meaningful structure in the movie. For example, all the frames from a particular scene might be grouped into the same cluster.

In [None]:
# Convert labels to timestamps by dividing by frame rate
# Assuming `frame_rate` is the frame rate of the movie
frame_rate = 24
timestamps = [int(label) / frame_rate for label in labels]

# Create a timeline plot for the t-SNE clusters
plt.figure(figsize=(12, 6))
plt.scatter(timestamps, tSNE_clusters, c=tSNE_clusters, cmap='viridis')
plt.xlabel('Time')
plt.ylabel('Cluster')
plt.title('t-SNE Clusters Over Time')
plt.colorbar(label='Cluster')
plt.show()

# Create a timeline plot for the PCA clusters
plt.figure(figsize=(12, 6))
plt.scatter(timestamps, PCA_clusters, c=PCA_clusters, cmap='viridis')
plt.xlabel('Time')
plt.ylabel('Cluster')
plt.title('PCA Clusters Over Time')
plt.colorbar(label='Cluster')
plt.show()

## Hypothesis 3

### Find a model that can identify objects from movie frame images
#### Utilizing the Yolo model for recognizing and counting objects

In [None]:
import torch
import os
from PIL import Image
import matplotlib.pyplot as plt
import cv2
from collections import Counter


model = torch.hub.load("ultralytics/yolov5", "yolov5s", pretrained=True)


image_directory = "thumbnails_folder2large/"

# List the first 100 images in the image file directory
image_files = [
    f
    for f in os.listdir(image_directory)
    if os.path.isfile(os.path.join(image_directory, f))
]
image_files = image_files[:100]

# Initialize a Counter to keep track of object counts
total_counts = Counter()


for image_file in image_files:

    image_path = os.path.join(image_directory, image_file)

    results = model(image_path)

    print(f"Results for {image_file}:")
    df = results.pandas().xyxy[0]  # Results as DataFrame
    print(df)

    # Update the total counts of objects
    counts = df["name"].value_counts()
    total_counts.update(counts)

    results.show()
    results.save(save_dir="output/")

    object_counts = df["name"].value_counts()
    print("Object counts:", object_counts)
    print("\n")


In [None]:
print("Total counts of detected objects:")
for object_type, count in total_counts.items():
    print(f"{object_type}: {count}")

---
## SUMMARY

### Film Description
#### Honey, I Shrunk the Kids
1989 PG 1h 33m

From the IMDB website:

"The scientist father of a teenage girl and boy accidentally shrinks his and two other neighborhood teens to the size of insects. Now the teens must fight diminutive dangers as the father searches for them."

IMDB website. (n.d.). imdb.com. Retrieved April 27, 2024, from https://www.imdb.com/title/tt0097523/

In 'Honey, I Shrunk the Kids,' an eccentric inventor, Wayne Szalinski, accidentally shrinks his and his neighbor's children with his experimental shrink ray. The miniature kids must navigate a perilous journey across their now-gigantic backyard, encountering obstacles like insects and sprinklers, as they try to return home.

The film is notable for its creative visual effects that magnify ordinary environments into epic landscapes. It's a blend of adventure, humor, and family dynamics, ultimately showcasing the children's resourcefulness and the parents' determination to rescue their kids. The movie was a commercial success and spawned a franchise including sequels and a television series.

### Methods Summary

#### For hypothesis 1 we used the following methods

1. Explore CLIP through a natural query for objects and investigate the surounding frames.
2. Look at the Euclidean distance of similar frames.
3. Explore similar frames for insights.

### Hunches and Hypotheses

**Hypothesis 1** 
-   Analyzing Narrative Impact: The Role of Key Characters and Objects in Film as Identified by CLIP
-   **Rationale:** In movies, the appearance of certain characters or objects is significantly associated with specific narrative effects such as humor or drama. By using the CLIP model to detect these elements in selected frames and analyzing the content of approximately five frames before and after their appearance, we can identify patterns that support or refute their role in contributing to these narrative effects.

**Hypothesis 2** 
-   Scene Consistency and Transition: Frames that are visually and thematically similar cluster together tightly in t-SNE and PCA visualizations, and distinct clusters correspond to different scenes or settings in the movie.
-  **Rationale:** This hypothesis tests the ability of CLIP embeddings, which capture both visual and semantic content, to differentiate between distinct scenes based on their visual content and thematic elements.

**Hypothesis 3**
-   Quantitative Analysis of Object Distribution in Images Using Deep Learning Models 
-   **Rationale:** The use of advanced image recognition models enables accurate identification and quantification of different object types within images, facilitating detailed analysis of object distribution patterns across varied datasets.


### Results and Interpretation

#### Hypothesis 1:
- **Objective:** Identify and analyze the top images of a dog, a lawnmower, and an ant using CLIP queries, and explore the surrounding frames to determine their narrative impact.

**Top Matches from CLIP:**
1. **Dog:** 'a photo of a white and brown dog' - **thumbnail_5039.jpg**
2. **Lawnmower:** 'a photo of a lawnmower' - **thumbnail_4734.jpg**
3. **Ant:** 'a photo of an ant' - **thumbnail_3231.jpg**

**Analysis of Key Frames:**
- **Dog Scene Analysis:**
  From the sequence of frames, the storyline unfolds with children running, parents talking, and the dog focusing upwards, climaxing with a dramatic moment where the father nearly harms his accidentally shrunken son. The dog's central position and alert demeanor suggest an awareness surpassing the humans, a trope often used to enhance dramatic tension in films.
  
- **Lawnmower Scene Analysis:**
  A neighborhood child remotely controls the lawnmower amidst children playing on the grass, creating a direct threat in this action-packed scene. The foreshadowing of this event is noted at index 790, where the lawnmower and its controller are introduced, demonstrating the effective use of foreshadowing discovered through our frame-by-frame analysis.

- **Ant Scene Analysis:**
  Initially a threat, the ant later interacts with a scorpion, shifting its role from antagonist to protector. This transformation not only adds complexity to the character but also alters its narrative impact, from eliciting fear to evoking sympathy among the audience.

#### Hypothesis 2:
- **Objective:** Utilize t-SNE and K-means clustering to analyze and categorize movie frames by scene content and type after removing noise from non-relevant frames like opening and closing credits.

**Findings:**
  The embeddings captured by our models effectively differentiated between scene types (e.g., indoor vs. outdoor, calm vs. action-packed), helping segment the movie based on visual content. This clustering accurately reflected the proper timeframe and transitions within the movie, with t-SNE outperforming PCA in visual correlation and cluster separation.

#### Hypothesis 3:
- **Objective:** Analyze the first 100 images in the thumbnail folder using the Yolo model to identify and count types of objects, examining the effectiveness and efficiency of object detection in a controlled set.

**Results:**
  The Yolo model identified objects within the images, though processing was slow. The results were visualized using bounding boxes and documented in a DataFrame, summarizing the count and type of each detected object. This provided a clear, quantitative insight into the object distribution within the sampled frames.

### Reflection

**Process Review:**
Our exploration of CLIP highlighted its robust capability to interpret natural language queries and direct our analytical focus, while integrating Yolo and clustering algorithms offered a multi-faceted view of the film's visual narrative.

**Challenges and Limitations:**
The slow processing speed of the full image analysis using Yolo was a notable limitation. Optimizing this aspect would be a primary goal for future work, alongside exploring additional AI tools for more nuanced insights.

**Future Directions:**
Given additional time, we would expand our analysis to include a larger dataset of films to validate and refine our findings further. Employing more advanced models could also uncover deeper insights into narrative structures and character roles within broader cinematic contexts.
