# Exercise 1. Normalizing Actor-Genre Matrix
This exercise builds on last week's but asks the student apply L1 normalization to the rows of the matrix. 

Download imdb_movies_2000to2022.prolific.jsonLinks to an external site. from GitHub (also available on Canvas in Exercises/data)
Create a data frame, where each row corresponds to an actor, each column represents a genre, and each cell captures how many times that row’s actor has appeared in that column’s genre
Using this data frame as your “feature matrix”, for every row in the matrix, apply L1 normalization. That is, calculate the sum of each row, and divide every element of that row by this sum. Store this normalized matrix as a new object. If you apply `sum(axis=1)` to the normalized matrix, all actors should now have a value of 1.
Using this L1-normalized data frame as your new feature matrix, select an actor (called your “query”) for whom you want to find the top 10 most similar actors based on the genres in which they’ve starred
As an example, select the row from your data frame associated with Chris Hemsworth, actor ID “nm1165110”, as your “query” actor
Use sklearn.metrics.DistanceMetricLinks to an external site. to calculate the Euclidean distances between your query actor and all other actors based on this normalized matrix of genre appearances.
Print a list of the top ten actors most similar to your query actor using Euclidean distance
Describe how this list has changed compared to Cosine Similarity.

In [None]:
import json
import pandas as pd
import numpy as np
from sklearn.metrics import DistanceMetric

with open('imdb_movies_2000to2022.prolific.json', 'r') as file_connection:
    movies = [json.loads(line) for line in file_connection]

# creating actor name mapping
actor_name_map = {}
for movie in movies:
    actors = movie.get('actors', [])
    for actor_id, actor_name in actors:
        if actor_id not in actor_name_map:
            actor_name_map[actor_id] = actor_name

# feature matrix
actor_genre_data = {}
for movie in movies:
    genres = movie.get('genres', [])
    actors = [actor[0] for actor in movie.get('actors', [])]
    
    for actor_id in actors:
        if actor_id not in actor_genre_data:
            actor_genre_data[actor_id] = {}
        for genre in genres:
            actor_genre_data[actor_id][genre] = actor_genre_data[actor_id].get(genre, 0) + 1

actor_genre_df = pd.DataFrame.from_dict(actor_genre_data, orient='index').fillna(0)

# applying L1 normalization
actor_genre_l1 = actor_genre_df.div(actor_genre_df.sum(axis=1), axis=0)
print("Row sums after L1 normalization (=1):")
print(actor_genre_l1.sum(axis=1).head())

# finding similar actors to Chris Hemsworth
query_actor_id = "nm1165110"

if query_actor_id in actor_genre_l1.index:

    dist = DistanceMetric.get_metric('euclidean')
    distances = dist.pairwise(actor_genre_l1.values)
    
    # getting distances from query actor to all others
    query_idx = actor_genre_l1.index.get_loc(query_actor_id)
    query_distances = distances[query_idx]
    
    # creating results and removing query actor itself (Chris Hemsworth)
    results = pd.DataFrame({
        'actor_id': actor_genre_l1.index,
        'actor_name': [actor_name_map.get(actor_id, "name") for actor_id in actor_genre_l1.index],
        'euclidean_distance': query_distances
    })
    results = results[results['actor_id'] != query_actor_id]
    
    # top 10 most similar actors
    top_10 = results.nsmallest(10, 'euclidean_distance')
    
    print(f"\nTop 10 actors most similar to Chris Hemsworth (nm1165110):")
    for i, row in top_10.iterrows():
        print(f"{row['actor_id']} - {row['actor_name']} <-> {row['euclidean_distance']:.5f}")
        
else:
    print(f"Chris Hemsworth ({query_actor_id}) not found in dataset")

# comparison explained
print("\nComparison to Cosine Similarity:")
print("With L1 normalization, Euclidean distance becomes more similar to Cosine similarity")
print("since all vectors are on the same scale. Euclidean distance is still sensitive to")
print("small differences in distribution proportions, while Cosine distance focuses only on the")
print("direction (angle) between vectors and not the magnitude.")

Row sums after L1 normalization (=1):
nm0000212    1.0
nm0413168    1.0
nm0000630    1.0
nm0005227    1.0
nm0864851    1.0
dtype: float64

Top 10 actors most similar to Chris Hemsworth (nm1165110):
nm0000129 - Tom Cruise <-> 0.09790
nm0829032 - Ray Stevenson <-> 0.12496
nm0147147 - Henry Cavill <-> 0.12797
nm5899377 - Tiger Shroff <-> 0.13449
nm0003244 - Jordi Mollà <-> 0.13473
nm1679372 - Sudeep <-> 0.13597
nm2018237 - Taylor Kitsch <-> 0.14047
nm4043618 - Tom Holland <-> 0.14238
nm2207222 - Scott Eastwood <-> 0.14953
nm5744729 - Vaani Kapoor <-> 0.15437

Comparison to Cosine Similarity:
With L1 normalization, Euclidean distance becomes more similar to Cosine similarity
since all vectors are on the same scale. Euclidean distance is still sensitive to
small differences in distribution proportions, while Cosine distance focuses only on the
direction (angle) between vectors and not the magnitude.


: 

# Extra Practice. Adding Movie-Vote Counts to Actor-Genre Matrix
All movies in the IMDB dataset include a `rating` field, which includes a "votes" and "avg" subfield (e.g., `"rating": {"avg": 7.4, "votes": 63}`)  for the majority of movies (some movies have an empty `rating` field). This exercise augments the actor-genre matrix with the total votes an actor has received across all their movies and asks students to recalculate Euclidean distances including this total `Votes`, with and without min/max column normalization.  While genre counts are relatively close in scale, this `Votes` field has a much wider range, which can lead to adverse impacts on distance metrics.

Download imdb_movies_2000to2022.prolific.jsonLinks to an external site. from GitHub (also available on Canvas in Exercises/data)
Create a data frame, where each row corresponds to an actor, each column represents a genre, and each cell captures how many times that row’s actor has appeared in that column’s genre
Apply L1 row-level normalization to this data, so all actors are on the same scale.
Select a "query" actor and identify the ten most similar actors using Euclidean distance (should be same as above).
Add a `votes` column to this normalized data frame that contains the total number of votes an actor has received across all their movies in the IMDB dataset.
Using that same query actor, recalculate Euclidean distance on this new data frame. How different are your results?
Apply min/max normalization to this `votes` column in this data frame, and recalculate Euclidean distance in this normalized matrix.
Describe how these metrics have changed with/without the `Votes` column and with/without columnar min/max normalization.

In [1]:
import json
import pandas as pd
import numpy as np
from sklearn.metrics import DistanceMetric

# Load data
with open('imdb_movies_2000to2022.prolific.json', 'r') as file_connection:
    movies = [json.loads(line) for line in file_connection]

# Step 1: Create actor-genre matrix with votes
actor_genre_data = {}
actor_votes_data = {}

for movie in movies:
    genres = movie.get('genres', [])
    actors = [actor[0] for actor in movie.get('actors', [])]
    
    # Get votes (handle missing rating data)
    votes = 0
    if 'rating' in movie and movie['rating'] and isinstance(movie['rating'], dict):
        votes = movie['rating'].get('votes', 0)
    
    for actor_id in actors:
        # Initialize actor in genre data if not present
        if actor_id not in actor_genre_data:
            actor_genre_data[actor_id] = {}
        
        # Initialize actor in votes data if not present
        if actor_id not in actor_votes_data:
            actor_votes_data[actor_id] = 0
        
        # Count genre appearances
        for genre in genres:
            actor_genre_data[actor_id][genre] = actor_genre_data[actor_id].get(genre, 0) + 1
        
        # Accumulate votes
        actor_votes_data[actor_id] += votes

# Create genre DataFrame
actor_genre_df = pd.DataFrame.from_dict(actor_genre_data, orient='index').fillna(0)

# Step 2: Apply L1 normalization to genres
actor_genre_l1 = actor_genre_df.div(actor_genre_df.sum(axis=1), axis=0)

# Step 3: Select query actor and find similar actors (genre-only)
query_actor_id = "nm0000136"  # Johnny Depp

print("=== GENRE-ONLY SIMILARITY ===")
if query_actor_id in actor_genre_l1.index:
    dist = DistanceMetric.get_metric('euclidean')
    distances = dist.pairwise(actor_genre_l1.values)
    
    query_idx = actor_genre_l1.index.get_loc(query_actor_id)
    query_distances = distances[query_idx]
    
    results_genre_only = pd.DataFrame({
        'actor_id': actor_genre_l1.index,
        'euclidean_distance': query_distances
    })
    results_genre_only = results_genre_only[results_genre_only['actor_id'] != query_actor_id]
    
    top_10_genre_only = results_genre_only.nsmallest(10, 'euclidean_distance')
    print("Top 10 similar actors (genre-only):")
    for i, row in top_10_genre_only.iterrows():
        print(f"{row['actor_id']} - {row['euclidean_distance']:.5f}")

# Step 4: Add votes column to normalized genre data
actor_with_votes = actor_genre_l1.copy()
actor_with_votes['votes'] = pd.Series(actor_votes_data)

print(f"\nVotes statistics:")
print(f"Min votes: {actor_with_votes['votes'].min()}")
print(f"Max votes: {actor_with_votes['votes'].max()}")
print(f"Mean votes: {actor_with_votes['votes'].mean():.0f}")

# Step 5: Recalculate with votes (no normalization)
print("\n=== GENRE + VOTES (NO NORMALIZATION) ===")
if query_actor_id in actor_with_votes.index:
    dist = DistanceMetric.get_metric('euclidean')
    distances_with_votes = dist.pairwise(actor_with_votes.values)
    
    query_idx = actor_with_votes.index.get_loc(query_actor_id)
    query_distances_votes = distances_with_votes[query_idx]
    
    results_with_votes = pd.DataFrame({
        'actor_id': actor_with_votes.index,
        'euclidean_distance': query_distances_votes
    })
    results_with_votes = results_with_votes[results_with_votes['actor_id'] != query_actor_id]
    
    top_10_with_votes = results_with_votes.nsmallest(10, 'euclidean_distance')
    print("Top 10 similar actors (genre + votes, no normalization):")
    for i, row in top_10_with_votes.iterrows():
        print(f"{row['actor_id']} - {row['euclidean_distance']:.5f}")

# Step 6: Apply min/max normalization to votes column
actor_normalized_votes = actor_genre_l1.copy()

# Min-max normalize the votes column
votes_min = actor_votes_series = pd.Series(actor_votes_data)
votes_min = votes_min.min()
votes_max = votes_min.max()

if votes_max > votes_min:  # Avoid division by zero
    normalized_votes = (actor_votes_series - votes_min) / (votes_max - votes_min)
else:
    normalized_votes = pd.Series(0.5, index=actor_votes_series.index)

actor_normalized_votes['votes'] = normalized_votes

print(f"\n=== GENRE + VOTES (WITH MIN/MAX NORMALIZATION) ===")
if query_actor_id in actor_normalized_votes.index:
    dist = DistanceMetric.get_metric('euclidean')
    distances_normalized = dist.pairwise(actor_normalized_votes.values)
    
    query_idx = actor_normalized_votes.index.get_loc(query_actor_id)
    query_distances_norm = distances_normalized[query_idx]
    
    results_normalized = pd.DataFrame({
        'actor_id': actor_normalized_votes.index,
        'euclidean_distance': query_distances_norm
    })
    results_normalized = results_normalized[results_normalized['actor_id'] != query_actor_id]
    
    top_10_normalized = results_normalized.nsmallest(10, 'euclidean_distance')
    print("Top 10 similar actors (genre + normalized votes):")
    for i, row in top_10_normalized.iterrows():
        print(f"{row['actor_id']} - {row['euclidean_distance']:.5f}")

# Step 7: Compare results
print("\n=== COMPARISON ANALYSIS ===")
print("1. GENRE-ONLY vs GENRE+VOTES (no normalization):")
print("   - Without normalization, votes dominate due to large scale differences")
print("   - Actors with similar vote counts become 'similar' regardless of genre")
print("   - This often produces meaningless results")

print("\n2. GENRE+VOTES (no normalization) vs GENRE+VOTES (with normalization):")
print("   - Min/max normalization brings votes to 0-1 scale like genres")
print("   - Restores balance between genre similarity and popularity")
print("   - Results become more meaningful and similar to genre-only approach")

print("\n3. Key Insight:")
print("   - Features with different scales need normalization for meaningful distance calculations")
print("   - Without normalization, high-magnitude features overwhelm the distance metric")
print("   - Min/max normalization preserves the relative relationships while fixing scale issues")

=== GENRE-ONLY SIMILARITY ===


: 