# Exercise 1. Clustering Actors in the Actor-Genre Matrix
This exercise builds on prior exercises, where we have extracted a data matrix of actors and the genres in which they have starred. We will use that data to extract a set of actor clusters from the data, using k-means.

Download this CSV file Links to an external site. from GitHub (also available on Canvas in Exercises/data)
Use the CSV file to create a data frame, where each row corresponds to an actor, each column represents a genre, and each cell captures how many times that row’s actor has appeared in that column’s genre
Using this data frame as your data matrix, use the k-means package in scikit-learn sklearn.cluster.KMeans Links to an external site. to extract k=8 clusters from the data.
Initialize the model using KMeans(n_clusters=8)
Fit this model to the data and get cluster predictions back using the model.fit_predict()
Print the number of actors per cluster you have found
For each cluster, print a random sample of 5 actors in that cluster to get a sense of who is in each cluster
One easy way to do this sampling is to use Pandas to create a data frame with the predicted cluster labels as a column and the actor IDs from your original actor x genre matrix as the index. Then you can use df.groupby("cluster") to get sub-data frames for each cluster.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans


df = pd.read_csv('imdb_movies_2000to2022.actorXgenre.csv')

# actor_id as index and creating data matrix
actor_ids = df['actor_id']
genre_columns = df.columns[1:]  # all columns except actor_id
data_matrix = df[genre_columns].values


# k-means clustering on data
kmeans_raw = KMeans(n_clusters=8, random_state=50)
cluster_labels_raw = kmeans_raw.fit_predict(data_matrix)

# creating DataFrame with cluster labels
clusters_df_raw = pd.DataFrame({
    'actor_id': actor_ids,
    'cluster': cluster_labels_raw
}).set_index('actor_id')

# number of actors per cluster
print("\nNumber of actors per cluster")
print(clusters_df_raw['cluster'].value_counts().sort_index())

# random samples from each cluster
print("\nRandom samples from each cluster")

for k_clusters in range(8):
    cluster_actors = clusters_df_raw[clusters_df_raw['cluster'] == k_clusters]
    sample_actors = cluster_actors.sample(n=min(5, len(cluster_actors)), random_state=50)
    
    print(f"\nCluster {k_clusters} (actors: {len(cluster_actors)}):")
    for actor_id in sample_actors.index:
        print(f"  {actor_id}")


Number of actors per cluster
cluster
0     2261
1     5570
2      903
3      327
4    24027
5      129
6      322
7       70
Name: count, dtype: int64

Random samples from each cluster

Cluster 0 (actors: 2261):
  nm0068042
  nm0058983
  nm2118666
  nm1298052
  nm0005311

Cluster 1 (actors: 5570):
  nm0532678
  nm0819908
  nm4596840
  nm0428656
  nm0004990

Cluster 2 (actors: 903):
  nm1953187
  nm0671032
  nm0001334
  nm0914612
  nm0001497

Cluster 3 (actors: 327):
  nm0000169
  nm0913488
  nm0000194
  nm0000849
  nm0719637

Cluster 4 (actors: 24027):
  nm7565175
  nm2893329
  nm4939975
  nm0113536
  nm8400963

Cluster 5 (actors: 129):
  nm1389064
  nm0612691
  nm0629697
  nm0410622
  nm0000105

Cluster 6 (actors: 322):
  nm0000950
  nm0005098
  nm0001378
  nm0005049
  nm0000867

Cluster 7 (actors: 70):
  nm0001194
  nm0000241
  nm0000174
  nm0000134
  nm0001595


# Exercise 2(Optional). Clustering Actors in the Normalized Actor-Genre Matrix
Building on your results from Exercise 1, normalize your actor-genre matrix by applying row-level L1 normalization (i.e., each row should sum to 1.0).
Use the scikit-learn sklearn.cluster.KMeans Links to an external site. package to re-run clustering, setting (n_clusters=8)to extract k=8 clusters from the normalized dataset. 
As before, for each cluster, print a random sample of 5 actors in that cluster to get a sense of who is in each cluster. Do these clusters appear to be the same or different from the results you got in Exercise 1?
For each cluster in this exercise, calculate the Jaccard similarity with each cluster from Exercise 1. Can you find corresponding clusters across the two exercises?