# **PREDICTING RELEVANT ACTORS FOR A GENRE**

<br>

<br>

**1.1. LIBRARY IMPORTING**

In [68]:
import pandas as pd
import ast
from sklearn.preprocessing import MultiLabelBinarizer


**1.2. LOAD DATASETS**

In [69]:
# Load the datasets
movies = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
credits = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")

# Display basic information about the datasets
print("Movies Dataset Columns:")
print(movies.columns)

print("\nCredits Dataset Columns:")
print(credits.columns)


Movies Dataset Columns:
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Credits Dataset Columns:
Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


**1.3.  MERGE THE DATASETS**

In [70]:
# Merge the datasets on the 'title' column
merged_data = pd.merge(movies, credits, on="title")

# Retain only relevant columns
columns_to_keep = ['title', 'genres', 'cast']
merged_data = merged_data[columns_to_keep]

# Display merged dataset
print("Merged Dataset:")
print(merged_data.head())

Merged Dataset:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                              genres  \
0  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                                cast  
0  [{"cast_id": 242, "character": "Jake Sully", "...  
1  [{"cast_id": 4, "character": "Captain Jack Spa...  
2  [{"cast_id": 1, "character": "James Bond", "cr...  
3  [{"cast_id": 2, "character": "Bruce Wayne / Ba...  
4  [{"cast_id": 5, "character": "John Carter", "c...  


- Merge the datasets to unify the `genres` and `cast` columns for each movie.
- Keep only the columns necessary for analyzing genres and cast, which are critical for the model.

**1.4. PROCESS GENRES**

In [71]:
# Extract genre names from the JSON-like column
def extract_genres(genres_column):
    try:
        return [genre['name'] for genre in ast.literal_eval(genres_column)]
    except (ValueError, SyntaxError):
        return []

merged_data['genres'] = merged_data['genres'].apply(extract_genres)

# Display processed genres
print("Processed Genres:")
print(merged_data[['title', 'genres']].head())

Processed Genres:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                          genres  
0  [Action, Adventure, Fantasy, Science Fiction]  
1                   [Adventure, Fantasy, Action]  
2                     [Action, Adventure, Crime]  
3               [Action, Crime, Drama, Thriller]  
4           [Action, Adventure, Science Fiction]  


Explanation:

- Parse the `genres` column to extract a list of genre names for each movie.
- This ensures the data is ready for genre-based analysis.

**1.5. PROCESS CAST**

In [72]:
# Extract top 3 actor names from the JSON-like column
def extract_top_actors(cast_column):
    try:
        return [actor['name'] for actor in ast.literal_eval(cast_column)[:3]]  # Top 3 actors
    except (ValueError, SyntaxError):
        return []

merged_data['cast'] = merged_data['cast'].apply(extract_top_actors)

# Display processed cast
print("Processed Cast:")
print(merged_data[['title', 'cast']].head())

Processed Cast:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                               cast  
0  [Sam Worthington, Zoe Saldana, Sigourney Weaver]  
1     [Johnny Depp, Orlando Bloom, Keira Knightley]  
2      [Daniel Craig, Christoph Waltz, Léa Seydoux]  
3      [Christian Bale, Michael Caine, Gary Oldman]  
4    [Taylor Kitsch, Lynn Collins, Samantha Morton]  


- Parse the `cast` column to extract the names of the top 3 actors for each movie.
- These actors are the candidates for determining relevance by genre

**FINAL DATASET STRUCTURE**
After processing, the dataset will have:

- `title`: Movie title (for reference).
- `genres`: List of genres for the movie.
- `cast`: Top 3 actors' names.

This ensures the data is now prepared for further steps to analyze the relationship between genres and relevant actors.

<br>

# **STEP 2: FEATURE ENGINEERING**

**2.1. ENCODE `genres`**

In [73]:
mlb = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(mlb.fit_transform(merged_data['genres']), columns=mlb.classes_)

# Add encoded genres back to the dataset
merged_data = pd.concat([merged_data, genres_encoded], axis=1)

# Drop the original 'genres' column
merged_data.drop('genres', axis=1, inplace=True)

# Display the dataset with encoded genres
print("Dataset with Encoded Genres:")
print(merged_data.head())


Dataset with Encoded Genres:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                               cast  Action  Adventure  \
0  [Sam Worthington, Zoe Saldana, Sigourney Weaver]       1          1   
1     [Johnny Depp, Orlando Bloom, Keira Knightley]       1          1   
2      [Daniel Craig, Christoph Waltz, Léa Seydoux]       1          1   
3      [Christian Bale, Michael Caine, Gary Oldman]       1          0   
4    [Taylor Kitsch, Lynn Collins, Samantha Morton]       1          1   

   Animation  Comedy  Crime  Documentary  Drama  Family  ...  History  Horror  \
0          0       0      0            0      0       0  ...        0       0   
1          0       0      0            0      0       0  ...        0       0   


- Use `MultiLabelBinarizer` to one-hot encode the `genres` column.
- Each genre becomes a separate binary feature (1 = present, 0 = absent).

**2.2 CREATE ACTOR FEATURES**


In [74]:
# Flatten the cast list into individual actor columns
for i in range(3):
    merged_data[f'actor_{i+1}'] = merged_data['cast'].apply(lambda x: x[i] if len(x) > i else None)

# Drop the original 'cast' column
merged_data.drop('cast', axis=1, inplace=True)



In [75]:
# Display dataset with actor columns
print("Dataset with Actor Features:")
merged_data.head()

Dataset with Actor Features:


Unnamed: 0,title,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,actor_1,actor_2,actor_3
0,Avatar,1,1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,Sam Worthington,Zoe Saldana,Sigourney Weaver
1,Pirates of the Caribbean: At World's End,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,Johnny Depp,Orlando Bloom,Keira Knightley
2,Spectre,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,Daniel Craig,Christoph Waltz,Léa Seydoux
3,The Dark Knight Rises,1,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,Christian Bale,Michael Caine,Gary Oldman
4,John Carter,1,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,Taylor Kitsch,Lynn Collins,Samantha Morton


- Extract the top 3 actors from the `cast` column into separate columns (`actor_1`, `actor_2`, `actor_3`).
- Simplifies analysis by making actor data explicit.

**2.3 ASSIGN GENRE-SPECIFIC ACTOR RELEVANCE**

In [76]:
# Create a helper DataFrame to track actor relevance by genre
actor_genre_relevance = merged_data.melt(id_vars=mlb.classes_, value_vars=['actor_1', 'actor_2', 'actor_3'], 
                                         var_name='actor_position', value_name='actor')

# Drop rows where 'actor' is NaN
actor_genre_relevance.dropna(subset=['actor'], inplace=True)



In [77]:
# Display the actor-genre mapping
print("ACTOR-GENRE RELEVANCE DataFrame:")
actor_genre_relevance.head()

ACTOR-GENRE RELEVANCE DataFrame:


Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,actor_position,actor
0,1,1,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,actor_1,Sam Worthington
1,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,actor_1,Johnny Depp
2,1,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,actor_1,Daniel Craig
3,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,actor_1,Christian Bale
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,actor_1,Taylor Kitsch


- Create a mapping between genres and actors.
- Use the `melt` function to transform actor columns into rows, associating each actor with their respective genres.


**2.4. GENERATE TRAINIG DATASET**

In [82]:
# Use only the columns available in actor_genre_relevance
group_columns = ['actor']  # Start with actor as the primary group
if set(mlb.classes_).issubset(actor_genre_relevance.columns):  # Check if genre columns exist
    group_columns += list(mlb.classes_)

# Aggregate actor appearances across genres
actor_genre_count = actor_genre_relevance.groupby(group_columns).size().reset_index(name='count')

# Filter out less relevant actors (e.g., threshold of appearances)
threshold = 2
actor_genre_count = actor_genre_count[actor_genre_count['count'] >= threshold]


In [83]:
# Display the final training dataset
print("Training Dataset:")
actor_genre_count.head()

Training Dataset:


Unnamed: 0,actor,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,count
9,Aamir Khan,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2
13,Aaron Eckhart,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4
14,Aaron Eckhart,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,2
18,Aaron Eckhart,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2
31,Aaron Taylor-Johnson,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2


In [84]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = actor_genre_count.drop(['actor', 'count'], axis=1)  # Features: genres
y = actor_genre_count['count']  # Target: actor relevance count

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display shapes of the splits
print("Train Features Shape:", X_train.shape)
print("Train Target Shape:", y_train.shape)
print("Test Features Shape:", X_test.shape)
print("Test Target Shape:", y_test.shape)


Train Features Shape: (823, 20)
Train Target Shape: (823,)
Test Features Shape: (206, 20)
Test Target Shape: (206,)


In [85]:
from sklearn.linear_model import LogisticRegression

# Initialize a Logistic Regression model
model = LogisticRegression(max_iter=1000, solver='liblinear')

# Train the model on the training set
model.fit(X_train, y_train)

# Display model coefficients for insight
print("Model Coefficients:")
print(model.coef_)


Model Coefficients:
[[ 0.58438489 -0.80900833 -0.55087575  0.42337587  0.60352047 -0.05163047
   0.55907288  0.46089871  0.41754864  0.          0.59729952  0.71465288
   0.34308319  0.65658544  0.07044405 -0.08339562  0.          0.03849273
  -0.53926067 -0.22458484]
 [-0.07822388  0.09277018  0.56706162 -0.27327862 -0.36320796 -0.14657695
  -0.20896041 -0.14804835 -0.87480675  0.         -0.41275909 -0.32368846
  -0.02877646 -0.1554652  -0.29652303  0.17678848  0.         -0.3636909
   0.95136212  0.06889304]
 [-0.96608455  1.1298462   0.82964654 -0.38687289 -0.54623927 -0.25955411
  -0.45823802 -0.85785799  0.29739307  0.         -0.32009887 -0.31867341
  -0.27195277 -0.70286551  0.42967325  0.48285191  0.          0.15583802
  -0.21338147 -0.23880923]
 [-0.89844212  0.88212406 -0.65294477 -1.39474551 -0.68907198  0.4943457
  -1.3485917   0.15873468  0.02752024  0.         -0.17553596 -1.00626867
  -0.20755787 -0.48085712 -0.20036971 -0.86113771  0.         -0.10447055
  -0.16952979

In [87]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics (handle zero division in precision/recall)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

# Display evaluation metrics
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


Model Evaluation Metrics:
Accuracy: 0.70
Precision: 0.50
Recall: 0.70
F1 Score: 0.58


In [88]:
import pickle

# Save the trained model to a file
with open('actor_relevance_model.pkl', 'wb') as file:
    pickle.dump(model, file)

print("Model saved as 'actor_relevance_model.pkl'")


Model saved as 'actor_relevance_model.pkl'
