# **PREDICTING RELEVANT ACTORS FOR A GENRE**

<br>

<br>

**1.1. LIBRARY IMPORTING**

In [1]:
import pandas as pd
import ast
from sklearn.preprocessing import MultiLabelBinarizer
import pickle
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pickle
from sklearn.linear_model import LogisticRegression ##








**1.2. LOAD DATASETS**

In [2]:
# Load the datasets
movies = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
credits = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")

# Display basic information about the datasets
print("Movies Dataset Columns:")
print(movies.columns)

print("\nCredits Dataset Columns:")
print(credits.columns)


Movies Dataset Columns:
Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Credits Dataset Columns:
Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


**1.3.  MERGE THE DATASETS**

In [3]:
# Merge the datasets on the 'title' column
merged_data = pd.merge(movies, credits, on="title")

# Retain only relevant columns
columns_to_keep = ['title', 'genres', 'cast']
merged_data = merged_data[columns_to_keep]

# Display merged dataset
print("Merged Dataset:")
print(merged_data.head())

Merged Dataset:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                              genres  \
0  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                                cast  
0  [{"cast_id": 242, "character": "Jake Sully", "...  
1  [{"cast_id": 4, "character": "Captain Jack Spa...  
2  [{"cast_id": 1, "character": "James Bond", "cr...  
3  [{"cast_id": 2, "character": "Bruce Wayne / Ba...  
4  [{"cast_id": 5, "character": "John Carter", "c...  


- Merge the datasets to unify the `genres` and `cast` columns for each movie.
- Keep only the columns necessary for analyzing genres and cast, which are critical for the model.

**1.4. PROCESS GENRES**

In [4]:
# Extract genre names from the JSON-like column
def extract_genres(genres_column):
    try:
        return [genre['name'] for genre in ast.literal_eval(genres_column)]
    except (ValueError, SyntaxError):
        return []

merged_data['genres'] = merged_data['genres'].apply(extract_genres)

# Display processed genres
print("Processed Genres:")
print(merged_data[['title', 'genres']].head())

Processed Genres:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                          genres  
0  [Action, Adventure, Fantasy, Science Fiction]  
1                   [Adventure, Fantasy, Action]  
2                     [Action, Adventure, Crime]  
3               [Action, Crime, Drama, Thriller]  
4           [Action, Adventure, Science Fiction]  


Explanation:

- Parse the `genres` column to extract a list of genre names for each movie.
- This ensures the data is ready for genre-based analysis.

**1.5. PROCESS CAST**

In [5]:
# Extract top 3 actor names from the JSON-like column
def extract_top_actors(cast_column):
    try:
        return [actor['name'] for actor in ast.literal_eval(cast_column)[:3]]  # Top 3 actors
    except (ValueError, SyntaxError):
        return []

merged_data['cast'] = merged_data['cast'].apply(extract_top_actors)

# Display processed cast
print("Processed Cast:")
print(merged_data[['title', 'cast']].head())

Processed Cast:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                               cast  
0  [Sam Worthington, Zoe Saldana, Sigourney Weaver]  
1     [Johnny Depp, Orlando Bloom, Keira Knightley]  
2      [Daniel Craig, Christoph Waltz, Léa Seydoux]  
3      [Christian Bale, Michael Caine, Gary Oldman]  
4    [Taylor Kitsch, Lynn Collins, Samantha Morton]  


- Parse the `cast` column to extract the names of the top 3 actors for each movie.
- These actors are the candidates for determining relevance by genre

**FINAL DATASET STRUCTURE**
After processing, the dataset will have:

- `title`: Movie title (for reference).
- `genres`: List of genres for the movie.
- `cast`: Top 3 actors' names.

This ensures the data is now prepared for further steps to analyze the relationship between genres and relevant actors.

<br>

# **STEP 2: FEATURE ENGINEERING**

**2.1. ENCODE `genres`**

In [6]:
mlb = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(mlb.fit_transform(merged_data['genres']), columns=mlb.classes_)

# Add encoded genres back to the dataset
merged_data = pd.concat([merged_data, genres_encoded], axis=1)

# Drop the original 'genres' column
merged_data.drop('genres', axis=1, inplace=True)

# Display the dataset with encoded genres
print("Dataset with Encoded Genres:")
print(merged_data.head())


Dataset with Encoded Genres:
                                      title  \
0                                    Avatar   
1  Pirates of the Caribbean: At World's End   
2                                   Spectre   
3                     The Dark Knight Rises   
4                               John Carter   

                                               cast  Action  Adventure  \
0  [Sam Worthington, Zoe Saldana, Sigourney Weaver]       1          1   
1     [Johnny Depp, Orlando Bloom, Keira Knightley]       1          1   
2      [Daniel Craig, Christoph Waltz, Léa Seydoux]       1          1   
3      [Christian Bale, Michael Caine, Gary Oldman]       1          0   
4    [Taylor Kitsch, Lynn Collins, Samantha Morton]       1          1   

   Animation  Comedy  Crime  Documentary  Drama  Family  ...  History  Horror  \
0          0       0      0            0      0       0  ...        0       0   
1          0       0      0            0      0       0  ...        0       0   


- Use `MultiLabelBinarizer` to one-hot encode the `genres` column.
- Each genre becomes a separate binary feature (1 = present, 0 = absent).

**2.2 CREATE ACTOR FEATURES**


In [7]:
# Flatten the cast list into individual actor columns
for i in range(3):
    merged_data[f'actor_{i+1}'] = merged_data['cast'].apply(lambda x: x[i] if len(x) > i else None)

# Drop the original 'cast' column
merged_data.drop('cast', axis=1, inplace=True)



In [8]:
# Display dataset with actor columns
print("Dataset with Actor Features:")
merged_data.head()

Dataset with Actor Features:


Unnamed: 0,title,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,actor_1,actor_2,actor_3
0,Avatar,1,1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,Sam Worthington,Zoe Saldana,Sigourney Weaver
1,Pirates of the Caribbean: At World's End,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,Johnny Depp,Orlando Bloom,Keira Knightley
2,Spectre,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,Daniel Craig,Christoph Waltz,Léa Seydoux
3,The Dark Knight Rises,1,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,Christian Bale,Michael Caine,Gary Oldman
4,John Carter,1,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,Taylor Kitsch,Lynn Collins,Samantha Morton


- Extract the top 3 actors from the `cast` column into separate columns (`actor_1`, `actor_2`, `actor_3`).
- Simplifies analysis by making actor data explicit.

**2.3 ASSIGN GENRE-SPECIFIC ACTOR RELEVANCE**

In [9]:
# Create a helper DataFrame to track actor relevance by genre
actor_genre_relevance = merged_data.melt(id_vars=mlb.classes_, value_vars=['actor_1', 'actor_2', 'actor_3'], 
                                         var_name='actor_position', value_name='actor')

# Drop rows where 'actor' is NaN
actor_genre_relevance.dropna(subset=['actor'], inplace=True)



In [10]:
# Display the actor-genre mapping
print("ACTOR-GENRE RELEVANCE DataFrame:")
actor_genre_relevance.head()

ACTOR-GENRE RELEVANCE DataFrame:


Unnamed: 0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,...,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,actor_position,actor
0,1,1,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,actor_1,Sam Worthington
1,1,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,actor_1,Johnny Depp
2,1,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,actor_1,Daniel Craig
3,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,actor_1,Christian Bale
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,actor_1,Taylor Kitsch


- Create a mapping between genres and actors.
- Use the `melt` function to transform actor columns into rows, associating each actor with their respective genres.


**2.4. GENERATE TRAINIG DATASET**

In [11]:
# Use only the columns available in actor_genre_relevance
group_columns = ['actor']  # Start with actor as the primary group
if set(mlb.classes_).issubset(actor_genre_relevance.columns):  # Check if genre columns exist
    group_columns += list(mlb.classes_)

# Aggregate actor appearances across genres
actor_genre_count = actor_genre_relevance.groupby(group_columns).size().reset_index(name='count')

# Filter out less relevant actors (e.g., threshold of appearances)
threshold = 2
actor_genre_count = actor_genre_count[actor_genre_count['count'] >= threshold]


In [12]:
# Display the final training dataset
print("Training Dataset:")
actor_genre_count.head()

Training Dataset:


Unnamed: 0,actor,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,count
9,Aamir Khan,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2
13,Aaron Eckhart,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,4
14,Aaron Eckhart,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,2
18,Aaron Eckhart,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2
31,Aaron Taylor-Johnson,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2


**Final Output:**

- **Dataset for Modeling**: `actor_genre_count` contains the relationship between `actors` and `genres`.
- The model can now use this data to predict which actors are most relevant to a given genre.

<br>

# **STEP 3: MODEL TRAINING**

**3.1. SPLIT THE DATASET**

In [13]:
# Define features (X) and target (y)
X = actor_genre_count.drop(['actor', 'count'], axis=1)  # Features: genres
y = actor_genre_count['count']  # Target: actor relevance count

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display shapes of the splits
print("Train Features Shape:", X_train.shape)
print("Train Target Shape:", y_train.shape)
print("Test Features Shape:", X_test.shape)
print("Test Target Shape:", y_test.shape)


Train Features Shape: (823, 20)
Train Target Shape: (823,)
Test Features Shape: (206, 20)
Test Target Shape: (206,)


- The dataset is split into **training** and **testing** sets.
- **X** contains the genre features, while **y** is the target (actor relevance count).
- The split ensures the model can be evaluated on unseen data.

**3.2 CHOOSE AND TRAIN A MODEL**

In [14]:
# Initialize Logistic Regression with class_weight='balanced'
model = LogisticRegression(
    max_iter=1000,           # Allow the model to converge with enough iterations
    solver='liblinear',      # Lightweight solver suitable for smaller datasets
    class_weight='balanced'  # Adjust weights inversely proportional to class frequency
)

# Train the model on the training set
model.fit(X_train, y_train)

# Display model coefficients for insight
print("Model Coefficients:")
for genre, coef in zip(X_train.columns, model.coef_[0]):
    print(f"{genre}: {coef:.4f}")


Model Coefficients:
Action: 0.2792
Adventure: -0.7015
Animation: -0.4509
Comedy: 0.2233
Crime: 0.4964
Documentary: -0.0258
Drama: 0.3286
Family: 0.2306
Fantasy: 0.3317
Foreign: 0.0000
History: 0.3639
Horror: 0.4303
Music: 0.1749
Mystery: 0.4723
Romance: 0.0162
Science Fiction: -0.0329
TV Movie: 0.0000
Thriller: 0.0305
War: -0.1880
Western: -0.1586


- Logistic Regression is chosen for its lightweight nature and suitability for classification.
- The model learns to predict the relevance of actors based on genres.

**3.3 EVALUATE THE MODEL**

In [15]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics (handle zero division in precision/recall)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

# Display evaluation metrics
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


Model Evaluation Metrics:
Accuracy: 0.54
Precision: 0.75
Recall: 0.54
F1 Score: 0.55


In [16]:
# Save the trained model to a file
with open('actor_relevance_model.pkl', 'wb') as file:
    pickle.dump(model, file)

print("Model saved as 'actor_relevance_model.pkl'")


Model saved as 'actor_relevance_model.pkl'


<br>

# **STEP 4: SAVE AND SERIALIZE THE MODEL**

**4.1. SAVE THE TRAINED MODEL**

In [17]:
# Save the trained model to a .pkl file
with open('actor_relevance_model.pkl', 'wb') as file:
    pickle.dump(model, file)

print("Model saved as 'actor_relevance_model.pkl'")


Model saved as 'actor_relevance_model.pkl'


- The trained Logistic Regression model is serialized using `pickle` and saved as a `.pkl` file.
- This file will be used later in the Flask application to load the model and make predictions.

**4.2. VERIFY THE MODEL FILE**

In [18]:
# Check the saved model file
with open('actor_relevance_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Ensure the loaded model matches the original one
print("Loaded Model Coefficients:")
for genre, coef in zip(X_train.columns, loaded_model.coef_[0]):
    print(f"{genre}: {coef:.4f}")

Loaded Model Coefficients:
Action: 0.2792
Adventure: -0.7015
Animation: -0.4509
Comedy: 0.2233
Crime: 0.4964
Documentary: -0.0258
Drama: 0.3286
Family: 0.2306
Fantasy: 0.3317
Foreign: 0.0000
History: 0.3639
Horror: 0.4303
Music: 0.1749
Mystery: 0.4723
Romance: 0.0162
Science Fiction: -0.0329
TV Movie: 0.0000
Thriller: 0.0305
War: -0.1880
Western: -0.1586


This step ensures the model was saved correctly by reloading it and checking that the coefficients match the original model.

<br>

# **