# mini_project_3.ipynb

Group Members:
- Peter Bollhorn
- Tobias Thormod Birk Nielsen

This notebook presents our solution to Mini Project 3, where we work with data from The Movie Database (TMDB) https://www.themoviedb.org/

In the notebook **read_tmdb_data.ipynb** we read JSON data from TMDB's API and store it as CSV files:
- **danish_movies.csv**: Data on all Danish-language movies ever made up to and including 2024-12-31.
- **danish_actors.csv**: Data on all actors that appear in these movies (also foreign actors).

Here is what we do in the different tasks:
- Task 1: We load the two CSV files into data frames, and then we clean them and prepare them for the next tasks.
- Task 2: We use linear regression to see if there is a linear relationship between actor age and their movie count (the number of movies they have appeared in).
- Task 3: We use classiciation to see if we can predict actor gender from their age and their movie count.
- Task 4: We use clustering to see (FILL IN HERE)


TMDB works with these genders:
| value  | Gender                  |
|--------|-------------------------|
| 0      | Not set / not specified |
| 1      | Female                  |
| 2      | Male                    |
| 3      | Non-binary              |

And TMDB works with these genres:
| genre_id | Genre           |
|----------|-----------------|
| 28       | Action          |
| 12       | Adventure       |
| 16       | Animation       |
| 35       | Comedy          |
| 80       | Crime           |
| 99       | Documentary     |
| 18       | Drama           |
| 10751    | Family          |
| 14       | Fantasy         |
| 36       | History         |
| 27       | Horror          |
| 10402    | Music           |
| 9648     | Mystery         |
| 10749    | Romance         |
| 878      | Science Fiction |
| 10770    | TV Movie        |
| 53       | Thriller        |
| 10752    | War             |
| 37       | Western         |

In [None]:
import sys
sys.path.append("..")

import pandas as pd
import ast  # Abstract Syntax Trees - safely parse string list to Python list
import seaborn as sns
import matplotlib.pyplot as plt
from reader import generic_reader

## Task 1: Data wrangling and exploration

First we read in danish_movies.csv and have a look at the data. We see that there is a total of 5134 movies.

In [None]:
danish_movies = generic_reader.read_csv_file_to_data_frame("movie_data/danish_movies.csv")
print(danish_movies.info())
danish_movies

Now we read in danish_actors.csv and have a look at the data. We see that a total of 15602 actors appear in these movies (including foreign actors).

In [None]:
danish_actors = generic_reader.read_csv_file_to_data_frame("movie_data/danish_actors.csv")
print(danish_actors.info())
danish_actors

For danish_movies data frame, we now drop the columns we are not interested in working with.

And we convert string representations of lists to actual Python lists.

And we convert string representation of release_date to actual Python datetime objects.

In [None]:
columns_to_keep = ['id', 'title', 'release_date', 'runtime', 'vote_average', 'vote_count', 'genre_ids', 'cast_person_ids']
danish_movies = danish_movies[columns_to_keep].copy()
danish_movies['genre_ids'] = danish_movies['genre_ids'].apply(ast.literal_eval)
danish_movies['cast_person_ids'] = danish_movies['cast_person_ids'].apply(ast.literal_eval)
danish_movies['release_date'] = pd.to_datetime(danish_movies['release_date'])
danish_movies

For danish_actors data frame, we now drop the columns we are not interested in working with.

And we convert string representation of dates to actual Python datetime objects.

In [None]:
columns_to_keep = ['actor_id', 'name', 'gender', 'birthday', 'deathday']
danish_actors = danish_actors[columns_to_keep].copy()
danish_actors['birthday'] = pd.to_datetime(danish_actors['birthday'])
danish_actors['deathday'] = pd.to_datetime(danish_actors['deathday'])
danish_actors

In the danish_movies data frame we drop all movies that have runtime shorter than 40 minutes.

We also drop animation films (genre_id=16) and documentaries (genre_id=99).

This is because we only want to work with films that are:
- **Feature-length** (excluding short-films)
- **Live-action** (excluding animation films)
- **Narrative** (excluding documentaries)

After doing this we have 1623 films left.

In [None]:
danish_movies = danish_movies[danish_movies["runtime"] >= 40]
danish_movies = danish_movies[~danish_movies['genre_ids'].apply(lambda genre_id: 16 in genre_id or 99 in genre_id)]
len(danish_movies)

Now we drop all actors that do not appear in the reduced danish_movies data frame.

After doing this we have 10651 actors left.

In [None]:
# Step 1: Get all unique actor IDs from danish_movies
all_actor_ids = set()
for cast in danish_movies['cast_person_ids']:
    all_actor_ids.update(cast)
    
# Step 2: Reduce danish_actors to only contain those who appear in the reduced danish_movies data frame
danish_actors = danish_actors[danish_actors['actor_id'].isin(all_actor_ids)].copy()
len(danish_actors)

Now we calculate movie_count for the actors

In [None]:
# Step 1: explode cast_person_ids so each row has one actor per movie
exploded = danish_movies.explode("cast_person_ids")

# Step 2: count movies per actor
actor_movie_counts = (
    exploded.groupby("cast_person_ids")["id"]
    .nunique()  # use nunique in case of duplicates
    .reset_index()
    .rename(columns={"cast_person_ids": "actor_id", "id": "movie_count"})
)
actor_movie_counts

# Step 3: merge movie_count back into danish_actors
danish_actors = danish_actors.merge(actor_movie_counts, on="actor_id")
danish_actors.sort_values(by="movie_count", ascending=False)

We drop actors whose birthday is missing

In [None]:
danish_actors = danish_actors.dropna(subset=["birthday"]).copy()
len(danish_actors)

We now calculate the age of the actors:
- For living actors: Their age as of 2024-12-31
- For deceased actors: Their age at the time of passing

And hereafter we look at actors when sorted by age.

In [None]:
def calculate_age_from_row(row):
    birthday = row['birthday']
    deathday = row['deathday']
    
    if pd.isna(deathday):
        end_date = pd.to_datetime('2024-12-31')
    else:
        end_date = deathday
    
    age = end_date.year - birthday.year - ((end_date.month, end_date.day) < (birthday.month, birthday.day))
    return age


danish_actors['age'] = danish_actors.apply(calculate_age_from_row, axis=1)
danish_actors.sort_values(by="age", ascending=False)

We now drop actors with age < 0, because they are errors in TMDB's database.

We also drop actors with no deathdate and age > 100, because their deathday is probably missing in TMDB's database.



In [None]:
danish_actors = danish_actors[danish_actors['age'] >= 0]
danish_actors = danish_actors[~(danish_actors['deathday'].isna() & (danish_actors['age'] > 100))]
danish_actors.sort_values(by="age", ascending=False)

### Final data frames

After cleaning and preparing the data frames, we arrive at these two data frames:

In [None]:
danish_movies

In [None]:
danish_actors

## Task 2: Supervised machine learning: linear regression

We will now use linear regression to see if there is a linear relationship between actor age and movie_count.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
# age is the independent variable
X = danish_actors['age'].values.reshape(-1, 1)

In [None]:
# movie_count is the dependent variable
y = danish_actors['movie_count'].values.reshape(-1, 1)

In [None]:
# Split data into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.15) 

In [None]:
# Creating an instance of Linear Regression model, and fit to our data
myreg = LinearRegression()
myreg.fit(X_train, y_train)

In [None]:
# Get the calculated coefficients
a = myreg.coef_
b = myreg.intercept_

In [None]:
y_predicted = myreg.predict(X_test)

We now make a plot of the data along with the linear regression.

In [None]:
# Visualise the Linear Regression 
plt.title('Linear Regression')
plt.scatter(X, y, alpha=0.2)
plt.plot(X_train, a*X_train + b, color='black')
plt.plot(X_test, y_predicted, color='orange')
plt.xlabel('age')
plt.ylabel('movie_count')
plt.show()

We see that the R-squared value is 0.029 which is not good.

This means that there is not really a linear relationship between actor age and movie_count.

If we instead had looked only at the actors with the top 100 movie_count we would expect to get a better linear relationship.

In [None]:
# R-squared: the proportion of the variation in the dependent variable that is predictable from the independent variable
from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)

## Task 3:  Supervised machine learning: classification

For the supervised machine-learning task, we used decision tree and random forest tree models to predict the gender of a Danish actor/atress based on two features: age and the number of movies they have appeared in.

In [None]:
import numpy as np
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import os
import graphviz

In [None]:
df_classification = danish_actors

Dropping columns from the dataframe which won't be part for the model.

In [None]:
columns_to_drop = ['actor_id', 'name', 'deathday', 'birthday']
df_classification = df_classification.drop(columns=columns_to_drop)

Selecting the feature input columns, age and movie_count, and the column to predict, gender.

In [None]:
feature_cols = [
    'age',          
    'movie_count' 
]
label_col = 'gender' 

Splitting data into X for the features and y the target

In [None]:
X = df_classification[feature_cols].values
y = df_classification[label_col].values

For clearer visualization of the upcoming decision tree, the gender values were recoded 1 = female and 2 = male.

In [None]:
female = X[y == 1] 
male = X[y == 2]   

In [None]:
#Set test size 
set_prop = 0.20

In [None]:
seed = 10

Split the dataset into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=set_prop, random_state=seed)

Initializing classifers. Randomclassifier will be used for the random forest tree model later.

In [None]:
params = {'max_depth': 4}
classifier = DecisionTreeClassifier(**params)
randomclassifier = RandomForestClassifier(n_estimators = 100, max_depth = 4)

Training the decision tree

In [None]:
classifier.fit(X_train, y_train)

Exporting the trained tree to graphviz format to draw the desicion tree.

In [None]:
dot_data = tree.export_graphviz(
    classifier,
    out_file=None,
    feature_names=feature_cols,
    class_names=['Unknown', 'female', 'male', 'No-binary'],  # '' for index 0
    filled=True, rounded=True
)

In [None]:
graph = graphviz.Source(dot_data)

#Remove the comment below to get a pdf of the decision tree called danish_actors
#graph.render("danish_actors") 

In [None]:
graph

In [None]:
scoring = 'accuracy'

Using the trained tree to predict the gender of the test set.

In [None]:
y_testp = classifier.predict(X_test)
y_testp

In [None]:
y_test

The model achieved an accuracy of only 59.2%, which is quite low. This indicates that it is unable to reliably predict the gender of an actor or actress

In [None]:
print ("Accuracy is ", accuracy_score(y_test,y_testp))

**Random decision tree**

In [None]:
randomclassifier.fit(X_train, y_train)

We find the best tree of the forest to draw.

In [None]:
best_accuracy = 0
best_tree = None

# Loop through all trees in the random forest
for tree_estimator in randomclassifier.estimators_:
    y_pred = tree_estimator.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    if acc > best_accuracy:
        best_accuracy = acc
        best_tree = tree_estimator

dot_data_best = tree.export_graphviz(
    best_tree,
    out_file=None,
    feature_names=feature_cols,
    class_names=["",'female', 'male'],
    filled=True,
    rounded=True
)


In [None]:
#Display the best tree of the forrest 
graph_best = graphviz.Source(dot_data_best)

#Remove the comment below to get the best random forest tree as pdf
#graph_best.render("best_random_forest_tree")

In [None]:
graph_best

In [None]:
scoring_random = 'accuracy'

In [None]:
y_testp = randomclassifier.predict(X_test)
y_testp

In [None]:
y_test

The random decision tree performed similarly to the standard decision tree, achieving a low accuracy of ~58.5%, the precise value can be found below

In [None]:
print ("Accuracy is ", accuracy_score(y_test,y_testp))

Both the decision tree and random forest models achieved only about 58–59% accuracy, which is low. This result—unsurprising given the limited and weakly related data—suggests there is no meaningful relationship between a Danish actor’s age or number of movie appearances and their gender.

## Task 4: Unsupervised machine learning: clustering

For the unsupervised machine-learning task, we apply Mean Shift clustering to Danish movies, using release year, runtime, and genre IDs as features.

In [None]:
from sklearn.cluster import MeanShift, estimate_bandwidth

In [None]:
df_clustering = danish_movies

Preparing the movie dataset for clustering by standardizing the numeric features, ensuring that variables like release year or runtime do not skew the results

In [None]:
#Looping through all movie ids to find the unique ids
all_genres = sorted({g for sublist in df_clustering['genre_ids'] for g in sublist})

#Making a one-hot dataframe with as many rows and the df (3990 rows as seen above), all rows have a value of 0
genre_df_clustering = pd.DataFrame(0, index=df_clustering.index,
                        columns=[f'genre_{g}' for g in all_genres])

#Looping through all the genre rows and changing the 0 to 1 cells which match the genre id
for i, genres in enumerate(df_clustering['genre_ids']):
    genre_df_clustering.loc[i, [f'genre_{g}' for g in genres]] = 1


df_clustering_encoded = pd.concat([df_clustering, genre_df_clustering], axis=1)

#Changing release_date to a date-time format and making a new column, release_year, and picks year from the date time as an int
df_clustering_encoded['release_date'] = pd.to_datetime(df_clustering_encoded['release_date'], errors='coerce')
df_clustering_encoded['release_year'] = df_clustering_encoded['release_date'].dt.year

#Scaling the years to not dominate the clustering 
numerical_features = ['release_year', 'runtime']
X_num = df_clustering_encoded[numerical_features]
X_num_scaled = (X_num - X_num.mean()) / X_num.std(ddof=0)
X = np.hstack([X_num_scaled.to_numpy(), genre_df_clustering.values])

Release_year and runtime should be close to zero, which they are.

In [None]:
X_num_scaled.mean(), X_num_scaled.std(ddof=0)

Because the dataset is relatively small, the sample size is set to 500 without any noticeable impact on performance.

In [None]:
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) 

Creating the Mean Shift model.

In [None]:
msmodel = MeanShift(bandwidth=bandwidth, bin_seeding=True)
msmodel.fit(X)

Labeling all movies with a cluster id.

In [None]:
labels = msmodel.labels_
labels

Array with the unique cluster ids.

In [None]:
labels_unique = np.unique(labels)
labels_unique

Counting the lenth of the array.

In [None]:
n_clusters_ = len(labels_unique)

In [None]:
print(f"Estimated number of clusters = {n_clusters_}")