# Predicting gender of movie directors using SSA baby names data
*Julie Nguyen, PhD candidate in Management (Organizational Behavior), McGill University*

Hello and welcome to the third chapter of our exploration into the interplay between social networks, gender, and career success in the film industry. Specifically, we aim to discern how connections with network brokers—those pivotal individuals who bridge separate creative circles in the industry— impact the career trajectories of women and men movie directors differently.

**What are we aiming to do?**

In our previous work, we've identified a group of directors who began their careers between 2003 and 2013 (`Phase_1_Tracking_Movie_Directors_Career.ipynb`) and have constructed networks to measure the social influence of their collaborators (`Phase_2_Constructing_Filmmaker_Network.ipynb`). Now, we turn our focus toward a another fundamental aspect of our analysis: predicting the gender of these directors.

**How do we do it?**

To accomplish this, we will utilize the baby names dataset from the US Social Security Administration (SSA), which compiles names and their associated genders from national registration data. This resource allows us to infer the most likely gender associated with a director's first name, based on the proportion of male and female registrations for that name. For example, if over 95% of the individuals registered with a certain first name are female, we will predict the director's gender as female, and similarly for males. We'll also devise strategies for handling cases with initials or compound names.

It's important to acknowledge that while this project uses binary gender classifications for simplicity, gender identity is indeed a diverse spectrum that extends beyond just male and female. This binary approach is used purely for analytical simplicity and is not intended as a statement negating the broader spectrum of gender identities.

**Looking ahead**

By the end of this notebook, we'll have established the predicted gender for each director in our sample. This will lay the groundwork for our next phase, where we will explore how the gender of a director might influence the benefits they gain from their network connections, particularly from those individuals who serve as bridges in the industry.

# Getting directors' first name

To predict the gender of the directors, we first need to get their first names. For this, we can use IMDb's data on the names of people in the film industry.

Let's load the neccessary libraries and a dataset we created in previous notebook (`Phase_1_Tracking_Movie_Directors_Career.ipynb`) containing the full filmography of our directors and see what the data looks like.

In [1]:
# Import essential libraries 
import pandas as pd # data manipulation
import matplotlib.pyplot as plt # data visualization
import os # interacting with the operating system, such as file paths
import requests # making HTTP requests to download files from the internet
import io # byte stream handling when dealing with binary data like files from a request
import zipfile # extracting files from ZIP archives
from unidecode import unidecode # normalizing unicode characters in strings to their ASCII counterparts

# Set the working directory to the location of project files
os.chdir('/Users/mac/Library/CloudStorage/OneDrive-McGillUniversity/Work/Projects/Gender and brokerage/WomenLeaders_SocialNetworks')

In [3]:
# Load the dataset containing movie directors' filmography from 2003 to 2023
directors_full_filmography = pd.read_csv('directors_full_filmography.csv')

# Display the first few rows of the data to understand the data structure
directors_full_filmography.head()

Unnamed: 0,tconst,startYear,genres,nconst,firstYear,averageRating,numVotes
0,tt0108549,2004.0,"Comedy,Mystery",nm1131265,2004.0,7.8,34.0
1,tt0108549,2004.0,"Comedy,Mystery",nm1130611,2004.0,7.8,34.0
2,tt0117461,2003.0,"Comedy,Romance",nm0290651,2003.0,6.3,24.0
3,tt0117743,2008.0,"Drama,Romance",nm0404033,2003.0,6.7,64.0
4,tt0118141,2005.0,Drama,nm0000417,2005.0,5.3,950.0


In the `directors_full_filmography` data, each row is a movie (`tconst`) made by the 63,169 directors (`nconst`) in our sample from their debut until 2023. 

Next, we download IMDb's `name.basics.tsv.gz` data which contains film-makers' names and merge it with our existing directors data, creating a dataframe called `directors_name` containing the actual names of the 63169 directors in our sample.

In [4]:
# Define the URL for the IMDb dataset that contains names of film industry professionals
url_name_basics = 'https://datasets.imdbws.com/name.basics.tsv.gz'
# Load selected columns ('nconst' for IMDb's unique identifier and 'primaryName' for the person's name) from the dataset
name_basics_df = pd.read_csv(url_name_basics, sep='\t', compression='gzip', encoding='utf-8', low_memory=False, usecols=['nconst', 'primaryName'])

# Merge the filmography dataset with the IMDb names dataset on the 'nconst' column to associate each director with their name
directors_name = directors_full_filmography[['nconst']].drop_duplicates().merge(name_basics_df, on='nconst', how='left')

Let's see what the data looks like. 

In [126]:
directors_name

Unnamed: 0,nconst,primaryName
0,nm1131265,Steve Ashlee
1,nm1130611,Valerie Silver
2,nm0290651,Val Franco
3,nm0404033,Halfdan Hussey
4,nm0000417,Crispin Glover
...,...,...
63164,nm4394847,Krishna Maaya
63165,nm10527671,Anwar Shahadat
63166,nm10532602,Patricia Gérimont
63167,nm4453202,Jean Claude Taburiaux


We now have the full name of our directors. The next step is to extract and standardize their first names so that we can link them with gender information later on. For this task, we use `unidecode` to convert each name to ASCII, standardizing our data and reducing potential errors due to character encoding or name variations. We then split the full name by space and select the first element to get directors' first name. 

In [127]:
# Ensure the 'primaryName' column is treated as a string to avoid any type-related issues during processing
directors_name['primaryName'] = directors_name['primaryName'].astype(str)

# Normalize the director's first names to ensure consistency, important for non-ASCII names
# The 'unidecode' function converts unicode characters to their closest ASCII representation
# Splitting the 'primaryName' by space and selecting the first element isolates the first name
directors_name['director_first_name'] = directors_name['primaryName'].apply(unidecode).apply(lambda x: x.split()[0])

In [128]:
directors_name

Unnamed: 0,nconst,primaryName,director_first_name
0,nm1131265,Steve Ashlee,Steve
1,nm1130611,Valerie Silver,Valerie
2,nm0290651,Val Franco,Val
3,nm0404033,Halfdan Hussey,Halfdan
4,nm0000417,Crispin Glover,Crispin
...,...,...,...
63164,nm4394847,Krishna Maaya,Krishna
63165,nm10527671,Anwar Shahadat,Anwar
63166,nm10532602,Patricia Gérimont,Patricia
63167,nm4453202,Jean Claude Taburiaux,Jean


# Gender classification with SSA baby names data

With the directors' first name in hand, now we can move on to leveraging the Social Security Administration (SSA) Baby Names dataset to predict the gender of movie directors based on their first names. This dataset includes names and their associated genders from birth records, offering a comprehensive overview of names' gender distribution from 1880 until now.

This dataset is stored in a ZIP archive, with individual `.txt` files representing different years. We'll download this data using its URL and extract its contents using Python's `requests` and `zipfile` modules. We then combine all yearly files into a single DataFrame.

In [15]:
# URL for the Social Security Administration baby names dataset
url_SSA = "https://www.ssa.gov/oact/babynames/names.zip"  

# Download and unzip the SSA dataset using HTTP requests and zipfile handling
response = requests.get(url_SSA)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# Initialize a DataFrame to hold the data for all years included in the SSA dataset
all_years_ssa_data = pd.DataFrame()

# Loop through each file within the zip archive, which contains names data for individual years
for filename in zip_file.namelist():
    if filename.endswith('.txt'):
        # Open the current file and read its content into a DataFrame
        with zip_file.open(filename) as file:
            year_data = pd.read_csv(file, names=['name', 'gender', 'count'])

        # Extract the year from the filename and add the year as a new column
        year = int(filename[3:7])  # Extract year from file name (e.g., 'yob2023.txt')
        year_data['year'] = year

        # Append the data from the current year to the all_years_ssa_data DataFrame
        all_years_ssa_data = pd.concat([all_years_ssa_data, year_data], ignore_index=True)


In [16]:
# Display the first few rows to check the combined data structure
all_years_ssa_data.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


We now have the number of times a name was associated with a particular gender in a given year. Next, we group this data by name and gender, summing the occurrences to obtain a total count for each name-gender pair across all years. Then let's generate summary statistics for the counts to get a sense of the distribution of names in the dataset. 

In [17]:
# Aggregate the combined data by name and gender, summing the occurrences across all years to get a total count per name-gender combination
aggregated_ssa_data = all_years_ssa_data.groupby(['name', 'gender']).sum().reset_index().drop(columns=['year'])

# Set pandas display options to format float numbers in a more readable way
pd.set_option('display.float_format', '{:.2f}'.format)
# Display summary statistics for the count column to understand the distribution of name occurrences
aggregated_ssa_data[['count']].describe()

Unnamed: 0,count
count,113882.0
mean,3207.67
std,51507.72
min,5.0
25%,11.0
50%,46.0
75%,243.0
max,5214844.0


So, there are 113,882 unique name-gender pairs in the dataset. There is a wide range in the frequency of name occurrences, from as few as 5 to as many as over 5 million times. The majority of name-gender pairs (75%) appear less than 243 times, pointing to a long tail of less common names. This means a relatively small number of names are incredibly popular, while the majority are much less common.

Now, let's move on to classify the gender of the names. We'll consider a person as a woman (man) only if more than 95% of individuals with the same first name are women (men). To enact this classification, we need to calculate gender ratios for each name and classify them as 'Male' or 'Female' if the ratio for that gender is above 95%, otherwise, it is 'Ambiguous'.

In [129]:
# Aggregate the data by name and gender, summing up the counts to get a total count for each name-gender combination
name_gender_counts = aggregated_ssa_data.groupby(['name', 'gender']).agg(total_count=('count', 'sum')).reset_index()

# Pivot the data so that each row represents a unique name, with separate columns for male and female counts
name_gender_counts = name_gender_counts.pivot(index='name', columns='gender', values='total_count').fillna(0)

# Calculate the total counts by adding male and female counts together, and compute the ratio of each gender to the total
# This helps in determining the predominant gender association for each name
name_gender_counts['total'] = name_gender_counts.sum(axis=1)
name_gender_counts['male_ratio'] = name_gender_counts['M'] / name_gender_counts['total']
name_gender_counts['female_ratio'] = name_gender_counts['F'] / name_gender_counts['total']

# Define a function to classify the gender based on the calculated ratios
# A name is classified as 'Male' or 'Female' if the ratio for that gender is above 95%, otherwise, it is 'Ambiguous'
def classify_gender(row):
    if row['male_ratio'] > 0.95:
        return 'Male'
    elif row['female_ratio'] > 0.95:
        return 'Female'
    else:
        return 'Ambiguous'

# Apply the gender classification function to each row in the dataset
name_gender_counts['predicted_gender'] = name_gender_counts.apply(classify_gender, axis=1)

# Reset the index to turn the names from an index into a column again, so that this data can be merged with the directors data
name_gender_counts = name_gender_counts.reset_index()

Let's take a look at the data.

In [130]:
name_gender_counts

gender,name,F,M,total,male_ratio,female_ratio,predicted_gender
0,Aaban,0.00,127.00,127.00,1.00,0.00,Male
1,Aabha,56.00,0.00,56.00,0.00,1.00,Female
2,Aabid,0.00,16.00,16.00,1.00,0.00,Male
3,Aabidah,5.00,0.00,5.00,0.00,1.00,Female
4,Aabir,0.00,19.00,19.00,1.00,0.00,Male
...,...,...,...,...,...,...,...
102444,Zyvion,0.00,5.00,5.00,1.00,0.00,Male
102445,Zyvon,0.00,7.00,7.00,1.00,0.00,Male
102446,Zyyanna,6.00,0.00,6.00,0.00,1.00,Female
102447,Zyyon,0.00,6.00,6.00,1.00,0.00,Male


Everything looks great! Each name is now associated with a gender group. 

# Gender prediction

Next, let's apply the gender prediction we just created to our dataset of directors' first names to predict their genders.

In [131]:
# Merge the gender prediction results with the directors dataset based on the directors' first names
directors_name = directors_name.merge(name_gender_counts[['name', 'predicted_gender']], left_on='director_first_name', right_on='name', how='left')

Let's see what our directors data looks like now.

In [132]:
directors_name

Unnamed: 0,nconst,primaryName,director_first_name,name,predicted_gender
0,nm1131265,Steve Ashlee,Steve,Steve,Male
1,nm1130611,Valerie Silver,Valerie,Valerie,Female
2,nm0290651,Val Franco,Val,Val,Ambiguous
3,nm0404033,Halfdan Hussey,Halfdan,,
4,nm0000417,Crispin Glover,Crispin,Crispin,Male
...,...,...,...,...,...
63164,nm4394847,Krishna Maaya,Krishna,Krishna,Ambiguous
63165,nm10527671,Anwar Shahadat,Anwar,Anwar,Male
63166,nm10532602,Patricia Gérimont,Patricia,Patricia,Female
63167,nm4453202,Jean Claude Taburiaux,Jean,Jean,Ambiguous


So, some directors have a predicted gender but others do not since their names are not in the SSA dataset. Let's see how many are in this category.

In [133]:
# Count the frequency of each predicted gender category among the directors
directors_name['predicted_gender'].value_counts()

predicted_gender
Male         38705
Female       10311
Ambiguous     4860
Name: count, dtype: int64

Among 63169 directors, 53876 (85%) have a name match with the SSA data, 9293 do not. Some directors do not have a direct name match with the SSA data because they use initials as first name or they have compound first names. We can analyze name components more deeply to improve gender predictions for these directors. Specifically, 
- Directors using initials may have a more gender-revealing middle name. For these directors, we extract and analyze their middle names.
- For directors with compound first name, often connected by hyphens, we can analyze each component of the compound name individually. When direct matches fail, we consider the concatenated form of the compound name for gender prediction.

First, let's create a separate dataset for directors without a gender prediction and work off that dataset.

In [135]:
# Separate directors with exact gender matches from those without a clear prediction
directors_exact_match = directors_name[directors_name['predicted_gender'].notna()]
directors_no_match = directors_name[directors_name['predicted_gender'].isna()]

In [148]:
directors_no_match

Unnamed: 0,nconst,primaryName,director_first_name,name,predicted_gender
3,nm0404033,Halfdan Hussey,Halfdan,,
50,nm0245315,MarieAnna Dvorak,MarieAnna,,
73,nm0046007,Aysun Bademsoy,Aysun,,
89,nm0486587,L. James Langlois,L.,,
99,nm0865379,Nickolay Todorov,Nickolay,,
...,...,...,...,...,...
63114,nm10477444,Sergej Aleksandrov,Sergej,,
63127,nm6512505,Dinakshie Priyasad,Dinakshie,,
63129,nm10464524,Sumith Galhena,Sumith,,
63133,nm10500950,Ranjith Jayasinghe,Ranjith,,


Now, let's write a function to identify the middle name of directors using initials as first name. We do this by checking if the first part of a name includes a period, indicating an intial and returning the second part of the name if a period is detected. Then we apply this function to our dataset on directors with no name match and create a separate dataset for directors using initials called `directors_initials_middle_match`.

In [143]:
# Define a function to extract a middle name when the director's name includes initials
def extract_middle_name(full_name):
    parts = full_name.split()
    # Check if the first part ends with a period, indicating an initial
    if len(parts) > 2 and parts[0].endswith('.'):
        return parts[1]  # Return the middle name if an initial is detected
    return None

# Apply the function to the subset of directors whose first names are initials, so that we can use their middle names for gender prediction.
directors_initials = directors_no_match[directors_no_match['director_first_name'].apply(lambda x: x.endswith('.'))]
directors_initials['middle_name'] = directors_initials['primaryName'].apply(extract_middle_name)
directors_initials_middle_match = directors_initials[directors_initials['middle_name'].notna()]

Let's see the results of our middle name extraction.

In [152]:
directors_initials_middle_match

Unnamed: 0,nconst,primaryName,director_first_name,name,predicted_gender,middle_name
89,nm0486587,L. James Langlois,L.,,,James
127,nm0808527,J. Jesses Smith,J.,,,Jesses
603,nm1852081,J. Marshall Craig,J.,,,Marshall
652,nm1281881,A. Blaine Miller,A.,,,Blaine
724,nm0184947,C. Jay Cox,C.,,,Jay
...,...,...,...,...,...,...
61506,nm8785527,S.D. Shri Ram Vasu,S.D.,,,Shri
61896,nm9131304,R. Allsion Jr.,R.,,,Allsion
62691,nm9978225,Hari. k. Chanduri,Hari.,,,k.
62730,nm10020812,P. Vijay Varma,P.,,,Vijay


Everything is in place for directors using initials! Now let's work directors with compound first name. First, we create a separate dataset for these directors called `directors_compound` based on whether their first name contains a hyphen (-).

In [159]:
# Identify directors with compound first names 
directors_compound = directors_no_match[directors_no_match['director_first_name'].str.contains('-')]

In [160]:
directors_compound

Unnamed: 0,nconst,primaryName,director_first_name,name,predicted_gender
182,nm1008121,Sigur-Björn,Sigur-Bjorn,,
311,nm1125131,Jean-Baptiste Andrea,Jean-Baptiste,,
657,nm1417409,Woon-hak Baek,Woon-hak,,
686,nm9426285,Hyeon-jeong Kim,Hyeon-jeong,,
759,nm1305425,Kyeong-hyeong Kim,Kyeong-hyeong,,
...,...,...,...,...,...
62772,nm10046917,Ho-jae Lee,Ho-jae,,
62819,nm0134901,Jean-Louis Cap,Jean-Louis,,
62821,nm10091782,Kap-Jong Park,Kap-Jong,,
62946,nm4170371,Karl-Heinz Klopf,Karl-Heinz,,


For these cases, we split the compound first name, check the gender prediction for each part, and classify the name based on the gender predominance of its parts. If both parts have matches in the SSA data and both are predominantly of one gender, the name is classified as that gender; otherwise, it's classified as ambiguous. If neither parts have matches in the SSA data, we consider the concatenated form of the compound name.

In [161]:
# Use the previously aggregated gender predictions (name_gender_counts) to create a dictionary for easier lookup.
name_gender_dict = name_gender_counts.set_index('name')['predicted_gender'].to_dict()

# Define a function to handle compound names by trying to predict gender based on individual parts of the name.
def classify_compound_name(name):
    parts = name.split('-')
    genders = []
    
    # Attempt to match each part individually
    for part in parts:
        gender = name_gender_dict.get(part)
        if gender:
            genders.append(gender)
    
    # If any part does not have a match, try the compound name without the hyphen
    if len(genders) != len(parts):
        # Turn the second part of the compound name to lowercase, join without the hyphen
        joined_name = parts[0] + parts[1].lower() if len(parts) > 1 else parts[0]
        gender = name_gender_dict.get(joined_name, 'NA')  # Attempt to find a match for the joined name
        return gender  # Return the found gender or 'NA'
    
    # Original classification logic
    # Determine the gender based on the majority of matches for the name parts, marking as 'Ambiguous' if unclear.
    if genders.count('Male') > 0 and genders.count('Female') == 0:
        return 'Male'
    elif genders.count('Female') > 0 and genders.count('Male') == 0:
        return 'Female'
    else:
        return 'Ambiguous'

# Apply the compound name classification logic.
directors_compound['predicted_gender'] = directors_compound['director_first_name'].apply(classify_compound_name)
directors_compound_names_match = directors_compound[directors_compound['predicted_gender'].isin(['Male', 'Female', 'Ambiguous'])]

Let's take a look at the results of our codes, which gives us a dataset of directors with compound names with a gender prediction.

In [177]:
directors_compound_names_match

Unnamed: 0,nconst,primaryName,director_first_name,name,predicted_gender
311,nm1125131,Jean-Baptiste Andrea,Jean-Baptiste,,Male
774,nm1158974,Jean-Luc François,Jean-Luc,,Male
794,nm1310022,Jean-Luc Getreau,Jean-Luc,,Male
807,nm0135269,Loren-Paul Caplin,Loren-Paul,,Male
1165,nm0320868,Jacques-Rémy Girerd,Jacques-Remy,,Male
...,...,...,...,...,...
62744,nm0732199,Jean-François Robin,Jean-Francois,,Male
62819,nm0134901,Jean-Louis Cap,Jean-Louis,,Male
62821,nm10091782,Kap-Jong Park,Kap-Jong,,Male
62946,nm4170371,Karl-Heinz Klopf,Karl-Heinz,,Male


Looks great! Now, let's put everything together by combinining the datasets on directors with exact name match(`directors_exact_match`), directors using initials with a clear gender prediction (`directors_initials_middle_match`), and directors with compound names with a clear gender prediciton (`directors_compound_names_match`).

In [178]:
# Combine all directors with gender predictions, including those corrected by handling initials and compound names.
directors_gender = pd.concat([
    directors_exact_match[['nconst', 'primaryName', 'predicted_gender']],
    directors_initials_middle_match[['nconst', 'primaryName', 'predicted_gender']],
    directors_compound_names_match[['nconst', 'primaryName', 'predicted_gender']]
], ignore_index=True)

Let's take a look at this combined data.

In [182]:
directors_gender

Unnamed: 0,nconst,primaryName,predicted_gender
0,nm1131265,Steve Ashlee,Male
1,nm1130611,Valerie Silver,Female
2,nm0290651,Val Franco,Ambiguous
3,nm0000417,Crispin Glover,Male
4,nm0818538,Keith Spiegel,Male
...,...,...,...
54648,nm0732199,Jean-François Robin,Male
54649,nm0134901,Jean-Louis Cap,Male
54650,nm10091782,Kap-Jong Park,Male
54651,nm4170371,Karl-Heinz Klopf,Male


Now we have data on the ID, full name, and predicted gender of 54,653 directors in our sample. Let's see how many directors are in each predicted gender group. 

In [181]:
# Count the distribution of predicted genders
directors_gender['predicted_gender'].value_counts()

predicted_gender
Male         39016
Female       10427
Ambiguous     4947
Name: count, dtype: int64

So among the 63,169 directors in our sample, 54,653 have name match with the SSA data. Among those with a name match, 71% are predicted as men, 19% as women, and 9% are categorized as ambigous since their first names are not predominantly associated with one gender group. 

Finally, let's save this data for future analysis.

In [183]:
# Save the final dataset with directors' gender predictions to a CSV file
directors_gender.to_csv("directors_gender.csv")