# Tracing directors' career journeys using time series data
*Julie Nguyen, PhD candidate in Management (Organizational Behavior), McGill University*

Welcome to another chapter of our exploration into the world of social networks and their impact on on movie directors' careers. So far, we've identified our sample of directors (`Phase_1_Tracking_Movie_Directors_Career.ipynb`), mapped the collaboration networks within the film industry and calculated the annual social capital of every creative worker (`Phase_2_Constructing_Filmmaker_Network.ipynb`), and predicted the genders of our directors (`Phase_3_Predicting_Director_Gender.ipynb`).

**What are we aiming to do?**

Now, we shift our focus to constructing a detailed time series dataset. This dataset will trace the career journey of each director year by year, from their debut until the present year, 2023.

**How do we do this?**

1. Constructing time series data: We'll begin by creating a dataset where each row represents a director's career in a given year, from the year their first film was released to 2023. This approach allows us to observe the evolution of each director’s career on a year-by-year basis.

2. Incorporating outcome variables: We will integrate two crucial variables into our dataset:
- Annual directorial engagement: This variable indicates whether a director has actively directed a movie in a given year. It serves as a primary indicator of whether they are still active in the industry.
- Dropout indicator: This variable indicates whether a director has potentially ceased directing, defined as having not directed any movie in the last 10 years. This variable identifies the point at which a director may have stepped away from their directing career.

3. Incorporating survival analysis variables: To further enrich our analysis, we'll include start_time and stop_time for each director's career relative to their debut. These are key for survival analysis, which will allow us to study how long directors remain active and the timing of any career transitions. By basing these times on each director’s debut, we focus on the duration and phases of their careers without being constrained by specific calendar years.

**Looking ahead**

By the end of this notebook, we’ll have developed a comprehensive dataset that tracks each director's career progression year-by-year, along with indicators that reflect their ongoing involvement and longevity in the film industry. This preparation sets the stage for our forthcoming analyses, where we'll examine how various factors—such as network connections and gender—may influence the trajectories of these careers.

# Constructing a time series dataset 

Our first task involves creating a time series dataset, let's call it `directors_years`, where each row represents a director in a given year, from their debut to the present day 2023.

We begin by loading the neccessary libraries and a dataset created in previous notebook (`Phase_1_Tracking_Movie_Directors_Career.ipynb`) called `directors_full_filmography`. This data includes information on all the movies made by the 63,169 directors in our sample, from their debut year to 2023. Let's load this data and see what it looks like. 

In [1]:
# Importing necessary libraries for data manipulation and handling
import pandas as pd  # data manipulation
import os  # interacting with the operating system, such as file paths

# Set the working directory to where the project files are located
os.chdir('/Users/mac/Library/CloudStorage/OneDrive-McGillUniversity/Work/Projects/Gender and brokerage/WomenLeaders_SocialNetworks')

In [2]:
# Load the filmography dataset for movie directors (2003-2023)
directors_full_filmography = pd.read_csv('directors_full_filmography.csv')

# Display the initial rows of the dataset to understand its structure
directors_full_filmography.head()

Unnamed: 0,tconst,startYear,genres,nconst,firstYear,averageRating,numVotes
0,tt0108549,2004.0,"Comedy,Mystery",nm1131265,2004.0,7.8,34.0
1,tt0108549,2004.0,"Comedy,Mystery",nm1130611,2004.0,7.8,34.0
2,tt0117461,2003.0,"Comedy,Romance",nm0290651,2003.0,6.3,24.0
3,tt0117743,2008.0,"Drama,Romance",nm0404033,2003.0,6.7,64.0
4,tt0118141,2005.0,Drama,nm0000417,2005.0,5.3,950.0


In the `directors_full_filmography` data, each row is a movie (`tconst`) made by a director (`nconst`) in our sample. We'll use this data to construct a time series dataset that captures each director's career progression by year. We'll call this dataset `directors_years`.

For this task, we'll first define temporal range of our study as being from 2003 to 2023, a span that captures the career history of the directors in our sample who debuted between 2003 and 2013. Ending the observation period at 2023 ensures that we have a long-term view (at least 10 years) of a director's career after their debut. Next, we generate all possible combinations of director-year. Since the debut years among directors are different, we'll next use each director's debut year to filter combinations so that for each director, the data only includes years after their debut. 

In [3]:
# Convert the 'startYear' column to integer values 
directors_full_filmography['startYear'] = directors_full_filmography['startYear'].astype(int)

# Identify unique directors and define the range of years for our analysis (2003-2023)
# This range is chosen based on the debut years of directors being studied (2003-2013), extending to 2023 for a long-term look at their careers
directors = directors_full_filmography['nconst'].unique()
years = range(2003, 2024) 

# Generate all possible director-year combinations within the specified range
# This creates a framework to examine each director's career on a yearly basis
directors_years = pd.DataFrame([(director, year) for director in directors for year in years], columns=['nconst', 'year'])

# Merge the generated combinations with the original filmography data to include each director's debut year
# This allows us to filter combinations to only include years after each director's debut
directors_years = directors_years.merge(directors_full_filmography[['nconst', 'firstYear']].drop_duplicates(), on='nconst', how='left')

# Filter out rows that represent years before a director's debut, ensuring the dataset only contains relevant director-year pairs
directors_years = directors_years[directors_years['year'] >= directors_years['firstYear']].drop(columns=['firstYear'])
directors_years.rename(columns={'nconst': 'nconst_director'}, inplace=True)

# Identify the debut year for each director
debut_years = directors_years.groupby('nconst_director')['year'].min().reset_index()
debut_years.rename(columns={'year': 'debut_year'}, inplace=True)

# Merge the debut year back into the directors_years dataset to add a debut_year column
directors_years = pd.merge(directors_years, debut_years, on='nconst_director', how='left')

Let's see what the data looks like.

In [4]:
directors_years.head(30)

Unnamed: 0,nconst_director,year,debut_year
0,nm1131265,2004,2004
1,nm1131265,2005,2004
2,nm1131265,2006,2004
3,nm1131265,2007,2004
4,nm1131265,2008,2004
5,nm1131265,2009,2004
6,nm1131265,2010,2004
7,nm1131265,2011,2004
8,nm1131265,2012,2004
9,nm1131265,2013,2004


Everything is in order! Now we can move on to adding variables indicating career longevity (our outcome variables) to our time series dataset.

# Outcome variables 

In this segment of the analysis, we'll create two outcome variables and add them to the `directors_years` dataset:
- `made_movie`: indicating whether a director released at least one movie in a given year. An important note is the treatment of directors' debut years; here, `made_movie` is set to 0 to align with our study's focus on directors' activities beyond their initial entry into the film industry.
- `dropout`: indicating whether a director dropped out of directing in a given year.

Let's take a look at the `directors_full_filmography` data again to see how we can use it to create these variables. 

In [5]:
directors_full_filmography

Unnamed: 0,tconst,startYear,genres,nconst,firstYear,averageRating,numVotes
0,tt0108549,2004,"Comedy,Mystery",nm1131265,2004.0,7.8,34.0
1,tt0108549,2004,"Comedy,Mystery",nm1130611,2004.0,7.8,34.0
2,tt0117461,2003,"Comedy,Romance",nm0290651,2003.0,6.3,24.0
3,tt0117743,2008,"Drama,Romance",nm0404033,2003.0,6.7,64.0
4,tt0118141,2005,Drama,nm0000417,2005.0,5.3,950.0
...,...,...,...,...,...,...,...
128529,tt9916362,2020,"Drama,History",nm1893148,2008.0,6.4,5687.0
128530,tt9916538,2019,Drama,nm4457074,2011.0,8.6,7.0
128531,tt9916622,2015,Documentary,nm9272490,2012.0,,
128532,tt9916754,2013,Documentary,nm9272490,2012.0,,


So, we'll group the `directors_full_filmography` dataset by director ID (`nconst`) and the movie release year (`startYear`). For each group, we count the unique movie IDs (`tconst`), giving us the total number of movies released by each director in each year. Then, we'll add this information to our `directors_years` dataset and create a binary outcome variable, `made_movie`, to indicate whether a director released at least one movie in a given year. This provides a straightforward measure of a director's active engagement in directing during each year of their career.

In [6]:
# Aggregate the director's filmography data to count unique movies released by each director per year.
movies_released_per_year = directors_full_filmography.groupby(['nconst', 'startYear'])['tconst'].nunique().reset_index()

# Rename columns for clarity: director ID as 'nconst_director', release year as 'year', and number of movies as 'num_movies'.
movies_released_per_year.rename(columns={'nconst': 'nconst_director', 'startYear': 'year', 'tconst': 'num_movies'}, inplace=True)

# Merge the yearly movie release counts with the main dataset to include the number of movies each director released each year.
directors_years = directors_years.merge(movies_released_per_year, on=['nconst_director', 'year'], how='left')

# Replace missing values in 'num_movies' with 0, indicating no movies were released by the director in those years.
directors_years['num_movies'] = directors_years['num_movies'].fillna(0)

# Create a binary indicator 'made_movie' to denote whether the director released any movies in a given year (1 if yes, 0 if no).
directors_years['made_movie'] = (directors_years['num_movies'] > 0).astype(int)

# For the director's debut year, set 'num_movies' and 'made_movie' to 0 so that the variables reflect activity after debut
directors_years.loc[directors_years['year'] == directors_years['debut_year'], ['num_movies', 'made_movie']] = 0

Let's see what our time series data looks like now.

In [7]:
directors_years.head(25)

Unnamed: 0,nconst_director,year,debut_year,num_movies,made_movie
0,nm1131265,2004,2004,0.0,0
1,nm1131265,2005,2004,0.0,0
2,nm1131265,2006,2004,0.0,0
3,nm1131265,2007,2004,0.0,0
4,nm1131265,2008,2004,0.0,0
5,nm1131265,2009,2004,0.0,0
6,nm1131265,2010,2004,0.0,0
7,nm1131265,2011,2004,0.0,0
8,nm1131265,2012,2004,0.0,0
9,nm1131265,2013,2004,1.0,1


Looks great! Next, we'll move on to identifying career dropout.

For this, we'll classify directors based on their last movie release within our observation period, distinguishing between those potentially inactive (no releases in the last decade) and those still active. This classification helps us understand the points at which directors may have stepped back from directing.

In [8]:
# Group by director ID and identify the latest year of movie release to determine the last active year for each director.
last_movie_year = directors_full_filmography.groupby('nconst')['startYear'].max().reset_index()

# Identify directors who haven't made any movies between 2013 and 2023, who are considered to have potentially dropped out
dropout_directors = last_movie_year[last_movie_year['startYear'] < 2013].copy()

# Set 'dropoutYear' as the year following their last movie release, marking when they might have left the industry.
dropout_directors['dropoutYear'] = dropout_directors['startYear'] + 1

# For directors active after 2013, set 'dropoutYear' as NA, indicating they haven't dropped out within the study period.
active_directors = last_movie_year[last_movie_year['startYear'] >= 2013].copy()
active_directors['dropoutYear'] = pd.NA

# Combine the dataframes to get a complete view of all directors' potential dropout years.
all_directors_dropout = pd.concat([dropout_directors, active_directors], ignore_index=True).sort_values(by='nconst')

# Display the first and last few rows of the combined dataframe
all_directors_dropout

Unnamed: 0,nconst,startYear,dropoutYear
36527,nm0000083,2022,
0,nm0000136,2009,2010
1,nm0000147,2010,2011
36528,nm0000154,2016,
36529,nm0000155,2013,
...,...,...,...
63168,nm9986224,2021,
36523,nm9990558,2009,2010
36524,nm9990640,2007,2008
36525,nm9990734,2009,2010


Next, let's add this information to the `directors_years` dataset and create a binary indicator for director dropout events, called `dropout`, based on the dropout year we identified above. This variable marks the timing at which directors potentially cease their directing endeavors.

In [9]:
# Merge the 'dropoutYear' information into the main dataset to indicate the potential dropout year for each director.
directors_years = pd.merge(directors_years, all_directors_dropout[['nconst', 'dropoutYear']],
                           how='left', left_on='nconst_director', right_on='nconst')

# Remove the redundant 'nconst' column after merging.
directors_years.drop(columns='nconst', inplace=True)

# Initialize a 'dropout' column with 0, indicating no dropout event by default.
directors_years['dropout'] = 0

# Update 'dropout' to 1 for years equal to or beyond the 'dropoutYear', marking it as the potential dropout event.
directors_years['dropout'] = ((directors_years['year'] >= directors_years['dropoutYear'].fillna(9999)).astype(int))

  directors_years['dropout'] = ((directors_years['year'] >= directors_years['dropoutYear'].fillna(9999)).astype(int))


Let's see what our time series data looks like now.

In [10]:
# Display the first few rows
directors_years.head(30)

Unnamed: 0,nconst_director,year,debut_year,num_movies,made_movie,dropoutYear,dropout
0,nm1131265,2004,2004,0.0,0,,0
1,nm1131265,2005,2004,0.0,0,,0
2,nm1131265,2006,2004,0.0,0,,0
3,nm1131265,2007,2004,0.0,0,,0
4,nm1131265,2008,2004,0.0,0,,0
5,nm1131265,2009,2004,0.0,0,,0
6,nm1131265,2010,2004,0.0,0,,0
7,nm1131265,2011,2004,0.0,0,,0
8,nm1131265,2012,2004,0.0,0,,0
9,nm1131265,2013,2004,1.0,1,,0


As we gear up for survival analysis, let's also create `start_time` and `stop_time` variables for each director relative to their debut year. These variables are essential for survival analysis, as they allow us to measure the duration of directors' active phases. By setting these times relative to each director's debut, we alleviate the influence of different debut years among directors, focusing only on their career stages rather than calendar years.

In [11]:
# Calculate relative start and stop times based on the director's debut year
directors_years['start_time'] = directors_years['year'] - directors_years['debut_year']
directors_years['stop_time'] = directors_years['start_time'] + 1

In [12]:
directors_years.head(30)

Unnamed: 0,nconst_director,year,debut_year,num_movies,made_movie,dropoutYear,dropout,start_time,stop_time
0,nm1131265,2004,2004,0.0,0,,0,0,1
1,nm1131265,2005,2004,0.0,0,,0,1,2
2,nm1131265,2006,2004,0.0,0,,0,2,3
3,nm1131265,2007,2004,0.0,0,,0,3,4
4,nm1131265,2008,2004,0.0,0,,0,4,5
5,nm1131265,2009,2004,0.0,0,,0,5,6
6,nm1131265,2010,2004,0.0,0,,0,6,7
7,nm1131265,2011,2004,0.0,0,,0,7,8
8,nm1131265,2012,2004,0.0,0,,0,8,9
9,nm1131265,2013,2004,1.0,1,,0,9,10


In [13]:
directors_years.to_csv("directors_years.csv", index=False)