[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1q5YmddNAX8D7pM187k2qetHBUwkTj4ct?usp=sharing)

# StarNetworks Report 

## Project Goal: 
Explore the phenomenon of long-standing working relationships between crew members and directors and identify trends in the feature-film movie industry, such as the regularity with which the same director reuses crew members.

## Introduction



Some renowned film directors have a long-standing working relationship with key collaborators. For example, Wes Anderson and Alexandre Desplat have collaborated on several films, including "The French Dispatch," "Isle of Dogs," and "The Grand Budapest Hotel." It is not uncommon for directors to work with the same cinematographers, production designers, or editors across multiple projects. These role recurrences for a singular director may be common practice in the film industry, as they allow directors to establish a consistent artistic vision. These frequent collaborations may foster a sense of trust and understanding between the director and crewmember, resulting in a distinct artistic style associated with the director. This practice can be observed in both male and female directors, where female directors also form long-standing partnerships with professionals that share their artistic vision.

## Methodology 

### Data Extraction

#### The main() Function

In order to answer our research questions, we first needed a dataset of directors and their crew members. To accomplish this task, we wrote a Python script titled "data_extraction.py", that takes a file containing directors and their information ("100_film_directors.csv"), gets the crew members each director has worked with using "imdb_scraper.py", and stores the extracted data into a nested folder structure. The "main()" function, supported by the other function in the Python script, is responsible for reading "100_film_directors.csv" and extracting all the relevant director-crew data at once. This function represents the overall logic and thought process that went into writing the code and accomplishing the data extraction subtask. A simple explanation of the "main()" function is provided in the pseudocode below. For more details about the code, please refer to the code for the "main()" function, also provided below, and the Python script "data_extraction.py".

In [None]:
# pseudocode of main()
  # open the file containing the director information
    # create a folder name "director_crew_info" in your current directory
    # go inside the folder "director_crew_info"
    # for each director in the file:
      # make dictionary with director information (Last Name, First Name, Sex, Ethnicity, Labels, Director ID, etc.)
      # attempt to make director folder; if director folder already exists, continue to the next director
      # go inside the director folder
        # write the dictionary containing director information into a json file
        # create folder named "movies"
        # go inside the folder "movies"
        # try:
          # get a list of feature films or works that the director has directed using imdb_scraper.py
          # iterate through each movie or work and obtain the crew using imdb_scaper.py
          # write the movie data and crew data into a jsonl file
        # if fail:
          # inform the user of the director that the code failed to extract the director-crew information for

In [None]:
# create the director crew dataset for multiple directors
# filename: name of the file that includes information about the directors
# exclude_non_feature_film: True will exclude the non-feature-films (non-films or films that are less than 70 min long) from the list of films. False will
                            # will keep all the works of the director
# increment: how many directors to get the dataset for at a time; pass in 0 if you would like to get the dataset for all the directors at once
def main(filename, exclude_non_feature_films, increment=0):
    director_categories=[]
    with open(filename,"r", encoding = "utf-8") as data:  
        # create a folder named "director_crew_info" in the directory you're currently in
        util.create_folder(os.getcwd(), "director_crew_info")
        # go inside the folder, "director_crew_info"
        path=os.getcwd()+"\\director_crew_info"
        os.chdir(path)

        counter=-1
        # iterate through all the directors
        for line in data:
            if counter==-1:
                director_categories = line.split(',')
                counter += 1
                continue

            print(line.replace("\n", ""))

            # make a dictionary containing information about the director {Last Name: val, First Name: val, Sex: val, Ethnicity: val, Labels: val, Director Id: val, ...}
            director_info=make_director_info_dict(line, director_categories)
            director_id=director_info["Director Id"]
            # create a folder named with the director id; if a folder for a director already exists, move on to the next director
            if util.create_folder(path, director_id):
                continue
            # go inside the director folder
            os.chdir(path+"\\"+director_id)
            # write the json file for the director information
            util.write_json(director_id+".json", director_info)
            # create a folder named "movies
            util.create_folder(os.getcwd(), "movies")
            # go inside the folder, "movies"
            os.chdir(os.getcwd()+"\\movies")
            try:
                # get a list of feature films or works that the director has directed
                movie_info_list=get_list_of_films_for_director(director_id, exclude_non_feature_films)
                print(len(movie_info_list),"films\n")
                # iterate through each movie or work and create a jsonl file that contains information about the movie's crew
                for movie in movie_info_list:
                    movie_id=movie["movie id"]
                    movie_crew_info=get_movie_crew(movie_id)
                    write_crew_data_to_movie_file(movie, movie_crew_info, movie_id+".jsonl")
            except KeyError:
                print("failed to get director-crew relationship for "+director_id+"\n")
                continue
            counter+=1
            if counter==increment:
                break
        # the final value assigned to counter should indicate for how many directors the data extraction successful for
        print(counter)

#### Data Storage

We decided to store our data in json and jsonl files. Specifically, each director's information (e.g. nm0000095.json) and movie-crew information (e.g. tt0061177.jsonl) was stored in json and jsonl files, respectively. Json files and objects were compatible with our code since our data extraction code often utilized dictionaries to store data. 

The director-crew relationship data is stored within a nested file structure:

Folder Structure Visualization:

![\label{fig: folder_structure_visualization}](https://drive.google.com/uc?id=1If8LpJwtgV0ImFo39ET2P7l2MRGLJgix)

Inside the folder "director_crew_info:"

![\label{fig:web-growth}](https://drive.google.com/uc?id=10bIlqN5vNpH3qJfUCGJscaVcgRTeh4eP)


Inside the folder "nm0000095:"
![\label{fig:web-growth}](https://drive.google.com/uc?id=15GVQrZDE2RlCoEkUfoDZQxWbRWBqgAQ3)

Inside the feature_films folder of "nm0000095:"

![\label{fig:web-growth}](https://drive.google.com/uc?id=100aqmmdmRW3jZYWwSsPnibYuP3ZHI4h8)








#### Command Line Arguments


The reusability of the data extraction Python script is an essential feature of our code. We wanted to make sure that "data_extraction.py" could be used for different but similar projects and customized by the user to fit their needs. To do this, we added command line arguments with the help of the argparse module. The Python script "data_extraction.py" includes one required and three optional command line arguments.


1.   The first and only required argument that "data_extraction.py" takes is the filename of an input file containing directors and their information. In the case of this specific project, our file was called "100_film_directors.csv." Our code will work with any .csv file that contains the same categories as "100_film_directors.csv" (i.e., LastName, FirstName, Sex, Ethnicity_Race, Labels, and IMDb_URI). If there is no value, you may write in its place an empty string. However, for the Python script to work, there must be a valid IMDb_URI for every director in the file. Additional categories may be added after the required categories if desired. The following snippet of code in the "make_director_info_dict" function is written specifically to allow for additional categories:

In [None]:
if len(categories)>6:
        for i in range(6, len(categories)):
            director_info[categories[i]]=director_info_list[i]

2.   The second argument, "-enff", excludes non-feature films from the data set. Non-feature films are movies with a minimum length of 70 minutes.

3.   The third argument, -d, takes ONE director ID and gets the director-crew data for that specific director. This argument serves as a fail-safe for any runtime errors that occur while extracting the data for a specific director.

4.   The last argument, -i, is the number of directors the user would like to run the data extraction for at a time. By default, -i is set to 0, which will run the data extraction python script for all directors at a time.



In [None]:
if __name__ == "__main__":
    parser=argparse.ArgumentParser()
    parser.add_argument("filename", help="the filename that contains information about all the directors", type=str)
    parser.add_argument("-enff", "--exclude_non_feature_films", help="exclude works that are not feature films (movies over 70 minutes long)", action="store_true")
    parser.add_argument("-d", "--director", help="pass in the director id (from imdb) to get the director-crew relationship data (for only ONE director)", type=str)
    parser.add_argument("-i", "--increment", help="the number of directors at a time you want to run the data extraction for", type=int)
    args=parser.parse_args()
    if args.director is None:
        if (args.increment is None):
            main(args.filename, args.exclude_non_feature_films)
        else:
            main(args.filename, args.exclude_non_feature_films, args.increment)
    else:
        get_director_crew_relationship_for_one(args.director, args.filename, args.exclude_non_feature_films)

### Network Generation



Before writing code, we knew we needed to implement several modular functions to succeed. We previously decided on a bipartite network with directed, weighted components. The implementation, however, was much more difficult than we previously thought. We first connected directors and crew members plainly with outright weights by iterating over the director files and creating these relationships for each movie, where the weight would increase if there were multiple instances of a certain director working with the same crew member. 

The next goal was to assign attributes to directors and crew members, which we accomplished promptly. The attributes of crew members consist of their role (Makeup Department, Animation, etc.), as well as an attribute that declares them a crew member, as opposed to a director. The addition of attributes for crew members is incorporated into the function that creates the overall network, as certain attributes, such as sex or race, were included in a different json file that we had to parse. 

In [1]:
def set_dir_attr(G, dir_id, dir_credits): # sets attributes for the director
    with open(dir_credits, 'r') as f:
        dir_credits_json = json.load(f) # load json file
        name = dir_credits_json['First Name'] + ' ' + dir_credits_json['Last Name'] 
        del dir_credits_json['First Name'] # remove first and last name keys
        del dir_credits_json['Last Name']
        dir_credits_json['name'] = name # add new name key that combines first and last nam
        # Set the node type to 'director' only if it is not already set or if it is currently set to 'crew'
        dir_credits_json['type'] = 'director'
        nx.set_node_attributes(G, {dir_id: dir_credits_json}) # set director attribute

We then used Dr. Nwala's guidance to normalize roles, including only nine out of the overall 25+.  The normalization function keeps all the wanted roles, and it is read into the terminal as an argument. 



```
python gen_networkFINAL.py -er "Additional Crew, Animation Department, Art Department, Art Direction by, Camera and Electrical 
                                 Department, Cast, Casting By, Casting Department, Costume and Wardrobe Department, Editorial 
                                 Department, Location Management, Music Department, Produced by, Production Department, 
                                 Production Management, Visual Effects by, Script and Continuity Department, 
                                 Second Unit Director or Assistant Director, Set Decoration by, Stunts, 
                                 Thanks, Transportation Department" -n

```



Finally, we created a metric to calculate the edge weights between director and crew members. The weight was calculated as: 

$$\displaystyle\frac{\mbox{crew members's recurrence for a role}}{\mbox{number of movies that role occurred for each director}}$$

In [None]:
def assign_weights_to_crew(role_frequency_for_crew, role_frequency_among_movies, dir_id, G):
    for role, crew_dist in role_frequency_for_crew.items(): # iterates over frequency of roles
        for crew_id, co_feat_count in crew_dist.items(): # iterate over crewmembers
            #if crew member is credited for more than one role
            if ('role' in G.get_edge_data(dir_id,crew_id).keys()) and (G.edges[dir_id, crew_id]['role'] != role):
                if co_feat_count<=G[dir_id][crew_id]['weight']:
                    continue
            G[dir_id][crew_id]['weight'] = co_feat_count/role_frequency_among_movies[role] # update weights between director and crewmembers
            weight_numerator=co_feat_count
            weight_denominator=role_frequency_among_movies[role]
            edge_attributes={'role':role, 'weight_numerator': weight_numerator, 'weight_denominator': weight_denominator} # creates attributes
            nx.set_edge_attributes(G, {(dir_id, crew_id): edge_attributes}) # set attributes

### Network Visualization

The network visualizations were created using Gephi's (version 0.10 ) OpenOrd, and Yifan Hu layouts. The color of the nodes was set using a Gephi plug-in called "Give Colors to Nodes". 

![picture](https://drive.google.com/uc?export=view&id=1HkZuq8swxJ3Ntn4lqAjnChhlXfYKJtVT)


The full network visualization was subsequently broken down by sex and renowned-ness and visualized as subnetworks highlighting those variables. For all subnetworks, a filter removing nodes with a numerator of one was applied to show crew members who had worked with a director at least twice. Then, for the subnetworks highlighting sex, the networks were filtered by giant connected component to only show nodes connected to the director of the selected sex. 


**Network Highlighting Female Directors**

![picture](https://drive.google.com/uc?export=view&id=1TB68Z-O7fFdu8RX6P4MMquAiOUOz1iH8)

</br>

**Network Highlighting Male Directors**
![picture](https://drive.google.com/uc?export=view&id=1_vSTxDdw5F6bif1qmiCJPzrQe1wroZjm)




An additional k-core filter of two was applied for the network highlighting renowned-ness homogeneity. This was done to highlight crew members that worked with more than one director more than one time, showing that not only do renowned directors reuse the same crew members, but all renowned directors reuse the same crew members.

![picture](https://drive.google.com/uc?export=view&id=1GiwOGA2w0oRCXXeur7lbOJOU1BVnnSkg)



The rest of the data visualizations, those that are not networks, were created in Canva using the data obtained from the analysis portion of the project. These visualizations were created to highlight the distribution of each director's average homogeneity score and the distribution of directors who worked with crew members more than once.


![picture](https://drive.google.com/uc?export=view&id=1djdUyTVuI_fAJ7UFi0ahtr21VpvrSEEU)
![picture](https://drive.google.com/uc?export=view&id=1pv12HXBS6hHS0oLyvJ6--yme3nDKCJIB)



### Network Analysis

To calculate the homogeneity of a specific role, we used the formula: 
</br>
</br>

 
$$1 - \displaystyle\frac{\mbox{# of unique crew members for that role}}{\mbox{# crew members for that role}}$$

 
</br>
</br>

We then created a few different functions to break down the process to calculate the average homogeneity through code.



The function, homogeneity_score_test, was created to calculate the average homogeneity for a single director with a single role. We normalized the writing credits for this as well. We stored all crew members for a particular role in a list called new_store and the unique values from new_store into a list called store_unique. If a role was not present for certain director's films, we did not include it in the homogeneity score. The eqution was then implemented using the line called measure.

In [None]:
def homogeneity_score_test(data, role_name):
 store = []
 director_info = []
 new_store = []
 store_unique = []
 measure = 0
 for d in data:
   if 'role' in d:
     if d['role'].startswith("Writing Credits"): # checks if 'role' parameter starts with 'Writing Credits'
        d['role'] = "Writing Credits" # return writing credits if it does to normalize it.
     if d['role'] == role_name: #holds onto each crew member with that role, including the director.
       store += d['crew']
     if d['role'] == 'Directed by':
       director_info += d['crew']
 for item in store:
   if item not in director_info:
     new_store.append(item) #holds onto each crew member with that role across the movies, with the exception of the director.
 for item in new_store:
   if item not in store_unique:
     store_unique.append(item) #store_unique holds each unique crew member's name (doesn't include doubles)
 if len(new_store) < 1: #this if condition is meant to account for a role not having any crew members a director's movies, except for potentially the director
   measure = -1
 else:
   measure = 1 - (len(store_unique)/len(new_store)) #implements the homogeneity.
 return measure

The next function, score_by_role_test, was created to calculate the average homogeneity of a single director with multiple roles. This function enforced the earlier homogeneity_score_test function. The roles we included were Sound Department, Makeup Department, Special Effects, Writing Credits, Film Editing, Cinematography, Music, Production Design, and Costume Design.

We then created a function called all_directors which implemented the score_by_role_test function to get the average homogeneity score of all directors. The homogeneity scores for each director were appended to a data frame created from the "101_directors.csv" file. This allowed us to gather various summary statistics, including the number of directors that fit into various categories, and generate mean homogeneity scores based on those categories. 

We also gathered summary statistics from Gephi to see the breakdowns of the roles across the network and to get information about things like average path length, average clustering coefficient, and graph density. Additionally, we gathered the number of movies each director directed and carried out the following function to implement it.

In [None]:
def number_of_films_directed(data): #to help me get the number of films that one director had directed
 director_repetitions = []
 for d in data:
   if 'role' in d:
     if (d['role'] == 'Directed by'):
       director_repetitions.append(d['crew'])
       #print(d.get('crew'))
 return len(director_repetitions)

We placed it in another, larger function within a loop to have it run for every director.

Ultimately, the average homogeneity function helped us to work on the first research question, and examining information from Gephi helped us explore the second research question. To answer the third research question, we thought interesting nodes could be family members. Therefore, we took a closer look at nodes such as Lilly and Lana Wachowski and Sofia and Francis Ford Coppola.

## Results 

#### How widespread is the phenomenon of directors re-using the same crew?


Based on our homogeneity metric results and the general distribution of directors who have worked with crew members more than one, we do not believe that the phenomenon of directors re-using the same crew is widespread. However, within some subgroups it is more common. 

#### Do renowned directors (and women/minority directors) tend to work persistently with the same key collaborators and less recognized directors tend to work with shifting groups of collaborators?


Men tend to work with the same collaborators more often than women.

Higher grossing directors also tend to work with the same collaborators more often than LGBTQ directors.

White directors tend to work with the same collaborators more often than other racial groups. 

Latin American directors on the other hand tend to work most often with shifting collaborators.

</br>

In terms of attributes about directors that made it onto this list, the majority of them were white males. The smallest groups of directors were Native American (Sterlin Harjo) or indigenous (Taika Waititi).



| Sex         | Average Homogeneity |
| ----------- | -----------         |
| Female      | .188202             |
| Male        | .292182             |

| Labels      | Average Homogeneity |
| ----------- | -----------         |
| H.          | .364000             |
| Q.          | .237449             |

| Ethnicity/Race| Average Homogeneity |
| -----------   | -----------         |
| A             | .197885             |
| B             | .246266             |
| I             | .238936             |
| L             | .182010             |
| N             | .204630             |
| W             | .285302             |

|Sex | Ethnicity/Race| Count |
|----| -----------   | ------|
|F| A                | 3     |
| | B                | 4     |
| | W                | 20    |
|M| A                | 8     |
| | B                | 9     |
| | I                | 1     |
| | L                | 5     |
| | N                | 1     |
| | W                | 50    |

| Labels | Sex | Ethnicity/Race | Count |
| ------ | --- | -------------- | ------|
|H       | M   | A              | 1     |
|        |     | W              | 19    |
|Q       | F   | B              | 1     |
|        |     | W              | 2     |
|        | M   | A              | 1     |
|        |     | B              | 1     |
|        |     | W              | 2     |

#### How will you characterize the film-director network? What are the properties (Average shortest path length, triangles aka clustering coefficient, density/sparsity)?


Nodes: 25141

Edges: 56314

Average Path length: 1.4389336788559364

Average Clustering Coefficient: 0.02

Density/Sparsity: 0.00008909812



**What all of this information tells us about the network:**

A lot of the people in the network do not know each other/have any relationship with each other.

The network is very sparse - that's also why the clustering coefficient is so low as well.

#### Did you find any interesting nodes/links?

All crew members with edge weight of one - they have worked with that director every time. There are several crew members that we've been able to identify that have an edge_weight of one. Most of them are in the Special Effects department which we thought was interesting.

Other interesting nodes include directors that are related. Their homogeneity scores are very similar. We examined 4 nodes in the dataset that were made up of directors who were related. Francis Ford Coppola's average homogeneity score is 0.309627 while his daughter Sofia Coppola's average homogeneity score is 0.288367. Meanwhile, Lilly Wachowski's average homogeneity score is 0.376583 while Lana Wachowski's average homogeneity score is 0.383186.


## Conclusion 



We hypothesized that directors would reuse their crewmembers often - our analysis concludes otherwise. Our research indicates that crewmembers used at least twice occur in an average of 10% of all directors. However, it is still not uncommon for directors to work with a core group of collaborators. As our research further indicates, directors often reuse those in specific departments, such as music and film editing, while sound and makeup are the least reused. Music and film editing directly affect the audience's experience while watching a film. While the roles of sound and makeup are important, they may not offer as many opportunities for recurring collaborations. While the director's vision guides the overall sound design, sound production's technical and detailed nature may lead to working with different sound designers depending on the project. Similarly, with makeup departments, makeup artists must cater specific needs of characters and visual aesthetics of a film, leading to a broader range of collaborators.