
```
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hpc325/Data-340-02-Network-Science/blob/main/Group-Project/Final-Submission/gp_NetworkProductions_final_report.ipynb)
```
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hpc325/Data-340-02-Network-Science/blob/main/Group-Project/Final-Submission/gp_NetworkProductions_final_report.ipynb)


Note: The Visualizations sometimes do not show when viewing from Github. Thus, when you open the file in Colab (with the button above), all the images from Visualization will appear. The images are also all in the repository as well.

#  **Debunking Hollywood: Final Report**
---
### Jeff Bailey, Harry Choi, Hassan Koroma, and Peter Schnizer
### DATA 340, Spring 2023
### Monday May 15th @ 2:00 p.m.


## **Introduction**
---

When our startup, Network Productions, was formed in March of 2023, we set out to better understand the making behind Hollywood. When we think of our favorite movies, we typically know our favorite actors and actresses, along with maybe the director. We’re more interested, however, in the relationships behind the screen, specifically between directors and crew members. Do certain directors like re-using the same Sound Department crew members, or recycle through Writing Department employees? Are they hiring the same people? Does this change by gender or ethnicity? These are the types of questions that we sought to learn more about.	

More specifically, we aimed to address and derive solutions to three primary research questions. The first question is determining if the phenomenon of directors re-using the same crew members is widespread, and this changes by their level of success, gender, and ethnicity. The second question is understanding the composition of the director-crew member network by analyzing its properties, such as determining if this is a dense or sparse network or its connectivity with triangles. The final question is discovering interesting nodes or links in the director-crew member network that could illuminate our conclusions. 

We could not accomplish this ambitious goal without breaking it down into multiple sections. This report will detail each component of the journey in addressing the research questions, from how we obtained the data, to brainstorming and generating our network, visually depicting the relationships between directors and crew members, and statistically analyzing our network. Finally, we will deliver our conclusions on the patterns of collaboration between directors and crew members. 



## **Subtask 1: Data Extraction (Jeffrey Bailey)**
---
### **Overview:**
To begin, the data had to be scraped from IMDB and organized into a useful dataset. Before any scraping could be done, we discussed as a group how we would like the data to be organized, and what specific data we would like to collect. After a few revisions, we settled on scraping a dataset organized into a .jsonl format, similar to a previous homework assignment. We knew that we needed a dataset that contained each of the given 101 directors, their attributes, and a list of each pertinent crew member who has worked on any of their movies.
 
With this goal and the given functions and scraping libraries, the optimal data extraction pipeline became clear and we took the following steps:
First, we designed a set of functions, which used the given scraping library, that produce a "director dictionary" for a given director's IMBD ID. These director dictionaries contain entries for the name, attributes, and credits for the specified director. The credits were stored as a list of movie dictionaries, where each movie had its name, IMDB ID, and a list of credited crew members stored as their own short lists of the form ['role name', 'crewmember name'].

With this function working, we then initialized the jsonl file, looped through our .csv of directors, generated a director dictionary with the full feature-film credits for each director, and added them to the .jsonl. Storing each set of credits by movie and by role allowed individuals with multiple roles in a movie or who worked with a director for multiple movies to be sufficiently recorded, and later given appropriate weight for repeat appearances in each director's dictionary. We additionally used this step to avoid self loops in our network; when gathering credits for each director, their own name was filtered out. We did not remove directors who had been credited in another director's film, this was a purposeful part of our research goal. 

Finally, the large .jsonl file was compressed into a gzip file and used to generate our network, as detailed in the next section.

### **Methodology and Code:**
Complete extraction of our dataset was completed using four files: a provided .csv spreadsheet with each director’s name, attributes, and IMDB link, a provided script that defines simple functions for scraping IMDB, and two scripts we made called “dataset_generator.py” and “credit_compressor.py”. 
The dataset generation script can be fully run from a command terminal and has no hard-coded file names or parameters. When it is ran, it prompts the user for the path to the initial director data, the desired jsonl output path, and a list of desired crew roles to scrape. If a user passes no input, the script runs using default values (100_film_directors.csv, scraped_credits.jsonl, and [‘’] for all possible roles). The script depends on having the given script of scraping functions, "imdb_scraper.py", in the same directory as it to import the functions. However, a user only has to run the single script "dataset_generator.py" in the terminal to generate a complete dataset.
<br>

<br>

After user inputs, the script performs the following steps: 

* If the specified output file exists, it checks which (if any) directors are already present and scraped, and prints a list of their id’s. 

* It then loops through each director in the input .csv and, if that director is not present already in the output .jsonl, runs the function “film_credits”, which generates that director’s dictionary as described above. 

* The film_credits function works by first initializing the dictionary with the director and attributes, and an empty list of movies. For each movie found using the provided function “get_full_credits_for_director” from imdb_scraper.py, if that movie is a full-length feature film, it creates a “movie dictionary” and adds it to the director’s movie list. This movie dictionary contains the title, IMDB id, and crew list of the specified film. The crew list was made using another function, “itemizecrew”.

* The itemizecrew function generates a crew list for from an inputted IMDB film id and a specified role list. While the option to filter roles during data collection exists in the code, which would decrease scraping time if the user does not want every role, in practice we passed an empty list and then manually filtered out unwanted roles at a later step in the project. The function then uses the imdb_scraper.py function “get_full_crew_for_movie” to scrape a large credit dictionary. From this dictionary, we loop through and record every credit that is in the desired role list and is not the film’s director themself, which is outputted as a list of lists of the form [[‘role name’, ‘crewmember name’], …].

* When this loop is completed, we are left with a .jsonl list of json-style dictionaries, one for each director, that contains a dictionary of movies with easily-parsed credit lists. At some stages, we had an issue where two specific directors timed out the imdb_scraper.py scraping functions and broke the loop. Because our script checks to ensure we do not record data for directors who are already present in the dataset, we can simply re-run the script in the case of these errors and the loop will start where it left off. 

After successfully generating our .jsonl dataset, we ran the script "credit_compressor.py" to compress the 20 megabyte .jsonl into a 4 megabyte .gz file. This script, similar to "dataset_generator.py", can be run in a terminal and prompts the user for inputs for the input .jsonl path and the desired output .gz path. The actual script is very simple, and uses the gzip library to compress and save the files using the inputted paths. Again, if the user gives no inputs it uses default name "scraped_credits.jsonl" and saves to "compressed_credits.jsonl.gz". 

<br>

Here is the main scraping loop that runs after "dataset_generator.py" defines functions and prompts user input:
```python
with open(jsonl_file, 'a') as f:
    for idx, row in tqdm(directors.iterrows(), total=len(directors)): #for each row in your input dataframe
        dir_id = row['id']
        if dir_id in dir_ids_in_jsonl: #if the director is already in the data, pass this row on the loop
            continue
        movie_credits = film_credits(dir_id, validroles) #if the director is not in the data, scrape and record their credits to the jsonl based on the inputted valid role list
        f.write(json.dumps(movie_credits) + '\n')
        dir_ids_in_jsonl.add(dir_id) #finally add the id to the list of already recorded directors. 
        #If for some reason the same director is in your input csv twice, this will keep them from being double-credited
```
Other referenced files, including fully commented functions, can be found in our repository. Additionally, an example of one of the "director dictionaries" from the .jsonl dataset can be seen at the start of the next section on **Network Generation**.



At first, we generated a preliminary dataset which only filtered key crew roles, and was made as fast as possible so other team members could begin their work using a lighter placeholder dataset. The final compressed dataset contained every crew member from each director's credited feature films. 


## **Subtask 2: Network Generation (Hassan Koroma)**
---
## **Preliminary**<br>
The network generation subtask uses a streamlined approach because every decision made in this step is entirely based on how we want to visualize the network to best answer the research questions.<br><br>
Like we mentioned in **Data Extraction**, we had only scraped a partial dataset for the director-crew credits network; we extracted about 7 valid roles to extract credits from: 
```python 
roles = ['direct', 'writ', 'produce', 'cinematography', 'casting by', 'editing', 'music by']
```
We figured these roles were more important than the rest because they are the most popular. This served as a sample dataset, so the rest of us can begin working on our respective tasks as early as possible. However, this dataset would not be sufficient to answer our research questions, so we decided to re-scrape all roles present in the director-crew credits network. Before proceeding with network generation, we first had to visualize the structure of the data and collectively decide what relationships to model and how we were going to present these relationships in a director-crew network to best answer our research questions. 

Below is the structure of each line of data in our *jsonl* file:
```json
{
    "dir_id":"nm0009190",
    "name":"J.J. Abrams",
    "gender":"M",
    "ethnicity":"W",
    "otherlabel":"H",
    "movies": [
          {
          "title_id":"tt2527338",
          "title":"Star Wars: Episode IX - The Rise of Skywalker",
          "crew":[
          ["Writing Credits", "Chris Terrio"],
          ["Writing Credits", "Derek Connolly"],
          ]
      },
    ]
}
```
## **Network Models**<br>
<u>Network 1</u>: To get a feel of what our final network might look like, we generated a weighted undirected network of one director, *Wes Anderson*. Nodes are crew members (including the director node), links are collaboration between director and crew members in various movies, and edge weight is the number of times they've been employed by *Wes Anderson*.

<u/>Network 2</u>: Now that we have an idea of what the network should look like, we decided to generate the full director-crew network. This is also a weighted undirected graph with about 202,028 nodes and 463,565 edges. We also added some node attributes for director nodes, namely *gender, ethnicity, renowned status, and node type*. With these attributes, we can partition the network and look for general patterns. For example, how is the ego network of a male director different from that of a female director?

<u>Network 3 (main network)</u>: For our analysis, we were instructed to exclude links between crew members with roles that do not receive awards at the OSCARs. There is a total of 22 excluded roles and they are as follows: `Additional Crew`,`Animation Department`,`Art Department`,`Art Direction by`,`Camera and Electrical Department`,`Cast`,`Casting By`,`Casting Department`,`Costume and Wardrobe Department`,`Editorial Department`,`Location Management`,`Music Department`,`Produced by`,`Production Department`,`Production Management`,`Script and Continuity Department`,`Second Unit Director or Assistant Director`,`Set Decoration by`,`Stunts`,`Thanks`,`Transportation Department`,`Visual Effects by`. After removal, we generated the main network. This is a weighted undirected graph with 24,960 nodes and 55,647 links. We also added additional node and edge attributes to facilitate the visualization process. Directors now have *grouped labels* and their *average homogeneity metric* (covered in **Analyses**) as additional attributes. The *grouped labels* attribute is combining *gender, ethnicity, and renowned status* (i.e., MWH for a renowned white male director). This allows us to compare and contrast the ego networks of directors from various intersectionalities. Also for visualization purposes, we decided to have the *roles* represented by director-crew collaborations  and the *crew homogeneity* (covered in Analysis) as an edge attribute. We also decided to remove self loops in the network because they are not needed for analysis.

<u>Network 4</u>: Another phenomenon we wanted to explore is the frequency in which directors hire the same crew members in various movies. Therefore, we decided to also generate a weighted bipartite subnetwork, where the two classes of nodes are directors and crew members, respectively. This also allows for good separation in the visualization. Links represent director-crew collaborations; weight of the link indicates the number of movies worked on together. This minimizes the number of links, for having a distinct link between each director and crew member for each project could be redundant. To acheive this, we modified the network generation code to include an optional argument we can toggle on or off. If turned on, the algorithm would skip any link from a director to a crew member who is also a director in the network, and self loops would also be removed. This is because a node in one class of a bipartite graph cannot point to another node in the same class. There were 57 instances of directors working with other crew members who are also directors in the network, and once removed, the bipartite graph has 24,960 nodes (101 nodes in one class and 24,859 noddes in the other) and 55,590 links. With this graph, we can project director cooccurrences to further explore this phenomenon. 
### **Network Generation Modules**
Now that we have an idea of how we want to model the director-crew relationships to answer our research questions, our next step was to write scripts to generate our network models. 
We created 4 python modules for the network generation task: `netgen_utils, metric, netgen, and run.` Each script was designed in a modular format (with exception handling) that can be utilized by all members of the group.

**netgen_utils**: This is a collection of utility functions that perform different tasks related to generating a network graph modeling the director-crew relationship. Here is a brief description of the main functions in the context of the subtask:

- `get_file_size`: This function returns the size of a file given its path. This is especially useful when deleting unwanted graphml files that are not written correctly. This also comes in handy in the analysis task, where we may need to check the size of the graph file before reading it into memory and running summary network statistics.

- `path_exists`: Checks if a file or directory exists at the given path. This is useful if the main network generation function breaks because it will still create an empty graph file, so this checks if this empty file exist before deleting

- `delete_file`: Deletes empty or unwanted graph files that may have been written due to exception errors

- `ErrorHandler`: Handles errors gracefully by printing the error message and any additional details that might be relevant to diagnose the error

- `load_data`: Loads data from our *jsonl* file as a generator, one line at a time. This avoids loading the entire data into memory at once, especially when the file is large
```python
def load_data(file):
    """Load data iteratively"""
    with open(file) as infile:
        for line in infile:
            yield line
```

- `json_loader`: Loads each line string of the JSON to a python dictionary for processing

- `get_directors`: Extracts the names of the directors from the data. This will be used in a few conditionals in the *netgen* script

- `get_node_type`: Determines the type of a node (either "crew" or "director") given the name of the crew member and the list of directors.
```python
def get_node_type(crew_name: str, directors: list) -> str:
    """Get the node type: crew or director"""
    return "crew" if crew_name not in directors else "director"
```

- `remove_extra_whitespaces`: There is inconsistent spacing among director and crew members in the raw data. This function removes extra whitespaces to prevent any node duplicates in the graph.

- `combine_labels`: This function combines the gender, ethnicity, and renowned label of a director into a single label to make partition easier in the visualization process. 
```python
def combine_labels(gender, ethnicity, label) -> str:
    """Group director labels into a single label"""
    return (
        gender + ethnicity if isinstance(label, float) else gender + ethnicity + label
    )
```
- `normalize_role`: The are several variants of the *Writing Credits* role in our raw data. This function normalizes the crew roles by replacing these instances with just "Writng Credits"
```python
def normalize_role(role: str) -> str:
    """Normalize the crew roles"""
    return "Writing Credits" if role.startswith("Writing") else role
```
- `show_self_loops`: Displays the self-loops in the graph.

- `remove_self_loops`: Removes the self-loops in the graph. This will come in handy when generating our bipartite network
```python
def remove_self_loops(G):
    """Remove the self-loops in the graph"""
    selfloops = list(nx.selfloop_edges(G))  # type: ignore
    G.remove_edges_from(selfloops)
    return None
```

**metric**: This is the collection of functions that compute the average director homogeneity, role homogeneity, and our custom crew weight. Learn more about these metrics in *Analyses*. 
- The `netgen` module uses this script when adding the homogeneity attributes and custom crew weight to the graph. See below!

**netgen**: The module uses the NetworkX library to create a graphical representation of the various director-crew network models as described above, which can be used for analyses and visualization. 

- The `create_graph` function takes in a file path and optional arguments to exclude certain roles or remove links from director to crew members who also directors in the network (for bipartite subnetwork). It uses the `get_directors` function from the `netgen_utils` module to get a list of directors from the raw data and `Apply_Metric` function from the `metric` module to unpack the homogeneity scores for each director.
```python
def create_graph(file, exclude_roles=None, remove_dir_targets=False):
    """Generate a networkx graph from the full director credits: user can decide which roles to exclude"""
    directors = nu.get_directors(file)
    scores = metric.Apply_Metric(file)
```

- The `add_dir_nodes` function is a nested function within `create_graph` and adds director nodes to the graph. It uses `load_data` and `json_loader` functions from the `netgen_utils` module to load and parse data from the file. It then extracts relevant data from the parsed JSON, such as the director's name, gender, ethnicity, and other labels. The function then adds a node to the graph for each director, with the extracted data as node attributes.
```python
def add_dir_nodes(G):
        """Add director nodes to the graph"""
        dir_homogeneity = scores[1]
        for data in nu.load_data(file):
            data = nu.json_loader(data)
            dir_name = nu.remove_extra_whitespaces(data["name"])
            renowned_status = nu.renowned(data["otherlabel"])
            combined_label = nu.combine_labels(
                data["gender"], data["ethnicity"], data["otherlabel"]
            )
            G.add_node(
                dir_name,
                dir_names=dir_name,
                dir_ids=data["dir_id"],
                gender=data["gender"],
                ethnicity=data["ethnicity"],
                renowned=renowned_status,
                grouped_labels=combined_label,
                type="director",
                avg_dir_homog=float(dir_homogeneity[data["name"]]),
            )
```

- The `add_crew_nodes` function is also a nested function within `create_graph` and adds crew nodes to the graph. It loops through the parsed data and extracts crew member information, such as the crew member's name, role, and type (e.g. crew or director). It then adds a node to the graph for each crew member, with the extracted data as node attributes.
```python
def add_crew_nodes(G):
        """Add crew nodes to the graph"""
        for data in nu.load_data(file):
            data = nu.json_loader(data)
            for movie in data["movies"]:
                for crew in movie["crew"]:
                    crew_name = nu.remove_extra_whitespaces(crew[1])
                    norm_role = nu.normalize_role(crew[0])
                    if (exclude_roles is not None) and (norm_role in exclude_roles):
                        continue
                    node_type = nu.get_node_type(crew[1], directors)  # type: ignore
                    if (not G.has_node(crew_name)) and node_type == "crew":
                        G.add_node(
                            crew_name,
                            crew_names=crew_name,
                            gender="nan",
                            ethnicity="nan",
                            avg_dir_homog=0.0,
                            grouped_labels=node_type,
                            type=node_type,
                        )
```

- The `add_edges` function is another nested function within `create_graph` and adds edges between the director and crew member nodes. It loops through the parsed data and creates an edge between the director and each crew member that worked on their movies. The edge is weighted by the number of movies they worked on together, and also includes attributes such as the crew member's department and their homogeneity score as custom weights. 
```python
def add_edges(G):
        crew_homogeneity = scores[3]
        for data in nu.load_data(file):
            data = nu.json_loader(data)
            dir_name = nu.remove_extra_whitespaces(
                data["name"]
            )  # remove extra whitespaces
            for movie in data["movies"]:
                for crew in movie["crew"]:
                    crew_name = nu.remove_extra_whitespaces(
                        crew[1]
                    )  # remove extra whitespaces
                    norm_role = nu.normalize_role(crew[0])
                    if (exclude_roles is not None) and (norm_role in exclude_roles):
                        continue
                    node_type = nu.get_node_type(crew[1], directors)  # type: ignore
                    if remove_dir_targets and (node_type == "director"):
                        continue
                    if G.has_edge(dir_name, crew_name):
                        G.edges[dir_name, crew_name]["weight"] += 1
                    else:
                        crew_homg = float(
                            crew_homogeneity[data["name"]][norm_role][crew[1]]
                        )
                        G.add_edge(
                            dir_name,
                            crew_name,
                            weight=1,
                            departments=norm_role,
                            crew_homog=crew_homg,
                        )
```

- Finally, `write_graph` function writes the generated graph to a GraphML file for later use in Gephi, a graph visualization and exploration software. It uses the `write_graphml` function from the NetworkX library to write the graph, and also checks if the file size is too small (which may indicate an error) and deletes the file in that case.
```python
def write_graph(G, fname="sample", ext=".graphml"):
    """Write the graph to a graphml file for later use in Gephi"""
    try:
        nx.write_graphml(G, fname + ext)  # type: ignore
        if nu.get_file_size(fname + ext) < 275:
            nu.delete_file(fname + ext)
        else:
            print(f"Graph written to {fname + ext}")
    except Exception as e:
        nu.ErrorHandler(e, input=G)
        print("Could not write graph!")
        # delete the empty file
        if nu.path_exists(fname + ext) and nu.get_file_size(fname + ext) == 0:
            nu.delete_file(fname + ext)
    return None
```

**run**: This script is made for the sole purpose of running functions to process, generate, and write the network graphs for further use in Gephi.
- Contains the list of roles to be excluded fron the network
```python
exc_roles = [
    "Additional Crew",
    "Animation Department",
    "Art Department",
    "Art Direction by",
    "Camera and Electrical Department",
    "Cast",
    "Casting By",
    "Casting Department",
    "Costume and Wardrobe Department",
    "Editorial Department",
    "Location Management",
    "Music Department",
    "Produced by",
    "Production Department",
    "Production Management",
    "Script and Continuity Department",
    "Second Unit Director or Assistant Director",
    "Set Decoration by",
    "Stunts",
    "Thanks",
    "Transportation Department",
    "Visual Effects by",
]
```
- Each network model was generated using the *run.py* script; `python3 -m run`
```python
import netgen
file = "fullest_credits.jsonl"
netgen.write_graph(netgen.create_graph(file), fname="full_dcnet")
netgen.write_graph(
    netgen.create_graph(file, exclude_roles=exc_roles),
    fname="final_dcnet",
)
netgen.write_graph(
    netgen.create_graph(file, exclude_roles=exc_roles, remove_dir_targets=True),
    fname="for_bipartite_dcnet",
)
```


## **Subtask 3: Network Visualization (Harry Choi)**
---

### **Part I: Main Visualization Methodology**

The purpose of our main visualization is to effectively depict the relationships between directors and crew members represented by our weighted, undirected graph. Thus, we first brainstormed the best methods to accomplish our goal before generating any visualization. We decided to use Gephi as our tool due to its robust ability to process and aesthetically visualize larger networks. We also wanted to use a force-directed algorithm to spatialize the network because it could optimally separate nodes and minimize edge crossings, allowing the viewer to effectively interpret the visualization.

After importing the GraphML file of our generated network into Gephi, we applied the ForceAtlas2 layout to our graph. We initially turned on the Approximate Repulsion parameter, set the Scaling factor to 2.0, and decreased the Edge Weight Influence to 0.4 so that the weights of the graph do not significantly distort the spatialization of the graph. The next steps were effectively visualizing attributes to answer the research questions. One of these attributes was ranking the size of the node based on our average director homogeneity metric. This allows the audience to easily comprehend which directors re-use or do not re-use the same crew members, directly addressing the first research question. Thus, the larger or smaller the node size, the higher or lower the homogeneity score. This metric, as indicated by the name, is only for directors, so crew members homogeneity scores are set to 0, which makes crew members the smallest nodes in the graph. Therefore, we initially set the minimum and maximum sizes to 10 and 50, respectively. Subsequently, we partitioned the color of the nodes based on our “type” attribute, an abbreviated encoding system where each letter represents the gender, ethnicity, and the renowned status of a director, respectively. For example, MWH represents Male, White, and Renowned. Therefore, there’s a distinct color for each label, and since crew members do not have labels, they’re represented by NaN and are all colored purple, as indicated by our legend. This feature allows us to visually analyze the phenomenon of re-using crew members between different groups of directors, such as male versus female or white vs minority directors, which helps to directly answer the research questions. Below is our color legend:



<img src='https://drive.google.com/uc?export=view&id=1gKKOucKO1IFdZZnnZL8fYFUeEhoEz5jw' width="100" height="300">


After ranking the node size on average director homogeneity and partitioning node color on grouped labels, we adjusted the parameters of the ForceAtlas2 algorithm and size rankings to optimally better space the director nodes and minimize edge crossings. Specifically, we increased the scaling parameter to 50.0, increasing the repulsion between nodes, in order to maximize the canvas space and make the graph more legible. Since we increased the scaling, we also increased the minimum and maximum sizes of the nodes to 25 and 700, respectively. This size disparity minimizes the size of crew member nodes, but the goal is to highlight the size of the 101 director nodes, not the 24,859 crew members nodes. We also briefly turned on Prevent Overlap, which effectively decreased node and edge overlap. 

Once the graph was spatialized and the most important attributes were visualized, we implemented more features. We changed the edge color from the color of the nodes to black and reduced the edge opacity by decreasing its scale parameter, making the graph more visually legible. We also turned on the “Hide non-selected” parameter so that the labels of nodes and edges only appear when you click on them, rather than 75,000 labels appearing all at once. Each node label included the director’s name, average homogeneity score, and “type”, while each edge label contained the weight between a director and crew member and the department a crew member worked in. The edge labels are particularly important because they allow us to visually analyze which type of crew members directors value. Below is our main Gephi visualization of our weighted undirected graph:




<img src='https://drive.google.com/uc?export=view&id=1K-K69Oc7b-sGB4JG3p7wlHPqkMf4FqLu' width="800" height="600">



### **Part II: Co-Occurrence Network Methodology**

Along with the main visualization of our weighted, undirected network, we also created a bipartite network to analyze the co-occurrences between directors. This was accomplished by regenerating our network, where we removed the links between directors and crew members that happen to be directors from our dataset. This preserves the property of distinct node classes, for nodes of the same class (directors and crew members) cannot be linked to one another. The new file, titled “for_bipartite_dcnet.graphml”, also changed the “type” attribute to values of either “director” or “crew member,” which was instrumental to creating the graph. After importing the file into Gephi, we projected the graph onto the director node class with the MultiMode Networks Projection plug-in. This tool allows you to specify which attribute to project with, where we used our “type” attribute, and then specify which node class to project on with the “Left matrix” and “Right matrix” parameters. We set the Left matrix to “director - crew member” and Right matrix to “crew member - director” so that we only have a graph with only directors. After specifying to remove extraneous nodes and edges, we applied the algorithm from the plugin and generated our bipartite network.  This graph allows us to delve deeper into the phenomenon of directors re-using crew members. We can analyze how connected directors are themselves by seeing whether certain types of directors (from White to Renowned to Indigenous) hire the same crew members as other types of directors. 


<img src='https://drive.google.com/uc?export=view&id=1I5CJERzmTQHoKlNteBPtLxwKoctV3m3L' width="800" height="600">



### **Part III: Findings**

Our main visualization and other sub-visualizations illustrate interesting patterns of collaboration between Hollywood directors and crew members. It’s immediately apparent by the visualization how the majority of Hollywood directors are Male and White, regardless of if they’re renowned or not. As indicated by the color legend, most director nodes are either light blue or a darker green, which represent Male/White/Renowned and Male/White directors, respectively. It’s also visually clear how strictly Male and White directors have a greater tendency of re-using the same crew members, regardless of their renowned status. We specify strictly because Male and White directors only have two attributes in the grouped labels; they do not have a third attribute of Renowned or Q (LGBTQ +). Of the ten directors with the highest average homogeneity scores, 9/10 directors were white and 9/10 were male; the only exceptions were Tyler Perry (Male and Black) and Lilly Wachowski (Female, White, and Queer), as shown below.

<img src='https://drive.google.com/uc?export=view&id=1LIVeOlaneXSBundLRFg_JH7ZY4M0Wfyr' width="800" height="600">

Meanwhile, the bottom ten directors with respect to average homogeneity scores have a more even distribution with respect to gender and ethnicity, as seen below:

<img src='https://drive.google.com/uc?export=view&id=1GoaenSyS5beYz1QfxFurFGS4mJvx4Zex' width="800" height="600">




Thus, we can conclude from the main visualization and sub-visualizations that Male and White directors typically re-use the same crew members. The visualizations also depict how renowned directors have a greater tendency of re-using the same crew members, while “unrenowned” directors have more variance in their homogeneity. Of the 20 renowned directors, almost 95% had a role homogeneity score greater than 0.20, with 10 of these directors having a score greater than 0.40, including the owner of the highest average director homogeneity score: Peter Jackson. The filter below of only renowned directors shows how the size of the nodes are more consistent than those of less renowned directors.

<img src='https://drive.google.com/uc?export=view&id=1UUeRmUgzAnp95IQ9PKgD6wH0MlM7unV1' width="600" height="400">



!<img src='https://drive.google.com/uc?export=view&id=1IXNQ7ARk-rcLihHvHhQO1mdsZ-fWTwG5' width="800" height="600">

The average homogeneity of less recognized directors tends to vary more than those of renowned directors. Some of these directors have very high homogeneity scores, such as Tyler Perry (indicated by the largest orange node), Steve Soderberg and Clint Eastwood. However, there are more nodes in the unrenowned subgraph that are much smaller in size, showing that their average homogeneity scores are significantly smaller. 



95% of the renowned directors are also Male, White, and Renowned, as seen by how most of the director nodes are light blue. Thus, it’s clear that directors who are Male, White, and Renowned (or just generally Male and White) consistently have the highest average homogeneity scores. But what’s even more interesting is that M/W/R directors not only tend to re-use the same crew members, but they also hire the same people in common with each other. Some of the most interesting links and the nodes from our network stem from our co-occurrence network. We filtered our co-occurrence graph to only show links between directors that have a weight greater than or equal to a thousand; since the weights of the links are the number of people a pair of directors have both hired, we wanted to see the directors with the most co-occurrences. Of the seven pairs of director nodes with a link weight of 1000 or greater, six of the pairings were Male, White, and Renowned. The only pairing that deviated was between Martin Scorsese and Spike Lee, who are Male/White and Male/Black; they also have the greatest number of co-occurrences with 2598 crew members hired in common:

<img src='https://drive.google.com/uc?export=view&id=129ekdxw1o1wH8r-PWZkO0YNUh7VjzNrc' width="800" height="600">

Therefore, our visualizations not only depict that renowned directors re-use the same crew members more than unrenowned directors, but that specifically Male, White, and Renowned directors have an even greater tendency to re-use the same crew and hire from the same pool of people. This also indicates how Male, White and Renowned, and to a lesser extent Male and White, directors are well connected to each other in Hollywood and have greater stability in who they hire.


Among the 81 other directors (i.e., unrenowned) in the network, we also analyzed directors’ average homogeneity for minority directors, which represents directors who are not Renwowned, or Male and White. The visualizations show how minority director node sizes have more variability, indicating that their average homogeneity is consistently lower, which our **Analyses** will confirm.

Here's the visualization for minority directors:

<img src='https://drive.google.com/uc?export=view&id=1_ORO9OzHfjPIQ3-wYDaGpPj-TuRoKFZ8' width="800" height="600">


Here's the visualization for "Other" directors (just Male and White):
<img src='https://drive.google.com/uc?export=view&id=1o-a80aHRWyYDVF84ebiQuU60dmDUpea4' width="800" height="600">




Our visualizations also depict how female directors typically work with more of a revolving door of crew members, while male directors have a higher tendency to re-use the same crew members. This phenomenon is also augmented by how nine of the ten largest average director homogeneity scores belonged to men.

Male directors:

<img src='https://drive.google.com/uc?export=view&id=1O175j-y_pWWmDLCFRZrSnKORVXEhUFWG' width="800" height="600">

Female Directors:

<img src='https://drive.google.com/uc?export=view&id=1zdLF6bHudfvnWU3VyPYJsDbK2Vzl_3Wz' width="800" height="600">

Along with the set of co-occurrences, one of the more interesting sets of nodes belong to the Wachowski sisters. From background knowledge, we had an idea that they've most likely worked with similar crew members, so we created an ego network in Gephi, along with a subfilter that only includes nodes with a degree greater than or equal to 4 (k-core = 4). The following visualizations are pretty illuminating as to how the Wachowski sisters have employed many of the same crew members, with their ego network even in the shape of a triangle. Our **Analyses** will confirm this phenomenon by examining the number of triangles of each respective director.

Here is their ego network:

<img src='https://drive.google.com/uc?export=view&id=1kwxk-rJ2AbahVVx0HdRbBQC6-Ak9MMCP' width="800" height="600">


Here is their ego network with a k-core = 4:

<img src='https://drive.google.com/uc?export=view&id=1FbkzD_F0M2Di1u6gOEEv9TGs8ZSBbB91' width="800" height="600">

## **Subtask 4: Analyses**
---
Analysis was conducted with the goal of gaining an understanding of the network so that we can effectively answer the research questions. The process consisted of three main focuses: analyzing the raw data, engineering a metric to answer our research questions, and analyzing the resulting network. The following analysis was conducted based on our main network, Network 3.

**Part 1: Raw Data**

Before answering our research questions, it was necessary to obtain various summary statistics of our raw data. Naturally, the first step was looking at the size of our network. Our data consists of 101 directors who have directed 1,216 movies and employed 24,859 crew members for a select group of roles. 10 roles were included in the analysis: cinematography, costume design, co-directors, film editing, the makeup department, music, production design, the sound department, special effects, and writing credits. Our resulting network has 24,960 nodes and 55,647 edges. Since our research question majorly focuses on comparing minority groups to others, we also looked at the distribution of ethnicity, sex, and sexual preference. Of the 101 directors, 70 are White, 13 are Black, 11 are Asian, 5 are Latin American, and 2 are Indigenous. 74 of the directors are Male and the remaining 27 are Female. There are also 7 directors who identify as Queer. Additionally, the top 20 highest-grossing directors have been designated as “renowned”.

**Part 2: Metric Engineering**

When envisioning our network, it seemed intuitive to weigh director-crew edges by the strength of their relationship. To quantify this relationship strength (S), a metric was created and applied to each edge.


$$S=
\begin{cases}
0 & \text{; } P=1\\
\frac{N-1}{P-1} & \text{; } P>1\\
\end{cases}$$

<p><center><i>N = # times a crew member was re-used in their role<br>P = # opportunities to be re-used in their role</br></i></center></p>


The relationship strength metric is calculated by dividing the number of times a crew member was re-employed for their role by the number of opportunities available to be re-employed for their role. This results in a relationship strength value between 0 and 1.

Function used to calculate weights:
```python
def Get_Weights(crew_list, weight_dict, dept_counts, exclude=True):
    """Returns a dictionary of weights between the director and each crew member, organized by role"""
    for role in crew_list:
        if role[0] in exc_roles and exclude: # exc_roles is a globally defined list of all roles to exclude
            continue
        if role[0] in weight_dict.keys():
            if role[1] in weight_dict[role[0]].keys(): # If we have seen the crew member before, that means they have been re-used
                try:
                    weight_dict[role[0]][role[1]] += 1/(dept_counts[role[0]]-1) # Add 1/number of possible re-uses to their weight
                except: 
                    weight_dict[role[0]][role[1]] += 0 # If the department was only used in one movie, the weight remains at zero
            else: # Add new crew member to existing role
                weight_dict[role[0]][role[1]] = 0 # Weight is initialized to zero since this is the first time we have seen them
        else: # Initialize the role in the weight dictionary
            weight_dict[role[0]] = {role[1]:0} # The weight is set to zero for a new crew member since this is the first time we have seen them
    return weight_dict
```

While planning out our network, we also realized that node size should be directly related to the research question. To accomplish this, a second metric was engineered to assign scores to each director based on how often they re-use the same crew members. 

$$H = 
\begin{cases}
1 & \text{; } u = 1\\
1-\frac{u}{n} & \text{; } u > 1\\
\end{cases}$$

<p><center><i>u = unique crew members for role across movies<br>n = total crew members for role across movies</br></i></center></p>




This was done by calculating a homogeneity score (H) for each role employed by a director and then taking the average across roles. The first step in calculating role homogeneity is obtaining the total number of individuals employed for a role across all movies for a given director. This number is divided by the total number of unique individuals in the given role, and the resulting fraction is subtracted from one. Again, the resulting score is between 0 and 1.

Function for calculating role homogeneity for a given director:
```python
def Homog_By_Dir(homog_dict):
    """
    Returns the overall homogeneity score for a given director. 
    This is the average homogeneity across roles.
    """
    homogeneity = {}
    for key in homog_dict.keys(): # For each department...
        unique = len(homog_dict[key].keys()) # Number of unique crew members
        total = sum(list(homog_dict[key].values())) # Total number of crew members
        if unique == 1:
            homogeneity[key] = 1
        else:
            homogeneity[key] = 1 - (unique/total)
    total_homogeneity = np.mean(list(homogeneity.values()))
    return homogeneity, total_homogeneity
```


**Part 3: Answering the Research Questions**

**Network Characteristics:**

The first research question we decided to analyze was: How will you characterize the film-director network? To answer this question, we looked at statistics for various network elements. First, we wanted to determine whether our network was sparse or dense. We calculated a density of 0.00018 which seems to be quite low. This indicates that directors are primarily working with a small fraction of the crew members that exist in the network. A sparse network could be indicative of directors reusing the same crew members. Next, we investigated whether or not the network could be classified as a small world by analyzing triangles. Our network has 5339 triangles and the mean clustering coefficient among directors was 0.000765. Such a low clustering coefficient is indicative of a network that would not be classified as a small world. This, however, is likely because directors do not typically collaborate and our network does not include lin

Function for gathering network statistics:
```python
def Analyze_Network(graph, directors):
    """Returns a dictionary of network characteristic statistics"""
    num_edges = graph.number_of_edges()
    connected = nx.is_connected(graph)
    dir_triangles = nx.triangles(graph)
    num_triangles = int(sum(dir_triangles.values())/3)
    cluster_coeffs = nx.clustering(graph,nodes=directors)
    avg_clustering = np.mean(list(cluster_coeffs.values()))
    max_clustering_idx = np.argmax(list(cluster_coeffs.values()))
    max_clustering_key = list(cluster_coeffs.keys())[max_clustering_idx]
    max_clustering = {max_clustering_key:cluster_coeffs[max_clustering_key]}
    density = nx.density(graph)
    assortativity = nx.degree_assortativity_coefficient(graph, weight='avg_dir_homog', nodes=directors)
    stats = {'connected':connected, 'num_edges':num_edges, 'num_triangles':num_triangles, 
             'avg_clustering':avg_clustering, 'density':density, 'assortativity':assortativity,
             'max_clustering':max_clustering, 'cluster_coeffs':cluster_coeffs}
    return stats
```

**Important Nodes:**

The next research question of focus was: Did you find any interesting nodes/links? In search of interesting nodes, we first looked into some of the network statistics. Knowing that the Wachowski sisters tend to work on the same movies we decided to look at the number of triangles for each of them. We found that the sisters have the exact same number of triangles at 464. Although the network as a whole might not be a small world, this indicates the existence of small worlds within the network. We also looked for hubs in our network by comparing the unweighted degree of each director and also calculating betweenness scores. The five directors with the highest unweighted degree are Steven Spielberg, Tim Burton, Ridley Scott, Ron Howard, and Roland Emmerich in order. It also happens that these are the five directors, in the same order, with the highest betweenness centrality. Additionally, all five of these directors are considered renowned which indicates that the hubs in the film industry are renowned directors. Other than Roland Emmerich, these directors all have a betweenness centrality of over 40,000 which is significantly higher than the average of 12,474. On the topic of betweenness centrality, it was interesting to see that Jordan Peele and Chloe Zhao both were in the bottom 12 of the betweenness ranking. Using domain knowledge, they are considered to be important directors who are on the rise in Hollywood. It would be interesting to see if they climb the betweenness ranking in the next few years.

Function for finding important/interesting nodes:
```python
def Analyze_Hubs(graph, directors):
    """Returns a dictionary of various node statistics"""
    crew = [name for name in graph.nodes() if name not in directors]
    dir_degree_rankings1 = sorted(graph.degree(nbunch= directors, weight ='crew_homog'), key=itemgetter(1), reverse=True)
    dir_degree_rankings2 = sorted(graph.degree(nbunch= directors), key=itemgetter(1), reverse=True)
    crew_degree_rankings1 = sorted(graph.degree(nbunch=crew,weight='crew_homog'), key=itemgetter(1), reverse=True)
    crew_degree_rankings2 = sorted(graph.degree(nbunch=crew), key=itemgetter(1), reverse=True)
    avg_dir_degree1 = np.mean([ranking[1] for ranking in dir_degree_rankings1])
    avg_dir_degree2 = np.mean([ranking[1] for ranking in dir_degree_rankings2])
    all_betweenness = nx.betweenness_centrality_subset(graph, directors, graph.nodes()) 
    dir_betweenness = {}
    for name, val in all_betweenness.items():
        if name in directors:
            dir_betweenness[name] = val
    betw_ranking = sorted(dir_betweenness.items(), key=lambda x:x[1], reverse=True) 
    avg_betweenness = np.mean([ranking[1] for ranking in betw_ranking])
    stats = {'dir_degree_rankings1': dir_degree_rankings1, 'dir_degree_rankings2': dir_degree_rankings2,
             'crew_degree_rankings1':crew_degree_rankings1, 'crew_degree_rankings2': crew_degree_rankings2,
             'avg_dir_degree1':avg_dir_degree1, 'avg_dir_degree2':avg_dir_degree2, 'all_betweenness':all_betweenness, 
             'dir_betweenness':dir_betweenness, 'ranking':betw_ranking, 'avg_betweenness':avg_betweenness}
    return stats
```


**Homogeneity:**

Finally, we analyzed role homogeneity across directors and departments to answer our overarching research question: How widespread is the phenomenon of directors re-using the same crew?  To get a sense of what we might find, we first obtained a ranking of all directors based on their homogeneity scores. Since it is difficult to quantify how widespread the phenomenon is, we partitioned the directors so that we could have a relative comparison. We created three partitions: renowned, minority, and other. The minority partition is composed of directors who are not male and white along with directors who identify as queer. For each partition, the distribution of role homogeneity scores was plotted on a histogram and the mean was calculated.


<img src='https://drive.google.com/uc?export=view&id=1pfzwjXJPJoToVtkHU4p-REn8Ua9S5o2C' width="800" height="300">




We found that renowned directors had the highest average role homogeneity at 0.363 and minority directors had the lowest at 0.227. The other directors were in between with an average role homogeneity of 0.309. Looking at the distributions, it appears that about half of the renowned directors had a score of over 0.4 while only a small fraction of minority directors and other directors had such a score. Comparing the non-renowned groups, the minority directors have a right-skewed distribution which peaks around 0.15. The other directors’ distribution is similar, but more uniform from 0.1 to 0.35. From the analysis of these distributions and mean scores, we conclude that re-using the same crew is a more widespread phenomenon among renowned and white male directors than among minority directors. Since the re-use of similar crews seems to be an indicator of success we conclude that, over the entire film industry, the re-use of similar crews is a widespread phenomenon. It seems that the directors who have the ability to re-use similar crews will choose to do so.

Lastly, we were able to quantify how widespread the phenomenon was by partitioning the homogeneity scores by role. Again, we plotted distributions for each role and calculated the average scores. 


<h2>Homogeneity Scores By Role</h2>

<img src='https://drive.google.com/uc?export=view&id=1yVtclN1Dnt80Yx5n9_ne3acQi8SA7-rs' width="1800" height="800">

 Based on the average scores, there are five roles where re-using the same crew members is significantly more widespread. From most widespread to least, these roles are film editing, music, cinematography, costume design, and production design. The average homogeneity scores for these roles range from 0.301 for production design to 0.426 for film editing. The next highest average homogeneity score was significantly lower at 0.276 for the writing department. From this analysis, we conclude that homogeneity is widespread among the mentioned top five roles, and not as widespread among the writing, sound, co-director, makeup, and special effects roles.

Functions for comparing homogeneity scores:
```python
def Group_By_Role(role_homogeneity):
    """Returns a dictionary of average homogeneity score for each role across directors"""
    homog_by_role = {}
    for director in role_homogeneity.keys():
        for role in role_homogeneity[director].keys():
            score = role_homogeneity[director][role]
            if role not in homog_by_role.keys():
                homog_by_role[role] = [score]
            else:
                homog_by_role[role].append(score)
    return homog_by_role

def Group_By_Attribute(dir_homogeneity, dir_attributes):
    """Returns a dictionary of average homogeneity score for each role across attributes"""
    groups = {'renowned':[], 'minority':[], 'other':[]}
    for director, homog in dir_homogeneity.items():
        other = dir_attributes[director]['other']
        gender = dir_attributes[director]['gender']
        ethn = dir_attributes[director]['ethnicity']
        renowned = other == 'H'
        minority = gender == 'F' or ethn != 'W' or other == 'Q'
        if renowned:
            groups['renowned'].append(homog)
        elif minority:
            groups['minority'].append(homog)
        else:
            groups['other'].append(homog)
    return groups
```

### **Conclusion**
---

After extracting data, generating networks, creating visualizations, and conducting an analysis, it can be easy to forget that the average person will not be able to simply look at our analysis and draw meaning from it. It is important to relate our findings back to the real world in a meaningful manner. Based on our findings, the following is our description of the film industry.

Hollywood appears to be a loosely connected community consisting of directors who, with a few exceptions, do not often collaborate. Most directors have only worked with a miniscule fraction of the community. That said, it seems as though each member of the community can be easily connected to another member through their own connections. Crew members connect directors and directors connect crew members. The members of the film industry who seem to connect the most people are the renowned directors. These are directors like Steven Spielberg, Tim Burton, and Ridley Scott who have been in the film industry long enough to establish their prominence. As a result of their experience, they have the ability to be the bridge between many crew members. Something these directors all have in common is that they are considered to be renowned directors and they are white. In fact, 19 out of the 20 designated renowned directors are white and one is Asian. These renowned directors seem to have figured out who they like to work with the most and have formed their own cliques with crew members that they like to reuse. This phenomenon seems to be present, but less drastic, among male white directors who are not renowned. Because of this formation of cliques, minority directors appear to be forced into employing unfamiliar faces. The roles that directors seem to re-use the most are film editing, music, cinematography, costume design, and production design. These roles are among the most essential for directing a quality film. This could explain the lack of renowned minority directors, as they are not able to maintain as much stability in their employment of the most important roles due to the cliques formed by renowned directors. This dynamic may not last much longer, however. Jordan Peele and Chloe Zhao are both young minority directors who appear to be on the rise. They do not yet have the same ability to bridge gaps between crew members, but if they continue directing movies, it is likely they will make more connections and build a network similar to today’s renowned directors. Jordan Peele and Chloe Zhao are two directors who could lead a change to the current dynamic of the film industry.

