In [2]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from collections import Counter
from statistics import mean
import numpy as np
import itertools
import my_functions as myf


# 1. **_Data_**


## **Data Preprocessing**


As always, in the data science area, you can find some inconsistencies in the provided data. Therefore, some modifications should be made to the data to make it consistent across all of the datasets you have. To ensure consistency in the data, keep the following in mind:

1.  Some of the heroes' names in 'hero-network.csv' are not found in 'edges.csv'. This inconsistency exists for the following reasons:

    - Some heroes' names in 'hero-netowrk.csv' have extra spaces at the end of their names compared to their names in 'edges.csv'.
    - Some heroes' names in 'hero-netowrk.csv' have an extra '/' at the end of their names compared to their names in 'edges.csv'.
    - The hero name 'SPIDER-MAN/PETER PARKER' in 'edges.csv' has been changed to 'SPIDER-MAN/PETER PAR' in 'hero-network.csv' due to a string length limit in 'hero-network.csv'.

2.  Some entries in the 'hero-network.csv' have the same hero in both columns. In the graph, these entries form a self-loop. Because a self-loop makes no sense in this network, you can safely remove those from the dataset.


In [3]:
hero_df = pd.read_csv("hero-network.csv")
edges_df = pd.read_csv("edges.csv")
nodes_df = pd.read_csv("nodes.csv")

We decided to homogenize all the DFs since there are many words that are the same but with extra characters in the end
and in the hero-network's DF there is a character limit of 20 digits.

We report here some examples of the issues we could find


In [4]:
print("SPIDER-MAN/PETER PAR" in hero_df.values)  # digits limit
print("SPIDER-MAN/PETER PARKERKER" in nodes_df.values)  # with an extra "KER" in the end
print("SPIDER-MAN/PETER PARKER" in edges_df.values)

True
True
True


Trimming to the 20th character and removing final blank spaces or final "/" character


In [5]:
hero_df["hero1"] = hero_df["hero1"].apply(lambda x: x.rstrip("/ ")[:20])
hero_df["hero2"] = hero_df["hero2"].apply(lambda x: x.rstrip("/ ")[:20])
edges_df["hero"] = edges_df["hero"].apply(lambda x: x.rstrip("/ ")[:20])
edges_df["comic"] = edges_df["comic"].apply(lambda x: x.rstrip("/ ")[:20])
nodes_df["node"] = nodes_df["node"].apply(lambda x: x.rstrip("/ ")[:20])

Let's check if the are any anomalies in our data now that we have formatted them.


In [6]:
print("SPIDER-MAN/PETER PAR" in hero_df.values)  # digits limit
print("SPIDER-MAN/PETER PARKERKER" in nodes_df.values)  # with an extra "KER" in the end
print("SPIDER-MAN/PETER PARKER" in edges_df.values)

True
False
False


Now let's remove the rows from the hero_df wich has the same hero in both columns.


In [7]:
hero_df.drop(hero_df[hero_df["hero1"] == hero_df["hero2"]].index, inplace=True)

## **Graphs setup**


First graph: Will be constructed using the data stored in the 'hero-network.csv' file, in which an edge between two heroes can be found if they have appeared in the same comic together. The number of edges between two heroes represents the number of times they have collaborated in different comics. The graph should be considered weighted and undirected. It is up to you to decide which metric to use to calculate the weights, but we anticipate that the cost will be lower for heroes with more collaborations. Please specify which metric you used to select the weights in the report.


We group by the hero1 and hero2 on the hero_df to obtain the number of occurencies between to nodes, which corresponds to the number of times a hero collaborated with another hero. The we obtain a new dataset from this operation, in fact we need the 'weight' variable in our dataset.


In [8]:
count_hero_df = hero_df.groupby(["hero1", "hero2"]).size().reset_index(name="weight")

We set the weight to be 1/n because it has to be inversely related to the number of times each hero collaborted with another hero. In this way heroes who got lot of collaborations will be penalized.


In [9]:
count_hero_df["weight"] = count_hero_df["weight"].apply(lambda x: 1 / x)

In [10]:
count_hero_df


Unnamed: 0,hero1,hero2,weight
0,24-HOUR MAN/EMMANUEL,"FROST, CARMILLA",1.0
1,24-HOUR MAN/EMMANUEL,KILLRAVEN/JONATHAN R,1.0
2,24-HOUR MAN/EMMANUEL,M'SHULLA,1.0
3,3-D MAN/CHARLES CHAN,ANGEL/WARREN KENNETH,1.0
4,3-D MAN/CHARLES CHAN,ANT-MAN II/SCOTT HAR,1.0
...,...,...,...
224094,ZZZAX,"RODRIGUEZ, DEBRA",1.0
224095,ZZZAX,"ROSS, GEN. THADDEUS",0.5
224096,ZZZAX,"SUMMERS, NATHAN CHRI",1.0
224097,ZZZAX,TIGRA/GREER NELSON,1.0


Now we can create a 'weighted graph' using the weight we calculated as our edge attribute.


In [11]:
G_hero_net = nx.from_pandas_edgelist(
    count_hero_df, "hero1", "hero2", edge_attr="weight", create_using=nx.MultiGraph
)

In [12]:
G_hero_net["ZZZAX"]["RODRIGUEZ, DEBRA"]

AtlasView({0: {'weight': 1.0}})

**Second graph** : The data in 'nodes.csv' and 'edges.csv' will be used to construct the second graph. The type of node (hero/comic) can be found in 'nodes.csv', and an edge between a hero node and a comic node can be found in 'edges.csv' when the hero has appeared in that specific comic. This graph is assumed to be undirected and unweighted.


We need to set the 'node' column as index of the nodes_df dataframe in order to proceed with our request.


In [13]:
print(len(nodes_df))
print(len(nodes_df.node.unique()))


19090
19087


We see that there are only 3 non unique values in the 'node' column se we decided to drop them to mantain the dataframe consistency.


In [14]:
nodes_df = nodes_df.drop_duplicates(subset=["node"])

We transform the node column in the new index of the dataframe and we transform it in a dictionary.


In [15]:
nodes_attr = nodes_df.set_index("node").to_dict(orient="index")

Now we can proceed with the creation of the second graph. First we create a new graph starting from the edges dataframe and then we select the nodes_attr dictionary to be the attributes of our nodes.


In [16]:
G_edges_net = nx.from_pandas_edgelist(edges_df, "hero", "comic")
nx.set_node_attributes(G_edges_net, nodes_attr)

In [17]:
G_edges_net.nodes["VENUS II"]

{'type': 'hero'}

# 2. **_Backend Implementation_**


The goal of this part is the implementation of a controller system that has different functionalities. The controller should take as input an identifier "i" and run the associated function_i applied to the graph you create from the downloaded data.

Definition: As the number of nodes and edges grows, we may request to work on a subset of the data to reduce computation time and improve network visualization. In this case, we will ask you only to consider the data for top N heros. We define the top N heroes as follows:

- Top N heroes: The top N heroes who have appeared in the most number of comics. The 'edges.csv' file, which represents the comics in which each hero has appeared, can be used to filter these N heroes.

Note: When the value of N is not set by the user, the function should consider the whole data.
Functionality 1 - extract the graph's features


## **_Functionality 1 - Extract the graph's features_**


**Input**:

- The graph data
- The graph type (ex number 1 or number 2)
- N: denoting the top N heroes that their data should be considered

**Output**:

- The number of nodes in the network (if type 2, report for both node types)
- The number of collaborations of each superhero with the others (only if type 1)
- The number of heroes that have appeared in each comic (only if type 2)
- The network's density
- The network's degree distribution
- The average degree of the network
- The network's Hubs (hubs are nodes having degrees more extensive than the 95th percentile of the degree distribution)
- Whether the Network is sparse or dense


In [18]:
def Functionality_1(graph, graph_type, N=6421):
    if N != 6421 and graph_type == 1:
        graph = myf.TOP_N(N, G_hero_net, edges_df)
    print("\n")
    myf.number_of_nodes(graph_type, graph)
    myf.number_of_collaborations(graph_type, graph)
    print("\n")
    myf.number_of_hero_in_each_comic(graph_type)
    print("\n")
    dens = myf.graph_density(graph)
    print("The degree distribution is", myf.degree_distribution(graph))
    print("\n")
    print("The average degree is", myf.degree_mean(graph))
    print("\n")
    print("The node hubs are", myf.nodes_hubs(graph))
    print("\n")
    print("The density is", dens)
    print("\n")
    if dens >= 0.5:
        print("The graph is dense")
    else:
        print("The graph is sparse")

    return

In [19]:
Functionality_1(G_hero_net, 1, N=40)




The graph have 40 nodes
WOLVERINE/LOGAN has 77 collaborations.
CYCLOPS/SCOTT SUMMER has 76 collaborations.
ANGEL/WARREN KENNETH has 76 collaborations.
PARKER, MAY has 59 collaborations.
WATSON-PARKER, MARY has 66 collaborations.
COLOSSUS II/PETER RA has 75 collaborations.
THOR/DR. DONALD BLAK has 78 collaborations.
SCARLET WITCH/WANDA has 77 collaborations.
BLACK WIDOW/NATASHA has 75 collaborations.
MR. FANTASTIC/REED R has 78 collaborations.
CAPTAIN AMERICA has 78 collaborations.
HUMAN TORCH/JOHNNY S has 78 collaborations.
BEAST/HENRY &HANK& P has 78 collaborations.
SUB-MARINER/NAMOR MA has 78 collaborations.
QUICKSILVER/PIETRO M has 75 collaborations.
JAMESON, J. JONAH has 76 collaborations.
NIGHTCRAWLER/KURT WA has 77 collaborations.
SHE-HULK/JENNIFER WA has 77 collaborations.
VISION has 77 collaborations.
FURY, COL. NICHOLAS has 75 collaborations.
ANT-MAN/DR. HENRY J. has 78 collaborations.
SHADOWCAT/KATHERINE has 69 collaborations.
DAREDEVIL/MATT MURDO has 78 collaborations.
JON

## Functionality 2 - Find top superheroes!


_Input_:

- The graph data
- A node (hero or comic)
- One of the given metrics : _Betweeness, PageRank, ClosenessCentrality, DegreeCentrality_
- N: denoting the top N heroes that their data should be considered

_Output_:

- The metric's value over the considered graph
- The given node's value

Note: Give an explanation regarding the features of the user based on all of the metrics (e.g. if the betweenness metric is high, what does this mean in practice, what if the betweenness is low but has a high PageRank value, etc.).


### Metrics


- **Degree centrality**:

  The degree centrality of a node in a graph is a measure of the importance of that node in the graph. It is defined as the number of edges incident to the node, i.e., the number of neighbors it has. In other words, it is the number of connections a node has in the graph. The degree centrality of a node can be calculated as:

  $ Degree\,Centrality = \frac{number\,\,of\,\,neighbors}{number\,of\,nodes\, -\, 1} $

  Nodes with high degree centrality are often considered important or influential in the graph, as they have a large number of connections and may be able to spread information or influence other nodes more effectively.


- **Betweenness centrality**:

  The betweenness centrality of a node in a graph is a measure of the importance of that node in terms of the number of shortest paths that pass through it. It is defined as the fraction of all shortest paths in the graph that pass through the node. In other words, it is a measure of the node's centrality in terms of the connectivity of the graph.
  The betweenness centrality of a node can be calculated as:

  $ Betweenness\,centrality = \frac{num\,of\,shortest\,path\,passing\,through\,the\,node}{total\,num\,of\,shortest\,paths\,in\,the\,graph}$

  Nodes with high betweenness centrality are often considered important or influential in the graph, as they are more likely to be on the shortest path between other nodes and may be able to control the flow of information or influence other nodes more effectively.


- **Closeness centrality** :

  The closeness centrality of a node in the graph is a measure of the importance of that node in terms of its ability to reach other nodes in the graph. It is defined as the reciprocal of the sum of the shortest path distances from the node to all other nodes in the graph. In other words, it is a measure of the node's centrality in terms of its proximity to other nodes.The closeness centrality can be calculated as:

  $ Closeness\,centrality = \frac{1}{sum\,of\,the\,shortest\,path\,distances\,from\,the\,node\,to\,all\,other\,nodes}$

  Nodes with high closeness centrality are often considered important or influential in the graph, as they are able to reach other nodes quickly and may be able to spread information or influence other nodes more effectively.


- **Page rank**:

  The page rank algorithm is a way to measuring the importance or ranking of nodes in a network. In the context of a network, the PageRank of a node is a measure of the importance of that node based on the number and quality of the edges pointing to it. Nodes with a higher PageRank are considered more important and are more likely to appear at the top of search results.

  To calculate the PageRank of the nodes in a network, the algorithm follows these steps:

  - First of all we need to transform our graph in a directed graph.
  - We need to assign a initial PageRank score to each node.
  - Then we need to iteratively update the PageRank score of each node based on the PageRank score of the nodes that link to it. Specifically, the PageRank of a node is equal to the sum of the PageRanks of the nodes that link to it, divided by the number of outbound links from each of those nodes.
    We also need to use a damping factor in order to avoid infinite loops and allow the algorithm to converge.
  - We need to repeat the previous step until the score converge.


### Correlation between different metrics


The correlation between different network metrics will depend on the specific characteristics of the network and the definition of the metrics. In general, different metrics may capture different aspects of the network and may not be strongly correlated.

For example, **degree centrality** measures the number of connections a node has, while **betweenness centrality** measures the number of times a node acts as a "bridge" between other nodes in the network. These two metrics may not be strongly correlated, as they capture different aspects of the network.

In the same way, also the **closeness centrality** capture another different aspect of the network, in fact this metric is a measure of the average distance of a node to all other nodes in the network. High closeness centrality indicates a node that is considered to be central to our network.

Similarly, **PageRank**, which is a measure of the importance or ranking of a node based on the number and quality of the links pointing to it, may not be strongly correlated with other metrics such as degree centrality or betweenness centrality.


In [20]:
# TODO: Fasten up betweeness, closeness and degree if it is possible.

In [21]:
def functionality_2(
    graph, node, metric, N=6421
):  # DO NOT USE IT WITH A BIG N, IT IS TOO SLOW

    if N != 6421:
        graph = myf.TOP_N(N, G_hero_net, edges_df)
    print("The " + str(metric) + " is", "\n")

    if metric.lower() == "betweeness":
        print(myf.betweeness_metric(graph))
        print("\n")
        print(
            "The value for the node " + str(node) + " is: ",
            myf.betweeness_metric(graph)[node],
        )
    elif metric.lower() == "pagerank":
        print(myf.page_rank_metric(graph))
        print("\n")
        print(
            "The value for the node " + str(node) + " is: ",
            myf.page_rank_metric(graph)[node],
        )
    elif metric.lower() == "closenesscentrality":
        print(myf.closeness_metric(graph))
        print("\n")
        print(
            "The value for the node " + str(node) + " is: ",
            myf.closeness_metric(graph)[node],
        )
    elif metric.lower() == "degreecentrality":
        print(myf.degree_metric(graph))
        print("\n")
        print(
            "The value for the node " + str(node) + " is: ",
            myf.degree_metric(graph)[node],
        )
    else:
        print("This metric is not included")
    return

In [22]:
functionality_2(G_hero_net, "SCARLET WITCH/WANDA", "pagerank", 45)

The pagerank is 

{'WOLVERINE/LOGAN': 1.0128692903009693, 'CYCLOPS/SCOTT SUMMER': 1.0128844707869777, 'ANGEL/WARREN KENNETH': 1.012899544705697, 'PARKER, MAY': 0.9336481272329689, 'WATSON-PARKER, MARY': 0.9533666485141262, 'COLOSSUS II/PETER RA': 1.0129441754741095, 'THOR/DR. DONALD BLAK': 1.012958830264833, 'SCARLET WITCH/WANDA': 1.012973382178657, 'BLACK WIDOW/NATASHA': 1.0129878319377803, 'MR. FANTASTIC/REED R': 1.01300218025933, 'CAPTAIN AMERICA': 1.0130164278554001, "BLACK PANTHER/T'CHAL": 0.9931290628033298, 'HUMAN TORCH/JOHNNY S': 1.0130446310370567, 'BEAST/HENRY &HANK& P': 1.0130585806278907, 'SUB-MARINER/NAMOR MA': 1.0130724322923417, 'QUICKSILVER/PIETRO M': 0.9932581057844919, 'ROGUE': 0.9931267819451733, 'JAMESON, J. JONAH': 0.9931470174561849, 'NIGHTCRAWLER/KURT WA': 0.9931537195045245, 'SHE-HULK/JENNIFER WA': 1.0131401530059754, 'VISION': 1.01315343203084, 'ODIN [ASGARDIAN]': 0.9142593608819196, 'FURY, COL. NICHOLAS': 0.9932779275281086, 'ANT-MAN/DR. HENRY J.': 1.013192725

## Functionality 3 - Shortest ordered Route


**Input**:

- The graph data
- A sequence of superheroes h = [$h_{2}$ , ... , $h_{n-1}$]
- Initial node $h_{1}$ and an end node $h_{n}$
- N: denoting the top N heroes that their data should be considered

**Output**:

- The shortest walk of comics that you need to read to get from $h_{1}$ to $h_{n}$

Considerations: For this functionality, you need to implement an algorithm that returns the shortest walk that goes from node $h_{j}$ to $h_{n}$ , which visits in order the nodes in h. The choice of $h_{j}$ and $h_{n}$ can be made randomly (or if it improves the performance of the algorithm, you can also define it in any other way)

**Important Notes**:

- This algorithm should be run only on the second graph.
- The algorithm needs to handle the case that the graph is not connected. Thus, only some of the nodes in h are reachable from $h_{1}$. In such a scenario, it is enough to let the program give in the output the string "There is no such path".
- Since we are dealing with walks, you can pass on the same node $h_{i}$ more than once, but you have to preserve order. E.g., if you start from Spiderman to reach Deadpool, and your path requires you to visit Iron-man and Colossus, you can go back to any comics any time you want, assuming that the order in which you visit the heroes is still the same.


In [23]:
start = ["HAWK"]
sequence = ["PEATOR", "SPIDER-MAN/PETER PAR", "SHAPE", "BI-BEAST II", "TANA NILE"]
end = ["RAMBO"]
complete_list = start + sequence + end

In [24]:
complete_path = list()
for i in range(len(complete_list) - 1):
    path = nx.shortest_path(
        G_edges_net, source=complete_list[i], target=complete_list[i + 1]
    )
    complete_path.append(path[0 : len(path) - 1])
    i += 1
complete_path = list(itertools.chain(*complete_path)) + end
print("The shortest path from " + start[0] + " to " + end[0] + " is:")
print(complete_path)

The shortest path from HAWK to RAMBO is:
['HAWK', 'COC 1', 'PEATOR', 'ASM 223', 'SPIDER-MAN/PETER PAR', 'Q 23', 'ARCANNA/ARCANNA JONE', 'A3 5', 'SHAPE', 'Q 14', 'BI-BEAST II', 'T 315', 'IRON MAN/TONY STARK', 'A 105', 'TANA NILE', 'A 105', 'AMPHIBIUS', 'M/FAN 1', 'SPIDER-MAN/PETER PAR', 'SLEEP 17', 'RAMBO']


## Functionality 4 - Disconnecting Graphs


**Input**:

- The graph data
- heroA: a superhero to which will relate sub-graph $G_{a}$
- heroB: a superhero to which will relate sub-graph $G_{b}$
- N: denoting the top N heroes that their data should be considered.

**Output**:

- The minimum number of links (by considering their weights) required to disconnect the original graph in two disconnected subgraphs: $G_{a}$ and $G_{b}$.


In [25]:
list(G_hero_net.nodes())[:200]


['24-HOUR MAN/EMMANUEL',
 'FROST, CARMILLA',
 'KILLRAVEN/JONATHAN R',
 "M'SHULLA",
 '3-D MAN/CHARLES CHAN',
 'ANGEL/WARREN KENNETH',
 'ANT-MAN II/SCOTT HAR',
 'AURORA/JEANNE-MARIE',
 "BLACK PANTHER/T'CHAL",
 'BLACK WIDOW/NATASHA',
 'CAGE, LUKE/CARL LUCA',
 'CAPTAIN AMERICA',
 'COLLECTIVE MAN',
 'CRYSTAL [INHUMAN]',
 'CYCLOPS/SCOTT SUMMER',
 'DARKSTAR/LAYNIA SERG',
 'DEFENSOR',
 'DR. DRUID/ANTHONY LU',
 'DR. STRANGE/STEPHEN',
 'GORILLA-MAN',
 'HAWK',
 'HUMAN ROBOT',
 'IGOR',
 'IKARIS/IKE HARRIS [E',
 'INVISIBLE WOMAN/SUE',
 'JACK OF HEARTS/JACK',
 'JONES, RICHARD MILHO',
 'KARNAK [INHUMAN]',
 'LOBO',
 'LOCKJAW [INHUMAN]',
 'MARVEL BOY III/ROBER',
 'MIKHLO',
 'MOCKINGBIRD/DR. BARB',
 'NORRISS, SISTER BARB',
 'PHARAOH RAMA-TUT',
 'PROFESSOR X/CHARLES',
 'QUICKSILVER/PIETRO M',
 'SABRA/RUTH BAT-SERAP',
 'SASQUATCH/WALTER LAN',
 'SCARLET WITCH/WANDA',
 'SHADOWCAT/KATHERINE',
 'SHE-HULK/JENNIFER WA',
 'SHROUD/MAXIMILLIAN Q',
 'SLOAN, FRED',
 'THING/BENJAMIN J. GR',
 'TORPEDO III/BROCK JO',
 

In [26]:
min_cut = nx.minimum_edge_cut(G_hero_net, "CAPTAIN AMERICA", "SPIDER-MAN/PETER PAR")

In [27]:
len(list(min_cut))

1734

## Functionality 5 - Extracting Communities


**Input**:

- The graph data
- N: denoting the top N heroes that their data should be considered
- Hero_1: denoting the name of one of the heroes
- Hero_2: denoting the name of one of the heroes

**Output**:

- The minimum number of edges that should be removed to form communities
- A list of communities, each containing a list of heroes that belong to them.
- If the Hero_1 and Hero_2 belongs to the same community

**Important Notes**:

This functionality should only be run on the first graph.
To comprehend this functionality better, we suggest you take a good look at this article


In [None]:
def functionality_5 (graph_data, N=1500, hero_1 = "", hero_2 = ""):
    communities_list = myf.label_propagation(graph_data)
    

In [28]:
print("Nodes: ", len(G_hero_net.nodes()))
print("Edges: ", len(G_hero_net.edges()))

Nodes:  6421
Edges:  224099


In [29]:
small_graph = myf.TOP_N(1500, G_hero_net, edges_df)

In [31]:
print("Nodes: ", len(small_graph.nodes))
print("Edges: ", len(small_graph.edges))

Nodes:  1500
Edges:  118710


In [38]:
# find communities in the graph
c = myf.girvan_newman(small_graph.copy())

# find the nodes forming the communities
node_groups = []

for i in c:
  node_groups.append(list(i))

In [40]:
len(node_groups)

2

In [37]:
a = myf.label_propagation(G_hero_net)
print("Number of communities: ", len(a))

Number of communities:  36
