In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# 1. ***Data***

## **Data Preprocessing**

As always, in the data science area, you can find some inconsistencies in the provided data. Therefore, some modifications should be made to the data to make it consistent across all of the datasets you have. To ensure consistency in the data, keep the following in mind:

  1)  Some of the heroes' names in 'hero-network.csv' are not found in 'edges.csv'. This inconsistency exists for the following reasons:
        - Some heroes' names in 'hero-netowrk.csv' have extra spaces at the end of their names compared to their names in 'edges.csv'.
        - Some heroes' names in 'hero-netowrk.csv' have an extra '/' at the end of their names compared to their names in 'edges.csv'.
        - The hero name 'SPIDER-MAN/PETER PARKER' in 'edges.csv' has been changed to 'SPIDER-MAN/PETER PAR' in 'hero-network.csv' due to a string length    limit in 'hero-network.csv'.


  2)  Some entries in the 'hero-network.csv' have the same hero in both columns. In the graph, these entries form a self-loop. Because a self-loop makes no sense in this network, you can safely remove those from the dataset.


In [2]:
hero_df = pd.read_csv('hero-network.csv')
edges_df = pd.read_csv('edges.csv')
nodes_df = pd.read_csv('nodes.csv')

We decided to homogenize all the DFs since there are many words that are the same but with extra characters in the end
and in the hero-network's DF there is a character limit of 20 digits.

We report here some examples of the issues we could find

In [3]:
print('SPIDER-MAN/PETER PAR' in hero_df.values)         #digits limit
print('SPIDER-MAN/PETER PARKERKER' in nodes_df.values)  #with an extra "KER" in the end
print('SPIDER-MAN/PETER PARKER' in edges_df.values)     

True
True
True


Trimming to the 20th character and removing final blank spaces or final "/" character

In [4]:
hero_df['hero1'] = hero_df['hero1'].apply(lambda x: x.rstrip('/')[:20])
hero_df['hero2'] = hero_df['hero2'].apply(lambda x: x.rstrip('/')[:20])
edges_df['hero'] = edges_df['hero'].apply(lambda x: x.rstrip('/')[:20])
nodes_df['node'] = nodes_df['node'].apply(lambda x: x.rstrip('/')[:20])

Let's check if the are any anomalies in our data now that we have formatted them.

In [5]:
print('SPIDER-MAN/PETER PAR' in hero_df.values)         #digits limit
print('SPIDER-MAN/PETER PARKERKER' in nodes_df.values)  #with an extra "KER" in the end
print('SPIDER-MAN/PETER PARKER' in edges_df.values)

True
False
False


Now let's remove the rows from the hero_df wich has the same hero in both columns.

In [6]:
hero_df.drop(hero_df[hero_df['hero1'] == hero_df['hero2']].index, inplace = True)

## **Graphs setup**

First graph: Will be constructed using the data stored in the 'hero-network.csv' file, in which an edge between two heroes can be found if they have appeared in the same comic together. The number of edges between two heroes represents the number of times they have collaborated in different comics. The graph should be considered weighted and undirected. It is up to you to decide which metric to use to calculate the weights, but we anticipate that the cost will be lower for heroes with more collaborations. Please specify which metric you used to select the weights in the report.

We group by the hero1 and hero2 on the hero_df to obtain the number of occurencies between to nodes, which corresponds to the number of times a hero collaborated with another hero. The we obtain a new dataset from this operation, in fact we need the 'weight' variable in our dataset.

In [7]:
count_hero_df = hero_df.groupby(["hero1", "hero2"]).size().reset_index(name='weight')

We set the weight to be 1/n because it has to be inversely related to the number of times each hero collaborted with another hero. In this way heroes who got lot of collaborations will be penalized.

In [8]:
count_hero_df['weight'] = count_hero_df['weight'].apply(lambda x: 1/x)

In [9]:
count_hero_df

Unnamed: 0,hero1,hero2,weight
0,24-HOUR MAN/EMMANUEL,"FROST, CARMILLA",1.0
1,24-HOUR MAN/EMMANUEL,KILLRAVEN/JONATHAN R,1.0
2,24-HOUR MAN/EMMANUEL,M'SHULLA,1.0
3,3-D MAN/CHARLES CHAN,ANGEL/WARREN KENNETH,1.0
4,3-D MAN/CHARLES CHAN,ANT-MAN II/SCOTT HAR,1.0
...,...,...,...
224164,ZZZAX,"RODRIGUEZ, DEBRA",1.0
224165,ZZZAX,"ROSS, GEN. THADDEUS",0.5
224166,ZZZAX,"SUMMERS, NATHAN CHRI",1.0
224167,ZZZAX,TIGRA/GREER NELSON,1.0


Now we can create a 'weighted graph' using the weight we calculated as our edge attribute.

In [10]:
G_hero_net = nx.from_pandas_edgelist(count_hero_df, 'hero1', 'hero2', edge_attr='weight', create_using=nx.MultiGraph)

In [11]:
G_hero_net['ZZZAX']['RODRIGUEZ, DEBRA']

AtlasView({0: {'weight': 1.0}})

**Second graph** : The data in 'nodes.csv' and 'edges.csv' will be used to construct the second graph. The type of node (hero/comic) can be found in 'nodes.csv', and an edge between a hero node and a comic node can be found in 'edges.csv' when the hero has appeared in that specific comic. This graph is assumed to be undirected and unweighted.

We need to set the 'node' column as index of the nodes_df dataframe in order to proceed with our request.


In [12]:
print(len(nodes_df))
print(len(nodes_df.node.unique()))

19090
19087


We see that there are only 3 non unique values in the 'node' column se we decided to drop them to mantain the dataframe consistency.

In [13]:
nodes_df = nodes_df.drop_duplicates(subset=['node'])

We transform the node column in the new index of the dataframe and we transform it in a dictionary.

In [14]:
nodes_attr = nodes_df.set_index('node').to_dict(orient = 'index')

Now we can proceed with the creation of the second graph. First we create a new graph starting from the edges dataframe and then we select the nodes_attr dictionary to be the attributes of our nodes.

In [15]:
G_edges_net = nx.from_pandas_edgelist(edges_df, 'hero', 'comic')
nx.set_node_attributes(G_edges_net, nodes_attr)

In [17]:
G_edges_net.nodes['VENUS II']

{'type': 'hero'}

# 2. ***Backend Implementation***

The goal of this part is the implementation of a controller system that has different functionalities. The controller should take as input an identifier "i" and run the associated function_i applied to the graph you create from the downloaded data.

Definition: As the number of nodes and edges grows, we may request to work on a subset of the data to reduce computation time and improve network visualization. In this case, we will ask you only to consider the data for top N heros. We define the top N heroes as follows:

* Top N heroes: The top N heroes who have appeared in the most number of comics. The 'edges.csv' file, which represents the comics in which each hero has appeared, can be used to filter these N heroes.

Note: When the value of N is not set by the user, the function should consider the whole data.
Functionality 1 - extract the graph's features

## ***Functionality 1 - Extract the graph's features***

**Input**: 
* The graph data
* The graph type (ex number 1 or number 2)
* N: denoting the top N heroes that their data should be considered

**Output**:

* The number of nodes in the network (if type 2, report for both node types)
* The number of collaborations of each superhero with the others (only if type 1)
* The number of heroes that have appeared in each comic (only if type 2)
* The network's density
* The network's degree distribution
* The average degree of the network
* The network's Hubs (hubs are nodes having degrees more extensive than the 95th percentile of the degree distribution)
* Whether the Network is sparse or dense

In [33]:
def TOP_N(N):
    edges_df=pd.read_csv('edges.csv')
    return [el for el in edges_df.groupby('hero').count().sort_values(by=['comic'],ascending=False).head(N).index]   

In [63]:
def number_of_nodes(type):
    if type == 1 :
        return (f"The G_hero_net has " + str(G_hero_net.number_of_nodes()) +  " nodes")
    elif type == 2 :
        return (f"The G_edges_net has " + str(G_edges_net.number_of_nodes()) +  " nodes")
    else:
        return ('Wrong type number')

In [82]:
def number_of_collaborations():
    for node in G_hero_net:
        print( node + " has " + str(G_hero_net.degree(node)) + " collaborations.")
        

In [19]:
def Functionality_1(graph,graph_type,N):
    number_of_nodes(graph_type)
    number_of_collaborations()
    return