In [76]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# EX 1

## Data Preprocessing

As always, in the data science area, you can find some inconsistencies in the provided data. Therefore, some modifications should be made to the data to make it consistent across all of the datasets you have. To ensure consistency in the data, keep the following in mind:

  1)  Some of the heroes' names in 'hero-network.csv' are not found in 'edges.csv'. This inconsistency exists for the following reasons:
        - Some heroes' names in 'hero-netowrk.csv' have extra spaces at the end of their names compared to their names in 'edges.csv'.
        - Some heroes' names in 'hero-netowrk.csv' have an extra '/' at the end of their names compared to their names in 'edges.csv'.
        - The hero name 'SPIDER-MAN/PETER PARKER' in 'edges.csv' has been changed to 'SPIDER-MAN/PETER PAR' in 'hero-network.csv' due to a string length    limit in 'hero-network.csv'.


  2)  Some entries in the 'hero-network.csv' have the same hero in both columns. In the graph, these entries form a self-loop. Because a self-loop makes no sense in this network, you can safely remove those from the dataset.


In [77]:
hero_df = pd.read_csv('hero-network.csv')
edges_df = pd.read_csv('edges.csv')
nodes_df = pd.read_csv('nodes.csv')

In [78]:
nodes_df

Unnamed: 0,node,type
0,2001 10,comic
1,2001 8,comic
2,2001 9,comic
3,24-HOUR MAN/EMMANUEL,hero
4,3-D MAN/CHARLES CHAN,hero
...,...,...
19085,"ZOTA, CARLO",hero
19086,ZOTA,hero
19087,ZURAS,hero
19088,ZURI,hero


We decided to homogenize all the DFs since there are many words that are the same but with extra characters in the end
and in the hero-network's DF there is a character limit of 20 digits.

We report here some examples of the issues we could find

In [79]:
print('SPIDER-MAN/PETER PAR' in hero_df.values)         #digits limit
print('SPIDER-MAN/PETER PARKERKER' in nodes_df.values)  #with an extra "KER" in the end
print('SPIDER-MAN/PETER PARKER' in edges_df.values)     

True
True
True


Trimming to the 20th character and removing final blank spaces or final "/" character

In [80]:
hero_df['hero1'] = hero_df['hero1'].apply(lambda x: x.rstrip('/')[:20])
hero_df['hero2'] = hero_df['hero2'].apply(lambda x: x.rstrip('/')[:20])
edges_df['hero'] = edges_df['hero'].apply(lambda x: x.rstrip('/')[:20])
nodes_df['node'] = nodes_df['node'].apply(lambda x: x.rstrip('/')[:20])

Let's check if the are any anomalies in our data now that we have formatted them.

In [81]:
print('SPIDER-MAN/PETER PAR' in hero_df.values)         #digits limit
print('SPIDER-MAN/PETER PARKERKER' in nodes_df.values)  #with an extra "KER" in the end
print('SPIDER-MAN/PETER PARKER' in edges_df.values)

True
False
False


Now let's remove the rows from the hero_df wich has the same hero in both columns.

In [82]:
hero_df = hero_df.drop(hero_df[hero_df['hero1'] == hero_df['hero2']].index, inplace = True)

### Graphs setup

First graph: Will be constructed using the data stored in the 'hero-network.csv' file, in which an edge between two heroes can be found if they have appeared in the same comic together. The number of edges between two heroes represents the number of times they have collaborated in different comics. The graph should be considered weighted and undirected. It is up to you to decide which metric to use to calculate the weights, but we anticipate that the cost will be lower for heroes with more collaborations. Please specify which metric you used to select the weights in the report.

In [7]:
MG_hero= nx.from_pandas_edgelist(hero_df, 'hero1', 'hero2', create_using=nx.MultiGraph)
####################################################################################################
##   FORSE ANDAVA CREATO UN NUOVO DF CON DIRETTAMENTE I PESI PER OGNI COPPIA SFRUTTANDO GROUPBY?  ##
##    #hero_df.groupby(['hero1', 'hero2']).size()                                                 ##
####################################################################################################

In [8]:
G_hero=nx.Graph()
for n, nbrs in MG_hero.adjacency():
    for nbr, edict in nbrs.items():
        #########################################################
        ## #NON SONO SICURO SIA QUESTA LA METRICA CHE INTENDE  ##
        #########################################################
        maxvalue = max(MG_hero.degree(weight='weight')[n], MG_hero.degree(weight='weight')[nbr]) 
        G_hero.add_edge(n,nbr,weight=1/maxvalue)
 


Second graph: The data in 'nodes.csv' and 'edges.csv' will be used to construct the second graph. The type of node (hero/comic) can be found in 'nodes.csv', and an edge between a hero node and a comic node can be found in 'edges.csv' when the hero has appeared in that specific comic. This graph is assumed to be undirected and unweighted.

In [9]:
G_nodes_edges= nx.from_pandas_edgelist(edges_df, 'hero', 'comic', create_using=nx.Graph) 

In [10]:
set(nodes_df.node)==set(G_nodes_edges.nodes) 

True

Since the nodes are the comics and the heros names and an edge between a hero and a comic exists if and only if those are in the same row of edges.csv, than creating a graph based on edges.csv could only left behind heros or comics that appear in nodes.csv but not in edges.csv, but as we can see doing set(nodes_df.node)==set(G_nodes_edges.nodes) these eventuality is escluded so we can create our graph starting from edges.csv only.

## EX 2

The goal of this part is the implementation of a controller system that has different functionalities. The controller should take as input an identifier "i" and run the associated function_i applied to the graph you create from the downloaded data.

Definition: As the number of nodes and edges grows, we may request to work on a subset of the data to reduce computation time and improve network visualization. In this case, we will ask you only to consider the data for top N heros. We define the top N heroes as follows:

    Top N heroes: The top N heroes who have appeared in the most number of comics. The 'edges.csv' file, which represents the comics in which each hero has appeared, can be used to filter these N heroes.

Note: When the value of N is not set by the user, the function should consider the whole data.
Functionality 1 - extract the graph's features

Input:

    The graph data
    The graph type (ex., number 1 or number 2)
    N: denoting the top N heroes that their data should be considered

Output:

    The number of nodes in the network (if type 2, report for both node types)
    The number of collaborations of each superhero with the others (only if type 1)
    The number of heroes that have appeared in each comic (only if type 2)
    The network's density
    The network's degree distribution
    The average degree of the network
    The network's Hubs (hubs are nodes having degrees more extensive than the 95th percentile of the degree distribution)
    Whether the Network is sparse or dense

Note: For this case, it makes sense to differentiate operations between the two graphs: for example, when computing hubs for the second graph, we likely care only about comics.

In [11]:
def TOP_N(N):
    edges_df=pd.read_csv('edges.csv')
    return [el for el in edges_df.groupby('hero').count().sort_values(by=['comic'],ascending=False).head(N).index]   

In [12]:
def Functionality_1(graph,graph_type,N):
    return