# First analysis of real graphs

Preamble:
In this first tutorial, we will study a real graph.
We'll apply all the basic measures of digital social network analysis, and draw our first observations and conclusions about the functioning of the networks we're studying.

- networkx library: https://networkx.github.io/documentation/stable/index.html
- graph-tool: https://graph-tool.skewed.de/
- (not recommended) python-igraph library: https://igraph.org/python/doc/tutorial/tutorial.html

For graph visualization :
- networkx library (via matplotlib)
- Gephi: https://gephi.org/
- Graphviz: https://pygraphviz.github.io

In this tutorial, we will only use **networkx**

In [None]:
import collections
import numpy as np
import random

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import plotly.express as px

## Enron communication graph

Download the graph file email-Enron.txt (https://snap.stanford.edu/data/email-Enron.html)

36,692 nodes and 183,831 edges

#### 1) Import the graph into Python using the command nx.read_edgelist

In [None]:
!head -10 '../../data/Email-Enron.txt'

In [None]:
!wc -l '../../data/Email-Enron.txt'

In [None]:
G = nx.read_edgelist('../../data/Email-Enron.txt') # if directed graph: create_using=nx.DiGraph()
G

**Identify its number of nodes and links:**

In [None]:
print(G)

In [None]:
len(G.nodes)

In [None]:
nodes = G.number_of_nodes()
edges = G.number_of_edges()
nodes, edges

**Calculate the density value in Python using the formula:**

In [None]:
density = edges / (((nodes)*(nodes-1))/2)  # L / L_max
# 36692 / (36692 * 36692-1)/2
density

**Check your calculation with the nx.density() function :**

In [None]:
nx.density(G)

**What do you think of the density of the analyzed graph?**

The density is low, but higher than in RSN graphs (e.g. FB)

#### 2) Identify the degree of the graph's nodes using the G.degree() command

In [None]:
d = dict(G.degree())

**Determine the value of the maximum, minimum and mean degrees and the standard deviation**

In [None]:
max(d.values())

In [None]:
%%timeit -n 100 -r 10
max(d.items(), key=lambda x: x[1])

In [None]:
%%timeit -n 100 -r 10
# don't do that
max_val = 0
max_node = None
for name, degree in d.items():
    if degree > max_val:
        max_val = degree
        max_node = (name, degree)
max_node

In [None]:
print(f"Le degré minimum est de : {min(d.items(), key=lambda x: x[1])}")
print(f"Le degré maximum est de : {max(d.items(), key=lambda x: x[1])}")
print("Le degré moyen est de : %0.2f" % np.mean(list(d.values())))
print("Le degré median est de : %0.2f" % np.median(list(d.values())))
print("L'écart type des degrés est de %0.2f:" % np.std(list(d.values())))

**What are the limits of simply observing the value of 〈k〉. Is it representative of all the nodes on the graph? Use the standard deviation to help you answer the question.**

The difference between the mean and the standard deviation is important. The standard deviation is large and shows a strong dispersion of values around the mean. The mean is impacted by outliers, which here have a very large number of links. It would be better to take the median (=3).

#### 3) To get a better idea of the degree of the graph's nodes, calculate and display the graph's degree distribution.

In [None]:
degree_count = dict(collections.Counter(d.values()))

In [None]:
# probability
degree_count[1] / nodes

In [None]:
fig = plt.figure()
plt.bar(degree_count.keys(), degree_count.values())
plt.title("Degree Histogram")
plt.ylabel("Count")
plt.xlabel("Degree")
plt.show()

**Add to the display to highlight the distribution of degrees. What can we conclude from this?**

In [None]:
fig, ax = plt.subplots()
plt.bar(degree_count.keys(), degree_count.values(), color="b")

plt.title("Degree Histogram")
plt.ylabel("Count")
plt.xlabel("Degree")
ax.set_yscale('log')

plt.show()

In [None]:
df = pd.DataFrame({
    "Degree": degree_count.keys(),
    "Count": degree_count.values()
})

fig = px.histogram(
    df,
    x="Degree",
    y="Count",
    title="Distribution of degrees",
    nbins=max(df['Degree'])
)

fig.update_yaxes(type="log")
fig.show()

In [None]:
#Create random subgraphh
num_nodes_to_sample = min(1000, len(G.nodes)) #Handle the case where G has fewer than 1000 nodess
random_nodes = random.sample(list(G.nodes), num_nodes_to_sample)

#Create a subgraph induced by these random nodess
subgraph = G.subgraph(random_nodes)
print(subgraph)

In [None]:
pos = nx.layout.forceatlas2_layout(subgraph)
plt.axis("off")
nx.draw_networkx_nodes(subgraph, pos, node_size=20)
nx.draw_networkx_edges(subgraph, pos, alpha=0.4)
plt.show()

**What can you conclude from this?**

Real communication graphs are sparse and their nodes generally have low degrees. Here, some nodes have high degrees, so they might not be humans.

In [None]:
def percent_nodes_with_degree_less_than(graph: nx.Graph, threshold: int) -> float:
    """Calculate the percentage of nodes with degree less than or equal to a threshold.

    Parameters
    ----------
    graph : nx.Graph
        The input graph.
    threshold : int
        The degree threshold.

    Returns
    -------
    float
        The percentage of nodes with a degree less than or equal to the threshold.
    """
    degrees = list(dict(graph.degree()).values())
    count_below_threshold = sum(1 for degree in degrees if degree <= threshold)
    return (count_below_threshold / len(degrees)) * 100

#### 5) Use the function created previously with thresholds 2, 5, 10 and 25 links.

In [None]:
for i in [2, 5, 10, 25]:
  print(f"{np.round(percent_nodes_with_degree_less_than(G, i), 2)}% de noeuds avec moins de {i} liens")

70% of graph nodes have fewer than 5 links.

**What can we conclude from this?**

A good proportion of nodes (35%) have a low degree (<=2). Only one node has a very high degree. Half the nodes have no more than 5 links.

#### 6) Identify the various graph components and their sizes

In [None]:
%%time
components_nodes = sorted(nx.connected_components(G), key=len, reverse=True)
n_nodes = [len(g) for g in components_nodes]

In [None]:
#Displays the size of the 10 largest related componentss
n_nodes[:10]

#### 7) Extract a component in a subgraph you call graph_component

In [None]:
G.number_of_nodes()

In [None]:
#the largest related component is recoverede
graph_component = G.subgraph(components_nodes[1])
graph_component.number_of_nodes()

In [None]:
pos = nx.spring_layout(graph_component)
plt.axis("off")
nx.draw_networkx_nodes(graph_component, pos, node_size=20)
nx.draw_networkx_edges(graph_component, pos, alpha=0.4)
plt.show()

#### 8) Identify the diameter of a component

In [None]:
nx.diameter(graph_component)

In [None]:
import networkx as nx
G2 = graph_component
# diamètre (longueur)
diam = nx.diameter(G2)
print("Diamètre :", diam)

# trouver le chemin correspondant
longest_shortest_path = []

for u in G2.nodes():
    for v in G2.nodes():
        if u != v:
            try:
                path = nx.shortest_path(G2, u, v)
                if len(path) - 1 == diam:
                    longest_shortest_path = path
                    break
            except nx.NetworkXNoPath:
                pass
    if longest_shortest_path:
        break

print("Chemin du diamètre :", longest_shortest_path)

import matplotlib.pyplot as plt

pos = nx.spring_layout(G)

# arêtes du diamètre
diam_edges = list(zip(longest_shortest_path[:-1], longest_shortest_path[1:]))

plt.figure(figsize=(8, 6))

# graphe complet
nx.draw(
    G2, pos,
    with_labels=True,
    node_color="lightgray",
    edge_color="lightgray",
    node_size=600
)

# chemin du diamètre en rouge
nx.draw_networkx_nodes(
    G2, pos,
    nodelist=longest_shortest_path,
    node_color="red",
    node_size=700
)

nx.draw_networkx_edges(
    G2, pos,
    edgelist=diam_edges,
    edge_color="red",
    width=3
)

plt.title(f"Diamètre du graphe = {diam}")
plt.show()

**Also calculate the average distance:**

In [None]:
%%time

nx.average_shortest_path_length(graph_component)

**What can we conclude from this?**

The maximum distance of the shortest path between the two farthest nodes of the graph is 5, and the number of links to be traversed between all pairs of nodes is on average 2.64, so it's quite low.

#### 9) Calculate the average clustering coefficient of graph G2 and Ggiant

In [None]:
%%time
nx.average_clustering(graph_component)

In [None]:
%%time
nx.average_clustering(G)