# Part 2: Email Behaviour Data Analysis

---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install networkx

### Import Python packages

In [2]:
import networkx as nx
import numpy as np
import json
from collections import Counter

---

### Task 1 of 1 

Examine the file "emails_cmt224.edgelist" which represents email behaviour at an organisation. Each line contains two numbers, 𝑢 and 𝑣, separated by a blank space. Consider each number as an identifier for an individual in an organisation, with the space on each line representing that the individual, 𝑢, sent at least one email to the another individual, 𝑣, at some point. Model the data using an appropriate, directed network representation and answer the following questions:

##### Q1. Are the majority of connections in the entire network 'mutual' connections where emails have been exchanged at least once, or asymmetric? In comparison, how many individuals have a higher or lower ratio of mutual connections than the entire network?

In [3]:
# Load the edgelist file as a directed graph
G = nx.read_edgelist("emails_cmt224.edgelist", create_using=nx.DiGraph())

# Compute overall ratio of mutual connections in the network
overall_ratio = nx.overall_reciprocity(G)

# Compute the ratio of mutual connections for each individual node
ratios = nx.reciprocity(G, G.nodes())

# Compare the ratio of each individual node with the overall network ratio
higher_ratio_count = 0
lower_ratio_count = 0
same_ratio_count = 0
for u, ratio in ratios.items():
    if ratio > overall_ratio:
        higher_ratio_count += 1
    elif ratio < overall_ratio:
        lower_ratio_count += 1
    elif ratio == overall_ratio:
        same_ratio_count += 1
        
# Print the result to 2 decimal places unless it is less than 0.01        
if nx.overall_reciprocity(G) >= 0.01:
    print("Overall network reciprocity: {:.2f}".format(overall_ratio))
else:
    print("Overall network reciprocity: {:.5f}".format(overall_ratio))
    
print('Number of individuals with the higher ratio: ', higher_ratio_count)
print('Number of individuals with the lower ratio: ' , lower_ratio_count)
print('Number of individuals with the same ratio: ' , same_ratio_count)

Overall network reciprocity: 0.71
Number of individuals with the higher ratio:  408
Number of individuals with the lower ratio:  578
Number of individuals with the same ratio:  0


##### Q2. Are occurrences of induced, connected subgraphs of 3 individuals (triads) with only mutual connections more abundant in the network than those with a mixture of asymmetric and mutual edges? What does this suggest about how mutual connections are distributed in the network?

In [4]:
# Compute the triadic census
triads = nx.triadic_census(G)

# Count the number of triads with only mutual connections
only_mutual_triads = triads['300']
only_mutual_triad_percentage = (only_mutual_triads / sum(triads.values())) * 100

# Count the number of triads with only asymmetric connections
only_asymmetric_triads = triads['030T'] + triads['030C']
only_asymmetric_triad_percentage = (only_asymmetric_triads / sum(triads.values())) * 100

# Count the number of triads with only null dyads connections
nulldyads_triads = triads['003']
nulldyads_triads_percentage = (nulldyads_triads / sum(triads.values())) * 100

# Count the number of mixed null dyads with asymmetric connections
nulldyads_asymmetric_triads = triads['012'] + triads['021D'] + triads['021U'] + triads['021C']
nulldyads_asymmetric_triads_percentage = (nulldyads_asymmetric_triads / sum(triads.values())) * 100

# Count the number of mixed null dyads with mutual connections
nulldyads_mutual_triads = triads['102'] + triads['201']
nulldyads_mutual_triads_percentage = (nulldyads_mutual_triads / sum(triads.values())) * 100

# Count the number of mixed asymmetric with mutual connections
asymmetric_mutual_triads = triads['120D'] + triads['120U'] + triads['120C'] + triads['210']
asymmetric_mutual_triads_percentage = (asymmetric_mutual_triads / sum(triads.values())) * 100

# Count the number of mixed all types connections
amn_triads = triads['111D'] + triads['111U']
amn_triads_percentage = (amn_triads / sum(triads.values())) * 100

# Print the result to 2 decimal places unless it is less than 0.01        
print(f"Only mutual connections: {only_mutual_triad_percentage:.2f}%" if only_mutual_triad_percentage >= 0.01 else f"Only mutual connections: {only_mutual_triad_percentage:.5f}%")
print(f"Only asymmetric connections: {only_asymmetric_triad_percentage:.2f}%" if only_asymmetric_triad_percentage >= 0.01 else f"Only asymmetric connections: {only_asymmetric_triad_percentage:.5f}%")
print(f"Only null dyads connections: {nulldyads_triads_percentage:.2f}%" if nulldyads_triads_percentage >= 0.01 else f"Only null dyads connections: {nulldyads_triads_percentage:.5f}%")
print(f"Null dyads + asymmetric connections: {nulldyads_asymmetric_triads_percentage:.2f}%" if nulldyads_asymmetric_triads_percentage >= 0.01 else f"Null dyads + asymmetric connections: {nulldyads_asymmetric_triads_percentage:.5f}%")
print(f"Null dyads + mutual connections: {nulldyads_mutual_triads_percentage:.2f}%" if nulldyads_mutual_triads_percentage >= 0.01 else f"Null dyads + mutual connections: {nulldyads_mutual_triads_percentage:.5f}%")
print(f"Asymmetric + mutual connections: {asymmetric_mutual_triads_percentage:.2f}%" if asymmetric_mutual_triads_percentage >= 0.01 else f"Asymmetric + mutual connections: {asymmetric_mutual_triads_percentage:.5f}%")
print(f"All types mixed connections: {amn_triads_percentage:.2f}%" if amn_triads_percentage >= 0.01 else f"All types mixed connections: {amn_triads_percentage:.5f}%")

Only mutual connections: 0.02%
Only asymmetric connections: 0.00380%
Only null dyads connections: 90.75%
Null dyads + asymmetric connections: 4.01%
Null dyads + mutual connections: 4.91%
Asymmetric + mutual connections: 0.04%
All types mixed connections: 0.26%


##### Q3. Using the largest, strongly connected component (where at least one path exists between each individual and all others), could the connectivity be suggested to be reflective of a small world phenomenon in comparison to the typical connectivity of 10 comparative random networks?

In [5]:
# Compute the largest strongly connected component
largest_scc = max(nx.strongly_connected_components(G), key=len)
scc = G.subgraph(largest_scc)

# Compute the average shortest path length in the largest strongly connected component
avg_shortest_path_scc = nx.average_shortest_path_length(scc)
print(f"Average shortest path length in the largest strongly connected component: {avg_shortest_path_scc:.2f}" if avg_shortest_path_scc >= 0.01 else f"Average shortest path length in largest strongly connected component: {avg_shortest_path_scc:.5f}")

# Compute the average clustering coefficient in the largest strongly connected component
avg_clustering_scc = nx.average_clustering(scc)
print(f"Average clustering coefficient for the largest strongly connected component: {avg_clustering_scc:.2f}" if avg_clustering_scc >= 0.01 else f"Average clustering coefficient for largest strongly connected component: {avg_clustering_scc:.5f}")

# Generate 10 random networks with the same number of nodes and edges
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
random_graphs = [nx.gnm_random_graph(num_nodes, num_edges, directed=True) for _ in range(10)]

# Compute the average shortest path length and average clustering coefficient for each random network
avg_shortest_path_random = []
avg_clustering_random = []
for random_graph in random_graphs:
    # Compute the largest strongly connected component
    largest_scc = max(nx.strongly_connected_components(random_graph), key=len)
    scc = random_graph.subgraph(largest_scc)
    
    # Compute the average shortest path length in the largest strongly connected component
    avg_shortest_path = nx.average_shortest_path_length(scc)
    avg_shortest_path_random.append(avg_shortest_path)
    
    # Compute the average clustering coefficient in the largest strongly connected component
    avg_clustering = nx.average_clustering(scc)
    avg_clustering_random.append(avg_clustering)

# Print the result to 2 decimal places unless it is less than 0.01
# May take a few minutes as it is computationally expensive to generate 10 networks and loop over them
print(f"Average shortest path length in random networks: {np.mean(avg_shortest_path_random):.2f}" if np.mean(avg_shortest_path_random) >= 0.01 else f"Average shortest path length in random networks: {np.mean(avg_shortest_path_random):.5f}")
print(f"Average clustering coefficient for random networks: {np.mean(avg_clustering_random):.2f}" if np.mean(avg_clustering_random) >= 0.01 else f"Average clustering coefficient for random networks: {np.mean(avg_clustering_random):.5f}")

Average shortest path length in the largest strongly connected component: 2.55
Average clustering coefficient for the largest strongly connected component: 0.39
Average shortest path length in random networks: 2.48
Average clustering coefficient for random networks: 0.03


---
### Task 2 of 2

Examine the JSON file "emails_cmt224_departments.json" (departments file). Keys in the departments file represent individuals using the same ids as in the "emails_cmt224.edgelist" file in Part 2, Task 1 and the values represent a department id that the individual can be attributed to. Using the contents of the departments file in combination with the network in Part 2, Task 1, answer the following questions:

##### Q1. Using the connections that individuals have in the network, are they more likely to mix with others in their department or those with a similar number of connections?

In [6]:
# Load department data
D = json.load(open('emails_cmt224_departments.json'))

# Assign the data to nodes in the graph
for node in G:
    G.nodes[node]['department'] = D[node]
    
# Calculate the assortativity coefficients for department and degree
department = nx.attribute_assortativity_coefficient(G, 'department')
degree = nx.degree_assortativity_coefficient(G)
    
# Print the result to 2 decimal places unless it is less than 0.01
if abs(department) >= 0.01:
    print("Department assortativity: {:.2f}".format(department))
else:
    print("Department assortativity: {:.5f}".format(department))
    
if abs(degree) >= 0.01:
    print("Degree assortativity: {:.2f}".format(degree))
else:
    print("Degree assortativity: {:.5f}".format(degree))

Department assortativity: 0.31
Degree assortativity: -0.01


##### Q2. Are all departments with 10 or more members more tightly connected amongst themselves in comparison to all individuals across the overall network irrespective of their department?  Where in this context, 'more tightly connected' is defined as having less sparsity in the connections among members AND more clustered connections. In addition to answering the overall question as yes or no, provide a list of departments this is true for (if any) and not true for (if any).

In [7]:
# Get a dictionary of nodes and their department attribute
node_department_dict = nx.get_node_attributes(G, 'department')

# Group nodes by department
department_node_dict = {}
for node, department in node_department_dict.items():
    department_node_dict.setdefault(department, []).append(node)

# Create a list of department subgraphs with 10 and more members
top_departments = {}
for department, nodes in department_node_dict.items():
    if len(nodes) >= 10:
        top_departments[department] = G.subgraph(nodes)

# Calculate the overall average clustering coefficient and density
overall_clustering = nx.average_clustering(G)
overall_density = nx.density(G)

# Identify tightly and weakly connected subgraphs
tightly_connected = []
weakly_connected = []
for department, subgraph in top_departments.items():
    subgraph_clustering = nx.average_clustering(subgraph)
    subgraph_density = nx.density(subgraph)
    if subgraph_clustering > overall_clustering and subgraph_density > overall_density:
        tightly_connected.append([department, subgraph])
    else:
        weakly_connected.append([department, subgraph])

# Print the results
print(f"There are {len(top_departments)} top departments with 10 or more members: {list(top_departments.keys())}")
print(f"{len(tightly_connected)} departments are more tightly connected than the overall network: {[department for department, _ in tightly_connected]}")
print(f"{len(weakly_connected)} departments are not more tightly connected than the overall network: {[department for department, _ in weakly_connected]}")

There are 28 top departments with 10 or more members: ['1', '15', '3', '0', '7', '14', '16', '20', '19', '36', '21', '38', '22', '34', '17', '37', '35', '10', '4', '5', '13', '6', '9', '8', '23', '11', '2', '27']
25 departments are more tightly connected than the overall network: ['1', '15', '3', '0', '7', '14', '16', '20', '19', '36', '21', '38', '22', '34', '17', '37', '35', '10', '4', '5', '13', '9', '8', '11', '2']
3 departments are not more tightly connected than the overall network: ['6', '23', '27']
