<span>
<b>Python version:</b>  3.7<br/>
<b>CDlib version:</b>  0.1.10<br/>
<b>Last update:</b> 25/11/2021
</span>

In [None]:
import warnings
from collections import Counter
import numpy as np
warnings.filterwarnings('ignore')

<a id='top'></a>
# *Chapter 5: Community Discovery*

In this notebook are introduced the main steps for the extraction and topological analysis of communities.

**Note:** this notebook is purposely not 100% comprehensive, it only discusses the basic things you need to get started. For all the details, algorithm/methods/evaluation facilities available in ``CDlib``, please refer to the official [documentation](https://cdlib.readthedocs.io) and the dedicated notebook appendix.

## Table of Contents

1. [Community Discovery Workflow](#workflow)
    1. [Graph Creation](#graph)
    2. [Community Discovery algorithm(s) selection and configuration ](#model)
    3. [Clustering Evaluation (Fitness functions)](#fitness)
    4. [Clustering Evaluation (Comparison)](#comparison)
    5. [Community/Statistics Visualization](#visualization)
    5. [Qualitative evaluation](#qualitative)
    7. [Ground Truth evaluation](#gt)

In [None]:
import cdlib

<a id='workflow'></a>
## Community Discovery Workflow ([to top](#top))

The standard workflow can be summarized as:
- Network Creation
- Community Discovery algorithm(s) selection and configuration
- Clustering(s) evaluation (Fitness functions)
- Clustering(s) evaluation (Comparisons)
- Community/Statistics Visualization

In this section we will observe how to templating such workflow applying two classic network clustering algorithms: Label Propagation and Leiden.
All analysis will be performed using ``CDlib``.

<a id="graph"></a>
### Graph object creation ([to top](#top))

As a first step we need to define the network topology that will be used as playground to study diffusive phenomena.

``CDlib`` natively supports both [``networkx``](https://networkx.github.io) and [``igraph``](https://igraph.org/python/) data structures.

In our examples, for the sake of simplicity, we will use ``networkx`` undirected graphs. 

In [None]:
import networkx as nx

def read_net(filename):
    g = nx.Graph()
    with open(filename) as f:
        f.readline()
        for l in f:
            l = l.split(",")
            g.add_edge(l[0], l[1])
    return g

# Game of Thrones Season 
season = 1
g = read_net(f'asioaf/got-s{season}-edges.csv')

<a id="model"></a>
### Community Discovery algorithm(s) selection and configuration ([to top](#top))

After having defined the graph, we can select the algorithm(s) to partition it.

In [None]:
from cdlib import algorithms

In [None]:
lp_coms = algorithms.label_propagation(g)

In [None]:
lp_coms.communities

In [None]:
leiden_coms = algorithms.leiden(g) # improvement on Louvain

All Community Discovery algorithms generate as result an object that implements a concrete instance of the ``Clustering`` datatype.

In particular, both Louvain and Label Propagation returns a ``NodeClustering`` object having the following propterties:

In [None]:
leiden_coms.method_name # Clustering algorithm name

In [None]:
leiden_coms.method_parameters # Clustering parameters

In [None]:
leiden_coms.communities # Identified Clustering

In [None]:
leiden_coms.overlap # Wehter the clustering is overlapping or not

In [None]:
leiden_coms.node_coverage # Percentage of nodes covered by the clustering

Moreover, ``Clustering`` object allow also for the generation of a JSON representation of the results

In [None]:
leiden_coms.to_json()

<a id="comparison"></a>
### Clustering Evaluation (Comparison) ([to top](#top))

When multiple clustering have been computed on a same network it is useful to measure their resemblance.

``CDlib`` allows to do so by exposing several clustering resemblance scores, each one of them tailored to support specific kind of network clusterings (crisp/partition, complete/partial node coverage).

As for the fitness functions, resemblance scores can be instantiated at the community level as well as at the library level.

In [None]:
leiden_coms.normalized_mutual_information(lp_coms,  )

In [None]:
cdlib.evaluation.normalized_mutual_information(leiden_coms, lp_coms)

## Exercise: visualize the network in which colors correspond to communities

- try several algorithms
- compare the results

Note: you can also do it with `cdlib` (python 3.8)

<a id="qualitative"></a>
### Qualitative evaluation ([to top](#top))

Another way to validate a clustering is to analyse the purity of each community w.r.t. an external attribute.

In our example, let's consider the Houses of GoT characters: what's the CD approach among the tested ones that allows to identify more "homogeneous" clusters?

In [None]:
def read_houses(filename):
    node_to_house = {}
    with open(filename) as f:
        f.readline()
        for l in f:
            l = l.rstrip().split(",")
            node_to_house[l[0]] = l[2]
    return node_to_house

def community_purity(coms, nth):
    purities = []
    for c in coms.communities:
        houses = []
        for node in c:
            if node in nth:
                houses.append(nth[node])
        
        cnt = Counter(houses)
        purity = max(cnt.values())/sum(cnt.values())
        purities.append(purity)
    return purities

In [None]:
# Game of Thrones Houses
season = 1
nth = read_houses(f'asioaf/got-s{season}-nodes_ext.csv')

In [None]:
leiden_purities = community_purity(leiden_coms, nth)
leiden_purities

In [None]:
np.mean(leiden_purities), np.std(leiden_purities)

In [None]:
lp_purities = community_purity(lp_coms, nth)
lp_purities

In [None]:
np.mean(lp_purities), np.std(lp_purities)