# Multi-view Clustering Demo
In this notebook we are going to demo how to use the MVMC algorithm to cluster [CiteSeer](https://dl.acm.org/doi/10.1145/276675.276685) data. the data set itself consists of both a citation network and processed content from the papers. To do this, we will do the following
- read in and pre-process the data
- create networks of any views that need a network
- Use MVMC to cluster the data

In [1]:
# import neccesary packages
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.neighbors import kneighbors_graph

import pandas as pd, networkx as nx, numpy as np
from scipy.sparse import csr_matrix
import os, sys

In [2]:
# Now, lets import MVMC from the base file
# Add the directory containing multi_view_clustering.py to the Python path
sys.path.append(os.path.abspath('..'))

from multi_view_clustering import MVMC

## Import and Preprocess the data
Having imported all the neccesary packages, we will now import and pre-process the CiteSeer data set.

In [3]:
content = pd.read_csv(os.path.join('Data','CiteSeer','citeseer_content.csv'), header=0, index_col=0, dtype={0: object})
links = pd.read_csv(os.path.join('Data','CiteSeer','citeseer_network_links.csv'), header=0, index_col=None, dtype=object) 

In [4]:
# Next, we will pre-process the data to remove some nodes that do not exist across the views of the dataset

discrepancies = list(np.setdiff1d(links['Source'], content.index)) + list(np.setdiff1d(links['Target'], content.index))
links = links[~links['Source'].isin(discrepancies)]
links = links[~links['Target'].isin(discrepancies)]
content = content[~content.index.isin(discrepancies)]

In [5]:
# Now, we'll use NetworkX's network constructor to do the network constructions for us

network = nx.from_pandas_edgelist(links, source='Source', target='Target', create_using=nx.DiGraph())
df = nx.to_pandas_adjacency(network)
content = content.reindex(labels=df.index, axis='index')
adjacency = df.values

print(f"network size: {adjacency.shape}")
print(f"nodal content size: {content.shape}")

network size: (3312, 3312)
nodal content size: (3312, 3704)


In [6]:
# Finally, let's take our the labels from the content view

X = content.iloc[:,:-1]
y = content.iloc[:,-1].astype('category').cat.codes

print(f"unique labels and counts for the dataset:\n{y.value_counts()}")

unique labels and counts for the dataset:
2    701
4    668
1    596
5    590
3    508
0    249
Name: count, dtype: int64


# Create the Base Networks for Clustering
Having the data imported and processed, we'll now begin with the first step of MVMC: convert the different views to networks. We note here that those views which are explicitly networks already, such as the citation network CiteSeer, could be taken as is, or could be transformed into a different type of network, such as a network of shared neighbors. 

The creation of networks for each of the views can be done in many, many different ways and should be done with repsect to the domain. Fidning optimal view networks remains further research for this method, and more generally.

In [7]:
networks = []
networks.append(csr_matrix(adjacency))

In [8]:
# add in a k-NN for the content view

kNN = kneighbors_graph(X, metric='cosine', mode='connectivity', n_neighbors=40)
kNN = kNN.minimum(kNN.T)
networks.append(kNN)

Once of the great difficulties of multi-view data is that there can be a lot of heterogeneity between the different views of the data. This difference will present itself even after converting the different views to the same type of data (i.e., network). Let's look at some simple network properties of the networks we just created to see this

In [9]:
def calculate_density(csr_adj_matrix):
    # Number of nodes
    num_nodes = csr_adj_matrix.shape[0]
    # Number of possible edges in an undirected graph
    possible_edges = num_nodes * (num_nodes - 1) / 2
    # Actual number of edges (considering it's an undirected graph)
    actual_edges = csr_adj_matrix.nnz / 2
    # Density: actual number of edges / possible number of edges
    density = actual_edges / possible_edges
    
    return density

def calculate_average_degree(csr_adj_matrix):
    # Number of nodes
    num_nodes = csr_adj_matrix.shape[0]
    # Degree of each node (sum of connections)
    degrees = csr_adj_matrix.sum(axis=1)
    # Average degree
    average_degree = degrees.mean()
    
    return average_degree

In [10]:
print(f"Density for the citation network :{calculate_density(networks[0])}\nDensity for the k-NN content network: {calculate_density(networks[1])}")

Density for the citation network :0.0004299640927547904
Density for the k-NN content network: 0.0074781835398620026


In [11]:
print(f"Density for the citation network :{calculate_average_degree(networks[0])}\nDensity for the k-NN content network: {calculate_average_degree(networks[1])}")

Density for the citation network :1.4236111111111112
Density for the k-NN content network: 24.760265700483092


# Cluster by MVMC
Having created networks from each of the views, lets now cluster them by MVMC

In [12]:
mvmc_clstr= MVMC(verbose = True, resolution_tol=0.1)

In [13]:
y_preds = mvmc_clstr.fit_transform(networks)

View Graph 0: num_nodes: 3312, num_edges: 4715, directed: True, num_components: 438, num_isolates: 48
View Graph 1: num_nodes: 3312, num_edges: 41003, directed: False, num_components: 4, num_isolates: 3
--------
Iteration: 1 
 Modularities: [0.599437502881645, 0.4874101643146264] 
 Resolutions: [1, 1] 
 Weights: [1, 1]
--------
Iteration: 2 
 Modularities: [0.4639523461010775, 0.4094532316111103] 
 Resolutions: [2.569541079847051, 1.6748032557227184] 
 Weights: [1.218383944721614, 0.7816160552783862]
--------
Iteration: 3 
 Modularities: [0.4091614651442202, 0.35629530570683154] 
 Resolutions: [4.229043828406161, 2.5305914818810766] 
 Weights: [1.1865432655551256, 0.8134567344448745]
--------
Iteration: 4 
 Modularities: [0.37722225837255574, 0.3161509944283575] 
 Resolutions: [5.916187318901956, 3.6002446960723975] 
 Weights: [1.161563563205802, 0.8384364367941981]
--------
Iteration: 5 
 Modularities: [0.35752134284653686, 0.28777158064024067] 
 Resolutions: [8.735916720699377, 4.772

In [14]:
print(f"Best modularity obtained: {mvmc_clstr.best_modularity}")
print(f"Best result found at iteration: {mvmc_clstr.best_iteration}")

Best modularity obtained: 1.0868476671962715
Best result found at iteration: 0


In [15]:
# compare found clusters to the actual labels

print(f"ARI between found clusters and labels: {adjusted_rand_score(y, y_preds)}")
print(f"AMI between found clusters and labels: {adjusted_mutual_info_score(y, y_preds)}")

ARI between found clusters and labels: 0.39102072611960464
AMI between found clusters and labels: 0.3772787791420943
