# Discover and Visualize the Data to Gain Insights

So, now we've got a codebase accumulated to read in the input data:

In [None]:
import os
import urllib
import numpy as np


MSR_ROOT = "https://raw.githubusercontent.com/microsoft/graspologic/dev/"
DROS_PATH = os.path.join("datasets", "drosophila")
DROS_URL = MSR_ROOT + "graspologic/datasets/drosophila"

DROS_NAMES = {"Left": {"A": "left_adjacency.csv", "labels": "left_cell_labels.csv"},
              "Right": {"A": "right_adjacency.csv", "labels": "right_cell_labels.csv"}}

def fetch_drosophila_data(dros_url=DROS_URL, dros_path=DROS_PATH,
                         dros_names=DROS_NAMES):
    if not os.path.isdir(dros_path):
        os.makedirs(dros_path)
    for (name, dictobj) in dros_names.items():
        for (objtype, fname) in dros_names[name].items():
            csv_path = os.path.join(dros_path, fname)
            csv_url = os.path.join(dros_url, fname)
            urllib.request.urlretrieve(csv_url, csv_path)

def load_drosophila_data(dros_path=DROS_PATH, dros_names=DROS_NAMES,
                        return_labels=True):
    adj_dict = {}  # make the return object
    for (name, dictobj) in dros_names.items():
        adj_dict[name] = {}  # dictionary for each adjacency matrix with labels
        
        adj_path = os.path.join(dros_path, dictobj["A"])
        with open(adj_path) as adjfile:
            adj_dict[name]["A"] = np.loadtxt(adjfile)
        
        labels_path = os.path.join(dros_path, dictobj["labels"])
        with open(labels_path) as labelfile:
            adj_dict[name]["labels"] = np.loadtxt(labelfile, dtype=str)
    return adj_dict

In [None]:
fetch_drosophila_data()
dataset = load_drosophila_data()

We've got some code to prepare the data for machine learning:

In [None]:
import graspologic as gp

def remove_isolates(A, labels):
    """
    A function which removes isolated nodes from the 
    adjacency matrix A and the labels.
    """
    in_degree = A.sum(axis=0)  # sum along the rows
    out_degree = A.sum(axis=1)  # sum along the columns
    cum_degree = in_degree + out_degree
    A_purged = A[~(cum_degree == 0),:]
    A_purged = A_purged[:,~(cum_degree == 0)]
    labels_purged = labels[~(cum_degree == 0)]
    print("Purging {:d} nodes...".format((cum_degree == 0).sum()))
    return (A_purged, labels_purged)

def remap_labels(labels):
    labs = {1: "K", 2: "P", 3: "O", 4: "I"}
    # initialize empty numerical vector and dictionary
    # to keep track of the mapping you produce
    mapping = {}
    numerical_labels = np.empty(labels.shape[0], dtype=int)
    for i, lab in labs.items():
        numerical_labels[labels == lab] = int(i)
        mapping[lab] = i + 1
    return numerical_labels, mapping

from sklearn.base import TransformerMixin, BaseEstimator

class CleanData(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    def fit(self, X):
        return self
    
    def transform(self, X):
        (A, labels) = X
        Acleaned, labels = remove_isolates(A, labels)
        labels_cleaned, mapping = remap_labels(labels)
        self.A_ = Acleaned
        self.labels_ = labels_cleaned
        self.mapping_ = mapping
        return (self.A_, self.labels_)
    
class FeatureScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        A, labels = X
        A_scaled = gp.utils.binarize(A)
        return (A_scaled, labels)

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('cleaner', CleanData()),
    ('scaler', FeatureScaler()),
])

In [None]:
(Aleft, labelsleft) = num_pipeline.fit_transform((dataset["Left"]["A"], dataset["Left"]["labels"]))
(Aright, labelsright) = num_pipeline.fit_transform((dataset["Right"]["A"], dataset["Right"]["labels"]))

And you are left with two adjacency matrices and node labels:

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(18, 6))
gp.plot.heatmap(Aleft, inner_hier_labels=labelsleft, title="Left Mushroom Body, Preprocessed", ax=axs[0])
gp.plot.heatmap(Aright, inner_hier_labels=labelsright, title="Right Mushroom Body, Preprocessed", ax=axs[1]);

What's next? Now is where the fun starts: you are ready to tackle some network machine learning algorithms.

## Choosing an appropriate network machine learning model

The most crucial step in network machine learning is figuring out an appropriate *model* for your data. A *statistical model*, which we will become more accustomed to in [Chapter 5](#link?), are sets of assumptions that describe how we think the data behaves.

In the above plot, what we immediately notice is that in general, nodes with a label of 2 tend to show similar behavior to one another: they are well connected with nodes labeled 2, 3, or 1 in the top right of the heatmap, or nodes labelled 4 and 1 in the bottom left of the heatmap. This applies to both the left and right mushroom bodies. As we keep going down the groups of nodes, we see this pattern continue: nodes labelled 4 tend to be completely unconnected to any nodes except for nodes labelled 2 in the bottom left of the heatmap, and are not connected to any other nodes in the entire network. This pattern continues for the other nodes in the network. If we were being naive, we might just say that the left mushroom body and the right mushroom body look the same, end of story. However, do they *actually*? As it turns out, there are some pretty big differences!

As it turns out, this pattern of nodes having similar patterns as other nodes in the same group is a statistical model you will learn about later on: the [stochastic block model](#link?) (SBM). You will first fit an SBM to the left and right mushroom bodies, using `graspologic`'s `SBMEstimator` class. The `SBMEstimator` class takes an adjacency matrix and the node labels as leading arguments:

In [None]:
from graspologic.models import SBMEstimator

left_sbm = SBMEstimator()
left_sbm.fit(Aleft, labelsleft);
right_sbm = SBMEstimator()
right_sbm.fit(Aright, labelsright);

What the SBM does is basically, for each block of edges in the adjacency matrix, it computes the fraction of the edges which exist (an estimate of the "probability" of an edge). This is organized into what is called the block matrix, which is the `block_p_` attribute of the `SBMEstimator`. We visualize these block probability matrices also as heatmaps:

In [None]:
import seaborn as sns

def plot_block(X, blockname="Node Group", blocktix=[0.5, 1.5, 2.5, 3.5],
               blocklabs=["1", "2", "3", "4"], ax=None, title=""):
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    with sns.plotting_context("talk", font_scale=1):
        ax = sns.heatmap(X, cmap="Purples",
                        ax=ax, cbar_kws=dict(shrink=1), yticklabels=False,
                        xticklabels=False, vmin=0, vmax=1, annot=True)
        ax.set_title(title)
        cbar = ax.collections[0].colorbar
        ax.set(ylabel=blockname, xlabel=blockname)
        ax.set_yticks(blocktix)
        ax.set_yticklabels(blocklabs)
        ax.set_xticks(blocktix)
        ax.set_xticklabels(blocklabs)
        cbar.ax.set_frame_on(True)
    return

fig, axs = plt.subplots(1,3, figsize=(27, 6))
plot_block(left_sbm.block_p_, title="Left Block Matrix", ax=axs[0])
plot_block(right_sbm.block_p_, title="Right Block Matrix", ax=axs[1])
plot_block(left_sbm.block_p_ - right_sbm.block_p_, title="Left - Right", ax=axs[2])

The block matrices look quite similar, at first glance. However, as you can see, some of the blocks are pretty different in the right-most plot, such as the block (node group 4, node group 3). Network machine learning is all about cleaning up this ambiguous language here. Can we be a bit more precise about what we mean by "pretty different"?

As we will see in [Chapter 9](#link?), there are many ways in which we can, in fact, be ultra precise! In fact, using some machine learning techniques, we can even put a probability on the chances we are wrong if we say the matrices are "different"! Putting some precision on the words "same" or "different" (how similar? how dissimilar?) is called hypothesis testing, which you will learn all about in the [Applications](#link?) section of the book. In the next code block, we will perform a simple test of how similar or different the left and right mushroom bodies are:

In [None]:
import numpy as np
from pkg.stats import stochastic_block_test

nl = Aleft.shape[0]
nr = Aright.shape[0]
density_left = Aleft.sum()*1/(nl*(nl-1))
density_right = Aright.sum()*1/(nr*(nr-1))

null_odds = density_left/density_right
stat, pvalue, misc = stochastic_block_test(
    Aleft, Aright, labels1=labelsleft, labels2=labelsright, method="fisher",
    null_odds=null_odds
)
Pval_mtx = misc["uncorrected_pvalues"]

right_sbm.block_p_adj_ = right_sbm.block_p_*null_odds

Next, we show the left and right blocks again, and we also include the probability that we would be *wrong* to say that the two blocks are different, the $p$-values:

In [None]:
fig, axs = plt.subplots(1,3, figsize=(27, 6))
plot_block(left_sbm.block_p_, title="Left Block Matrix", ax=axs[0])
plot_block(right_sbm.block_p_adj_, title="Right Block Matrix, Adjusted", ax=axs[1])
plot_block(Pval_mtx, title="p-values per-block", ax=axs[2])

So it turns out that four of the blocks are pretty radically different: we can see that we have $p$-values of $0.0064$, $9.1 \times 10^{-6}$, $1.6 \times 10^{-9}$, and $1.4 \times 10^{-23}$! Remember that these are the probabilities that we would be *wrong* to say that the two blocks are different, so this means that with a *very* high probability, four of the blocks are different!

We have now discovered a new neurobiological insight, that the left and right mushroom bodies of the drosophila are not similar! This provides us with evidence that the fly mushroom body, and consequently, its *brain*, are bilaterally *asymmetric*: the left and right sides of the brain have unique functionality. We know that human beings' have asymmetrical brains, but it is incredible to think that with a few snippets of code, we can show that fruit flies share this property too. At the time of the writing of this textbook, this result hasn't even been discovered yet!

## Try it out!

Hopefully this chapter gave you a small scale peek at what a network machine learning project looks like, and showed you a brief introduction to some tools you can use to gain novel insights from your network data. While what we did in this chapter was relatively straightforward, the process from obtaining your data to choosing appropriate network machine learning problems can be extremely arduous! In fact, as a network machine learning scientist, you might find that just obtaining your data in a useful form (a network) and cleaning the data to be usable might take an *enormous* chunk of your time!

If you haven't already done so, now is a fantastic time to grab your laptop, select a network dataset you are interested in, and try to work through the whole process from A to Z. If you need some pointers, the `graspologic` package [makes several datasets available to you](#https://microsoft.github.io/graspologic/latest/reference/reference/datasets.html). We'd recommend working through the contents of this book by first using the example data that is presented in the chapter, and then try to apply the techniques to a network dataset of your choosing.