(ch2:prepare)=
# Prepare the Data for Network Algorithms

Next, it's time for us to prepare our networks for machine learning algorithms. Like before, you are going to try to capture most of these with functions. This is because:
1. Functions will make the useful data preparation code that we write usable on new networks,
2. You will gradually build libraries of utility functions that we can prepare together into packages of their own or recycle for future projects,
3. You can use modularize these functions into other parts of your data pipeline before it gets to your algorithm, to keep a lean module-oriented design,
4. You can easily try different transformations of the data and evaluate which ones tend to work best.

First, let's re-load the data that we have read in from the previous section:

In [None]:
import os
import urllib
import numpy as np


MSR_ROOT = "https://raw.githubusercontent.com/microsoft/graspologic/dev/"
DROS_PATH = os.path.join("datasets", "drosophila")
DROS_URL = MSR_ROOT + "graspologic/datasets/drosophila"

DROS_NAMES = {"Left": {"A": "left_adjacency.csv", "labels": "left_cell_labels.csv"},
              "Right": {"A": "right_adjacency.csv", "labels": "right_cell_labels.csv"}}

def fetch_drosophila_data(dros_url=DROS_URL, dros_path=DROS_PATH,
                         dros_names=DROS_NAMES):
    if not os.path.isdir(dros_path):
        os.makedirs(dros_path)
    for (name, dictobj) in dros_names.items():
        for (objtype, fname) in dros_names[name].items():
            csv_path = os.path.join(dros_path, fname)
            csv_url = os.path.join(dros_url, fname)
            urllib.request.urlretrieve(csv_url, csv_path)

def load_drosophila_data(dros_path=DROS_PATH, dros_names=DROS_NAMES,
                        return_labels=True):
    adj_dict = {}  # make the return object
    for (name, dictobj) in dros_names.items():
        adj_dict[name] = {}  # dictionary for each adjacency matrix with labels
        
        adj_path = os.path.join(dros_path, dictobj["A"])
        with open(adj_path) as adjfile:
            adj_dict[name]["A"] = np.loadtxt(adjfile)
        
        labels_path = os.path.join(dros_path, dictobj["labels"])
        with open(labels_path) as labelfile:
            adj_dict[name]["labels"] = np.loadtxt(labelfile, dtype=str)
    return adj_dict

In [None]:
fetch_drosophila_data()
dataset = load_drosophila_data()
Aleft, labelsleft = (dataset["Left"]["A"], dataset["Left"]["labels"])

## Data cleaning

Most network machine learning algorithms cannot work with a node which is *isolated*, a term we will learn in [Chapter 4](#link?) which means that the node has no edges. Let's start with fixing this. We can remove isolated nodes from the network as follows:
1. Compute the number of nodes each node connects to. This consists of summing the matrix along the rows and along the columns.
2. Identify any nodes which are connected to zero nodes along either the rows or columns. These are the *isolated* nodes.
3. Remove the isolated nodes from both the adjacency matrix and the labels.

Let's see how this works in practice. We begin by first taking the row and column sums of each node, and then finding the sum across the rows and the columns. Next, we remove all nodes with are not connected to any other nodes (the row and column sum are both zero) from both the adjacency matrix and the labels:

In [None]:
def remove_isolates(A, labels):
    """
    A function which removes isolated nodes from the 
    adjacency matrix A and the labels.
    """
    in_degree = A.sum(axis=0)  # sum along the rows
    out_degree = A.sum(axis=1)  # sum along the columns
    cum_degree = in_degree + out_degree
    A_purged = A[~(cum_degree == 0),:]
    A_purged = A_purged[:,~(cum_degree == 0)]
    labels_purged = labels[~(cum_degree == 0)]
    print("Purging {:d} nodes...".format((cum_degree == 0).sum()))
    return (A_purged, labels_purged)

In [None]:
Aleft, labelsleft = remove_isolates(Aleft, labelsleft)

So no isolated nodes were found, and consequently no nodes were purged. Great! What else can we do?

### Handling categorical variables

If you remember, the `labels` variable is a list of characters which gives a unique cell type of each mushroom body node. Unfortunately, many network machine learning algorithms prefer to work with numbers of characters. For this reason, it is instead advantageous to us if we convert these node labels instead to a numerical vector of integers. We arbitrarily just decide a mapping for the different unique labels, numbering them 1, 2, 3, or 4:

In [None]:
def remap_labels(labels):
    labs = {1: "K", 2: "P", 3: "O", 4: "I"}
    # initialize empty numerical vector and dictionary
    # to keep track of the mapping you produce
    mapping = {}
    numerical_labels = np.empty(labels.shape[0], dtype=int)
    for i, lab in labs.items():
        numerical_labels[labels == lab] = int(i)
        mapping[lab] = i + 1
    return numerical_labels, mapping

In [None]:
num_labsleft, mapping = remap_labels(labelsleft)
print(mapping)

So as you can see, I labels were replaced with the value 1, K labels were replaced with the value 2, O labels were replaced with the value 3, and P labels were replaced with the value 4.

To streamline the process of cleaning up the raw data, you will often need to write custom data cleaners. You will want your cleaners to work seamlessly with `sklearn`'s functions, such as pipelines, and will require you to only implement three class methods: `fit()`, `transform()`, `fit_transform()`. By adding `TransformerMixin` as a base class, we do not even have to implement the third one! If we use `BaseEstimator` as a base class, we will also obtain `get_params()` and `set_params()`, which will be useful for hyperparameter tuning steps later on. For example, here is an example cleaner class which purges the adjacency matrix of isolates and remaps the categorical labels to numbers. Note that a key step to implementing this all as cleanly as possible is that the inputs, an adjacency matrix and a vector of node labels, are passed in as a *single* tuple object. This is because `sklearn` anticipates that the return arguments from calls of `transform()` can be passed sequentially to one another, which we will see later on when we try to string several of these transformers together into a single pipeline:

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator

class CleanData(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    def fit(self, X):
        return self
    
    def transform(self, X):
        (A, labels) = X
        Acleaned, labels = remove_isolates(A, labels)
        labels_cleaned, mapping = remap_labels(labels)
        self.A_ = Acleaned
        self.labels_ = labels_cleaned
        self.mapping_ = mapping
        return (self.A_, self.labels_)
    
data_cleaner = CleanData()
(Aleft_clean, labelsleft_clean) = data_cleaner.transform((Aleft, labelsleft))

## Edge weight transformations

One of the most important transformations that we will come across in network machine learning is called *edge-weight transformation*. Many networks you enounter, such as the drosophila mushroom body, will have edge weights which do not just take values of 1 or 0 (edge or no edge, a *binary* network); rather, many of the networks you come across may have discrete-weighted edges (the edges take non-negative inter values, such as 0, 1, 2, 3, ...), or decimal-weight edges (the edges take values like 0, 0.1234, 0.234, 2.4234, ...). For a number of reasons icussed later in [Chapter 4](#link?), this is often not really a desirable characteristic.  The edges in a network might be error prone, and it might only be desirable to capture one (or a few) properties about the edge weights, rather than just leave them in their raw values. Further, a lot of the techniques we come across throughout this book might not even *work* on networks which are not binary. For this reason, we need to get accustomed to transforming the edge weights to take new sets of values.

There are two common approaches to transform edge weights: the first is called binarization (set all of the edges to take a value of 0 or 1), and the second is called an ordinal transformation. 

### Binarization of edges 

Binarization is quite simple: the edges in the raw network take non-binary values (0s and 1s), and you need them to for your algorithm. How do you solve this? 

The simplest thing to do is usually to just look at which edges take a value of zero, and keep them as zero, and then look at all of the edges which take a non-zero value, and set them to one. In effect, what this does is it just takes the original non-binary network, and converts it to a binary one. Let's take a look at how we can implement this using `graspologic`. We first look at the network before binarization, and then after:

In [None]:
import graspologic as gp

Aleft_bin = gp.utils.binarize(Aleft_clean)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,2, figsize=(18, 6))
gp.plot.heatmap(Aleft_clean, ax=ax[0], inner_hier_labels=labelsleft, title="Weighted Drosophila Mushroom Body")
gp.plot.heatmap(Aleft_bin, ax=ax[1], inner_hier_labels=labelsleft, title="Binary Drosophila Mushroom Body");

Woah! That heatmap looks a whole lot different, particularly in the top left corner. What happened was that edges in the weighted drosophila mushroom body have very small edge weights in the upper left corner, which *almost* look like they are zero. But when we binarize the network, we see that this is no longer the case: all of the edge weights which are non-zero took a value of one (and are dark red) and all of the edge weights which are zero stay at zero (and are white). 

Another way we could have normalized these edge weights is through something called a *pass to ranks*. Through a pass to ranks, the edge weights are discarded entirely, with one exception: the edges which are non-zero are first ranked, from smallest to largest, with the largest item having a rank of one, and the smallest item having a rank of $\frac{1}{\text{number of non-zero edges}}$. This is called an *ordinal transformation*, in that it preserves the *orders* of the edge-weights, but discards all other information. 

In [None]:
Aleft_ptr = gp.utils.pass_to_ranks(Aleft_clean)

Again, we plot the resulting connectome, before and after passing to ranks, as heatmaps:

In [None]:
fig, ax = plt.subplots(1,2, figsize=(18, 6))
gp.plot.heatmap(Aleft_clean, ax=ax[0], inner_hier_labels=labelsleft, title="Weighted Drosophila Mushroom Body")
gp.plot.heatmap(Aleft_ptr, ax=ax[1], inner_hier_labels=labelsleft, title="Drosophila Mushroom Body, Passed to Ranks");

This has shifted the histogram of edge-weights, as we can see below:

In [None]:
import seaborn as sns


fig, ax = plt.subplots(2,1, figsize=(10, 10))
sns.histplot(Aleft_clean[Aleft_clean > 0].flatten(), ax=ax[0]);
ax[0].set_xlabel("Edge weight")
ax[0].set_title("Histogram of left mushroom body non-zero edge weights");

sns.histplot(Aleft_ptr[Aleft_ptr > 0].flatten(), ax=ax[1]);
ax[1].set_xlabel("ptr(Edge weight)")
ax[1].set_title("Histogram of left mushroom body, passed-to-ranks");

This has the desirable property that it bounds the network's edge weights to be between $0$ and $1$, as we can see above, which is often crucial if we seek to compare two or more networks and the edge weights are relative in magnitude (an edge's weight might mean something in relation to another edge's weight in that same network, but an edge's weight means nothing in relation to another edge's weight in a separate network). Further, passing to ranks is not very susceptible to outliers, as we will see in later chapters. 

Again, we will turn the edge-weight transformation step into its own class, much like we did above:

In [None]:
class FeatureScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        A, labels = X
        A_scaled = gp.utils.binarize(A)
        return (A_scaled, labels)
    
feature_scaler = FeatureScaler()
A_cleaned_scaled, _ = feature_scaler.transform((Aleft_clean, labelsleft_clean))

### Transformation pipelines

As you can see, there are a number of data transformations that need to be executed to prepare network data for machine learning algorithms. One thing that might be desirable is to develop a pipeline which automates the data preparation process for you. We will perform this using the `Pipeline` class from `sklearn`. The `Pipeline` class can help us apply sequences of transformations. Here is a simple pipeline for doing all of the steps we have performed so far:

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('cleaner', CleanData()),
    ('scaler', FeatureScaler()),
])

left_tr = num_pipeline.fit_transform((Aleft, labelsleft))

The pipeline class takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers, which implement the `fit_transform()` method. In our case, this is handled directly by the `TransformerMixin` base class.

When you call the `fit_transform()` method of the numerical pipeline, it calls the `fit_transform()` method on each of the transformers, and passes the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the `fit()` method. 

Next, we'll see the real handiness of the `Pipeline` module. The reason we went to lengths to define a pipeline was that we wanted to have an easily reproducible procedure that we could efficiently apply to new datasets. We'll see how we can do that using both the left and right mushroom bodies of the drosophila, which we perform below:

In [None]:
left_tr = num_pipeline.fit_transform((Aleft, labelsleft))
right_tr = num_pipeline.fit_transform((dataset["Right"]["A"], dataset["Right"]["labels"]))

Next, we visualize the mushroom bodies, after transformation:

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(18, 6))
gp.plot.heatmap(left_tr[0], inner_hier_labels=left_tr[1], title="Left Mushroom Body, Preprocessed", ax=axs[0])
gp.plot.heatmap(right_tr[0], inner_hier_labels=right_tr[1], title="Right Mushroom Body, Preprocessed", ax=axs[1]);