(ch2:prepare)=
# Prepare the Data for Network Algorithms

Next, it's time for us to prepare our networks for machine learning algorithms. Like before, you are going to try to capture most of these with functions. This is because:
1. Functions will make the useful data preparation code that we write usable on new networks,
2. You will gradually build libraries of utility functions that we can prepare together into packages of their own or recycle for future projects,
3. You can use modularize these functions into other parts of your data pipeline before it gets to your algorithm, to keep a lean module-oriented design,
4. You can easily try different transformations of the data and evaluate which ones tend to work best.

First, let's re-load the data that we have read in from the previous section:

In [None]:
import os
import urllib
import graspologic as gp

AWSND_ROOT = "https://open-neurodata.s3.amazonaws.com/m2g/Diffusion/BNU1-8-27-20-m2g-native-csa-det/"
DWI_URL = os.path.join(AWSND_ROOT, 
                       "sub-0025864/ses-1/connectomes/AAL_space-MNI152NLin6_res-2x2x2/")
DWI_PATH = os.path.join("datasets", "dwi")
DWI_NAME = "sub-0025864_ses-1_dwi_AAL_space-MNI152NLin6_res-2x2x2_connectome.csv"

def fetch_dwi_data(dwi_url=DWI_URL, dwi_path=DWI_PATH, dwi_name=DWI_NAME):
    local_path = os.path.join(dwi_path, dwi_name)
    if not os.path.isdir(dwi_path):
        os.makedirs(dwi_path)
    csv_url = os.path.join(dwi_url, dwi_name)
    urllib.request.urlretrieve(csv_url, local_path)
    return local_path

In [None]:
local_path = fetch_dwi_data()
A = gp.utils.import_edgelist(local_path)

## Data cleaning

Most network machine learning algorithms cannot work with a node which is *isolated*, a term we will learn in [Chapter 4](#link?) which means that the node has no edges. Let's start with fixing this. We can remove isolated nodes from the network as follows:
1. Compute the number of nodes each node connects to. This consists of summing the matrix along the rows (or columns). The network is *undirected*, a property you will learn in [properties of networks](ch4:prop-net), which means that if a node can communicate with another node, that other node can also communicate with that node
2. Identify any nodes which are connected to zero nodes along either the rows or columns. These are the *isolated* nodes.
3. Remove the isolated nodes from the adjacency matrix.

Let's see how this works in practice. We begin by first taking the row sums of each node, which tells us how many nodes that each node is connected to. Next, we remove all nodes with are not connected to any other nodes (the row and column sum are both zero) from both the adjacency matrix and the labels:

In [None]:
def remove_isolates(A):
    """
    A function which removes isolated nodes from the 
    adjacency matrix A and the labels.
    """
    degree = A.sum(axis=0)  # sum along the rows to obtain the node degree
    out_degree = A.sum(axis=1)
    A_purged = A[~(degree == 0),:]
    A_purged = A_purged[:,~(degree == 0)]
    print("Purging {:d} nodes...".format((degree == 0).sum()))
    return A_purged

In [None]:
A = remove_isolates(A)

So no isolated nodes were found, and consequently no nodes were purged. Great! What else can we do?

To streamline the process of cleaning up the raw data, you will often need to write custom data cleaners. You will want your cleaners to work seamlessly with `sklearn`'s functions, such as pipelines, and will require you to only implement three class methods: `fit()`, `transform()`, `fit_transform()`. By adding `TransformerMixin` as a base class, we do not even have to implement the third one! If we use `BaseEstimator` as a base class, we will also obtain `get_params()` and `set_params()`, which will be useful for hyperparameter tuning steps later on. For example, here is an example cleaner class which purges the adjacency matrix of isolates and remaps the categorical labels to numbers. Note that a key step to implementing this all as cleanly as possible is that the inputs, an adjacency matrix and a vector of node labels, are passed in as a *single* tuple object. This is because `sklearn` anticipates that the return arguments from calls of `transform()` can be passed sequentially to one another, which we will see later on when we try to string several of these transformers together into a single pipeline:

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator

class CleanData(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    def fit(self, X):
        return self
    
    def transform(self, X):
        (A) = X
        Acleaned = remove_isolates(A)
        self.A_ = Acleaned
        return self.A_
    
data_cleaner = CleanData()
A_clean = data_cleaner.transform(A)

## Edge weight transformations

One of the most important transformations that we will come across in network machine learning is called *edge-weight transformation*. Many networks you enounter, such as the human diffusion connectome, will have edge weights which do not just take values of 1 or 0 (edge or no edge, a *binary* network); rather, many of the networks you come across may have discrete-weighted edges (the edges take non-negative inter values, such as 0, 1, 2, 3, ...), or decimal-weight edges (the edges take values like 0, 0.1234, 0.234, 2.4234, ...). For a number of reasons discussed later in [Regularization](ch4:regularization), this is often not really a desirable characteristic.  The edges in a network might be error prone, and it might only be desirable to capture one (or a few) properties about the edge weights, rather than just leave them in their raw values. Further, a lot of the techniques we come across throughout this book might not even *work* on networks which are not binary. For this reason, we need to get accustomed to transforming the edge weights to take new sets of values.

There are two common approaches to transform edge weights: the first is called binarization (set all of the edges to take a value of 0 or 1), and the second is called an ordinal transformation. 

### Binarization of edges 

Binarization is quite simple: the edges in the raw network take non-binary values (values other than just 0s and 1s), and you need them to be 0s and 1s for your algorithm. How do you solve this? 

The simplest thing to do is usually to just look at which edges take a value of zero, and keep them as zero, and then look at all of the edges which take a non-zero value, and set them to one. In effect, what this does is it just takes the original non-binary network, and converts it to a binary one. Let's take a look at how we can implement this using `graspologic`. We first look at the network before binarization, and then after:

In [None]:
A_bin = gp.utils.binarize(A_clean)

In [None]:
import matplotlib.pyplot as plt
from graphbook_code import heatmap

fig, ax = plt.subplots(1,2, figsize=(18, 6))
heatmap(A_clean, ax=ax[0], title="Weighted Human Connectome")
heatmap(A_bin, ax=ax[1], title="Binary Human Connectome");

Woah! That heatmap looks a whole lot different, particularly in the top left. What happened was that edges in the weighted human connectome have very small edge weights in the upper left corner, which *almost* look like they are zero. But when we binarize the network, we see that this is no longer the case: all of the edge weights which are non-zero took a value of one (and are dark purple) and all of the edge weights which are zero stay at zero (and are white). 

Another way we could have normalized these edge weights is through something called a *pass to ranks*. Through a pass to ranks, the edge weights are discarded entirely, with one exception: the edges which are non-zero are first ranked, from smallest to largest, with the largest item having a rank of one, and the smallest item having a rank of $\frac{1}{\text{number of non-zero edges}}$. This is called an *ordinal transformation*, in that it preserves the *orders* of the edge-weights, but discards all other information. 

In [None]:
A_ptr = gp.utils.pass_to_ranks(A)

Again, we plot the resulting connectome, before and after passing to ranks, as heatmaps:

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18, 6))
heatmap(A_clean, ax=ax[0], title="Weighted human connectome")
heatmap(A_bin, ax=ax[1], title="Binary human connectome")
heatmap(A_ptr, ax=ax[2], title="Ranked human connectome", vmin=0, vmax=1);

This has shifted the histogram of edge-weights, as we can see below:

In [None]:
import seaborn as sns


fig, ax = plt.subplots(2,1, figsize=(10, 10))
sns.histplot(A_clean[A_clean > 0].flatten(), ax=ax[0]);
ax[0].set_xlabel("Edge weight")
ax[0].set_title("Histogram of human connectome non-zero edge weights");

sns.histplot(A_ptr[A_ptr > 0].flatten(), ax=ax[1]);
ax[1].set_xlabel("ptr(Edge weight)")
ax[1].set_title("Histogram of human connectome, passed-to-ranks");

This has the desirable property that it bounds the network's edge weights to be between $0$ and $1$, as we can see above, which is often crucial if we seek to compare two or more networks and the edge weights are relative in magnitude (an edge's weight might mean something in relation to another edge's weight in that same network, but an edge's weight means nothing in relation to another edge's weight in a separate network). Further, passing to ranks is not very susceptible to outliers, as we will see in later chapters. 

Again, we will turn the edge-weight transformation step into its own class, much like we did above:

In [None]:
class FeatureScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    
    def fit(self, X):
        return self
    
    def transform(self, X):
        A = X
        A_scaled = gp.utils.pass_to_ranks(A)
        return (A_scaled)
    
feature_scaler = FeatureScaler()
A_cleaned_scaled = feature_scaler.transform(A_clean)

### Transformation pipelines

As you can see, there are a number of data transformations that need to be executed to prepare network data for machine learning algorithms. One thing that might be desirable is to develop a pipeline which automates the data preparation process for you. We will perform this using the `Pipeline` class from `sklearn`. The `Pipeline` class can help us apply sequences of transformations. Here is a simple pipeline for doing all of the steps we have performed so far:

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ('cleaner', CleanData()),
    ('scaler', FeatureScaler()),
])

xfm_dat = num_pipeline.fit_transform(A)

The pipeline class takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers, which implement the `fit_transform()` method. In our case, this is handled directly by the `TransformerMixin` base class.

When you call the `fit_transform()` method of the numerical pipeline, it calls the `fit_transform()` method on each of the transformers, and passes the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the `fit()` method. 

Next, we'll see the real handiness of the `Pipeline` module. The reason we went to lengths to define a pipeline was that we wanted to have an easily reproducible procedure that we could efficiently apply to new datasets. We'll see how we can do that using two human connectomes, the one we have been studying so far and an additional subject's data, which we perform below:

In [None]:
A_xfm_dat1 = num_pipeline.fit_transform(A)

DWI_URL2 = os.path.join(AWSND_ROOT, 
                       "sub-0025865/ses-1/connectomes/AAL_space-MNI152NLin6_res-2x2x2/")
DWI_NAME2 = "sub-0025865_ses-1_dwi_AAL_space-MNI152NLin6_res-2x2x2_connectome.csv"
local_path2 = fetch_dwi_data(dwi_url=DWI_URL2, dwi_name=DWI_NAME2)
A_sub2 = gp.utils.import_edgelist(local_path2)
A_xfm_dat2 = num_pipeline.fit_transform(A_sub2)

Next, we visualize the mushroom bodies, after transformation:

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(18, 6))
heatmap(A_xfm_dat1, title="Connectome 1, Preprocessed", ax=axs[0], vmin=0, vmax=1)
heatmap(A_xfm_dat2, title="Connectome 2, Preprocessed", ax=axs[1], vmin=0, vmax=1);