# Non-Iterative Subgraph Screening

This demo shows how to use the Non-Iterative Screening class. The use of this tool will be explained by first generating simulation data. The correlation values for each node will then be generated and the signal subgraph will be estimated.

In [None]:
import numpy as np
np.random.seed(8889)
from graspy.simulations import sbm
from graspy.plot import heatmap

%matplotlib inline

Subgraph Screening uses rows of adjacency matrices as feature vectors for each node, and finds the correlation of those feature vectors with a covariate of interest. Non-Iterative screening is the simplest form of subgraph screening. 

Say there are m graphs and thus m labels. The correlation values found from screening are 

\begin{align*}
c_{u} = MGC\left(\left\{(\hat{X_{i}}[u,\cdot], Y_{i})\right\}^{m}_{i = 1}\right)
\end{align*}

for covariates of interest $\left\{ Y_{i} \in \mathbb{R}, i = 1, ..., m\right\}$ and feature vectors $\hat{X_{i}}[u,.] = A_{i}[u,\cdot]$, where $A_{i}[u,\cdot]$ is the set of corresponding rows in the adjacency matricies. 

For non-iterative screening, the correlation values for each node are returned in an array. Also, every node with correlation value higher than a set threshold value $c$ are returned. This is the estimated signal subgraph $\hat{S} = \left\{u \in V|c_{u} > c\right\}$, where $V = [n]$ for adjacency matrices in $\mathbb{R}^{n \ \times \ n}$.

## Generate Mock Data

One has to start by writing a function that will first generate simulation data. The simulation data in this case will be a tensor of sbms, each belonging to one of some given number of possible classes. Those sbms will be placed at random indicies throughout the tensor and represent the adjacency matricies of each graph. This function will then generate an array of labels that identifies the class of each adjacency matrix in the data tensor.

The function takes in the desired total number of graphs to be generated, the dimension of each adjacency matrix, a vector with the number of nodes in each community, a tensor of different probability matricies that each will be used to create a different class, and a vector with the percentages of the total number of graphs that will be in each class. 

In [None]:
import random
def data_generator(num_graphs, N, n, prob_tensor, percent_vec):
    
    # Getting the number of classes
    num_types = len(percent_vec)

    # Getting vector with the number of graphs in each class
    num = [int(num_graphs * a) for a in percent_vec]

    data = np.zeros((num_graphs, N, N))
    y_label = np.zeros((num_graphs, 1))

    # Creates vector of random indices to randomly distribute graphs in tensor
    L_ind = random.sample(range(0, num_graphs), num_graphs)

    # Loop for creating the returns
    for i in range(num_types):

        # Create tensor that will contain all of the graphs of one type
        types = np.zeros((num[i], N, N))

        # Put all the graphs of one type into "types" tensor
        for j in range(len(types)):
            types[j] = sbm(n=n, p=prob_tensor[i])

        # Assigns all of the graphs in "types" to random indices in data
        data[L_ind[: num[i]]] = types

        # Create corresponding labels
        y_label[L_ind[: num[i]]] = int(i)

        # Get rid of used indices
        L_ind = L_ind[num[i] :]

    return data, y_label

Shown below is how to use the data_generator function to generate a data set consisting of 100 different 200 by 200 adjacency matricies. Each adjacency matrix has two communities, one made up of 20 nodes and the other made up of 180 nodes. The community of 20 nodes is the signal subgraph since it is the only subset of nodes that has a different probability depending on the class. 

In [None]:
# Tensor of different probability matricies for each sbm type
prob_tensor = np.zeros((2, 2, 2))
prob_tensor[0] = [[0.3, 0.2], [0.2, 0.3]]
prob_tensor[1] = [[0.4, 0.2], [0.2, 0.3]]

n = [20, 180]
percent_vec = np.asarray([0.50, 0.50])

data_samp, y_label_samp = data_generator(100, 200, n, prob_tensor, percent_vec)

## Usage

Now that data has been generated, it is possible to use the Non-Iterative Screening class. The fit function will generate the array of correlation values as a class attribute and the fit_transform function will output the signal subgraph.

In [None]:
from graspy.subgraph import NonItScreen

# Screening with Multiscale Graph Correlation (MGC)
screen = NonItScreen("mgc", 0.001)

#Correlations
screen.fit(data_samp, y_label_samp)

#Estimated Signal Subgraph
#S_hat = screen.fit_transform(data_samp, y_label_samp)

Below, the first matrix in the data set data_samp is presentated as a heatmap. The signal subgraph for this adjacency matrix is then presentated as a heatmap afterward. Lastly, since the indicides of the nodes that make up the Signal Subgraph become a class attribute after the fit_transform function is called, these indicies are presented as well. 

In [None]:
heatmap(data_samp[0], title='Entire Graph for One Data Entry')
heatmap(S_hat[0], title='Estimated Signal Subgraph for Same Entry')
print("Estimation for the Signal Subgraph Nodes:")
print(screen.subgraph_verts)