# PyGraSPI Introduction

This notebook provides an introduction to using the PyGraSPI API. PyGraSPI provides a set of features or descriptors from sample microstructures. It is an alternative to using 2-point stats in homogenization workflows for materials science AI applications. PyGraSPI currently provides a function, `make_descriptors`, that takes a set of microstructures and returns a set of descriptors in a Pandas dataframe. PyGraSPI returns two main categories of descriptors. The first is based on the graph network generated from the microstructure where the graph nodes are colored based on the material phase. This method provides descriptors such as vertex count, tortuosity and connected components. The second method, based on the skeleton of the graph, provides features concerned with the internal cycles and intersections in the graph.

In [1]:
"""PyGraSPI Intro
"""

import zipfile

import dask.array as da
import numpy as np
import pandas
from pymks import solve_cahn_hilliard
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from toolz.curried import curry, pipe

from pygraspi.combined_descriptors import (_make_skeletal_descriptors,
                                           make_descriptors)

## The Data

The data used here consists of 573 artificially generated microstructures from Cahn-Hilliard simulations with 401x101 shaped grids (see [Jivani et al.](https://doi.org/10.1016/j.commatsci.2021.110409) for more details). Each data sample is generated with a random initial condition and with different initial volume fractions and interaction parameters and run for varying durations. On unzipping the data in `data/cahn-hilliard.zip` there are files of the type `data_X.XXX_Y.Y_NNNNNNN.txt` where the `X.XXX` refer to the volume fraction and the `Y.Y` refer to the interaction parameter values. The `NNNNNN` denotes the number of time steps reached for that particular sample. Note that files with corresponding volume fractions and interaction parameters are from the same simulation (just with varying duration).

In [2]:
@curry
def read_data(zip_stream_, file_name):
    """Read a single CSV file"""
    return np.array(
        pandas.read_csv(
            zip_stream_.open(file_name, "r"), delimiter=" ", header=None
        ).swapaxes(0, 1)
    )


with zipfile.ZipFile("data/cahn-hilliard.zip", "r") as zip_stream:
    data = np.array(
        list(
            # pylint: disable=no-value-for-parameter
            map(read_data(zip_stream), zip_stream.namelist()[:3])
        )
    )

    print(data.shape)

(3, 401, 101)


Currently, we are only using 3 samples as the implementation uses NetworkX, which is extremely slow for this category of calculations. The new implementation will use Graph-tool, which is considerably more efficient for these calculations. 

In [3]:
make_descriptors(data)

KeyboardInterrupt: 

The following demonstrates how to use the graph descriptors to classify microstructures. 

Here, two categories of microstructure are generated each with 96 samples using a Cahn-Hilliard simulation. The two categories of microstructures differ based on the duration of evolution (10 steps versus 100 steps). This is not a particularly useful machine learning example, but suffices to demonstrate using the graph descriptors alongside Scikit-learn.

The `generate_data` function uses the PyMKS function `solve_cahn_hilliard` to generate the data.

In [None]:
def generate_data(n_category, n_chunks, n_domain, seed=99):
    """Generate the Cahn-Hilliard data"""
    da.random.seed(seed)
    solve_ch = curry(solve_cahn_hilliard)(delta_t=1.0, delta_x=0.5)
    x_data_ = pipe(
        da.random.random(
            (n_category * 2, n_domain, n_domain), chunks=(n_chunks, n_domain, n_domain)
        ),
        lambda x: 2 * x - 1,
        lambda x: [
            solve_ch(x[:n_category], n_steps=10),
            solve_ch(x[n_category:], n_steps=100),
        ],
        da.concatenate,
        lambda x: da.where(x > 0, 1, 0).persist(),
    )
    y_data_ = da.from_array(
        np.concatenate([np.zeros(n_category), np.ones(n_category)]).astype(int),
        chunks=(n_chunks,),
    )
    return np.array(x_data_), np.array(y_data_)

Below, `n_category` refers to the number of samples per category, `n_chunks` refers to the number of samples per chunk of data in the Dask array, `n_domain` refers to the number of pixels along an edge of the domain. 

In [None]:
x_data, y_data = generate_data(n_category=96, n_chunks=24, n_domain=101)

The following generates the graph descriptors from the raw microstructures. Note that only `_make_skeletal_descriptors` is used as the graph descriptors are inefficient in the current version of PyGraSPI

In [None]:
# replace this with make_descriptors when switched over to use Graph-tool
x_graph = _make_skeletal_descriptors(x_data)

The redundant, constant-value features need to be removed otherwise the `LogisticRegression` fails to execute.

In [None]:
mask = ~x_graph.eq(x_graph.iloc[0]).all()
x_graph_clean = x_graph.loc[:, mask]

Train / test split the data.

In [None]:
x_train_graph, x_test_graph, y_train, y_test = train_test_split(
    np.array(x_graph_clean), y_data, test_size=0.2, random_state=99
)

The graph data is required to be scaled for the logistic regression. Note that the scaler is only fit using the training data (not all the data).

In [None]:
scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train_graph)
x_test_scaled = scaler.transform(x_test_graph)

train ther regresson

In [None]:
# NBVAL_IGNORE_OUTPUT
model = LogisticRegression().fit(x_train_scaled, y_train)

In [None]:
y_predict = model.predict(x_test_scaled)

This is a very easy classification problem and so the predictions are perfect.

In [None]:
confusion_matrix(y_test, y_predict)