# Introduction and Background

In this notebook we step through the importing and partitioning of the dataset. For other notebooks, this is done using a utility function found in `../utilities/general.py`.

In [None]:
#!/usr/bin/env python3
import sys

# Maths things
import numpy as np

# Atomistic structure manipulation
from ase.io import read, write

sys.path.append('../')

# Librascal
from rascal.representations import SphericalInvariants as SOAP

# scikit-cosmo
from skcosmo.preprocessing import KernelNormalizer
from skcosmo.preprocessing import StandardFlexibleScaler
from skcosmo.feature_selection import FPS

# Local Utilities for Notebook
from utilities.general import load_variables
from utilities.kernels import linear_kernel, gaussian_kernel

# Import data

Structures are read using the ASE I/O library from an extended XYZ file, that contains also information on the properties of the structures or the atoms. In this case, we read a property that contains the local chemical shieldings as computed by GIPAW-DFT.

In [None]:
N=10
input_file="../datasets/CSD-1000R.xyz"
properties = ["CS_local", "CS_total"]
    
# Read the first N frames of CSD-500
frames = read(input_file, index=':{}'.format(N))

# Wrap atoms to unit cell
for frame in frames:
    frame.wrap()

# Extract chemical shifts
Y = np.vstack([np.concatenate([frame.arrays[property] for frame in frames]) for property in properties]).T

Within the {{ N }} frames we have {{ len(Y) }} environments.

# Compute SOAP Vectors
We use the SOAP power spectrum vectors as atomic descriptors for the structures [(Bartók, 2013)](https://doi.org/10.1103/PhysRevB.87.184115).
Understanding SOAP vectors is not necessary for this tutorial, although they are crucial for correlating chemical environments and materials properties. For now, consider the power spectrum SOAP vectors as a three-body correlation function which includes information on each atom, its relationships with neighboring atoms, and the relationships between pairs of neighbors. The correlation function is expanded on a dense basis, and the feature vector contains more information than it is necessary for these tutorials, so we use [farthest point sampling](https://en.wikipedia.org/wiki/Farthest-first_traversal) to only include 200 components of the SOAP vectors while still retaining much of their diversity.

SOAP vectors are computed with the librascal package [(librascal GitHub)](https://github.com/cosmo-epfl/librascal). If you don't want (or cannot) install librascal, you can download a precomputed version datafile `precomputed.npz`, that you should store in the `datasets/` folder, as discussed in the [foreword](0_Foreword.ipynb) to this tutorial.  You should then be able to run all tutorials without having to install librascal. 

In [None]:
# Compute SOAPs (from librascal tutorial)
soap = SOAP(soap_type='PowerSpectrum',
           interaction_cutoff=3.5,
           max_radial=6,
           max_angular=6,
           gaussian_sigma_type='Constant',
           gaussian_sigma_constant=0.4,
           cutoff_smooth_width=0.5)

soap_rep = soap.transform(frames)
X_raw = soap_rep.get_features(soap)

num_features = X_raw.shape[1]

here we prepare a file that can be used to initialize all local variables without having to read the raw data file or using `librascal` to compute the features

In [None]:
np.savez("../datasets/precomputed.npz", n_atoms=X_raw.shape[0], 
         indices=range(N), X=X_raw, Y=Y)

Each SOAP vector contains {{num_features}} components. We use furthest point sampling to generate a subsample of our SOAP vectors.

In [None]:
# FPS the components
n_FPS=200
col_idxs = FPS(n_to_select=n_FPS).fit(X_raw).selected_idx_
X = X_raw[:, col_idxs]

# Prepare data

## Splitting into Testing and Training
Data is split into a training and testing set, and normalized based on the train set. 
This makes it easier to compare performance of PCA and linear regression based on the intrinsic
variability, and makes the whole analysis dimensionless.

In [None]:
from sklearn.model_selection import train_test_split
# Splits in train and test sets
n_train = int(len(Y)/2)
n_test = len(Y)-n_train
r_train = np.asarray(range(len(Y)))
i_train, i_test = train_test_split(r_train, train_size=n_train, shuffle=True)

X_train = X[i_train]
Y_train = Y[i_train]
X_test = X[i_test]
Y_test = Y[i_test]

print(f'Shape of testing data is: {X_train.shape}, ||X|| = {np.linalg.norm(X_train)}.')        

## Centering and Normalizing Data
In order to simplify the algebra in what follows, and to treat features and properties on the same grounds, we center and normalize the data. In other words, we calculate the means and standard deviation for the two training arrays (X_train and Y_train) and normalize the other matrices based upon these values.

In [None]:
x_scaler = StandardFlexibleScaler(column_wise=False).fit(X_train)
y_scaler = StandardFlexibleScaler(column_wise=True).fit(Y_train)

# Center total dataset
X = x_scaler.transform(X)
Y = y_scaler.transform(Y)

# Center training data
X_train = x_scaler.transform(X_train)
Y_train = y_scaler.transform(Y_train)

# Center training data
X_test = x_scaler.transform(X_test)
Y_test = y_scaler.transform(Y_test)

# Generating Kernels
In later notebooks ([Kernel Methods](3_KernelMethods.ipynb) and [Sparse Kernel Methods](4_SparseKernelMethods.ipynb)) we use kernels rather than the raw features. They can be computed as follows using the utility functions and default parameters

In [None]:
K_train = gaussian_kernel(X_train, X_train)
K_test = gaussian_kernel(X_test, X_train)

k_scaler = KernelNormalizer().fit(K_train)

K_train = k_scaler.transform(K_train)
K_test = k_scaler.transform(K_test)

# Loading data with the Utility Class

The data preparation protocol that is explained in this notebook can be automated using a utility class found in `utilities/general_utils.py`. This call is used in all the example notebooks, and sets all of the variables locally.

In [None]:
var_dict = load_variables()
locals().update(var_dict)