### Clustering on the SOAP Descriptors for Atom Environment Identification

In this section, we perform clustering on the SOAP descriptors to identify atomic environments for a given Nb structure. The clustering process involves training the MEAGraph autoencoder, inference on specific structures, saving cluster labels for visualization, and comparing the results with other clustering methods.The Nb datasets are obtained from Sun, H. et al, Computational Materials Science, 230, p. 112497, 2023.
1. **MEAGraph Autoencoder Training:**
- Train the MEAGraph autoencoder on the Nb dataset.
- The trained model will be used for clustering the SOAP feature matrix.

2. **Inference:**
- Specify the structure ID for investigation (check available structures in `Nb_CONTCARs`).
- Cluster the SOAP feature matrix using the loaded MEAGraph model (`results` folder) to identify atomic environments.

3. **Save Cluster Labels for Visualization:**
- Save the cluster labels obtained from MEAGraph as the artificial `charges` in the atoms object.
- Visualize the xyz file using OVITO and check the clusters by color coding.
4. **Compare with Other Clustering Methods**
- Apply other clustering methods to the SOAP feature matrix for comparison: e.g., Affinity Propagation, K-means
- Compare the clustering results obtained from MEAGraph with these alternative methods.

#### 1. GAE Training

- Specify the configuration settings in the `YAML` file located in the `configs` folder. For example, set device to either cuda or cpu depending on your system.
- Convert the structures in the Nb_CONCTARs dataset to ASE atoms XYZ format and save the file as train.xyz in the `raw/raw_Nb_soap` folder.
- Run `python main.py --cfg configs/user-defined-config.yaml` in the notebook or on the terminal or submit the job to the cluster if available

In [1]:
import os
import subprocess
# Get the current working directory
#current_dir=os.getcwd()
#Uncomment to specify the absolute path if `os.getcwd` not working
current_dir = f"/usr/workspace/sun36/MEAGraph/run/applications/AtomEnvIdentification"  # change it to your working directory

# Change the working directory to "run_dir"
main_dir = os.path.abspath(os.path.join(current_dir, "..", ".."))
yaml_file='Nb_soap'
os.chdir(main_dir)

# Run GAE training (if not working try to run it in the terminal or submit the job to the cluster )
command=f'python main.py --cfg configs/{yaml_file}.yaml'
#Uncomment to run the main.py on this notebook
#subprocess.run(command,shell=True)

#### 2. Inference 

  - Specify the structure ID  for clustering (check available structures in `Nb_CONTCARs` folder)
  - run `python inference.py`

In [2]:
import yaml
import os
import subprocess

os.chdir(main_dir)

#define the parameters

group_name_strs='84' # structure ID is 84
group_name_list=[group_name_strs]
fixed_rate_l=0.8  #'Lower rate for edge selection of build_graph function'
rate=0.3     #'self-defined pooling rate for edge_reduction in Encoder layers'
train_val_ratio=1.0 #'ratio of training data to test data for the force field fitting using inference.py'
yaml_file='Nb_soap'
# Load the YAML file
with open(f'{main_dir}/configs/{yaml_file}.yaml', "r") as file:
    config = yaml.safe_load(file)
# Get the name of result dir
test_dir = config['test']['xyz_dir']

Run the `inference.py` to obtain the clustering results for the selected structure

In [3]:
result_dir=f"{main_dir}/results/{yaml_file}/{test_dir}"
jsonfile_path=f"{result_dir}/clusters_{group_name_strs}_r{rate}_train{train_val_ratio}.json"

if not os.path.exists(jsonfile_path):
    command = f'python inference.py --cfg configs/{yaml_file}.yaml --group_name {group_name_strs}  --rate {rate}  --train_val_ratio {train_val_ratio}  --device cpu'
    subprocess.run(command, shell=True)

# Change the working directory back to the original directory
os.chdir(current_dir)

testing loss is: 0.0013985219411551952
total number of atoms: 1280, number of isolated atoms: 0


Load the saved clustering results, merge the feature matrix and save them to the dataframe

In [4]:
import pandas as pd
import json
import torch

with open(jsonfile_path,'r') as json_file:
    node_info=json.load(json_file)
new_columns = {0: 'cluster_idx', 1: 'cluster_size'}
df_info = pd.DataFrame.from_dict(node_info,orient='index')
df_info.rename(columns=new_columns,inplace=True)
df_info.reset_index(inplace=True)
df_info.rename(columns={'index':'Atom_ID'},inplace=True)
df_info['Atom_ID'] = df_info['Atom_ID'].astype(int)  

data=torch.load(f"{result_dir}/graph_{group_name_strs}_train{train_val_ratio}_data.pt")
df_feats=pd.DataFrame(data.x)
df_feats.reset_index(inplace=True)
df_feats.rename(columns={'index':'Atom_ID'},inplace=True)

df=pd.merge(df_feats,df_info,on='Atom_ID')



#### 3. Save Cluster Labels for Visualization
- Save the cluster labels obtained from MEAGraph as the artificial `charges` in the atoms object.
- Visualize the xyz file using OVITO and check the clusters by color coding.

In [5]:
import numpy as np
import ase.io

charges=df['cluster_idx']
charges=np.array(charges)
config_id=int(group_name_strs)+1
atom=ase.io.read('Nb_CONTCARs/CONTCAR'+str(config_id))

atom.arrays['charge']=charges
ase.io.write(f'{config_id}_{rate}.xyz',atom,format='extxyz')

Set the DFT force magnitude values (in `Nb_forces` folder) as the `charges` label for the reference

In [6]:
forces=np.loadtxt(f'Nb_Forces/force_Cfg{config_id}')
forces=np.linalg.norm(forces, axis=1)
config_id=int(group_name_strs)+1
atom=ase.io.read('Nb_CONTCARs/CONTCAR'+str(config_id))
atom.arrays['charge']=forces
ase.io.write(f'{config_id}_force.xyz',atom,format='extxyz')

####  4. Compare with Other Clustering Methods
Apply other clustering methods to the SOAP feature matrix for comparison:
 - Affinity Propagation
 - K-means
 - Spectral Clustering
 - DBSCAN
 - Mean Shift
 - Gaussian Mixture Models (GMM)

In [8]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import MeanShift
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.cluster import SpectralClustering
import ase.io
atom=ase.io.read('Nb_CONTCARs/CONTCAR'+str(config_id))

# bandwidth=2
# mean_shift = MeanShift(bandwidth=bandwidth)
# labels = mean_shift.fit_predict(data.x)

# n_clusters=6
# # Apply spectral clustering
# spectral_clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors', random_state=42)
# labels = spectral_clustering.fit_predict(numeric_df)

# damping=0.5
# affinity_propagation = AffinityPropagation(damping=damping)
clusters = AffinityPropagation().fit(data.x)
# labels = affinity_propagation.fit_predict(X_embedded)
#labels = mean_shift.fit_predict(X_embedded)

# gmm = GaussianMixture(n_components=15, random_state=42)
# labels = gmm.fit_predict(numeric_df)

# n_clusters=6
# kmeans = KMeans(n_clusters=n_clusters, random_state=42)
# labels = kmeans.fit_predict(numeric_df)
# labels = kmeans.fit_predict(X_embedded)

# clusters=DBSCAN().fit(data.x)
#ase.io.write(f'{config_id}_affinity_{damping}.xyz',atom,format='extxyz')
#ase.io.write(f'{config_id}_kmeans_{n_clusters}.xyz',atom,format='extxyz')

# ase.io.write(f'{config_id}_spectral_{n_clusters}.xyz',atom,format='extxyz')

charges=clusters.labels_
charges=np.array(charges)
atom.arrays['charge']=charges
# ase.io.write(f'{config_id}_meanshift_b{bandwidth}.xyz',atom,format='extxyz')
ase.io.write(f'{config_id}_affinity.xyz',atom,format='extxyz')
#ase.io.write(f'{config_id}_DBSCAN.xyz',atom,format='extxyz')

