# Analyzing Training Set Data\n\nThis notebook provides a comprehensive guide to analyzing the HDF5 dataset generated by the `training` workflow in the `mbe-automation` program. We will walk through the following steps:\n\n1.  **Inspecting the Dataset:** Learn how to view the hierarchical structure of your HDF5 file to understand its contents.\n2.  **Reading Trajectories:** Use specialized functions from `mbe_automation.storage` to read and process trajectory data from both periodic and finite systems.\n3.  **Data Subsampling:** Discover how to create smaller, representative subsets of your data using techniques like farthest point sampling.\n4.  **Visualization:** Visualize your molecular trajectories using powerful tools like `chemiscope` to gain insights into your simulations.\n5.  **Saving Data:** Learn how to save specific frames from your simulations into standard file formats like XYZ and CIF for further analysis.

In [1]:
import mbe_automation

### Setup the Dataset Path\n\nBefore running the cells below, please define the path to your HDF5 dataset file in the following code cell. All subsequent cells will use this path.

In [None]:
# Define the path to your HDF5 dataset file here\ndataset_path = "path/to/your/training_set.hdf5"

# Glance inside your HDF5 dataset\nThe `tree` function displays the hierarchical structure of your dataset file.

In [3]:
mbe_automation.storage.tree(dataset_path)

# Read trajectory from the NPT simulation of the periodic system

In [4]:
trajectory_pbc = mbe_automation.storage.read_trajectory(\n    dataset=dataset_path,\n    key="training/md_sampling/crystal[dyn:T=298.15,p=0.00010]/trajectory"\n    )\n\nprint(f"There are {trajectory_pbc.n_frames} frames in the PBC trajectory")

In [5]:
try:\n    trajectory_pbc_subsampled = trajectory_pbc.subsample(n=100)\nexcept ValueError as e:\n    print(f"Error: {e}")

# Convert trajectory to an ASE trajectory object

In [6]:
ase_trajectory_pbc = mbe_automation.storage.ASETrajectory(trajectory_pbc)

# Access a selected frame as an ASE Atoms instance

In [7]:
ase_trajectory_pbc[10]

# Visualize the PBC trajectory

In [8]:
mbe_automation.structure.display.to_chemiscope(structure=trajectory_pbc)

# Read a finite subsystem trajectory

In [9]:
finite_subsystem_npt = mbe_automation.storage.read_finite_subsystem(\n    dataset=dataset_path,\n    key="training/md_sampling/crystal[dyn:T=298.15,p=0.00010]/finite_subsystems/n=8"\n)\nprint(f"Number of molecules: {finite_subsystem_npt.n_molecules}")\nprint(f"Number of frames: {finite_subsystem_npt.cluster_of_molecules.n_frames}")

In [10]:
mbe_automation.ml.display.pca(\n    finite_subsystem_npt.cluster_of_molecules,\n    plot_type="2d",\n    subset_size = 20,\n    subsample_algorithm="farthest_point_sampling"\n)

In [11]:
mbe_automation.ml.display.to_chemiscope(finite_subsystem_npt.cluster_of_molecules)

# Create a subset of frames using farthest point sampling

In [17]:
subsampled = finite_subsystem_npt.subsample(n=10, algorithm="farthest_point_sampling")\nprint(f"Subsampled cluster has {subsampled.cluster_of_molecules.n_frames} frames")

In [18]:
mbe_automation.ml.display.pca(subsampled.cluster_of_molecules)

# Visualize the finite subsystem

In [19]:
mbe_automation.structure.display.to_chemiscope(structure=finite_subsystem_npt.cluster_of_molecules)

# Read the subsystem part of the MD trajectory

In [20]:
finite_subsystem = mbe_automation.storage.read_finite_subsystem(\n    dataset=dataset_path,\n    key="training/phonon_sampling/finite_subsystems/n=4"\n    )

In [21]:
mbe_automation.structure.display.to_chemiscope(finite_subsystem.cluster_of_molecules)

# Read the PBC trajectory from the PhononSampling workflow

In [22]:
# Note: This cell originally used 'training_set_08.hdf5'.\n# We've replaced it with the 'dataset_path' variable for consistency.\nmolecular_crystal_phonons = mbe_automation.storage.read_molecular_crystal(\n    dataset=dataset_path,\n    key="training/phonon_sampling/molecular_crystal"\n)\nprint(f"Number of molecules: {molecular_crystal_phonons.n_molecules}")\nprint(f"Number of frames: {molecular_crystal_phonons.supercell.n_frames}")

# Visualize the PBC trajectory from PhononSampling

In [23]:
mbe_automation.structure.display.to_chemiscope(molecular_crystal_phonons.supercell)

# Read a finite subsystem from the PhononSampling trajectory

In [24]:
finite_subsystem_phonons = mbe_automation.storage.read_finite_subsystem(\n    dataset=dataset_path,\n    key="training/phonon_sampling/finite_subsystems/n=4"\n)

# Visualize the finite subsystem

In [25]:
mbe_automation.structure.display.to_chemiscope(finite_subsystem_phonons.cluster_of_molecules)

In [None]:
mbe_automation.ml.display.pca(finite_subsystem_phonons.cluster_of_molecules)

In [26]:
mbe_automation.ml.display.to_chemiscope(finite_subsystem_phonons.cluster_of_molecules)

# Save selected frames to XYZ files

In [27]:
mbe_automation.storage.to_xyz_file(save_path="04.xyz", system=trajectory_pbc, frame_index=4)

In [28]:
mbe_automation.storage.to_xyz_file(save_path="04_pbc.cif", system=trajectory_pbc, frame_index=4)

In [29]:
mbe_automation.storage.to_xyz_file(\n    system=finite_subsystem_phonons.cluster_of_molecules, \n    save_path="frame_04_phonons_supercell.xyz",\n    frame_index=4\n)