# The Basics

### Dataset loading and creation

The primary use for LoadAtoms is to enable quick and simple access to a variety of chemical datasets. 
This is done using the *dataset()* function as shown below.

The function will:

- download the dataset if it not already present in the cache
- store the dataset in the cache directory which is located in the user's home directory
- print out any copyright and licensing information related to the dataset


In [1]:
from load_atoms import dataset

# Load the dataset
# Arguments can be a string, a list of atoms objects or a path to structure files
gap17 = dataset("C-GAP-17")

This dataset is covered by the CC BY-NC-SA 4.0 license.
Please cite this dataset if you use it in your work.
For more information, visit:
https://jla-gardner.github.io/load-atoms/datasets/C-GAP-17.html


The "`dataset`" object contains all information about the data:

- total number of structures
- total number of atoms
- species present (in %)
- any additional properties stored within the data files; this can be per atom or per structure properties

In [2]:
gap17

C-GAP-17:
    structures: 4,530
    atoms: 284,965
    species:
        C: 100.00%
    properties:
        per atom: (force)
        per structure: (energy, detailed_ct, config_type, split)

If the dataset you're looking for is not available via LoadAtoms, you can easily build your own or load in existing structure files from a path:

In [3]:
from ase import Atoms

# list of structures 
structures = [
    Atoms("H2O"),
    Atoms("NH3"),
    Atoms("CH4"),
]

# create a dataset object from a list of structures
small_molecules = dataset(structures)
small_molecules

Dataset:
    structures: 3
    atoms: 12
    species:
        H: 75.00%
        N: 8.33%
        C: 8.33%
        O: 8.33%
    properties:
        per atom: ()
        per structure: ()

In [4]:
from ase.io import write

# save a dataset to a file
write("small_molecules.traj", small_molecules)

# create a dataset object from a file
dataset("small_molecules.traj")

Dataset:
    structures: 3
    atoms: 12
    species:
        H: 75.00%
        N: 8.33%
        C: 8.33%
        O: 8.33%
    properties:
        per atom: ()
        per structure: ()

### Dataset Manipulation

"`Dataset`" objects are just lists of `ase.Atoms` objects. This makes them very easy to manipulate.

In [5]:
# access a specific structure by index
structure = gap17[0]
structure

Atoms(symbols='C64', pbc=True, cell=[9.483921, 9.483921, 9.483921], force=..., calculator=SinglePointCalculator(...))

Properties of structures can be accessed just as you would normally do in ASE

In [6]:
# get the atomic numbers of all atoms in the structure
structure.arrays['numbers']

array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6])

Similarly new dataset objects can be created from subsets of a larger dataset:

In [7]:
# access a range of structures by index
# returns a new dataset object with only the selected structures
gap17[:4]

Dataset:
    structures: 4
    atoms: 256
    species:
        C: 100.00%
    properties:
        per atom: (force)
        per structure: (energy, detailed_ct, config_type, split)

LoadAtoms also contains some useful in-built functions for working with and manipulating datasets:

**Filter_by():** filters a dataset by a given property; returns a new dataset object with only the structures that match the given criteria

In [8]:
from load_atoms import filter_by

# only retain structures labelled as bulk amorphous
bulk_amo = filter_by(gap17, config_type="bulk_amo")

# only retain structures with less than 64 atoms
small = filter_by(gap17, lambda atoms: len(atoms) < 64)

len(bulk_amo), len(small)

(3410, 1434)

In [9]:
# new, filtered dataset object
bulk_amo

Dataset:
    structures: 3,410
    atoms: 224,665
    species:
        C: 100.00%
    properties:
        per atom: (force)
        per structure: (energy, detailed_ct, config_type, split)

**cross_validate_split():** splits the dataset into k folds; returns two new dataset objects which correspond to the train and test sets for a particular fold

In [10]:
from load_atoms import cross_validate_split

# obtains the data from the first fold (fold=0) of a 5-fold cross-validation split
train, test = cross_validate_split(gap17, fold=0, k=5, seed=42)
len(train), len(test)

(3624, 906)

In [11]:
train

Dataset:
    structures: 3,624
    atoms: 226,698
    species:
        C: 100.00%
    properties:
        per atom: (force)
        per structure: (energy, detailed_ct, config_type, split)

In [12]:
test

Dataset:
    structures: 906
    atoms: 58,267
    species:
        C: 100.00%
    properties:
        per atom: (force)
        per structure: (energy, detailed_ct, config_type, split)

Instead of specifying a number of folds, you can instead obtain a test/train split based on a specified number of test structures:

In [13]:
train, test = cross_validate_split(gap17,fold=0, n_test=100, seed=42)
len(train), len(test)

(4430, 100)

The full list of in-built functions can be found at ...