# Getting `ase.db` to do what we want

James Kermode, June 2015

The [ASE database module](https://wiki.fysik.dtu.dk/ase/ase/db/db.html) is a lightweight way to store atomic configurations. However, it doesn't yet support all the features we would like, namely:

 * No support for storing original output files from calculations along with records
 * No command line tool for extracting configurations to files
 * No easy way to include arbitrary per-frame and per-atom data

I've extended `ase.db.core.Database.write()`, `ase.db.row.AtomsRow.toatoms()` to allow all data to be read from the `Atoms.info` and `Atoms.arrays` dictionaries, and added some new command line options to `ase-db` to do what we need:

 * `-o/--store-original-file` - attach original input to database record as a string in the `data` section
 * `-x/--extract-original-file` - attach original input to database record as a string in the `data` section

 * `-W/--write-to-file [type]:filename` - write rows matching a query to files, optionallly specifiying format
 * `-A/--all-data` - include contents of `atoms.info` (per-frame dictionary) and `atoms.arrays` (per-atom dictionary) in database records

## Generate some dummy data

In [41]:
import glob
import numpy as np
from ase.io import read, write
from ase.lattice import bulk
from ase.calculators.singlepoint import SinglePointCalculator

N = 10

!rm dump*.xyz dump*.cif
for i in range(N):
    atoms = bulk('Si', crystalstructure='diamond', a=5.43, cubic=True)
    atoms.rattle()

    # simulate a calculation with random results
    e = np.random.uniform()
    f = np.random.uniform(size=3*len(atoms)).reshape((len(atoms), 3))
    s = np.random.uniform(size=9).reshape((3, 3))
    calc = SinglePointCalculator(atoms, energy=e, forces=f, stress=s)
    atoms.set_calculator(calc)
    f = atoms.get_forces()
    e = atoms.get_potential_energy()

    # add some arbitrary data
    atoms.info['integer_info'] = 42
    atoms.info['real_info'] = 217
    atoms.info['config_type'] = 'diamond'
    atoms.new_array('array_data', np.ones_like(atoms.numbers))

    atoms.write('dump_%03d.xyz' % i, format='extxyz')

## Make a `.db` file and add our `.xyz` files, attaching original files with `-o`

In [42]:
!rm -f tmp.db
for addfile in glob.glob("dump*.xyz"):
    !ase-db tmp.db -Aoa $addfile

Added Si8 from dump_000.xyz
Added Si8 from dump_001.xyz
Added Si8 from dump_002.xyz
Added Si8 from dump_003.xyz
Added Si8 from dump_004.xyz
Added Si8 from dump_005.xyz
Added Si8 from dump_006.xyz
Added Si8 from dump_007.xyz
Added Si8 from dump_008.xyz
Added Si8 from dump_009.xyz


In [43]:
!ase-db tmp.db

id|age|user        |formula|calculator|energy| fmax|pbc| volume|charge|   mass| smax
 1| 4s|jameskermode|Si8    |unknown   | 0.024|1.177|TTT|160.103| 0.000|224.684|0.942
 2| 4s|jameskermode|Si8    |unknown   | 0.855|1.506|TTT|160.103| 0.000|224.684|0.986
 3| 4s|jameskermode|Si8    |unknown   | 0.410|1.362|TTT|160.103| 0.000|224.684|0.918
 4| 4s|jameskermode|Si8    |unknown   | 0.973|1.401|TTT|160.103| 0.000|224.684|0.920
 5| 3s|jameskermode|Si8    |unknown   | 0.538|1.309|TTT|160.103| 0.000|224.684|0.965
 6| 3s|jameskermode|Si8    |unknown   | 0.221|1.458|TTT|160.103| 0.000|224.684|0.993
 7| 3s|jameskermode|Si8    |unknown   | 0.529|1.335|TTT|160.103| 0.000|224.684|0.826
 8| 3s|jameskermode|Si8    |unknown   | 0.568|1.214|TTT|160.103| 0.000|224.684|0.951
 9| 2s|jameskermode|Si8    |unknown   | 0.460|1.214|TTT|160.103| 0.000|224.684|0.873
10| 2s|jameskermode|Si8    |unknown   | 0.741|1.253|TTT|160.103| 0.000|224.684|0.782
Rows: 10
Keys: config_type, integer_info, original_fi

## Extract records from database in various file formats

* First, let's dump record #1 in `POSCAR` format

In [44]:
!ase-db tmp.db 1 -W vasp:-

Si 
 1.0000000000000000
     5.4299999999999997    0.0000000000000000    0.0000000000000000
     0.0000000000000000    5.4299999999999997    0.0000000000000000
     0.0000000000000000    0.0000000000000000    5.4299999999999997
   8
Cartesian
  0.0004967100000000 -0.0001382600000000  0.0006476900000000
  1.3590230299999999  1.3572658500000001  1.3572658600000000
  0.0015792100000000  2.7157674300000001  2.7145305300000002
  1.3580425599999999  4.0720365799999998  4.0720342699999996
  2.7152419600000002 -0.0019132800000000  2.7132750799999998
  4.0719377100000003  1.3564871700000001  4.0728142500000004
  2.7140919800000001  2.7135877000000002  0.0014656500000000
  4.0722742199999997  4.0725675299999997  1.3560752500000000
Wrote 1 rows.


* Now, we extract the same record in extended XYZ format, including all the additional data (`-A` option)

In [45]:
!ase-db tmp.db 1 -AW -

8
Lattice="5.43 0.0 0.0 0.0 5.43 0.0 0.0 0.0 5.43" Properties=species:S:1:pos:R:3:array_data:I:1:Z:I:1:forces:R:3 stress="0.796071140077 0.874200025244 0.664711001956 0.879650083717 0.249105884066 0.00529155715579 0.226770606161 0.942011627316 0.638690883362" original_file_name=dump_000.xyz integer_info=42 real_info=217 pbc="T T T" energy=0.0244401372704 config_type=diamond unique_id=abad1499a4d532f5ac7afba9ae16029a
Si      0.00049671      -0.00013826       0.00064769        1       14       0.66676912       0.45667764       0.29363953 
Si      1.35902303       1.35726585       1.35726586        1       14       0.74198155       0.69267489       0.59521078 
Si      0.00157921       2.71576743       2.71453053        1       14       0.41062755       0.13588458       0.82468071 
Si      1.35804256       4.07203658       4.07203427        1       14       0.56795887       0.10382025       0.04625466 
Si      2.71524196      -0.00191328       2.71327508        1       14       0.737

 * A more complicated query - we extract all configs with `energy<0.7` and write to separate CIF files

In [46]:
!ase-db tmp.db 'energy>0.7' -W dump_%02d.cif
!ls dump*.cif

Wrote 3 rows.
dump_00.cif dump_01.cif dump_02.cif


## Recover original files with `-x` argument

In [51]:
!rm -f dump*.xyz
!ase-db tmp.db -x
!ls dump*.xyz
!cat dump_000.xyz

Writing dump_000.xyz
Writing dump_001.xyz
Writing dump_002.xyz
Writing dump_003.xyz
Writing dump_004.xyz
Writing dump_005.xyz
Writing dump_006.xyz
Writing dump_007.xyz
Writing dump_008.xyz
Writing dump_009.xyz
Extracted original output files for 10/10 selected rows
dump_000.xyz dump_001.xyz dump_002.xyz dump_003.xyz dump_004.xyz dump_005.xyz dump_006.xyz dump_007.xyz dump_008.xyz dump_009.xyz
8
Lattice="5.43 0.0 0.0 0.0 5.43 0.0 0.0 0.0 5.43" Properties=species:S:1:pos:R:3:array_data:I:1:Z:I:1:forces:R:3 stress="0.796071140077 0.879650083717 0.226770606161 0.874200025244 0.249105884066 0.942011627316 0.664711001956 0.00529155715579 0.638690883362" energy=0.0244401372704 config_type=diamond real_info=217 integer_info=42 pbc="T T T"
Si      0.00049671      -0.00013826       0.00064769        1       14       0.66676912       0.45667764       0.29363953 
Si      1.35902303       1.35726585       1.35726586        1       14       0.74198155       0.69267489       0.59521078 
Si      0.001

## Access via the Python API

In [50]:
from ase.db import connect
con = connect('tmp.db')
for row in con.select('id=1'):
    atoms = row.toatoms(add_to_info_and_arrays=True)
    print atoms, 'original_file', atoms.info['original_file_name']

Atoms(symbols='Si8', array_data=..., positions=..., cell=[5.43, 5.43, 5.43], pbc=[True, True, True], calculator=SinglePointCalculator(...)) original_file dump_000.xyz
