# SFILES Validator

The purpose of this notebook is to explore methods for parsing and validating an SFILES text string.

## SFILES Notation

SFILES strings are read left to right. Process groups are delimited with an left parenthesis "(" and a terminal right parenthesis ")", and are not nested. The contents of a process group consists of series of strings corresponding to process streams separated by a forward slash "/". Process streams consist of an optional type designated by a sequence of lower case alphabetic characters, followed by a sequence of one or more upper case alphabetic characters denoting chemical species.

## Process Types

The SFILES string includes specification of a process or stream type that is represented in the grammar as a sequence of lower case alphabetic characters. Data on the available types and their properites are given in the following dictionary.

In [165]:
types = {
    'i'    : {'name': 'input'},
    'o'    : {'name': 'output'},
    'f'    : {'name': 'flash'},
    'e'    : {'name': ''},
    'm'    : {'name': ''},
    'n'    : {'name': ''},
    'p'    : {'name': 'reactor product'},
    'cyc'  : {'name': 'solvent based azeotropic distillation'}, 
    'r'    : {'name': 'reactor'},
    'sw'   : {'name': 'pressure swing distillation'},
    'pms'  : {'name': 'polar molecule sieve based separation'},
    'ms'   : {'name': 'molecular sieve based separation'},
    'lmem' : {'name': 'liquid membrane based separation'},
    'gmem' : {'name': 'gas membrane based separation'},
    'crs'  : {'name': 'crystallization'},
    'ab'   : {'name': 'absorption'}
}

## Parsing Rules

The next cell develops a representation of the SFILES grammar using the `pyparsing` library.

In [166]:
from pyparsing import Literal, Word, Group, Suppress
from pyparsing import Optional, OneOrMore, ZeroOrMore, oneOf, nestedExpr
from pyparsing import alphas, nums

LPAR  = Suppress("(")
RPAR  = Suppress(")")
LBRA  = Suppress("[")
RBRA  = Suppress("]")
SLASH = Suppress("/")
GT = Literal(">")
LT = Literal("<")

# components
component = Word(alphas.upper(), exact=1)

# mixtures
mixture = Group(OneOrMore(component))

# first unit and stream in a process group
type = Optional(oneOf(' '.join(types.keys())), default='dist')
stream = Group(type + mixture)

# subsequent units and streams in a process group
type_ = Optional(oneOf(' '.join(types.keys())), default='s')
stream_ = Group(type_ + mixture)

# process group
processgroup = Group(LPAR + stream + ZeroOrMore(SLASH + stream_) + RPAR)
           
# a process group sequence is comprised of connectors, process group, and recycles                                             
connector = Optional(GT | LT, default=GT)
recycle = Word(nums, exact=1)
sequence = Group(processgroup + ZeroOrMore(connector + (processgroup | recycle )))

# nested branches
branchsequence = OneOrMore(connector + (processgroup | recycle ))
branch = nestedExpr(opener=LBRA, closer=RBRA, content=branchsequence)

# sfiles expression start with sequence
sfiles = sequence + ZeroOrMore(branch | sequence)

# example
results = sfiles.parseString('(iA)(rAB/pABCD)<1<2[<(iB)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)')
results.asList()

[[[['i', ['A']]],
  ">",
  [['r', ['A', 'B']], ['p', ['A', 'B', 'C', 'D']]],
  '<',
  '1',
  '<',
  '2'],
 ['<', [['i', ['B']]]],
 [[['m', ['A', 'B', 'C']], ['s', ['D']]]],
 ['<', [['o', ['D']]]],
 [[['dist', ['A']], ['s', ['B', 'C']]],
  ">",
  '1',
  ">",
  [['cyc', ['B']], ['s', ['C']]],
  ">",
  '2',
  ">",
  [['o', ['C']]]]]

## Tests

In [173]:
processgroup_tests = """\
    (A/BC)
    (ABC/DE)
    (cycA/B)
    (fABC/BCD)
    (rABC/nE/pABCD)
    (rABC/nE/pABCD)
    (swA/B)
    (pmsABC/D)
    (msABC/D)
    (lmemABC/D)
    (gmemABC/D)
    (crsABC/D)
    (abEAB/eF/EABF/EF)
    (iABCD)
    (oABD)
    (ABC/D)
    (rAB/pABD)
"""
    
sequence_tests = """\
    (iA)(rAB/pABCD)
    (iA)(oB)(iC)
    (iA)<(oB)>2
    (A/BC)2(oD)
    (A/BC)1
"""

branch_tests = """\
    (iA)[(oA)]
    (iA)[<(oD)]
    (iA)[<(A/BD)]
    (iA)[(A/BD)(BD/B)[(oA)]]
"""

sfiles_tests = """\
    (iA)(rAB/pABCD)<1<2[<(iB)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)
    (iB)(rAB/pABCD)<1<2[<(iA)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)
    (iA)(rAB/pABCD)[(iB)]    
    (iA)(rAB/pABCD)[<(iB)]
    (iA)(rAB/pABCD)<1<2[<(iB)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)
    (iB)(rAB/pABCD)<1<2[<(iA)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)
    (iA)(rAB/pABCD)[<(iB)](mABC/D)[(oD)](A/BC)[(oA)](oBC)
    (iA)[(oB)(oC)[(oD)]]
    (iABCDE)(AB/CDE)[(A/B)[(oA)](oB)]
"""

sfiles.runTests(processgroup_tests)
sfiles.runTests(sequence_tests)
sfiles.runTests(branch_tests)
sfiles.runTests(sfiles_tests)


(A/BC)
[[[['dist', ['A']], ['s', ['B', 'C']]]]]
[0]:
  [[['dist', ['A']], ['s', ['B', 'C']]]]
  [0]:
    [['dist', ['A']], ['s', ['B', 'C']]]
    [0]:
      ['dist', ['A']]
      [0]:
        dist
      [1]:
        ['A']
    [1]:
      ['s', ['B', 'C']]
      [0]:
        s
      [1]:
        ['B', 'C']


(ABC/DE)
[[[['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]]]
[0]:
  [[['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]]
  [0]:
    [['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]
    [0]:
      ['dist', ['A', 'B', 'C']]
      [0]:
        dist
      [1]:
        ['A', 'B', 'C']
    [1]:
      ['s', ['D', 'E']]
      [0]:
        s
      [1]:
        ['D', 'E']


(cycA/B)
[[[['cyc', ['A']], ['s', ['B']]]]]
[0]:
  [[['cyc', ['A']], ['s', ['B']]]]
  [0]:
    [['cyc', ['A']], ['s', ['B']]]
    [0]:
      ['cyc', ['A']]
      [0]:
        cyc
      [1]:
        ['A']
    [1]:
      ['s', ['B']]
      [0]:
        s
      [1]:
        ['B']


(fABC/BCD)
[[[['f', ['A', 'B', 'C']], ['s', ['B', '

(True,
 [('(iA)(rAB/pABCD)<1<2[<(iB)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)',
   ([([([(['i', (['A'], {})], {})], {}), ">", ([(['r', (['A', 'B'], {})], {}), (['p', (['A', 'B', 'C', 'D'], {})], {})], {}), '<', '1', '<', '2'], {}), (['<', ([(['i', (['B'], {})], {})], {})], {}), ([([(['m', (['A', 'B', 'C'], {})], {}), (['s', (['D'], {})], {})], {})], {}), (['<', ([(['o', (['D'], {})], {})], {})], {}), ([([(['dist', (['A'], {})], {}), (['s', (['B', 'C'], {})], {})], {}), ">", '1', ">", ([(['cyc', (['B'], {})], {}), (['s', (['C'], {})], {})], {}), ">", '2', ">", ([(['o', (['C'], {})], {})], {})], {})], {})),
  ('(iB)(rAB/pABCD)<1<2[<(iA)](mABC/D)[<(oD)](A/BC)1(cycB/C)2(oC)',
   ([([([(['i', (['B'], {})], {})], {}), ">", ([(['r', (['A', 'B'], {})], {}), (['p', (['A', 'B', 'C', 'D'], {})], {})], {}), '<', '1', '<', '2'], {}), (['<', ([(['i', (['A'], {})], {})], {})], {}), ([([(['m', (['A', 'B', 'C'], {})], {}), (['s', (['D'], {})], {})], {})], {}), (['<', ([(['o', (['D'], {})], {})], {})], {}), 

In [169]:
sfiles.runTests(processgroup_tests)


(A/BC)
[[[['dist', ['A']], ['s', ['B', 'C']]]]]
[0]:
  [[['dist', ['A']], ['s', ['B', 'C']]]]
  [0]:
    [['dist', ['A']], ['s', ['B', 'C']]]
    [0]:
      ['dist', ['A']]
      [0]:
        dist
      [1]:
        ['A']
    [1]:
      ['s', ['B', 'C']]
      [0]:
        s
      [1]:
        ['B', 'C']


(ABC/DE)
[[[['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]]]
[0]:
  [[['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]]
  [0]:
    [['dist', ['A', 'B', 'C']], ['s', ['D', 'E']]]
    [0]:
      ['dist', ['A', 'B', 'C']]
      [0]:
        dist
      [1]:
        ['A', 'B', 'C']
    [1]:
      ['s', ['D', 'E']]
      [0]:
        s
      [1]:
        ['D', 'E']


(cycA/B)
[[[['cyc', ['A']], ['s', ['B']]]]]
[0]:
  [[['cyc', ['A']], ['s', ['B']]]]
  [0]:
    [['cyc', ['A']], ['s', ['B']]]
    [0]:
      ['cyc', ['A']]
      [0]:
        cyc
      [1]:
        ['A']
    [1]:
      ['s', ['B']]
      [0]:
        s
      [1]:
        ['B']


(fABC/BCD)
[[[['f', ['A', 'B', 'C']], ['s', ['B', '

(True,
 [('(A/BC)',
   ([([([(['dist', (['A'], {})], {}), (['s', (['B', 'C'], {})], {})], {})], {})], {})),
  ('(ABC/DE)',
   ([([([(['dist', (['A', 'B', 'C'], {})], {}), (['s', (['D', 'E'], {})], {})], {})], {})], {})),
  ('(cycA/B)',
   ([([([(['cyc', (['A'], {})], {}), (['s', (['B'], {})], {})], {})], {})], {})),
  ('(fABC/BCD)',
   ([([([(['f', (['A', 'B', 'C'], {})], {}), (['s', (['B', 'C', 'D'], {})], {})], {})], {})], {})),
  ('(rABC/nE/pABCD)',
   ([([([(['r', (['A', 'B', 'C'], {})], {}), (['n', (['E'], {})], {}), (['p', (['A', 'B', 'C', 'D'], {})], {})], {})], {})], {})),
  ('(rABC/nE/pABCD)',
   ([([([(['r', (['A', 'B', 'C'], {})], {}), (['n', (['E'], {})], {}), (['p', (['A', 'B', 'C', 'D'], {})], {})], {})], {})], {})),
  ('(swA/B)',
   ([([([(['sw', (['A'], {})], {}), (['s', (['B'], {})], {})], {})], {})], {})),
  ('(pmsABC/D)',
   ([([([(['pms', (['A', 'B', 'C'], {})], {}), (['s', (['D'], {})], {})], {})], {})], {})),
  ('(msABC/D)',
   ([([([(['ms', (['A', 'B', 'C'], {})]