# ARC Typesystem
In this tutorial, you will learn the basic data structures of the ARC Typesystem, as well as saving and load with the core ARC-Types.

## Basic Types

There are two types of objects we deal with in ARC:

An **Element** is any linguistic object of interest. In our case, `Phoneme`, `Syllable`, `Word`, and `Stream` are Elements. These objects can consist of other elements, in a dictionary-style fashion, i.e. `Word`, and `Stream` consist of `Syllable`s, a `Syllable` consists of `Phoneme`s, and `Phonemes` are atomic. If an Element consists of multiple sub-elements, like in the case of a Syllable, the sub-elements can repeat, e.g. a Phoneme can repeat multiple times inside a Syllable. Elements can be part of a Register, which can be thought of as a corpus of Elements.  Like in real corpora, every element can be annotated, hence it has an `.info` field, which can hold arbitrary annotations in dictionary format. Elements in a Register *don't repeat*.

A **Register** is essentially an ordered set with some extra functionality. We use this container type to create ordered collections of Elements that do not repeat, i.e. ordered, annotated sets of Phonemes, Syllables, and Words (like little corpora). Since every element in the Register has a unique string representation, it can be hashed and thus found quickly in memory. The `Lexicon` type is implemented as a Register of words, as well as any collection of Phonemes, Syllables or Words we use to generate higher level elements.

In summary
- `Phoneme`, `Syllable`, `Word` are subclasses of `Element`
- `Stream` is the same as a `Word` just longer, since it consists of repeatable `Syllable` objects
- whenever you see multiple elements, e.g. a `phonemes` object, its a `Register`
- `Lexicon` is a special `Register` of `Word`s

## Phonemes
Phonemes are the atomic unit of the ARC-Typesystem and built the basis for constructing other types like Syllables and Words. 
To enjoy the full functionolity of ARC, you'll need Phonemes annotated with their phonetic features. Luckily, ARC comes with an extensive corpus of Phonemes with phonetic features.
Let's load them and see what they look like.

In [24]:
from arc import load_phonemes
help(load_phonemes)

Help on function load_phonemes in module arc.io:

load_phonemes(path_to_json: Union[str, os.PathLike, NoneType] = None, language_control=False) -> ~RegisterType



In [27]:
phonemes = load_phonemes()
print(phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


The `phonemes` variable is a Collection of Phoneme-Objects, more specifically a `Register`. What you see when you print any Register is a short summary of the first elements.
You can treat the Register like most Python collection types, meaning you can access elements, iterate over it etc.

> Note: Internally, `Register`s are `OrderedDict`s (with some extra convenience methods). Essentially, you can treat it like both Python builtin types `Dict`and `List`.

Let's see that in action.

In [28]:
print("We can reference elements of a Corpus by position/index:", phonemes[0], ", or by its string representation:", phonemes["k"])

We can reference elements of a Corpus by position/index: k͡p , or by its string representation: k


Internally, Elements are `Dict`-like objects, more specifically, [Pydantic](https://docs.pydantic.dev/latest/) types.

In [29]:
print(phonemes["k"])
phonemes["k"]

k


Phoneme(id='k', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']})

Annotations can be referenced via the `.info` property, which can hold arbitrary dictionary data

In [30]:
print(phonemes["k"].info)

{'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']}


Phoneme features can be hard to interpret, so you can also get features directly, e.g. the "is labial" binary feature, called `lab`:

In [31]:
help(phonemes["k"].get_binary_feature)

Help on method get_binary_feature in module arc.types.phoneme:

get_binary_feature(label: Literal['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']) method of arc.types.phoneme.Phoneme instance



In [32]:
phonemes["k"].get_binary_feature("lab")

False

## The Register

While Registers in ARC print as compact summaries of their contents, they can be arbitrarily complex data structures.

Regardless of the contents, Elements and Registers are always JSON serializable, as long as they are valid (which is checked at initialization):

In [34]:
print(phonemes.to_json()[:80] + "...")

{"k͡p": {"id": "k͡p", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "...


... which means they can be written to file and read later as a register. The Register container type has methods for reading and writing:

In [35]:
help(phonemes.save)

Help on method save in module arc.types.base_types:

save(path: Union[str, os.PathLike] = None) method of arc.types.base_types.Register instance



In [36]:
import os

os.makedirs("results", exist_ok=True)
phonemes.save(os.path.join("results", "test_phonemes.json"))
loaded_phonemes = load_phonemes(os.path.join("results", "test_phonemes.json"))
print(loaded_phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


### Some useful operations

In [52]:
from arc import Register
help(Register)

Help on class Register in module arc.types.base_types:

class Register(collections.OrderedDict)
 |  Register(other=(), /, **kwargs)
 |  
 |  Method resolution order:
 |      Register
 |      collections.OrderedDict
 |      builtins.dict
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, item: Union[str, arc.types.base_types.Element])
 |      True if the dictionary has the specified key, else False.
 |  
 |  __getitem__(self, item)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __init__(self, other=(), /, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Implement iter(self).
 |  
 |  __str__(self)
 |      Return str(self).
 |  
 |  append(self, obj: arc.types.base_types.Element)
 |  
 |  empty_like(self)
 |  
 |  filter(self, func, *args, **kwargs)
 |  
 |  flatten(self) -> ~RegisterType
 |  
 |  get_self_with_info_key(self)
 |  
 |  get_subset(self, size: int) -> ~RegisterType
 |      Create a 

In [38]:
from arc.io import read_phoneme_corpus
corpus_phonemes = read_phoneme_corpus()
print(phonemes)
phonemes = phonemes.intersection(corpus_phonemes)
print(phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)
ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)


## Syllables
Our first composite type is the `Syllable`, consisting of a list of `Phoneme`s. Let's make a collection of syllables, that follow the `cV` pattern, meaning they consist of a single-character phoneme `c` followed by a long vowel `V`.

In [41]:
from arc.core.syllable import make_syllables
help(make_syllables)

Help on function make_syllables in module arc.core.syllable:

make_syllables(phonemes: ~RegisterType, phoneme_pattern: str = 'cV', unigram_control: bool = True, language_control: bool = True, language_alpha: Optional[float] = 0.05, from_format: Literal['ipa', 'xsampa'] = 'xsampa', lang: str = 'deu') -> ~RegisterType
    _summary_
    
    Args:
        phonemes (RegisterType): A Register of phonemes that will be used as a basis to generate the syllables
        phoneme_pattern (str, optional): describes how a syllable is structured, e.g. "cV" syllables consist of a single-consonant character and a long vowel. Defaults to "cV".
        unigram_control (bool, optional): apply statistical control (on the basis of p-val of uniform distribution) to single unigrams. Defaults to True.
        language_control (bool, optional): apply language specific controls (only german for now) on the syllable level. Defaults to True.
        language_alpha (Optional[float], optional): which p-value to ass

In [42]:
syllables = make_syllables(phonemes, phoneme_pattern="cV")
print(syllables)

ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


They behave pretty much like the phonemes register except that each is further composed of phonemes.

In [43]:
print(syllables["ɡaː"], syllables[1])
syllables["ɡaː"], syllables[1]

ɡaː ɡiː


(Syllable(id='ɡaː', info={'binary_features': [0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1], 'phonotactic_features': [['plo', 'oth'], ['a']], 'freq': 85, 'prob': 8.41048e-05}, phonemes=[Phoneme(id='ɡ', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '+', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-'], 'word_position_prob': {0: 0.6206003986790039, 1: 0.0370116926006367, 2: 0.18437416322037428, 3: 0.0536729047038172, 4: 0.03079349022641397, 5: 0.030019933950194876, 6: 0.016601707774240575, 7: 0.013061200202314719, 8: 0.004938859302014221, 9: 0.00395703787450537, 10: 0.0016958733747880158, 11: 0.0011603344143286424, 12: 0.0006545476183392342, 13: 0.00038677813810954746, 14: 0.0003570259736395823, 15: 0.00020826515128975634, 16: 0.00014876082234982596, 17: 0.00017851298681979114, 18: 5.950432893993038e-05, 19: 5.950432893993038e-05, 20: 0.0, 21: 0.0, 22: 2.975216446996519e-05, 23: 0.0, 24: 2.975216446996519e-05}}), Phoneme(id='aː', info={'features': ['+', '+', '-', '

You can get the sub-elements of any composite element as a list:

In [44]:
syllables["ɡaː"].get_elements()

[Phoneme(id='ɡ', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '+', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-'], 'word_position_prob': {0: 0.6206003986790039, 1: 0.0370116926006367, 2: 0.18437416322037428, 3: 0.0536729047038172, 4: 0.03079349022641397, 5: 0.030019933950194876, 6: 0.016601707774240575, 7: 0.013061200202314719, 8: 0.004938859302014221, 9: 0.00395703787450537, 10: 0.0016958733747880158, 11: 0.0011603344143286424, 12: 0.0006545476183392342, 13: 0.00038677813810954746, 14: 0.0003570259736395823, 15: 0.00020826515128975634, 16: 0.00014876082234982596, 17: 0.00017851298681979114, 18: 5.950432893993038e-05, 19: 5.950432893993038e-05, 20: 0.0, 21: 0.0, 22: 2.975216446996519e-05, 23: 0.0, 24: 2.975216446996519e-05}}),
 Phoneme(id='aː', info={'features': ['+', '+', '-', '+', '-', '-', '-', '0', '+', '-', '-', '-', '-', '-', '-', '-', '+', '+', '-', '+', '+'], 'word_position_prob': {0: 0.12874454553597706, 1: 0.7254789051403124, 2: 0.04774335547109934,

Finally, you can iterate over both, the Elements of a Register and over the Sub-Elements of an Element:

In [45]:
for syllable in syllables[:2]:
    print("Syllable", syllable, f"consists of phonemes ", end="") 
    for phoneme in syllable:
        print(phoneme, end=" ")
    print("")

Syllable ɡaː consists of phonemes ɡ aː 
Syllable ɡiː consists of phonemes ɡ iː 


## Export to SSML
Once we are done making syllables, we can export them to Speech Synthesis Markup Language (SSML) for later reference.

In [46]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables, syllables_dir=os.path.join("results", "ssml"))

Done


## Words
`Word`s are made out of `Syllable`s, same as before when we made syllables from phonemes.

Since one of ARC's main features is rythmicity control, our `make_words` function will only create words that have minimum overlap of phonotactic features. By default, this function generates 10000 words, but you can change that with the `n_words` option. With 10000 words, this should run fairly quickly, however, when you set the number higher you may want to also set the `progress_bar=True` flag in the function arguments.

In [47]:
from arc import make_words
help(make_words)

Help on function make_words in module arc.core.word:

make_words(syllables: ~RegisterType, num_syllables=3, bigram_control=True, bigram_alpha=None, trigram_control=True, trigram_alpha=None, positional_control=True, positional_control_position=None, position_alpha=0, phonotactic_control=True, n_look_back=2, n_words=10000, max_tries=100000, progress_bar: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        syllables (RegisterType): The Register of syllables to use as a basis for word generation
        num_syllables (int, optional): how many syllables are in a word. Defaults to 3.
        bigram_control (bool, optional): apply statistical control on the bigram level. Defaults to True.
        bigram_alpha (_type_, optional): which p-value to assume for bigram control. Defaults to None.
        trigram_control (bool, optional): apply statistical control on the trigram level. Defaults to True.
        trigram_alpha (_type_, optional): which p-value to assume for trigram contr

In [48]:
words = make_words(syllables, n_words=10_000, progress_bar=False, positional_control=True, position_alpha=0.001)
print(words)

bigram control...
trigram control...
positional control...
ʃoːpaːhuː|foːryːkaː|løːbyːçaː|fuːnøːɡaː|løːkuːfiː|ʃuːbiːhøː|nyːɡaːfoː|seːmyːkaː|ʃuːhøːbyː|nɛːbiːçaː|... (9720 elements total)


The words register has some relevant info about how it has been created:

In [49]:
for key in words.info:
    print(key)
print("")
print(f"For example, the type of syllables used to create the words is '{words.info['syllables_info']['syllable_type']}'")

n_syllables_per_word
n_look_back
phonotactic_control
syllables_info
bigram_pval
trigram_pval

For example, the type of syllables used to create the words is 'cV'


## Summary of important functions

In [14]:
from arc import load_phonemes, make_syllables, make_words, Register, Element

phonemes: Register = load_phonemes()
print(phonemes)

syllables = make_syllables(phonemes)
print(syllables)

words = make_words(syllables)
print(words)

ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


100%|██████████| 10000/10000 [00:12<00:00, 827.13it/s]


bigram control...
trigram control...
positional control...
heːʃøːpaː|kaːʃuːmoː|peːryːçaː|kaːfyːroː|loːpuːçaː|riːpuːçaː|doːfiːhøː|ɡɛːnyːfuː|moːtɛːçaː|puːʃaːhøː|... (10000 elements total)


In [15]:
import os
words.save(os.path.join("results", "words.json"))

In [17]:
from arc import load_words

print(load_words(os.path.join("results", "words.json")))

heːʃøːpaː|kaːʃuːmoː|peːryːçaː|kaːfyːroː|loːpuːçaː|riːpuːçaː|doːfiːhøː|ɡɛːnyːfuː|moːtɛːçaː|puːʃaːhøː|... (10000 elements total)


In [18]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables, syllables_dir=os.path.join("results", "ssml"))

Done


In [19]:
print(words.get_subset(10))

zøːhuːpeː|vaːɡɛːreː|ɡɛːruːfaː|faːhoːtuː|ɡɛːluːvaː|myːkoːsuː|poːzyːhiː|ruːɡaːfiː|paːheːʃoː|puːçaːnøː


In [20]:
for word in words.get_subset(2):
    print(word)

byːhøːsiː
reːkoːfiː


In [21]:
print(words['lyːfiːkɛː'])

lyːfiːkɛː


In [22]:
words['lyːfiːkɛː']

Word(id='lyːfiːkɛː', info={'binary_features': [[1, 0, 0], [0, 0, 1], [0, 0, 1], [0, 1, 0], [1, 0, 0], [1, 1, 0], [1, 0, 0], [0, 0, 0], [1, 0, 0], [0, 0, 0], [1, 1, 0], [0, 0, 0], [1, 0, 0], [1, 1, 0], [1, 1, 1]]}, syllables=[Syllable(id='lyː', info={'binary_features': [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1], 'phonotactic_features': [['son', 'den'], ['y']], 'freq': 60, 'prob': 5.93681e-05}, phonemes=[Phoneme(id='l', info={'features': ['-', '+', '+', '+', '-', '+', '-', '-', '+', '-', '-', '+', '+', '-', '-', '-', '-', '-', '-', '0', '-'], 'word_position_prob': {0: 0.07766470273165506, 1: 0.26216619481215087, 2: 0.3287359400107124, 3: 0.07513964343101998, 4: 0.08835794628510216, 5: 0.08066799296044073, 6: 0.04361466064733338, 7: 0.015590328257709082, 8: 0.01289310582293978, 9: 0.0055092202922947435, 10: 0.003825847425204683, 11: 0.002065957609610529, 12: 0.0013199173616956156, 13: 0.00078429872216696, 14: 0.0006695232994108195, 15: 0.0003634555053944449, 16: 0.00034432626826842147,

In [8]:
nu_ko_va = words['nuːkøːvaː']

for syllable in nu_ko_va:
    print(syllable)
    print(syllable.info)

print(nu_ko_va[0])

try:
    nu_ko_va['nuː']
except TypeError:
    print("Sub-elements are accessed by index, not by key, because they can repeat, i.e. there could be multiple appearences of the same syllable in a word etc.")
    print("It will throw a TypeError")

nuː
{'binary_features': [1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1], 'phonotactic_features': [['son', 'den'], ['u']], 'freq': 357, 'prob': 0.0003532401}
køː
{'binary_features': [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1], 'phonotactic_features': [['plo', 'oth'], ['ø']], 'freq': 45, 'prob': 4.45261e-05}
vaː
{'binary_features': [0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1], 'phonotactic_features': [['fri', 'lab'], ['a']], 'freq': 100, 'prob': 9.89468e-05}
nuː
Sub-elements are accessed by index, not by key, because they can repeat, i.e. there could be multiple appearences of the same syllable in a word etc.
It will throw a TypeError


In [9]:
nu_ko_va.get_elements()

[Syllable(id='nuː', info={'binary_features': [1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1], 'phonotactic_features': [['son', 'den'], ['u']], 'freq': 357, 'prob': 0.0003532401}, phonemes=[Phoneme(id='n', info={'features': ['-', '+', '+', '-', '-', '-', '+', '-', '+', '-', '-', '+', '+', '-', '-', '-', '-', '-', '-', '0', '-'], 'word_position_prob': {0: 0.18406973725953912, 1: 0.17740028264067143, 2: 0.2780405502840827, 3: 0.05586479393187783, 4: 0.11546448245032158, 5: 0.05872725175208375, 6: 0.0460084215383728, 7: 0.03188359818879243, 8: 0.020361664695872868, 9: 0.013274017246849134, 10: 0.007671675367000259, 11: 0.004441496265105414, 12: 0.002862457820205924, 13: 0.001542987338851556, 14: 0.0008075447754737115, 15: 0.0005119257058806564, 16: 0.00036772128168892224, 17: 0.00024514752112594814, 18: 0.0001442044241917342, 19: 0.00015862486661090763, 20: 9.373287572462723e-05, 21: 2.884088483834684e-05, 22: 1.442044241917342e-05, 23: 7.21022120958671e-06, 24: 7.21022120958671e-06}}), Phon

In [10]:
nu = words['nuːkøːvaː'][0]

for phoneme in nu:
    print(phoneme)
    print(phoneme.info)

n
{'features': ['-', '+', '+', '-', '-', '-', '+', '-', '+', '-', '-', '+', '+', '-', '-', '-', '-', '-', '-', '0', '-'], 'word_position_prob': {0: 0.18406973725953912, 1: 0.17740028264067143, 2: 0.2780405502840827, 3: 0.05586479393187783, 4: 0.11546448245032158, 5: 0.05872725175208375, 6: 0.0460084215383728, 7: 0.03188359818879243, 8: 0.020361664695872868, 9: 0.013274017246849134, 10: 0.007671675367000259, 11: 0.004441496265105414, 12: 0.002862457820205924, 13: 0.001542987338851556, 14: 0.0008075447754737115, 15: 0.0005119257058806564, 16: 0.00036772128168892224, 17: 0.00024514752112594814, 18: 0.0001442044241917342, 19: 0.00015862486661090763, 20: 9.373287572462723e-05, 21: 2.884088483834684e-05, 22: 1.442044241917342e-05, 23: 7.21022120958671e-06, 24: 7.21022120958671e-06}}
uː
{'features': ['+', '+', '-', '+', '-', '-', '-', '0', '+', '-', '-', '-', '-', '-', '+', '+', '-', '+', '+', '+', '+'], 'word_position_prob': {0: 0.21468460464525954, 1: 0.4075390277954055, 2: 0.22210940474679

In [11]:
from arc import Syllable, Word, Register, Element

print("Registers are like Dicts:", isinstance(words, Register), isinstance(syllables, Register), isinstance(phonemes, Register), isinstance(words, dict))

print("Subsets of Registers are Registers again:", isinstance(words.get_subset(10), Register))

print("Elements are the things inside a Register:", isinstance(nu_ko_va, Element), isinstance(nu_ko_va, Word), isinstance(nu, Element), isinstance(nu, Syllable), isinstance(syllables[0], Syllable))

print("Sub-Elements of an Element are Lists:", isinstance(nu_ko_va.get_elements(), list), isinstance(nu_ko_va[:2], list))

Registers are like Dicts: True True True True
Subsets of Registers are Registers again: True
Elements are the things inside a Register: True True True True True
Sub-Elements of an Element are Lists: True True


In [12]:
syllables = words.flatten()
print(syllables)

muː|kaː|zøː|tuː|hiː|faː|ɡyː|moː|byː|nøː|... (75 elements total)


In [23]:
from arc.io import read_syllables_corpus

corpus = read_syllables_corpus()
half_corpus = corpus.get_subset(len(corpus)//2)

print(half_corpus)

print(syllables.intersection(corpus))

toːsts|ɪçs|kras|ʃtɔʏ|bsəŋ|zɔɐkt|iːz|darft|tseːn|flan|... (3198 elements total)
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)
