# Extra Tutorial: ARC Typesystem
You will learn basic data structures of the ARC Typesystem, as well as saving and load with the core ARC-Types. If you just want to reproduce the experiments from the paper, see `controlled_stream_generation.ipynb`.

## Basic Types

There are two types of objects we deal with in ARC:

An **Element** is any linguistic object of interest. In our case, `Phoneme`, `Syllable`, `Word`, and `Stream` are Elements. These objects can consist of other elements, in a dictionary-style fashion, i.e. `Word`, and `Stream` consist of `Syllable`s, a `Syllable` consists of `Phoneme`s, and `Phonemes` are atomic. If an Element consists of multiple sub-elements, like in the case of a Syllable, the sub-elements can repeat, e.g. a Phoneme can repeat multiple times inside a Syllable. Elements can be part of a Register, which can be thought of as a corpus of Elements.  Like in real corpora, every element can be annotated, hence it has an `.info` field, which can hold arbitrary annotations in dictionary format. Elements in a Register *don't repeat*.

A **Register** is essentially an ordered set with some extra functionality. We use this container type to create ordered collections of Elements that do not repeat, i.e. ordered sets of Phonemes, Syllables, and Words. Since every element in the Register is unique in its string representation, it can be hashed and thus found quickly in memory. The `Lexicon` type is implemented as a Register of words, as well as any collection of Phonemes, Syllables or Words we use to generate higher level elements.

## Phonemes
Phonemes are the atomic unit of the ARC-Typesystem and built the basis for constructing other types like Syllables and Words. 
To enjoy the full functionolity of ARC, you'll need Phonemes annotated with their phonetic features. Luckily, ARC comes with an extensive corpus of Phonemes with phonetic features.
Let's load them and see what they look like.

In [28]:
from arc import load_phonemes
phonemes = load_phonemes()
print(phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


The `phonemes` variable is a Collection of Phoneme-Objects, more specifically a `Register`. What you see when you print any Register is a short summary of the first elements.
You can treat the Register like most Python collection types, meaning you can access elements, iterate over it etc.

> Note: Internally, `Register`s are `OrderedDict`s (with some extra convenience methods). Essentially, you can treat it like both Python builtin types `Dict`and `List`.

Let's see that in action.

In [29]:
print("We can reference elements of a Corpus by position/index:", phonemes[0], ", or by its string representation:", phonemes["k"])

We can reference elements of a Corpus by position/index: k͡p , or by its string representation: k


Internally, Elements are `Dict`-like objects, more specifically, [Pydantic](https://docs.pydantic.dev/latest/) types.

In [30]:
phonemes["k"]

Phoneme(id='k', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']})

Annotations can be referenced via the `.info` property, which can hold arbitrary dictionary data

In [31]:
phonemes["k"].info

{'features': ['-',
  '-',
  '+',
  '-',
  '-',
  '-',
  '-',
  '0',
  '-',
  '-',
  '-',
  '-',
  '-',
  '0',
  '-',
  '+',
  '-',
  '+',
  '-',
  '0',
  '-']}

Phoneme features can be hard to interpret, so you can also get features directly, e.g. the "is labial" binary feature, called `lab`:

In [32]:
phonemes["k"].get_binary_feature("lab")

False

Finally, you can get some help on which features the binary feature vector holds:

In [33]:
help(phonemes["k"].get_binary_feature)

Help on method get_binary_feature in module arc.core.phoneme:

get_binary_feature(label: Literal['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']) method of arc.core.phoneme.Phoneme instance



## Registers

While Registers in ARC print as compact summaries of there contents, they can be arbitrarily complex data structures.

In [34]:
print("This is the print output:", phonemes, end="\n\n")

from pprint import pprint
print("These are the first 2 entries of the Phonemes Register:", end="\n\n")
for ph in phonemes[:2]:
    pprint(ph)
    print("")

This is the print output: k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)

These are the first 2 entries of the Phonemes Register:

Phoneme(id='k͡p', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '0', '-', '0', '+', '+', '-', '0', '-', '0', '-']})

Phoneme(id='ɡ͡b', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '+', '-', '-', '0', '-', '0', '+', '+', '-', '0', '-', '0', '-']})



Regardless of the contents, Elements and Registers are always JSON serializable, as long as they are valid (which is checked by Pydantic at initialization):

In [35]:
phonemes.to_json()

'{"k͡p": {"id": "k͡p", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "-", "-", "-", "0", "-", "0", "+", "+", "-", "0", "-", "0", "-"]}}, "ɡ͡b": {"id": "ɡ͡b", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "+", "-", "-", "0", "-", "0", "+", "+", "-", "0", "-", "0", "-"]}}, "c": {"id": "c", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "-", "-", "-", "-", "-", "0", "-", "+", "-", "-", "-", "0", "-"]}}, "ɡ": {"id": "ɡ", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "+", "-", "-", "-", "-", "0", "-", "+", "-", "+", "-", "0", "-"]}}, "k": {"id": "k", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "-", "-", "-", "-", "-", "0", "-", "+", "-", "+", "-", "0", "-"]}}, "q": {"id": "q", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "-", "-", "-", "-", "-", "0", "-", "-", "-", "+", "-", "0", "-"]}}, "ɖ": {"id": "ɖ", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "0", "+", "-", "-", "-", "+", "-",

... which means they can be written to file. The Register container type has a method for that:

In [36]:
phonemes.save("test_phonemes.json")

loaded_phonemes = load_phonemes("test_phonemes.json")
print(loaded_phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


## Syllables
Our first composite type is the `Syllable`, consisting of a list of `Phoneme`s. Let's make a collection of syllables, that follow the `cV` pattern, meaning they consist of a single-character phoneme `c` followed by a long vowel `V`.

In [37]:
from arc.core.syllable import make_feature_syllables
artificial_syllables = make_feature_syllables(phonemes, phoneme_pattern="cV")
print(artificial_syllables)

cʔː|cɥː|cɰː|cʋː|cʍː|cjː|cwː|cɹː|cɻː|cɑː|... (2294 elements total)


They behave pretty much like Phonemes.

In [38]:
print(artificial_syllables["cʔː"], artificial_syllables[1])
artificial_syllables["cʔː"], artificial_syllables[1]

cʔː cɥː


(Syllable(id='cʔː', info={'binary_features': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], 'phonotactic_features': [['plo', 'oth'], []]}, phonemes=[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}), Phoneme(id='ʔː', info={'features': ['-', '+', '-', '-', '-', '-', '-', '0', '-', '-', '+', '-', '-', '0', '-', '-', '-', '-', '-', '0', '+']})]),
 Syllable(id='cɥː', info={'binary_features': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1], 'phonotactic_features': [['plo', 'oth'], []]}, phonemes=[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}), Phoneme(id='ɥː', info={'features': ['-', '+', '-', '+', '0', '-', '-', '0', '+', '-', '-', '0', '-', '0', '+', '+', '-', '-', '+', '+', '+']})]))

... except that they have an additional `phonemes` field:

In [39]:
artificial_syllables["cʔː"].phonemes

[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}),
 Phoneme(id='ʔː', info={'features': ['-', '+', '-', '-', '-', '-', '-', '0', '-', '-', '+', '-', '-', '0', '-', '-', '-', '-', '-', '0', '+']})]

With any composite element, you can use the same function to get the sub-elements:

In [40]:
artificial_syllables["cʔː"].get_elements()

[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}),
 Phoneme(id='ʔː', info={'features': ['-', '+', '-', '-', '-', '-', '-', '0', '-', '-', '+', '-', '-', '0', '-', '-', '-', '-', '-', '0', '+']})]

This is not a `Register` but a standard Python `List`, since phonemes can repeat inside a syllable, and the collections are usually rather small.

Finally, you can iterate over both, the Elements of a Register and over the Sub-Elements of an Element:

In [41]:
for syllable in artificial_syllables[:2]:
    print("Syllable", syllable, f"consists of phonemes", end="") 
    for phoneme in syllable:
        print(" ", end="")
        print(phoneme, end="")
    print("")

Syllable cʔː consists of phonemes c ʔː
Syllable cɥː consists of phonemes c ɥː


## Merge and Filter operations for Registers

Since we started with an international Phoneme Register, and because we generate artificial syllables, there will be many Syllables we do not want to include in our further analysis. Lets filter out some of them.

We'll start by filtering based on a real corpus of syllables. ARC comes with an example corpus in German, so let's use that as an example. 

>The filter-implementations are specific to the german corpus, so you might want to implement your own filters. We will discuss that in a later tutorial. If you are curious, you can take a look at the arc.filter submodule to see how to implement a filter.

In [42]:
from arc.io import read_syllables_corpus
german_syllable_corpus = read_syllables_corpus()  # defaults to the german corpus that comes with ARC
print(german_syllable_corpus)

jaː|ɪç|das|n|dan|ɡə|tn|ə|daː|diː|... (6397 elements total)


In [43]:
artificial_syllables_valid_german = artificial_syllables.intersection(german_syllable_corpus)
print(artificial_syllables_valid_german)

ɡaː|ɡeː|ɡiː|ɡoː|ɡuː|ɡyː|ɡøː|ɡɛː|kaː|keː|... (130 elements total)


In our original publication, we filter syllables based on the p-value that the syllable is uniformaly distributed with the others. This can be implemented as a filter:

In [44]:
from arc.controls.filter import filter_uniform_syllables, filter_common_phoneme_syllables

syllables_german_filtered = filter_uniform_syllables(artificial_syllables_valid_german)
print("Syllables with uniform probability of occurence: ", syllables_german_filtered)

syllables_german_filtered = filter_common_phoneme_syllables(syllables_german_filtered)
print("Syllables with common phonemes: ", syllables_german_filtered)

Syllables with uniform probability of occurence:  ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (77 elements total)
Syllables with common phonemes:  ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (76 elements total)


If you have a native (in our case German) phoneme corpus, you can filter the syllables based on that.

## Export to SSML
Once we are done making syllables, we can export them to Speech Synthesis Markup Language (SSML) for later reference.

In [45]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables_german_filtered, syllables_dir="results/ssml")

Done


## Words
`Word`s are made out of `Syllable`s, same as before when we made syllables from phonemes.

Since one of ARC's main features is rythmicity control, our `make_words` function will only create words that have minimum overlap of phonotactic features. By default, this function generates 10000 words, but you can change that with the `n_words` option. With 10000 words, this should run fairly quickly, however, when you set the number higher you may want to also set the `progress_bar=True` flag in the function arguments.

In [46]:
from arc import make_words
words = make_words(syllables_german_filtered, n_words=10_000, progress_bar=True)
print(words)



bigram control...
trigram control...
positional control...


100%|██████████| 10000/10000 [00:22<00:00, 442.18it/s]

heːtuːfyː|doːfyːhøː|tyːfaːhoː|hoːtuːfaː|luːfyːɡiː|deːfaːhoː|faːniːkoː|riːɡaːfoː|foːheːtyː|ʃoːɡiːmuː|... (1844 elements total)





Again, we apply some filters, but this time at the word level.

In [47]:
from arc.controls.filter import filter_common_phoneme_words, filter_gram_stats

words_filtered = filter_common_phoneme_words(words, position=0)
print(words_filtered)

#words_filtered = filter_gram_stats(words_filtered)
#print(words_filtered)

heːtuːfyː|doːfyːhøː|tyːfaːhoː|hoːtuːfaː|luːfyːɡiː|deːfaːhoː|faːniːkoː|riːɡaːfoː|foːheːtyː|ʃoːɡiːmuː|... (1844 elements total)


In [48]:
print(words_filtered.info)

{'phoneme_feature_labels': ['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long'], 'syllable_feature_labels': [['son', 'back', 'hi', 'lab', 'cor', 'cont', 'lat', 'nas', 'voi'], ['back', 'hi', 'lo', 'lab', 'tense', 'long']], 'syllable_type': 'cV', 'bigram_pval': None, 'bigrams_count': 1473, 'trigram_pval': None, 'trigrams_count': 21266}


Even with all the phonotactic conditions we applied, there may be many words left to choose from to build our `Lexicon`s and `Stream`s later on.

However, we can always get a random subsample of a Register by running:

In [49]:
words_subset = words_filtered.get_subset(100)
print(words_subset, words_subset.info)

fyːhiːdeː|doːfaːhiː|kaːfoːruː|luːɡiːfaː|ʃuːɡiːmoː|ʃiːpoːheː|ʃiːpoːhuː|høːtoːfuː|ʃuːbiːhøː|nuːkɛːfoː|... (100 elements total) {'phoneme_feature_labels': ['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long'], 'syllable_feature_labels': [['son', 'back', 'hi', 'lab', 'cor', 'cont', 'lat', 'nas', 'voi'], ['back', 'hi', 'lo', 'lab', 'tense', 'long']], 'syllable_type': 'cV', 'bigram_pval': None, 'bigrams_count': 1473, 'trigram_pval': None, 'trigrams_count': 21266}


In [50]:
words_subset.save("results/words.json")

This concludes our first tutorial. 
We've made `Syllable`s from `Phonemes`s and `Word`s from `Syllable`s and applied filters to them. 
Finally, we saved the generated words to a json file. 
In the other tutorial, we will pick up where we left and load the saved words to generate a `Lexicon`, a Register of `Word`s with specific phonotactic requirements. Later, we will use Lexicons to generate different types of streams.