# ARC Typesystem
In this tutorial, you will learn the basic data structures of the ARC Typesystem, as well as saving and load with the core ARC-Types.

> ⚠️ We recommend using a virtual environment

> ⚠️ If you use a virtual environment, make sure you use the right kernel for this notebook. You can usually select it in the top right corner. If your environment is not in the list, you have to add the ipython kernel from the environment like so:
> 1. Activate virtual environment in terminal
> 2. Run `pip install ipykernel`
> 3. Run `python -m ipykernel install --user --name arc --display-name "Python (ARC)"`
> 4. Reload this page

In [44]:
%pip install --upgrade git+https://github.com/milosen/arc.git

Collecting git+https://github.com/milosen/arc.git
  Cloning https://github.com/milosen/arc.git to /private/var/folders/n1/bxdrmv296493f6tbg9v8pjnh0000gn/T/pip-req-build-qkr5cw7n
  Running command git clone --filter=blob:none --quiet https://github.com/milosen/arc.git /private/var/folders/n1/bxdrmv296493f6tbg9v8pjnh0000gn/T/pip-req-build-qkr5cw7n
  Resolved https://github.com/milosen/arc.git to commit 7c80f8d820e0f0241d860e90cd76e23fdcbd9b37
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


## Basic Types

There are two types of objects we deal with in ARC:

An **Element** is any linguistic object of interest. In our case, `Phoneme`, `Syllable`, `Word`, and `Stream` are Elements. These objects can consist of other elements, in a dictionary-style fashion, i.e. `Word`, and `Stream` consist of `Syllable`s, a `Syllable` consists of `Phoneme`s, and `Phonemes` are atomic. If an Element consists of multiple sub-elements, like in the case of a Syllable, the sub-elements can repeat, e.g. a Phoneme can repeat multiple times inside a Syllable. Elements can be part of a Register, which can be thought of as a corpus of Elements.  Like in real corpora, every element can be annotated, hence it has an `.info` field, which can hold arbitrary annotations in dictionary format. Elements in a Register *don't repeat*.

A **Register** is essentially an ordered set with some extra functionality. We use this container type to create ordered collections of Elements that do not repeat, i.e. ordered, annotated sets of Phonemes, Syllables, and Words (like little corpora). Since every element in the Register has a unique string representation, it can be hashed and thus found quickly in memory. The `Lexicon` type is implemented as a Register of words, as well as any collection of Phonemes, Syllables or Words we use to generate higher level elements.

In summary
- `Phoneme`, `Syllable`, `Word` are subclasses of `Element`
- `Stream` is the same as a `Word` just longer, since it consists of repeatable `Syllable` objects
- whenever you see multiple elements, e.g. a `phonemes` object, its a `Register`
- `Lexicon` is a special `Register` of `Word`s

## Phonemes
Phonemes are the atomic unit of the ARC-Typesystem and built the basis for constructing other types like Syllables and Words. 
To enjoy the full functionolity of ARC, you'll need Phonemes annotated with their phonetic features. Luckily, ARC comes with an extensive corpus of Phonemes with phonetic features.
Let's load them and see what they look like.

In [1]:
from arc import load_phonemes
help(load_phonemes)

Help on function load_phonemes in module arc.io:

load_phonemes(path_to_json: Union[str, os.PathLike, NoneType] = None, language_control=False) -> ~RegisterType



In [8]:
phonemes = load_phonemes()
print(phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


The `phonemes` variable is a Collection of Phoneme-Objects, more specifically a `Register`. What you see when you print any Register is a short summary of the first elements.
You can treat the Register like most Python collection types, meaning you can access elements, iterate over it etc.

> Note: Internally, `Register`s are `OrderedDict`s (with some extra convenience methods). Essentially, you can treat it like both Python builtin types `Dict`and `List`.

Let's see that in action.

In [9]:
print("We can reference elements of a Corpus by position/index:", phonemes[0], ", or by its string representation:", phonemes["k"])

We can reference elements of a Corpus by position/index: k͡p , or by its string representation: k


Internally, Elements are `Dict`-like objects, more specifically, [Pydantic](https://docs.pydantic.dev/latest/) types.

In [10]:
print(phonemes["k"])
phonemes["k"]

k


Phoneme(id='k', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']})

Annotations can be referenced via the `.info` property, which can hold arbitrary dictionary data

In [11]:
print(phonemes["k"].info)

{'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']}


Phoneme features can be hard to interpret, so you can also get features directly, e.g. the "is labial" binary feature, called `lab`:

In [12]:
help(phonemes["k"].get_binary_feature)

Help on method get_binary_feature in module arc.types.phoneme:

get_binary_feature(label: Literal['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']) method of arc.types.phoneme.Phoneme instance



In [13]:
phonemes["k"].get_binary_feature("lab")

False

## The Register

While Registers in ARC print as compact summaries of their contents, they can be arbitrarily complex data structures.

Regardless of the contents, Elements and Registers are always JSON serializable, as long as they are valid (which is checked at initialization):

In [14]:
print(phonemes.to_json()[:80] + "...")

{"k͡p": {"id": "k͡p", "info": {"features": ["-", "-", "+", "-", "-", "-", "-", "...


... which means they can be written to file and read later as a register. The Register container type has methods for reading and writing:

In [15]:
help(phonemes.save)

Help on method save in module arc.types.base_types:

save(path: Union[str, os.PathLike] = None) method of arc.types.base_types.Register instance



In [16]:
import os

os.makedirs("results", exist_ok=True)
phonemes.save(os.path.join("results", "test_phonemes.json"))
loaded_phonemes = load_phonemes(os.path.join("results", "test_phonemes.json"))
print(loaded_phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)


### Some useful operations

In [17]:
from arc import Register
# help(Register)

In [18]:
from arc.io import read_phoneme_corpus
corpus_phonemes = read_phoneme_corpus()
print(phonemes)
phonemes = phonemes.intersection(corpus_phonemes)
print(phonemes)

k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)
ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)


## Syllables
Our first composite type is the `Syllable`, consisting of a list of `Phoneme`s. Let's make a collection of syllables, that follow the `cV` pattern, meaning they consist of a single-character phoneme `c` followed by a long vowel `V`.

In [19]:
from arc.core.syllable import make_syllables
help(make_syllables)

Help on function make_syllables in module arc.core.syllable:

make_syllables(phonemes: ~RegisterType, phoneme_pattern: str = 'cV', unigram_control: bool = True, language_control: bool = True, language_alpha: Optional[float] = 0.05, from_format: Literal['ipa', 'xsampa'] = 'xsampa', lang: str = 'deu') -> ~RegisterType
    _summary_
    
    Args:
        phonemes (RegisterType): A Register of phonemes that will be used as a basis to generate the syllables
        phoneme_pattern (str, optional): describes how a syllable is structured, e.g. "cV" syllables consist of a single-consonant character and a long vowel. Defaults to "cV".
        unigram_control (bool, optional): apply statistical control (on the basis of p-val of uniform distribution) to single unigrams. Defaults to True.
        language_control (bool, optional): apply language specific controls (only german for now) on the syllable level. Defaults to True.
        language_alpha (Optional[float], optional): which p-value to ass

In [20]:
syllables = make_syllables(phonemes, phoneme_pattern="cV")
print(syllables)

ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


They behave pretty much like the phonemes register except that each is further composed of phonemes.

In [21]:
print(syllables["ɡaː"], syllables[1])
syllables["ɡaː"], syllables[1]

ɡaː ɡiː


(Syllable(id='ɡaː', info={'binary_features': [0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1], 'phonotactic_features': [['plo', 'oth'], ['a']], 'freq': 85, 'prob': 8.41048e-05}, phonemes=[Phoneme(id='ɡ', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '+', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-'], 'word_position_prob': {0: 0.6206003986790039, 1: 0.0370116926006367, 2: 0.18437416322037428, 3: 0.0536729047038172, 4: 0.03079349022641397, 5: 0.030019933950194876, 6: 0.016601707774240575, 7: 0.013061200202314719, 8: 0.004938859302014221, 9: 0.00395703787450537, 10: 0.0016958733747880158, 11: 0.0011603344143286424, 12: 0.0006545476183392342, 13: 0.00038677813810954746, 14: 0.0003570259736395823, 15: 0.00020826515128975634, 16: 0.00014876082234982596, 17: 0.00017851298681979114, 18: 5.950432893993038e-05, 19: 5.950432893993038e-05, 20: 0.0, 21: 0.0, 22: 2.975216446996519e-05, 23: 0.0, 24: 2.975216446996519e-05}}), Phoneme(id='aː', info={'features': ['+', '+', '-', '

You can get the sub-elements of any composite element as a list:

In [22]:
syllables["ɡaː"].get_elements()

[Phoneme(id='ɡ', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '+', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-'], 'word_position_prob': {0: 0.6206003986790039, 1: 0.0370116926006367, 2: 0.18437416322037428, 3: 0.0536729047038172, 4: 0.03079349022641397, 5: 0.030019933950194876, 6: 0.016601707774240575, 7: 0.013061200202314719, 8: 0.004938859302014221, 9: 0.00395703787450537, 10: 0.0016958733747880158, 11: 0.0011603344143286424, 12: 0.0006545476183392342, 13: 0.00038677813810954746, 14: 0.0003570259736395823, 15: 0.00020826515128975634, 16: 0.00014876082234982596, 17: 0.00017851298681979114, 18: 5.950432893993038e-05, 19: 5.950432893993038e-05, 20: 0.0, 21: 0.0, 22: 2.975216446996519e-05, 23: 0.0, 24: 2.975216446996519e-05}}),
 Phoneme(id='aː', info={'features': ['+', '+', '-', '+', '-', '-', '-', '0', '+', '-', '-', '-', '-', '-', '-', '-', '+', '+', '-', '+', '+'], 'word_position_prob': {0: 0.12874454553597706, 1: 0.7254789051403124, 2: 0.04774335547109934,

Finally, you can iterate over both, the Elements of a Register and over the Sub-Elements of an Element:

In [23]:
for syllable in syllables[:2]:
    print("Syllable", syllable, f"consists of phonemes ", end="") 
    for phoneme in syllable:
        print(phoneme, end=" ")
    print("")

Syllable ɡaː consists of phonemes ɡ aː 
Syllable ɡiː consists of phonemes ɡ iː 


## Export to SSML
Once we are done making syllables, we can export them to Speech Synthesis Markup Language (SSML) for later reference.

In [24]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables, syllables_dir=os.path.join("results", "ssml"))

Done


## Words
`Word`s are made out of `Syllable`s, same as before when we made syllables from phonemes.

Since one of ARC's main features is rythmicity control, our `make_words` function will only create words that have minimum overlap of phonotactic features. By default, this function generates 10000 words, but you can change that with the `n_words` option. With 10000 words, this should run fairly quickly, however, when you set the number higher you may want to also set the `progress_bar=True` flag in the function arguments.

In [25]:
from arc import make_words
help(make_words)

Help on function make_words in module arc.core.word:

make_words(syllables: ~RegisterType, num_syllables=3, bigram_control=True, bigram_alpha=None, trigram_control=True, trigram_alpha=None, positional_control=True, positional_control_position=None, position_alpha=0, phonotactic_control=True, n_look_back=2, n_words=10000, max_tries=100000, progress_bar: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        syllables (RegisterType): The Register of syllables to use as a basis for word generation
        num_syllables (int, optional): how many syllables are in a word. Defaults to 3.
        bigram_control (bool, optional): apply statistical control on the bigram level. Defaults to True.
        bigram_alpha (_type_, optional): which p-value to assume for bigram control. Defaults to None.
        trigram_control (bool, optional): apply statistical control on the trigram level. Defaults to True.
        trigram_alpha (_type_, optional): which p-value to assume for trigram contr

In [27]:
words = make_words(syllables, n_words=10_000, progress_bar=False, positional_control=True, position_alpha=0.001)
print(words)

bigram control...
trigram control...
positional control...
tuːfiːheː|biːhøːʃaː|biːnyːçaː|høːbyːsiː|baːhuːʃoː|ʃøːmeːɡiː|muːʃiːɡaː|ɡyːʃuːmeː|puːʃaːhiː|doːhiːfuː|... (9720 elements total)


The words register has some relevant info about how it has been created:

In [28]:
for key in words.info:
    print(key)
print("")
print(f"For example, the type of syllables used to create the words is '{words.info['syllables_info']['syllable_type']}'")

n_syllables_per_word
n_look_back
phonotactic_control
syllables_info
bigram_pval
trigram_pval

For example, the type of syllables used to create the words is 'cV'


## Bonus functions

We can always get a random subsample of a Register by running:

In [29]:
words_subset = words.get_subset(10)
print(words_subset)

hiːbaːzyː|vaːtuːhøː|kɛːfiːreː|ʃøːɡyːmuː|poːʃaːhøː|saːhiːbyː|vaːtyːhøː|foːhuːdeː|heːʃiːbøː|luːkøːfyː


In [30]:
words.save(os.path.join("results", "words.json"))

This concludes our first tutorial. 
We've made `Syllable`s from `Phonemes`s and `Word`s from `Syllable`s and applied filters to them. 
Finally, we saved the generated words to a json file. 
In the other tutorial, we will pick up where we left and load the saved words to generate a `Lexicon`, a Register of `Word`s with specific phonotactic requirements. Later, we will use Lexicons to generate different types of streams.