# Tutorial 1
You will learn basic data structures of the ARC Typesystem, as well as saving and load with the core ARC-Types.

## Basic Types

There are two types of objects we deal with in ARC:

An **Element** is any linguistic object of interest. In our case, `Phoneme`, `Syllable`, `Word`, and `Stream` are Elements. These objects can consist of other elements, in a dictionary-style fashion, i.e. `Word`, and `Stream` consist of `Syllable`s, a `Syllable` consists of `Phoneme`s, and `Phonemes` are atomic. If an Element consists of multiple sub-elements, like in the case of a Syllable, the sub-elements can repeat, e.g. a Phoneme can repeat multiple times inside a Syllable. Elements can be part of a Corpus.  Like in real corpora, every element can be annotated, hence it has an `.info` field, which can hold arbitrary annotations in dictionary format.

A **Register** is essentially an ordered set with some extra functionality. We use this container type to create ordered collections of Elements that do not repeat, i.e. ordered sets of Phonemes, Syllables, and Words. Since every element in the Register is unique in its string representation, it can be hashed and thus found quickly in memory. The `Lexicon` type is implemented as a Register of words, but also any corpus of Phonemes or Syllables.

## Phonemes
Phonemes are the atomic unit of the ARC-Typesystem and built the basis for constructing other types like Syllables and Words. 
To enjoy the full functionolity of ARC, you'll need Phonemes annotated with their phonetic features. Luckily, ARC comes with an extensive corpus of Phonemes and phonetic features.
Let's load them and see what they look like.

In [None]:
from arc import load_phonemes
phonemes = load_phonemes()
print(phonemes)

The `phonemes` variable is a Collection of Phoneme-Objects, more specifically a `Register`. What you see when you print any Register is a short summary of the first elements.
You can treat the Register like most Python collection types, meaning you can access elements, iterate over it etc.

> Note: Internally, `Register`s are `OrderedDict`s (with some extra convenience methods). Essentially, you can treat it like both Python builtin types `Dict`and `List`.

Let's see that in action.

In [None]:
print("We can reference elements of a Corpus by position/index:", phonemes[0], ", or by its string representation:", phonemes["k"])

Internally, Elements are `Dict`-like objects, more specifically, [Pydantic](https://docs.pydantic.dev/latest/) types.

In [None]:
phonemes["k"]

Annotations can be referenced via the `.info` property, which can hold arbitrary dictionary data

In [None]:
phonemes["k"].info

Phoneme features can be hard to interprete, so you can also get features directly, e.g. the "is labial" binary feature, called `lab`:

In [None]:
phonemes["k"].get_binary_feature("lab")

Finally, you can get some help on which features the binary feature vector holds:

In [None]:
help(phonemes["k"].get_binary_feature)

While Registers in ARC print as compact summaries of there contents, they can be arbitrarily complex data structures.

In [None]:
print("This is the print output:", phonemes, end="\n\n")

from pprint import pprint
print("These are the contents of the Syllable Register:", end="\n\n")
for ph in phonemes:
    pprint(ph)
    print("")

Regardless of the contents, Elements and Registers are always JSON serializable, as long as they are valid (which is checked by Pydantic at initialization):

In [None]:
phonemes.to_json()

... which means they can be written to file. The Corpus type has a method for that:

In [None]:
phonemes.save("test_phonemes.json")

In [None]:
from arc import load_phonemes

loaded_phonemes = load_phonemes("test_phonemes.json")
print(loaded_phonemes)

## Syllables
Our first composite type is the `Syllable`, consisting of a list of `Phoneme`s. Let's make a collection of syllables, that follow the `cV` pattern, meaning they consist of a single-character phoneme `c` followed by a long vowel `V`.

In [None]:
from arc.generation.syllables import make_feature_syllables
artificial_syllables = make_feature_syllables(phonemes, phoneme_pattern="cV")
print(artificial_syllables)

In [None]:
print(artificial_syllables["cʔː"], artificial_syllables[1])
artificial_syllables["cʔː"], artificial_syllables[1]

Finally, you can iterate over both, the Elements of a Register and over the Sub-Elements of an Element:

In [None]:
for syllable in artificial_syllables:
    print("Syllable", syllable, f"consists of phonemes", end="") 
    for phoneme in syllable:
        print(" ", end="")
        print(phoneme, end="")
    print("")

## Merge and Filter operations for Corpora

Since we started with an international Phoneme corpus, and because we generate artificial syllables, there will be many Syllables we do not want to include in our further analysis. Lets filter out some of them.

We'll start by filtering based on a real corpus of syllables. ARC comes with an example corpus in German, so let's use that as an example. 

>The filter-implementations are specific to the german corpus, so you might want to implement your own filters. We will discuss that in a later tutorial. If you are curious, you can take a look at the arc.filter submodule to see how to implement a filter.

In [None]:
from arc.io import read_syllables_corpus
german_syllable_corpus = read_syllables_corpus()  # defaults to the german corpus that comes with ARC
print(german_syllable_corpus)

In [None]:
artificial_syllables_valid_german = artificial_syllables.intersection(german_syllable_corpus)
print(artificial_syllables_valid_german)

In our original publication, we filter syllables based on the p-value that the syllable is uniformaly distributed with the others. This can be implemented as a filter:

In [None]:
from arc.filter import filter_uniform_syllables, filter_common_phoneme_syllables

syllables_german_filtered = filter_uniform_syllables(artificial_syllables_valid_german)
print("Syllables with uniform probability of occurence: ", syllables_german_filtered)

syllables_german_filtered = filter_common_phoneme_syllables(syllables_german_filtered)
print("Syllables with common phonemes: ", syllables_german_filtered)

If you have a native (in our case German) phoneme corpus, you can filter the syllables based on that.

## Export to SSML
Once we are done making syllables, we can export them to Speech Synthesis Markup Language (SSML) for later reference.

In [None]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables_german_filtered, syllables_dir="ssml")

## Words
`Word`s are made out of `Syllable`s, same as before when we made syllables from phonemes.

Since one of ARC's main features is rythmicity control, our `make_words` function will only create words that have minimum overlap of phonotactic features. By default, this function generates 10000 words, but you can change that with the `n_words` option. With 10000 words, this should run fairly quickly, however, when you set the number higher you may want to also set the `progress_bar=True` flag in the function arguments.

In [None]:
from arc.generation.words import make_words
words = make_words(syllables_german_filtered, n_words=100_000, progress_bar=True)
print(words)

Again, we apply some filters, but this time at the word level.

In [None]:
from arc.filter import filter_common_phoneme_words, filter_gram_stats

words_filtered = filter_common_phoneme_words(words, position=0)
print(words_filtered)

#words_filtered = filter_gram_stats(words_filtered)
#print(words_filtered)

In [None]:
print(words_filtered.info)

Even with all the phonotactic conditions we applied, there may be many words left to choose from to build our `Lexicon`s and `Stream`s later on.

However, we can always get a random subsample of a Register by running:

In [None]:
words_subset = words_filtered.get_subset(100)
print(words_subset, words_subset.info)

In [None]:
words_subset.save("words.json")

This concludes our first tutorial. 
We've made `Syllable`s from `Phonemes`s and `Word`s from `Syllable`s and applied filters to them. 
Finally, we saved the generated words to a json file. 
In the next tutorial, we will pick up where we left and load the saved words to generate a `Lexicon`, a Register of `Word`s with specific phonotactic requirements. Later, we will use Lexicons to generate different types of streams.