# Tutorial 1
You will learn basic data saving and load with the core ARC-Types.

## Phonemes
Phonemes are the atomic unit of the ARC-Typesystem and built the basis for constructing other types like Syllables and Words. 
To enjoy the full functionolity of ARC, you'll need Phonemes with the phonetic feature fields filled. Luckily, ARC comes with an extensive corpus of Phonemes and phonetic features.
Let's load them and see what they look like.

In [1]:
from arc import load_default_phonemes
phonemes = load_default_phonemes()
print(phonemes)









k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5175 elements total)


The `phonemes` variable is a Collection of Phoneme-Objects, more specifically an `ARC-Collection`. What you see when you print any `ARC-Collection` is a short summary of the highest level elements.
You can treat the `ARC-Collection` like most Python collection types, meaning you can access elements, iterate over it etc.

> Note: Internally, `ARC-Collection`s are `OrderedDict`s (with some extra convenience methods). This meaning you can treat it like both Python builtin types `Dict`and `List`.

Let's see that in action.

In [2]:
print(phonemes[0], phonemes["k"])

k͡p k


Internally, these are `Dict`-like objects.

In [3]:
phonemes["k"]

Phoneme(id='k', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '+', '-', '0', '-']})

Notice that the object has an `info` field with phonetic features. You can also get features directly:

In [4]:
phonemes["k"].get_feature_symbol("lab"), phonemes["k"].get_binary_feature("lab")

('-', False)

In [5]:
help(phonemes["k"].get_binary_feature)

Help on method get_binary_feature in module arc.types:

get_binary_feature(label: Literal['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']) method of arc.types.Phoneme instance



## Syllables
Our first composite type is the `Syllable`, consisting of a list of `Phoneme`s. Let's make a collection of syllables, that follow the `cV`pattern, meaning they consist of a single-character phoneme `c` followed by a long vowel `V`.

In [6]:
from arc.data import make_feature_syllables
syllables = make_feature_syllables(phonemes, phoneme_pattern="cV")
print(syllables)

Output()

cʔː|cɥː|cɰː|cʋː|cʍː|cjː|cwː|cɹː|cɻː|cɑː|... (2108 elements total)


In [7]:
print(syllables["cʔː"], syllables[1])
syllables["cʔː"], syllables[1]

cʔː cɥː


(Syllable(id='cʔː', info={'binary_features': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'phonotactic_features': [['plo', 'oth'], []]}, phonemes=[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}), Phoneme(id='ʔː', info={'features': ['-', '+', '-', '-', '-', '-', '-', '0', '-', '-', '+', '-', '-', '0', '-', '-', '-', '-', '-', '0', '+']})]),
 Syllable(id='cɥː', info={'binary_features': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1], 'phonotactic_features': [['plo', 'oth'], []]}, phonemes=[Phoneme(id='c', info={'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '-', '-', '0', '-', '+', '-', '-', '-', '0', '-']}), Phoneme(id='ɥː', info={'features': ['-', '+', '-', '+', '0', '-', '-', '0', '+', '-', '-', '0', '-', '0', '+', '+', '-', '-', '+', '+', '+']})]))

Finally, you can iterate over both, the Elements of a Register and over the Sub-Elements of an Element:

In [8]:
for syllable in syllables:
    print("Syllable ", syllable, f"consists of phonemes {[str(phoneme) for phoneme in syllable]}")

Syllable  cʔː consists of phonemes ['c', 'ʔː']
Syllable  cɥː consists of phonemes ['c', 'ɥː']
Syllable  cɰː consists of phonemes ['c', 'ɰː']
Syllable  cʋː consists of phonemes ['c', 'ʋː']
Syllable  cʍː consists of phonemes ['c', 'ʍː']
Syllable  cjː consists of phonemes ['c', 'jː']
Syllable  cwː consists of phonemes ['c', 'wː']
Syllable  cɹː consists of phonemes ['c', 'ɹː']
Syllable  cɻː consists of phonemes ['c', 'ɻː']
Syllable  cɑː consists of phonemes ['c', 'ɑː']
Syllable  cɤː consists of phonemes ['c', 'ɤː']
Syllable  cʉː consists of phonemes ['c', 'ʉː']
Syllable  caː consists of phonemes ['c', 'aː']
Syllable  ceː consists of phonemes ['c', 'eː']
Syllable  ciː consists of phonemes ['c', 'iː']
Syllable  coː consists of phonemes ['c', 'oː']
Syllable  cuː consists of phonemes ['c', 'uː']
Syllable  cyː consists of phonemes ['c', 'yː']
Syllable  cæː consists of phonemes ['c', 'æː']
Syllable  cøː consists of phonemes ['c', 'øː']
Syllable  cœː consists of phonemes ['c', 'œː']
Syllable  cɒː

## Merge and Filter

Since we started with an international Phoneme corpus, there will be many Syllables we do not want to include in our further analysis. Lets filter out some of them.

We'll start by filtering based on a corpus of syllables. ARC comes with an example corpus in German, and it will be called, when you call filters without supplying a path to a custom file. 

>The filter-implementations are specific to the corpus, so you might want to implement your own filters. We will discuss that in a later tutorial. If you are curious, you can take a look at the arc.filter submodule to see how to implement a filter.

In [9]:
from arc.io import read_syllables_corpus
syllable_corpus = read_syllables_corpus()
print(syllable_corpus)

jaː|ɪç|das|n|dan|ɡə|tn|ə|daː|diː|... (6397 elements total)


In [16]:
from arc.filter import merge_collections
syllables_german = merge_collections(syllables, syllable_corpus)
print(syllables_german)

ɡaː|ɡeː|ɡiː|ɡoː|ɡuː|ɡyː|ɡøː|ɡɛː|kaː|keː|... (130 elements total)


In our original publication, we filter syllables based on the p-value that the syllable is uniformaly distributed with the others. We made a filter for that:

In [17]:
from arc.filter import filter_uniform_syllables, filter_common_phoneme_syllables

syllables_german_filtered = filter_uniform_syllables(syllables_german)
print("Syllables with uniform probability of occurence: ", syllables_german_filtered)

syllables_german_filtered = filter_common_phoneme_syllables(syllables_german_filtered)
print("Syllables with common phonemes: ", syllables_german_filtered)

Syllables with uniform probability of occurence:  ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (77 elements total)


Output()

Syllables with common phonemes:  ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (76 elements total)


If you have a native (in our case German) phoneme corpus, you can filter the syllables based on that.

## Export to SSML
Once we are done making syllables, we can export them to Speech Synthesis Markup Language (SSML) for later reference.

In [18]:
from arc.io import export_speech_synthesiser
export_speech_synthesiser(syllables_german_filtered, syllables_dir="ssml")

## Words
`Word`s are made out of `Syllable`s, same as before when we made syllables from phonemes.

Since one of ARC's main features is rythmicity control, our `make_words` function will only create words that have minimum overlap of phonotactic features.

In [19]:
from arc.data import make_words
words = make_words(syllables_german_filtered)
print(words)

Output()

foːtiːhuː|høːfyːdeː|kuːfoːraː|ʃuːpeːhoː|doːfuːhiː|siːbøːhuː|zøːbeːhiː|toːfaːhøː|lyːfaːkuː|løːvaːɡɛː|... (10000 elements total)


Again, we apply some filters, but this time at word level.

In [20]:
from arc.filter import filter_common_phoneme_words, filter_gram_stats

words_filtered = filter_common_phoneme_words(words, position=0)
print(words_filtered)

#words_filtered = filter_gram_stats(words_filtered)
#print(words_filtered)

Output()

Output()

foːtiːhuː|høːfyːdeː|kuːfoːraː|ʃuːpeːhoː|doːfuːhiː|zøːbeːhiː|toːfaːhøː|lyːfaːkuː|løːvaːɡɛː|zøːmeːkoː|... (9234 elements total)


Even with all the phonotactic conditions we applied, there are still many words to choose from to build our `Lexicons`and streams later on.

However, we can always get a random subsample of a Register by running:

In [22]:
words_subsampled = words_filtered.get_subset(100)
print(words_subsampled)

zyːɡiːmuː|ʃøːɡyːmuː|huːbøːsaː|lyːfaːkuː|ʃøːkɛːmeː|foːlaːkøː|foːkøːluː|puːhøːʃaː|deːçaːmuː|loːkɛːvaː|... (100 elements total)


In [23]:
words_subsampled.save()

This concludes our first tutorial. 
We've made `Syllable`s from `Phonemes`s and `Word`s from `Syllable`s and applied filters to them. 
Finally, we saved the generated words to a json file. 
In the next tutorial, we will pick up where we left and load the saved words to generate a `Lexicon`, a Register of `Word`s with specific phonotactic requirements. Later, we will use Lexicons to generate different types of streams.