# Tutorial 2

We will load the words and generate a Lexicon with minimal feature overlap between the words. Then, we will follow up with the 2 main ways we can generate random streams, the `Word`-based and the `Syllable`-based stream generation.

First, we load the words from tutorial 1 (or let the script below generate them from scratch).

In [1]:
import os

from arc import load_words

FORCE_RECOMPUTE = False

if os.path.exists("words.json") and not FORCE_RECOMPUTE:
    words = load_words("words.json")
else:
    from arc import load_phonemes
    feature_phonemes = load_phonemes()
    
    from arc.core.syllable import make_feature_syllables
    feature_syllables = make_feature_syllables(feature_phonemes, phoneme_pattern="cV")

    from arc.io import read_syllables_corpus
    german_syllable_corpus = read_syllables_corpus()  # defaults to the german corpus that comes with ARC

    syllables_valid_german = feature_syllables.intersection(german_syllable_corpus)

    from arc.tpc.filter import filter_uniform_syllables, filter_common_phoneme_syllables
    print("Syllables valid german: ", syllables_valid_german)
    
    syllables_german_filtered = filter_uniform_syllables(syllables_valid_german)
    print("Syllables with uniform probability of occurence: ", syllables_german_filtered)
    
    syllables_german_filtered = filter_common_phoneme_syllables(syllables_german_filtered)
    print("Syllables with common phonemes: ", syllables_german_filtered)

    from arc.core.word import make_words
    print("Make words...")
    words = make_words(syllables_german_filtered, n_words=11_000, max_tries=100_000, progress_bar=True)

    from arc.tpc.filter import filter_common_phoneme_words
    print("Select words with common phonemes (german) ...")
    words = filter_common_phoneme_words(words, position=0)

    print("Save words ...")
    words.save("words.json")
    
print(words)

zɛːkøːmyː|kɛːluːfyː|laːɡɛːfoː|reːɡaːfuː|ryːɡiːfaː|ʃiːmoːkuː|ʃaːmuːɡyː|ʃoːbeːhøː|køːʃɛːmuː|tiːhøːvaː|... (10124 elements total)


In [3]:
print("Select words with common bigrams and trigrams (german) ...")
from arc.tpc.filter import filter_gram_stats
words = filter_gram_stats(words)

print("Sample subset of words (n=200) ...")
words = words.get_subset(200)

Select words with common bigrams and trigrams (german) ...
Sample subset of words (n=200) ...


## Lexicon

Great, now we generate minimum-overlap lexica. Let's start with 4 words each. 

By default, the function will generate 5 `Lexicon`s max. Let's generate 2 and print some info.

In [4]:
from arc.tpc.lexicon import make_lexicons_from_words

lexicons = make_lexicons_from_words(words, n_lexicons=2)

for lexicon in lexicons:
    print(lexicon, ", cumulative_overlap:", lexicon.info["cumulative_overlap"], ", max_pairwise_overlap:", lexicon.info["max_pairwise_overlap"])



fuːɡiːryː|ʃiːbyːhoː|kuːloːfaː|heːpoːzuː (4 elements total) , cumulative_overlap: 3 , max_pairwise_overlap: 1
fuːɡiːryː|ʃiːbyːhoː|heːpoːzyː|kuːloːfaː (4 elements total) , cumulative_overlap: 3 , max_pairwise_overlap: 1


By default, Lexicons with the minimum possible cumulative overlap between the word features will be generated first, starting at zero overlap. If it is not possible to generate all the requested Lexicons with the given parameters, the allowed overlap will be increased, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

## Stream

### Single Stream

The stream generation internally generates a Lexicon first, and then a Stream based on that. The following cell, however, generates a stream directly from words for convenience. The cell should execute quickly. If it doesn't, try reducing the number of words per lexicon or increasing the allowed rhythmicity index.

In [6]:
from arc.tpc.stream import make_stream_from_words

stream = make_stream_from_words(words, rand_mode="word", n_words=4, max_rhythmicity=0.2, max_lexicons=100, max_tries_randomize=10)

print("")

print(stream)

print("")

for key, val in stream.info.items():
    print(f"{key}: {str(val)}")
    print("")




heːpoːzuːkuːloːfaːʃiːbyːhoːfuːɡiːryːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːfuːɡiːryːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːʃiːbyːhoːheːpoːzuːkuːloːfaːfuːɡiːryːheːpoːzuːʃiːbyːhoːkuːloːfaːheːpoːzuːfuːɡiːryːʃiːbyːhoːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaːʃiːbyːhoːkuːloːfaːheːpoːzuːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaːfuːɡiːryːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːheːpoːzuːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːfuːɡiːryːkuːloːfaːʃiːbyːhoːkuːloːfaːfuːɡiːryːʃiːbyːhoːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːheːpoːzuːfuːɡiːryːheːpoːzuːkuːloːfaːʃiːbyːhoːkuːloːfaːʃiːbyːhoːfuːɡiːryːheːpoːzuːʃiːbyːhoːheːpoːzuːfuːɡiːryːkuːloːfaːheːpoːzuːkuːloːfaːfuːɡiːryːʃiːbyːhoːheːpoːzuːfuːɡiːryːʃiːbyːhoːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːʃiːbyːhoːfuːɡiːryːheːpoːzuːkuːloːfaːfuːɡiːryːheːpoːzuːʃiːbyːhoːkuːloːfaːfuːɡiːryːʃiːbyːhoːkuːloːfaːheːpoːzuːkuːloːfaːʃiːbyːhoːheːpoːzuːfuːɡiːryːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaːfuːɡiːryːkuːloːfaːʃiːbyːhoːheːpoːzuːfuːɡiːryːʃiːbyːhoːkuːloːfaːfuːɡiːryː

As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization mode has been used. The randomization mode can be `syllable` or `word`. It can be immediately varified that the randomization mode is `word`, since the individual words of the Lexicon can be recognized in the stream. Contrastingly, syllable level randomization means that the words are further brocken down into syllables, and the syllables are shuffled across the whole lexicon, destroying word-level information.

Next, we will use this distinction to generate a compatible set of streams for testing statistical learning hypotheses.

### Set of Compatible Streams

If this runs quickly, then we can step it up and generate a complete set of compatible lexicons for our study. If `streams` is empty, try increasing the allowed maximum rythmicity).

In [7]:
from arc.tpc.stream import make_compatible_streams
streams = make_compatible_streams(words, n_words=4, max_rhythmicity=0.2)

for i, stream in enumerate(streams):
    print("========= Stream Nr. ", i + 1, " =========")
    
    print("")
    
    print(stream)
    
    print("")
    
    for key, val in stream.info.items():
        print(f"{key}: {str(val)}")
        print("")




fuːɡiːryːheːpoːzuːʃiːbyːhoːkuːloːfaːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaːʃiːbyːhoːfuːɡiːryːkuːloːfaːheːpoːzuːfuːɡiːryːʃiːbyːhoːkuːloːfaːheːpoːzuːfuːɡiːryːheːpoːzuːkuːloːfaːʃiːbyːhoːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːheːpoːzuːkuːloːfaːfuːɡiːryːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːʃiːbyːhoːheːpoːzuːkuːloːfaːheːpoːzuːfuːɡiːryːʃiːbyːhoːheːpoːzuːfuːɡiːryːkuːloːfaːʃiːbyːhoːfuːɡiːryːkuːloːfaːheːpoːzuːʃiːbyːhoːheːpoːzuːfuːɡiːryːʃiːbyːhoːkuːloːfaːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːʃiːbyːhoːfuːɡiːryːheːpoːzuːkuːloːfaːʃiːbyːhoːheːpoːzuːkuːloːfaːfuːɡiːryːʃiːbyːhoːkuːloːfaːheːpoːzuːfuːɡiːryːheːpoːzuːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːkuːloːfaːʃiːbyːhoːfuːɡiːryːkuːloːfaːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaːʃiːbyːhoːfuːɡiːryːheːpoːzuːʃiːbyːhoːheːpoːzuːfuːɡiːryːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːheːpoːzuːfuːɡiːryːʃiːbyːhoːkuːloːfaːʃiːbyːhoːkuːloːfaːfuːɡiːryːheːpoːzuːkuːloːfaːheːpoːzuːʃiːbyːhoːfuːɡiːryːkuːloːfaːʃiːbyːhoːheːpoːzuːfuːɡiːryːʃiːbyːhoːheːpoːzuːkuːloːfaː