# Controlled Stream Generation

We will generate words and a Lexicon with minimal feature overlap between the words. Next, we introduce the 3 main ways to generate random streams, depending on how the transition probabilities (TPs) of their syllables are structured: word-structured TPs, fully random (uniform) TPs, and position-controlled TPs. Finally, we compare the Streams generated with controlled Lexicons (ours) against random baseline Streams and Streams generated based on reference lexicons from the literature. We compare the Streams based on the repetitiveness of the syllable features.

First, we generate/reload the words register (see arc types tutorial).

## Syllable and Word Generation

In [2]:
from arc import load_phonemes, make_syllables, make_words
import numpy as np 
import random

np.random.seed(100)
random.seed(100)

print("Load phonemes...")
phonemes = load_phonemes()
print(phonemes)

print("Make syllables...")
syllables = make_syllables(phonemes, phoneme_pattern="cV", unigram_control=True, language_alpha=0.05)
print(syllables)

print("Make words...")
words = make_words(syllables, n_words=10_000, max_tries=100_000)
print(words)

print("Save words ...")
words.save("words.json")

Load phonemes...
k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)
Make syllables...
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (76 elements total)
Make words...


100%|█████████▉| 9987/10000 [00:12<00:00, 206.56it/s] 

bigram control...
trigram control...
positional control...


100%|██████████| 10000/10000 [00:22<00:00, 454.49it/s]


loːkuːfiː|nyːfaːkuː|huːfiːtyː|moːzuːɡaː|tɛːheːfoː|raːkɛːfiː|nuːkoːfaː|kaːfuːnɛː|hiːtoːfyː|kuːnɛːfoː|... (1839 elements total)
Save words ...


In [4]:
best_lexicon = ["heːdoːfaː", "riːfoːɡyː", "ʃuːhiːboː", "vaːkuːniː"]
np.random.seed(6923)
random.seed(6923)
print([word in words for word in best_lexicon])

[True, True, True, True]


In [8]:
import itertools

best_lexicon = ["heːdoːfaː", "riːfoːɡyː", "ʃuːhiːboː", "vaːkuːniː"]

np.random.seed(6923)
random.seed(6923)
subset_words = words.get_subset(200)
is_in = [word in subset_words for word in best_lexicon]
print(is_in)

[True, False, False, False]


## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. Let's start with 4 words each. 

By default, the function will generate 5 `Lexicon`s max. Let's generate 2 and print some info.

In [5]:
from arc import make_lexicons, load_words

words = load_words("words.json")

lexicons = make_lexicons(words, n_lexicons=2, n_words=4, control_features=True)
print("")

for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")




Lexicon: toːheːfaː|roːfyːɡiː|muːkaːzyː|fiːdeːhøː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: hiːdoːvaː|biːhuːzyː|koːfuːnɛː|tɛːheːfiː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be incremented untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

The function `make_streams` will try to generate one stream for each lexicon and TP mode, and will discard those that do not meet the max_rhythmicity requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes.

In [6]:
def print_stream_info(stream):
    print("Stream:", "|".join([syll.id for syll in stream]))
    print("TP mode:", stream.info["stream_tp_mode"])
    print("Lexicon:", stream.info["lexicon"])
    print("Feature PRIs:", stream.info["rhythmicity_indexes"])
    print("")


In [7]:
from arc import make_streams, make_lexicons, load_words

words = load_words("words.json")
lexicons = make_lexicons(words, n_lexicons=20, n_words=4, control_features=True)
streams = make_streams(lexicons, max_rhythmicity=0.1, require_all_tp_modes=True)
print("")

for stream in streams:
    print_stream_info(stream)

print("Num Streams: ", len(streams))




Stream: høː|hoː|deː|nøː|heː|ɡiː|zuː|paː|boː|zyː|vaː|foː|heː|foː|zuː|boː|høː|vaː|ɡiː|zyː|nøː|deː|hoː|paː|vaː|boː|heː|hoː|ɡiː|høː|foː|deː|zyː|paː|nøː|zuː|høː|zuː|hoː|zyː|deː|foː|boː|nøː|paː|ɡiː|vaː|heː|deː|heː|høː|nøː|boː|vaː|paː|zuː|zyː|foː|ɡiː|hoː|zuː|foː|nøː|hoː|høː|boː|ɡiː|paː|deː|vaː|zyː|heː|zuː|heː|nøː|høː|zyː|ɡiː|foː|hoː|vaː|deː|boː|paː|heː|vaː|zuː|nøː|foː|zyː|hoː|boː|deː|paː|høː|ɡiː|heː|paː|hoː|foː|høː|deː|ɡiː|nøː|zyː|boː|zuː|vaː|nøː|ɡiː|boː|hoː|heː|zyː|zuː|deː|høː|paː|foː|vaː|høː|heː|boː|foː|paː|zyː|vaː|hoː|nøː|deː|zuː|ɡiː|deː|nøː|vaː|zyː|høː|foː|heː|ɡiː|zuː|boː|paː|hoː|vaː|deː|foː|zyː|hoː|paː|zuː|ɡiː|høː|boː|nøː|heː|paː|vaː|boː|hoː|zyː|zuː|høː|deː|heː|nøː|foː|ɡiː|zyː|boː|foː|vaː|zuː|hoː|heː|høː|nøː|ɡiː|deː|paː|zyː|ɡiː|nøː|vaː|paː|høː|hoː|zuː|deː|boː|heː|foː|zuː|vaː|ɡiː|foː|nøː|boː|høː|zyː|paː|heː|hoː|deː|ɡiː|vaː|høː|paː|boː|zuː|nøː|zyː|heː|deː|hoː|foː|boː|ɡiː|hoː|nøː|høː|zuː|zyː|foː|paː|deː|vaː|heː|zuː|foː|høː|ɡiː|heː|vaː|hoː|boː|deː|zyː|nøː|paː|foː|deː|zuː|paː|ɡiː|boː|vaː|nøː