# Controlled Stream Generation

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Syllable and Word Generation

First, we generate/reload the words register (see arc types tutorial).

In [75]:
from arc import load_phonemes, make_syllables, make_words
import numpy as np 
import random

np.random.seed(100)
random.seed(100)

print("Load phonemes...")
phonemes = load_phonemes()
print(phonemes)

print("Make syllables...")
syllables = make_syllables(phonemes, phoneme_pattern="cV", unigram_control=True, language_alpha=0.05)
print(syllables)

print("Make words...")
words = make_words(syllables, n_words=10_000, max_tries=100_000)
print(words)

#print("Save words ...")
#words.save("words.json")

Load phonemes...
k͡p|ɡ͡b|c|ɡ|k|q|ɖ|ɟ|ɠ|ɢ|... (5275 elements total)
Make syllables...


100%|██████████| 10000/10000 [23:56:58<00:00,  8.62s/it]


ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (76 elements total)
Make words...


100%|█████████▉| 9988/10000 [00:11<00:00, 210.17it/s] 

bigram control...
trigram control...
positional control...


100%|██████████| 10000/10000 [00:21<00:00, 457.67it/s]

loːkuːfiː|nyːfaːkuː|huːfiːtyː|moːzuːɡaː|tɛːheːfoː|raːkɛːfiː|nuːkoːfaː|kaːfuːnɛː|hiːtoːfyː|kuːnɛːfoː|... (1839 elements total)





## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. Let's start with 4 words each. 

By default, the function will generate 5 `Lexicon`s max. Let's generate 2 and print some info.

In [76]:
from arc import make_lexicons, load_words

lexicons = make_lexicons(words, n_lexicons=2, n_words=4, control_features=True)
print("")

for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")




Lexicon: nuːkaːfoː|faːhoːtiː|kuːriːfyː|zyːbeːhuː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: hoːdeːfiː|peːhuːʃoː|ɡiːfaːnuː|tiːheːvaː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be incremented untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

The function `make_streams` will try to generate one stream for each lexicon and TP mode, and will discard those that do not meet the max_rhythmicity requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes.

In [77]:
from arc import make_streams, make_lexicons, load_words

def print_stream_info(stream):
    print("Stream:", "|".join([syll.id for syll in stream]))
    print("TP mode:", stream.info["stream_tp_mode"])
    print("Lexicon:", stream.info["lexicon"])
    print("Feature PRIs:", stream.info["rhythmicity_indexes"])
    print("")

lexicons = make_lexicons(words, n_lexicons=20, n_words=4, control_features=True)
streams = make_streams(lexicons, max_rhythmicity=0.1, require_all_tp_modes=True)
print("")

print_stream_info(streams[0])

print("Total number of generated streams: ", len(streams))




Stream: høː|muː|koː|poː|ryː|foː|deː|ʃiː|fuː|ʃoː|ɡiː|huː|ʃiː|høː|poː|foː|koː|deː|ryː|ʃoː|muː|huː|fuː|ɡiː|muː|deː|høː|fuː|poː|ʃiː|ɡiː|foː|ʃoː|koː|huː|ryː|koː|ɡiː|ʃiː|poː|huː|foː|fuː|ryː|deː|muː|ʃoː|høː|ʃoː|deː|koː|ʃiː|foː|ɡiː|høː|ryː|muː|poː|fuː|huː|deː|poː|ʃoː|ryː|ʃiː|muː|ɡiː|koː|fuː|foː|høː|huː|muː|foː|ryː|huː|ɡiː|poː|koː|høː|ʃiː|ʃoː|fuː|deː|fuː|muː|høː|deː|huː|poː|ɡiː|ʃoː|foː|ʃiː|koː|ryː|høː|foː|muː|ryː|poː|deː|ɡiː|fuː|koː|ʃoː|ʃiː|huː|høː|ɡiː|ryː|fuː|ʃiː|deː|foː|poː|muː|huː|ʃoː|koː|muː|fuː|høː|koː|foː|huː|deː|ʃoː|poː|ʃiː|ryː|ɡiː|deː|muː|ʃiː|poː|høː|fuː|foː|huː|koː|ryː|ɡiː|ʃoː|huː|muː|fuː|høː|deː|ɡiː|ryː|koː|ʃoː|poː|foː|ʃiː|huː|ʃiː|foː|ɡiː|deː|høː|koː|poː|ʃoː|muː|ryː|fuː|ʃiː|høː|foː|ʃoː|huː|poː|fuː|deː|ryː|muː|koː|ɡiː|foː|muː|ʃoː|fuː|ɡiː|poː|ryː|huː|høː|ʃiː|deː|koː|høː|ryː|foː|poː|muː|deː|ʃoː|ʃiː|fuː|koː|huː|ɡiː|huː|fuː|ryː|deː|poː|koː|foː|høː|muː|ɡiː|ʃiː|ʃoː|foː|koː|ʃiː|muː|høː|huː|ryː|ʃoː|deː|fuː|poː|ɡiː|koː|muː|ʃiː|ɡiː|fuː|huː|ʃoː|ryː|høː|poː|deː|foː|deː|ʃiː|ryː|poː|huː|koː|fuː|ʃoː