# Controlled Stream Generation

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Syllable and Word Generation

First, we generate/reload the words register (see arc types tutorial).

Because ARC runs probabilistically (to speed things up), we set the random seeds to make sure our runs are reproducible.

In [1]:
import numpy as np 
import random
import os

np.random.seed(0)
random.seed(0)

In [2]:
from arc import load_words

print("Load words...")
words = load_words(os.path.join("results", "words.json"))
print(words)

Load words...
nyːfuːɡaː|huːpoːzɛː|løːvaːkuː|tiːfyːhøː|ɡiːʃoːmyː|køːvaːroː|tuːfoːheː|nuːkɛːfiː|hiːtyːvaː|hiːtyːfuː|... (1849 elements total)


## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. Let's start with 4 words each. 

By default, the function will generate 5 `Lexicon`s max. Let's generate 2 and print some info.

In [3]:
from arc import make_lexicons, load_words

lexicons = make_lexicons(words, n_lexicons=2, n_words=4, control_features=True)
print("")

for lexicon in lexicons:
    print(lexicon)

Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=1
Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=2



ɡaːfiːloː|hiːdoːfyː|foːreːɡiː|ʃuːkaːmɛː
koːruːvaː|faːhøːdeː|ʃoːpeːhuː|nuːɡiːfoː


By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be increased untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

In [4]:
for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")

Lexicon: ɡaːfiːloː|hiːdoːfyː|foːreːɡiː|ʃuːkaːmɛː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: koːruːvaː|faːhøːdeː|ʃoːpeːhuː|nuːɡiːfoː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

The function `make_streams` will try to generate one stream for each lexicon and TP mode. If you specify 'max_rhythmicity', it will discard those that do not meet the requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes. Printed below you see a collection of streams. Because streams can get long, you only see their key consisting of the lexicon used to generate it and its TP mode.

In [10]:
from arc import make_streams, make_lexicons, load_words

lexicons = make_lexicons(words, n_lexicons=1, n_words=4, control_features=True)
streams = make_streams(lexicons, require_all_tp_modes=True)

print(streams)

Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=1
Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=2


Lexicon-lyːfoːkuː-huːʃiːbaː-faːhiːtuː-koːvaːniː_TP-random|Lexicon-lyːfoːkuː-huːʃiːbaː-faːhiːtuː-koːvaːniː_TP-word_structured|Lexicon-lyːfoːkuː-huːʃiːbaː-faːhiːtuː-koːvaːniː_TP-position_controlled


To inspect a stream, you just have to select one:

In [14]:
stream = streams[0]
print(stream)

kuːbaːlyːhiːtuːfaːkoːniːvaːhuːfoːʃiːfaːhuːtuːfoːlyːniːbaːkuːkoːʃiːhiːvaːniːfaːkuːvaːʃiːfoːtuːkoːhiːlyːhuːbaːtuːvaːfoːkoːkuːhuːniːʃiːlyːfaːhiːbaːfoːfaːniːhiːkoːhuːvaːbaːʃiːkuːtuːlyːbaːvaːkuːʃiːniːtuːhuːlyːkoːfaːfoːhiːkuːlyːtuːhiːfaːbaːkoːfoːniːhuːʃiːvaːlyːvaːfaːʃiːbaːhiːniːkoːtuːkuːfoːhuːhiːhuːkuːfaːtuːʃiːkoːbaːniːlyːfoːvaːtuːniːfoːbaːfaːlyːʃiːhuːkoːvaːhiːkuːniːkuːhiːfoːlyːfaːvaːkoːʃiːtuːbaːhuːfaːhiːʃiːlyːkuːkoːbaːniːfoːhuːvaːtuːfoːkuːbaːvaːʃiːfaːkoːlyːhuːhiːtuːniːbaːtuːlyːkuːhuːkoːniːʃiːfoːvaːhiːfaːlyːkoːhiːhuːniːkuːtuːfaːvaːfoːʃiːbaːʃiːvaːkoːfoːtuːbaːhuːfaːkuːhiːlyːniːtuːhuːfoːniːkoːvaːfaːbaːhiːʃiːkuːlyːbaːfoːfaːniːvaːhuːkuːʃiːtuːkoːlyːhiːvaːniːfaːhuːtuːʃiːkoːkuːfoːhiːbaːlyːtuːhiːniːlyːfoːkoːfaːʃiːhuːbaːkuːvaːlyːvaːbaːkoːhuːʃiːniːhiːfoːkuːfaːtuːvaːkuːniːhuːlyːʃiːhiːkoːtuːfoːbaːfaːfoːniːkuːʃiːhiːkoːfaːhuːlyːbaːvaːtuːkuːfoːhuːʃiːtuːvaːlyːfaːhiːniːkoːbaːtuːniːfaːkuːbaːhuːfoːkoːhiːlyːʃiːvaːbaːfoːkuːfaːtuːlyːhuːhiːʃiːkoːvaːniːhiːfaːkoːtuːhuːbaːʃiːniːfoːvaːkuːlyːfoːlyːkuːkoːhuːvaːʃiːfaːniːt

In [15]:
print("TP mode:", stream.info["stream_tp_mode"])
print("Lexicon:", stream.info["lexicon"])
print("Feature PRIs:") 
for feat, pri in stream.info["rhythmicity_indexes"].items():
    print(" ", feat, pri)

TP mode: random
Lexicon: lyːfoːkuː|huːʃiːbaː|faːhiːtuː|koːvaːniː
Feature PRIs:
  phon_1_son 0.06315789473684211
  phon_1_back 0.03508771929824561
  phon_1_hi 0.03508771929824561
  phon_1_lab 0.07192982456140351
  phon_1_cor 0.09298245614035087
  phon_1_cont 0.015789473684210527
  phon_1_lat 0.0
  phon_1_nas 0.0
  phon_1_voi 0.09298245614035087
  phon_2_back 0.005263157894736842
  phon_2_hi 0.010526315789473684
  phon_2_lo 0.03684210526315789
  phon_2_lab 0.05087719298245614
  phon_2_tense 0.0
  phon_2_long 0.0


As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

This concludes the second tutorial, and we end this series with the third and last tutorial about how to use your own data.