# Controlled Stream Generation

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Syllable and Word Generation

First, we generate/reload the words register (see arc types tutorial).

Because ARC runs probabilistically (to speed things up), we set the random seeds to make sure our runs are reproducible.

In [1]:
import numpy as np 
import random
import os

np.random.seed(0)
random.seed(0)

In [2]:
from arc import load_words

print("Load words...")
words = load_words(os.path.join("results", "words.json"))
print(words)

Load words...
tuːfiːheː|biːhøːʃaː|biːnyːçaː|høːbyːsiː|baːhuːʃoː|ʃøːmeːɡiː|muːʃiːɡaː|ɡyːʃuːmeː|puːʃaːhiː|doːhiːfuː|... (9720 elements total)


## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. 

In [3]:
from arc import make_lexicons
help(make_lexicons)

Help on function make_lexicons in module arc.core.lexicon:

make_lexicons(words: ~RegisterType, n_lexicons: int = 5, n_words: int = 4, max_overlap: int = 1, lag_of_interest: int = 1, max_word_matrix: int = 200, unique_words: bool = False, control_features: bool = True) -> List[arc.types.base_types.Register]
    _summary_
    
    Args:
        words (RegisterType): The Register of words which the lexicon generation is based on.
        n_lexicons (int, optional): How many lexicons to generate. Defaults to 5.
        n_words (int, optional): How many words should be in a lexicon. Defaults to 4.
        max_overlap (int, optional): How much feature overlap between pairwise word features is allowed. Defaults to 1.
        lag_of_interest (int, optional): the frequency of the word features for which a feature is consideret 'overlapping'. 1 means the feature frequency is the number of syllables in 1 word. Defaults to 1.
        max_word_matrix (int, optional): How many words to use maximum 

Let's generate 2 lexicons with 4 words each and print some info.

In [4]:
lexicons = make_lexicons(words, n_lexicons=2, n_words=4, control_features=True)
print("")

for lexicon in lexicons:
    print(lexicon)

Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=1
Increasing allowed overlaps: MAX_PAIRWISE_OVERLAP=1, MAX_CUMULATIVE_OVERLAP=2



kuːmyːʃøː|zɛːpoːhuː|fyːkaːniː|hiːzyːbaː
kuːmyːʃøː|tɛːfyːhøː|beːʃaːhoː|huːʃiːpaː


> ⚠️ The runtime of this function depends on the parameters when `control_features=True`. If it takes too long, consider reducing the number of words in the lexicon or the number of lexicons. If you don't get any output, consider increasing the maximum pairwise overlap allowed.

By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be increased untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

In [5]:
for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")

Lexicon: kuːmyːʃøː|zɛːpoːhuː|fyːkaːniː|hiːzyːbaː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: kuːmyːʃøː|tɛːfyːhøː|beːʃaːhoː|huːʃiːpaː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

The function `make_streams` will try to generate one stream for each lexicon and TP mode. If you specify 'max_rhythmicity', it will discard those that do not meet the requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes. Printed below you see a collection of streams. Because streams can get long, you only see their key consisting of the lexicon used to generate it and its TP mode.

In [6]:
from arc import make_streams
help(make_streams)

Help on function make_streams in module arc.core.stream:

make_streams(lexicons: List[~RegisterType], max_rhythmicity: Optional[float] = None, stream_length: int = 32, max_tries_randomize: int = 10, tp_modes: tuple = ('random', 'word_structured', 'position_controlled'), require_all_tp_modes: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        lexicons (List[LexiconType]): A list of lexicons used as a basis for generatng the streams
        max_rhythmicity (Optional[float], optional): check rhythmicity and discard all streams that have at least one feature with higher PRI than this number. Defaults to None.
        stream_length (int, optional): how many syllables are in a stream in multiples of syllables in the lexicon. Defaults to 4.
        max_tries_randomize (int, optional): if max_rhythmicity is given and violated, how many times to try with a new randomization. Defaults to 10.
        tp_modes (tuple, optional): the ways (modes) in which to control for transition p

In [8]:
streams = make_streams(lexicons)

print(streams)

kuːmyːʃøːzɛːpoːhuːfyːkaːniːhiːzyːbaː_random|kuːmyːʃøːzɛːpoːhuːfyːkaːniːhiːzyːbaː_word_structured|kuːmyːʃøːzɛːpoːhuːfyːkaːniːhiːzyːbaː_position_controlled|kuːmyːʃøːtɛːfyːhøːbeːʃaːhoːhuːʃiːpaː_random|kuːmyːʃøːtɛːfyːhøːbeːʃaːhoːhuːʃiːpaː_word_structured|kuːmyːʃøːtɛːfyːhøːbeːʃaːhoːhuːʃiːpaː_position_controlled


> ⚠️ The runtime of this function depends on the parameters, especially when you specify a `max_rhythmicity`, because the function re-samples the random stream until `max_rhythmicity` is satisfied. This takes time, because TP-statistics need to be controlled each time. If it takes too long, consider removing the option.

To inspect a stream, select one either by index or by key:

In [9]:
stream = streams[0]
print(stream)

kuː|baː|poː|ʃøː|kaː|zɛː|fyː|hiː|myː|niː|huː|zyː|huː|poː|kuː|myː|zyː|kaː|hiː|zɛː|baː|fyː|niː|ʃøː|zyː|kuː|huː|zɛː|ʃøː|fyː|kaː|myː|hiː|baː|niː|poː|niː|fyː|ʃøː|poː|huː|baː|kaː|kuː|zɛː|hiː|zyː|myː|kuː|poː|kaː|ʃøː|zɛː|niː|zyː|fyː|huː|myː|baː|hiː|huː|ʃøː|kuː|zyː|hiː|niː|baː|myː|fyː|zɛː|kaː|poː|fyː|kuː|niː|kaː|baː|zyː|poː|myː|zɛː|huː|hiː|ʃøː|myː|ʃøː|huː|niː|zɛː|zyː|baː|kuː|kaː|fyː|poː|hiː|poː|zyː|ʃøː|hiː|kaː|niː|kuː|fyː|baː|zɛː|myː|huː|kuː|ʃøː|baː|huː|kaː|zyː|niː|hiː|fyː|myː|poː|zɛː|poː|baː|ʃøː|niː|myː|kaː|huː|fyː|zyː|zɛː|kuː|hiː|kuː|niː|myː|ʃøː|huː|poː|kaː|zyː|fyː|hiː|zɛː|baː|ʃøː|niː|kuː|baː|huː|myː|zyː|zɛː|fyː|poː|hiː|kaː|zɛː|huː|kuː|kaː|ʃøː|myː|baː|fyː|niː|hiː|zyː|poː|fyː|kaː|huː|ʃøː|poː|zɛː|zyː|niː|baː|myː|hiː|kuː|zyː|baː|hiː|myː|zɛː|niː|fyː|kuː|ʃøː|kaː|poː|huː|baː|zyː|kaː|niː|ʃøː|zɛː|kuː|myː|fyː|huː|hiː|poː|ʃøː|fyː|zyː|hiː|baː|kuː|poː|niː|zɛː|kaː|myː|huː|zɛː|poː|myː|kuː|huː|niː|zyː|ʃøː|hiː|fyː|baː|kaː|baː|niː|poː|zyː|huː|fyː|myː|kaː|hiː|ʃøː|kuː|zɛː|hiː|huː|zyː|myː|niː|kaː|fyː|zɛː|ʃøː|baː|

In [10]:
print("Lexicon:", stream.info["lexicon"])
print("TP mode:", stream.info["stream_tp_mode"])
print("Feature PRIs:") 
for feat, pri in stream.info["rhythmicity_indexes"].items():
    print(" ", feat, pri)

Lexicon: kuːmyːʃøː|zɛːpoːhuː|fyːkaːniː|hiːzyːbaː
TP mode: random
Feature PRIs:
  phon_1_son 0.06878306878306878
  phon_1_back 0.03968253968253968
  phon_1_hi 0.03968253968253968
  phon_1_lab 0.07671957671957672
  phon_1_cor 0.06084656084656084
  phon_1_cont 0.03968253968253968
  phon_1_lat 0.0
  phon_1_nas 0.05291005291005291
  phon_1_voi 0.06613756613756613
  phon_2_back 0.082010582010582
  phon_2_hi 0.015873015873015872
  phon_2_lo 0.042328042328042326
  phon_2_lab 0.07936507936507936
  phon_2_tense 0.0
  phon_2_long 0.0


As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

This concludes the second tutorial, and we end this series with the third and last tutorial about how to use your own data.