# Controlled Stream Generation

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Installation

> ⚠️ We recommend using a virtual environment

> ⚠️ If you use a virtual environment, make sure you use the right kernel for this notebook. You can usually select it in the top right corner. If your environment is not in the list, you have to add the ipython kernel from the environment like so:
> 1. Activate virtual environment in terminal
> 2. Run `pip install ipykernel`
> 3. Run `python -m ipykernel install --user --name arc --display-name "Python (ARC)"`
> 4. Reload this page

In [None]:
%pip list

Package                   Version    Editable project location
------------------------- ---------- -------------------------------
annotated-types           0.6.0
anyio                     4.1.0
appnope                   0.1.3
arc                       1.0        /Users/nmilosevic/workspace/arc
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
attrs                     23.1.0
Babel                     2.13.1
beautifulsoup4            4.12.2
bleach                    6.1.0
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        3.3.2
comm                      0.2.0
contourpy                 1.2.1
cycler                    0.12.1
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.0
fonttools         



In [1]:
%pip install --upgrade git+https://github.com/milosen/arc.git

Collecting git+https://github.com/milosen/arc.git
  Cloning https://github.com/milosen/arc.git to /private/var/folders/3q/q1slz36d5c74bbp9g06q3m0r0000gx/T/pip-req-build-e3ot8eum
  Running command git clone --filter=blob:none --quiet https://github.com/milosen/arc.git /private/var/folders/3q/q1slz36d5c74bbp9g06q3m0r0000gx/T/pip-req-build-e3ot8eum
  Resolved https://github.com/milosen/arc.git to commit 6f21fc54e7205ece8bf66c82cf370bd089ed475a
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: arc
  Building wheel for arc (setup.py) ... [?25ldone
[?25h  Created wheel for arc: filename=arc-1.0-py3-none-any.whl size=8364590 sha256=6d1ef11da9d4ab526896926d62cd9abf8041365443b3879817b917b48aa7b743
  Stored in directory: /private/var/folders/3q/q1slz36d5c74bbp9g06q3m0r0000gx/T/pip-ephem-wheel-cache-x167hs4b/wheels/ba/6b/57/d1e3ae32907d4440fdb6d1c99e436e0cffcc45c89afb632f13
Successfully built arc
Installing collected packages: arc
  Attempting uninstall: arc


## Syllable and Word Generation

Because ARC runs probabilistically (to speed things up), we set the random seeds to make sure our runs are reproducible.

In [1]:
from arc import set_seed

set_seed(0)

In [2]:
from arc import load_phonemes, make_syllables, make_words

phonemes = load_phonemes()
print(phonemes)

syllables = make_syllables(phonemes)
print(syllables)

words = make_words(syllables)
print(words)

ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


100%|██████████| 10000/10000 [00:15<00:00, 628.79it/s]


bigram control...
trigram control...
positional control...
vaːtoːhuː|doːhuːfiː|biːroːçaː|piːzuːhoː|ɡaːzuːmoː|loːkuːvaː|huːbiːzyː|ɡaːʃøːmoː|deːhøːfuː|deːfyːhuː|... (10000 elements total)


## Get Help

In [3]:
help(make_syllables)

Help on function make_syllables in module arc.core.syllable:

make_syllables(phonemes: ~RegisterType, phoneme_pattern: str = 'cV', unigram_control: bool = True, language_control: bool = True, language_alpha: Optional[float] = 0.05, from_format: Literal['ipa', 'xsampa'] = 'xsampa', lang: str = 'deu') -> ~RegisterType
    _summary_
    
    Args:
        phonemes (RegisterType): A Register of phonemes that will be used as a basis to generate the syllables
        phoneme_pattern (str, optional): describes how a syllable is structured, e.g. "cV" syllables consist of a single-consonant character and a long vowel. Defaults to "cV".
        unigram_control (bool, optional): apply statistical control (on the basis of p-val of uniform distribution) to single unigrams. Defaults to True.
        language_control (bool, optional): apply language specific controls (only german for now) on the syllable level. Defaults to True.
        language_alpha (Optional[float], optional): which p-value to ass

## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. 

Let's generate 2 lexicons with 4 words each and print some info.

In [12]:
from arc import make_lexicons

lexicons = make_lexicons(words, n_lexicons=2, n_words=4)
print("")

for i, lexicon in enumerate(lexicons):
    print(i, ":", lexicon)




0 : ɡyːfoːnuː|boːzøːhuː|ryːɡaːfuː|ʃiːmoːkaː
1 : høːboːsaː|vaːroːkøː|koːʃiːmeː|nøːɡɛːfiː


> ⚠️ The runtime of this function depends on the parameters when `control_features=True`. If it takes too long, consider reducing the number of words in the lexicon or the number of lexicons. If you don't get any output, consider increasing the maximum pairwise overlap allowed.

By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be increased untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

In [13]:
for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")

Lexicon: ɡyːfoːnuː|boːzøːhuː|ryːɡaːfuː|ʃiːmoːkaː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: høːboːsaː|vaːroːkøː|koːʃiːmeː|nøːɡɛːfiː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

The function `make_streams` will try to generate one stream for each lexicon and TP mode. If you specify 'max_rhythmicity', it will discard those that do not meet the requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes. Printed below you see a collection of streams. Because streams can get long, you only see their key consisting of the lexicon used to generate it and its TP mode.

In [14]:
from arc import make_streams
help(make_streams)

Help on function make_streams in module arc.core.stream:

make_streams(lexicons: List[~RegisterType], max_rhythmicity: Optional[float] = None, stream_length: int = 32, max_tries_randomize: int = 10, tp_modes: tuple = ('random', 'word_structured', 'position_controlled'), require_all_tp_modes: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        lexicons (List[LexiconType]): A list of lexicons used as a basis for generatng the streams
        max_rhythmicity (Optional[float], optional): check rhythmicity and discard all streams that have at least one feature with higher PRI than this number. Defaults to None.
        stream_length (int, optional): how many syllables are in a stream in multiples of syllables in the lexicon. Defaults to 4.
        max_tries_randomize (int, optional): if max_rhythmicity is given and violated, how many times to try with a new randomization. Defaults to 10.
        tp_modes (tuple, optional): the ways (modes) in which to control for transition p

In [17]:
streams = make_streams(lexicons)

> ⚠️ The runtime of this function depends on the parameters, especially when you specify a `max_rhythmicity`, because the function re-samples the random stream until `max_rhythmicity` is satisfied. This takes time, because TP-statistics need to be controlled each time. If it takes too long, consider removing the option.

To inspect a stream, select one either by index or by key:

In [18]:
stream = streams[0]
print(stream)

foː|kaː|ʃiː|ɡyː|zøː|huː|boː|nuː|ɡaː|fuː|ryː|moː|zøː|fuː|foː|ryː|ʃiː|boː|huː|moː|nuː|kaː|ɡaː|ɡyː|ʃiː|foː|zøː|moː|huː|fuː|ɡaː|boː|kaː|nuː|ɡyː|ryː|fuː|ʃiː|ɡaː|zøː|ryː|nuː|moː|ɡyː|boː|foː|huː|kaː|huː|ɡaː|moː|ʃiː|zøː|ɡyː|kaː|fuː|boː|ryː|foː|nuː|boː|moː|fuː|nuː|foː|ʃiː|kaː|zøː|ɡaː|ryː|ɡyː|huː|ɡyː|nuː|zøː|ʃiː|fuː|huː|foː|boː|ɡaː|kaː|moː|ryː|ɡaː|huː|ryː|zøː|kaː|foː|ɡyː|fuː|moː|boː|ʃiː|nuː|fuː|kaː|ɡyː|moː|ɡaː|ʃiː|huː|nuː|ryː|boː|zøː|foː|ɡaː|foː|fuː|zøː|boː|ɡyː|nuː|ʃiː|moː|kaː|ryː|huː|ʃiː|ryː|kaː|boː|fuː|ɡyː|foː|moː|zøː|nuː|huː|ɡaː|nuː|boː|fuː|ɡyː|ɡaː|ʃiː|moː|foː|kaː|huː|zøː|ryː|fuː|kaː|foː|ryː|ɡyː|zøː|ɡaː|nuː|huː|moː|ʃiː|boː|zøː|nuː|kaː|fuː|huː|boː|ɡyː|moː|foː|ʃiː|ryː|ɡaː|huː|kaː|nuː|ryː|boː|ʃiː|foː|ɡaː|moː|ɡyː|fuː|zøː|boː|nuː|fuː|ɡaː|ɡyː|ryː|ʃiː|zøː|huː|foː|moː|kaː|zøː|foː|ɡyː|kaː|boː|moː|nuː|ɡaː|fuː|ryː|huː|ʃiː|huː|ryː|nuː|moː|fuː|boː|ɡaː|kaː|ʃiː|ɡyː|foː|zøː|fuː|ʃiː|nuː|foː|huː|ɡyː|ɡaː|zøː|moː|boː|kaː|ryː|kaː|ɡyː|boː|huː|nuː|zøː|ʃiː|ɡaː|foː|fuː|moː|ryː|zøː|ɡyː|huː|fuː|nuː|ʃiː|kaː|moː|ɡaː|ryː|

In [19]:
print("Lexicon:", stream.info["lexicon"])
print("TP mode:", stream.info["stream_tp_mode"])
print("Feature PRIs:") 
for feat, pri in stream.info["rhythmicity_indexes"].items():
    print(" ", feat, pri)

Lexicon: ɡyːfoːnuː|boːzøːhuː|ryːɡaːfuː|ʃiːmoːkaː
TP mode: random
Feature PRIs:
  phon_1_son 0.10052910052910052
  phon_1_back 0.03968253968253968
  phon_1_hi 0.03968253968253968
  phon_1_lab 0.07936507936507936
  phon_1_cor 0.07936507936507936
  phon_1_cont 0.037037037037037035
  phon_1_lat 0.0
  phon_1_nas 0.03439153439153439
  phon_1_voi 0.026455026455026454
  phon_2_back 0.0026455026455026454
  phon_2_hi 0.047619047619047616
  phon_2_lo 0.018518518518518517
  phon_2_lab 0.05291005291005291
  phon_2_tense 0.0
  phon_2_long 0.0


As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

This concludes the second tutorial, and we end this series with the third and last tutorial about how to use your own data.