# Controlled Stream Generation

> ⚠️ We are in the process of migrating the old library called `arc` to a new library called `alparc`. While the command line tools carry the new name, the core library modules used in this and the following tutorials are still imported from `arc`. This will change in the future.

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Installation

> ⚠️ We recommend using a virtual environment

> ⚠️ If you use a virtual environment, make sure you use the right kernel for this notebook. You can usually select it in the top right corner. If your environment is not in the list, you have to add the ipython kernel from the environment like so:
> 1. Activate virtual environment in terminal
> 2. Run `pip install ipykernel`
> 3. Run `python -m ipykernel install --user --name arc --display-name "Python (ARC)"`
> 4. Reload this page

In [19]:
%pip install --upgrade --editable ..

Obtaining file:///Users/nikola/workspace/alparc
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: alparc
  Building editable for alparc (pyproject.toml) ... [?25ldone
[?25h  Created wheel for alparc: filename=alparc-0.0.1-0.editable-py3-none-any.whl size=2592 sha256=5b829dec0f39fe144331ae82f272e423ff5aaec5c73fb0336fa3ff94e0ce0626
  Stored in directory: /private/var/folders/n1/bxdrmv296493f6tbg9v8pjnh0000gn/T/pip-ephem-wheel-cache-utx_n5xx/wheels/4e/8b/0b/d3bdb9934e92c8d0097403841e577be8baadd88948ad73ee72
Successfully built alparc
Installing collected packages: alparc
  Attempting uninstall: alparc
    Found existing installation: alparc 0.0.1
    Uninstalling alparc-0.0.1:
      Successfully uninstalled alparc-0.0.1
Successfully installed alparc-0.


## Syllable and Word Generation

Because ARC runs probabilistically (to speed things up), we set the random seeds to make sure our runs are reproducible.

In [2]:
from alparc import load_phonemes, make_syllables, make_words, make_lexicons, make_streams

phonemes = load_phonemes()
print(phonemes)

syllables = make_syllables(phonemes)
print(syllables)

words = make_words(syllables)
print(words)

ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


100%|██████████| 10000/10000 [00:12<00:00, 825.74it/s]


ʃoːɡɛːmeː|reːfaːkoː|çaːroːbuː|heːʃoːpaː|myːseːɡɛː|ɡyːmɛːʃiː|reːɡaːfyː|niːpuːçaː|tiːçaːmeː|ʃøːɡaːmoː|... (10000 elements total)


In [19]:
import os
words.save(os.path.join("results", "test_words"))

## Get Help

In [5]:
help(make_words)

Help on function make_words in module arc.core.word:

make_words(syllables: ~RegisterType, num_syllables=3, bigram_control=True, bigram_alpha=None, trigram_control=True, trigram_alpha=None, positional_control=True, positional_control_position=None, position_alpha=0, phonotactic_control=True, n_look_back=2, n_words=10000, max_tries=100000, progress_bar: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        syllables (RegisterType): The Register of syllables to use as a basis for word generation
        num_syllables (int, optional): how many syllables are in a word. Defaults to 3.
        bigram_control (bool, optional): apply statistical control on the bigram level. Defaults to True.
        bigram_alpha (_type_, optional): which p-value to assume for bigram control. Defaults to None.
        trigram_control (bool, optional): apply statistical control on the trigram level. Defaults to True.
        trigram_alpha (_type_, optional): which p-value to assume for trigram contr

In [6]:
help(make_syllables)

Help on function make_syllables in module arc.core.syllable:

make_syllables(phonemes: ~RegisterType, phoneme_pattern: str = 'cV', unigram_control: bool = True, language_control: bool = True, language_alpha: Optional[float] = 0.05, from_format: Literal['ipa', 'xsampa'] = 'xsampa', lang: str = 'deu') -> ~RegisterType
    _summary_
    
    Args:
        phonemes (RegisterType): A Register of phonemes that will be used as a basis to generate the syllables
        phoneme_pattern (str, optional): describes how a syllable is structured, e.g. "cV" syllables consist of a single-consonant character and a long vowel. Defaults to "cV".
        unigram_control (bool, optional): apply statistical control (on the basis of p-val of uniform distribution) to single unigrams. Defaults to True.
        language_control (bool, optional): apply language specific controls (only german for now) on the syllable level. Defaults to True.
        language_alpha (Optional[float], optional): which p-value to ass

## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. 

Let's generate 2 lexicons with 4 words each and print some info.

In [3]:
from alparc import make_lexicons

lexicons = make_lexicons(words, n_lexicons=2, n_words=4)
print("")

for i, lexicon in enumerate(lexicons):
    print(i, ":", lexicon)




0 : høːboːsuː|doːfuːheː|buːçaːnyː|ɡyːløːfoː
1 : muːtɛːçaː|faːkuːreː|seːpaːhuː|ɡiːluːfyː


In [4]:
help(make_lexicons)

Help on function make_lexicons in module alparc.core.lexicon:

make_lexicons(words: ~RegisterType, n_lexicons: int = 5, n_words: int = 4, max_overlap: int = 1, lag_of_interest: int = 1, max_word_matrix: int = 200, unique_words: bool = False, binary_feature_control: bool = True, progress_bar: bool = False, control_features: List[Literal['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']] = ['syl', 'son', 'cons', 'cont', 'delrel', 'lat', 'nas', 'strid', 'voi', 'sg', 'cg', 'ant', 'cor', 'distr', 'lab', 'hi', 'lo', 'back', 'round', 'tense', 'long']) -> List[alparc.types.base_types.Register]
    _summary_
    
    Args:
        words (RegisterType): The Register of words which the lexicon generation is based on.
        n_lexicons (int, optional): How many lexicons to generate. Defaults to 5.
        n_words (int, optional): How many words should be in a lexicon. Defaults to 4.
       

> ⚠️ The runtime of this function depends on the parameters when `control_features=True`. If it takes too long, consider reducing the number of words in the lexicon or the number of lexicons. If you don't get any output, consider increasing the maximum pairwise overlap allowed.

By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be increased untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

In [5]:
for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")

Lexicon: høːboːsuː|doːfuːheː|buːçaːnyː|ɡyːløːfoː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: muːtɛːçaː|faːkuːreː|seːpaːhuː|ɡiːluːfyː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

The function `make_streams` will try to generate one stream for each lexicon and TP mode. If you specify 'max_rhythmicity', it will discard those that do not meet the requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes. Printed below you see a collection of streams. Because streams can get long, you only see their key consisting of the lexicon used to generate it and its TP mode.

In [9]:
from alparc import make_streams
help(make_streams)

Help on function make_streams in module alparc.core.stream:

make_streams(lexicons: Optional[List[~RegisterType]], max_rhythmicity: Optional[float] = None, stream_length: int = 15, max_tries_randomize: int = 10, tp_modes: tuple = ('random', 'word_structured', 'position_controlled'), require_all_tp_modes: bool = True) -> ~RegisterType
    _summary_
    
    Args:
        lexicons (List[LexiconType]): A list of lexicons used as a basis for generatng the streams
        max_rhythmicity (Optional[float], optional): check rhythmicity and discard all streams that have at least one feature with higher PRI than this number. Defaults to None.
        stream_length (int, optional): how many syllables are in a stream in multiples of syllables in the lexicon. Defaults to 4.
        max_tries_randomize (int, optional): if max_rhythmicity is given and violated, how many times to try with a new randomization. Defaults to 10.
        tp_modes (tuple, optional): the ways (modes) in which to control for

In [10]:
streams = make_streams(lexicons)

> ⚠️ The runtime of this function depends on the parameters, especially when you specify a `max_rhythmicity`, because the function re-samples the random stream until `max_rhythmicity` is satisfied. This takes time, because TP-statistics need to be controlled each time. If it takes too long, consider removing the option.

To inspect a stream, select one either by index or by key:

In [17]:
stream = streams[0]
print(stream)

høː_çaː_foː_buː_suː...høː_boː_løː_nyː_buː


In [18]:
print("Lexicon:", stream.info["lexicon"])
print("TP mode:", stream.info["stream_tp_mode"])
print("Feature PRIs:") 
for feat, pri in stream.info["rhythmicity_indexes"].items():
    print(" ", feat, pri)

Lexicon: høːboːsuː|doːfuːheː|buːçaːnyː|ɡyːløːfoː
TP mode: random
Feature PRIs:
  phon_1_son 0.047619047619047616
  phon_1_back 0.0
  phon_1_hi 0.03221288515406162
  phon_1_lab 0.0938375350140056
  phon_1_cor 0.0700280112044818
  phon_1_cont 0.029411764705882353
  phon_1_lat 0.0
  phon_1_nas 0.0
  phon_1_voi 0.04341736694677871
  phon_2_back 0.028011204481792718
  phon_2_hi 0.08263305322128851
  phon_2_lo 0.0
  phon_2_lab 0.08263305322128851
  phon_2_tense 0.0
  phon_2_long 0.0


As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

This concludes the second tutorial, and we end this series with the third and last tutorial about how to use your own data.