# Controlled Stream Generation

We will generate words and a lexicon with minimal feature overlap between words. Next, we introduce the 3 main ways to generate random streams based on a lexicon. Each specifies how the transition probabilities (TPs) of their syllables are structured:

1. uniformlly distributed TPs, called "TP-random position-random" in the paper, 
2. position-controlled TPs, called "TP-random position-fixed", and
3. TPs that fully preserve the words, called "TP-structured".

## Installation

> ⚠️ We recommend using a virtual environment

> ⚠️ If you use a virtual environment, make sure you use the right kernel for this notebook. You can usually select it in the top right corner. If your environment is not in the list, you have to add the ipython kernel from the environment like so:
> 1. Activate virtual environment in terminal
> 2. Run `pip install ipykernel`
> 3. Run `python -m ipykernel install --user --name arc --display-name "Python (ARC)"`
> 4. Reload this page

In [None]:
%pip install --upgrade git+https://github.com/milosen/arc.git


## Syllable and Word Generation

Because ARC runs probabilistically (to speed things up), we set the random seeds to make sure our runs are reproducible.

In [1]:
from arc import load_phonemes, make_syllables, make_words, make_lexicons, make_streams

phonemes = load_phonemes()
print(phonemes)

syllables = make_syllables(phonemes)
print(syllables)

words = make_words(syllables)
print(words)

ɡ|k|b|d|p|t|x|ç|ʃ|f|... (38 elements total)
ɡaː|ɡiː|ɡyː|ɡɛː|kaː|koː|kuː|køː|kɛː|baː|... (75 elements total)


100%|██████████| 10000/10000 [00:11<00:00, 835.77it/s]


bigram control...
trigram control...
positional control...
çaːbøːriː|buːsiːheː|ʃiːmoːɡaː|boːhøːsaː|tuːhøːvaː|zuːpeːhoː|nøːfoːɡaː|saːpuːhøː|bøːzyːhuː|roːɡɛːfyː|... (10000 elements total)


In [2]:
words.save("test_words")

In [None]:
import webbrowser
import os

webbrowser.open('file://' + os.path.realpath("test_words.json"))

True

## Get Help

In [64]:
%pip install -e ..

Obtaining file:///Users/nikola/workspace/arc
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: arc
  Attempting uninstall: arc
    Found existing installation: arc 1.0
    Uninstalling arc-1.0:
      Successfully uninstalled arc-1.0
  Running setup.py develop for arc
Successfully installed arc-1.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
help(make_words)

In [None]:
help(make_syllables)

## Lexicon Generation

Now we generate lexica with minimal feature repetitiveness. 

Let's generate 2 lexicons with 4 words each and print some info.

In [5]:
from arc import make_lexicons

lexicons = make_lexicons(words, n_lexicons=2, n_words=4)
print("")

for i, lexicon in enumerate(lexicons):
    print(i, ":", lexicon)




0 : byːhiːzøː|løːvaːkoː|ʃøːheːpaː|køːsiːmyː
1 : buːhoːʃøː|kaːriːfoː|zøːɡɛːmuː|løːvaːkoː


In [None]:
help(make_lexicons)

> ⚠️ The runtime of this function depends on the parameters when `control_features=True`. If it takes too long, consider reducing the number of words in the lexicon or the number of lexicons. If you don't get any output, consider increasing the maximum pairwise overlap allowed.

By default, Lexicons with the minimum possible cumulative feature repetitiveness will be generated first, starting at zero. This means words will be joined into a lexicon if the features of all word pairs in the lexicon have no overlap. If it is not possible to generate the requested number Lexicons with zero overlap, the allowed overlap will be increased untill all lexicons are collected, which will be indicated by a warning message.

This process will be repeated, until any of the following statements is true
- the requested number of Lexicons has been generated
- the maximum allowed overlap is reached (set via `max_overlap`)
- the set of all word combinations is exhausted

If one or more Lexicons is returned, their info fields hold the cumulative overlap between all word pairs that is achieved by the Lexicon as well as the maximum pairwise overlap used.

In [65]:
for lexicon in lexicons:
    print("Lexicon:", lexicon)
    print("cumulative_feature_repetitiveness:", lexicon.info["cumulative_feature_repetitiveness"])
    print("max_pairwise_feature_repetitiveness:", lexicon.info["max_pairwise_feature_repetitiveness"])
    print("")

Lexicon: byːhiːzøː|løːvaːkoː|ʃøːheːpaː|køːsiːmyː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1

Lexicon: buːhoːʃøː|kaːriːfoː|zøːɡɛːmuː|løːvaːkoː
cumulative_feature_repetitiveness: 2
max_pairwise_feature_repetitiveness: 1



## Stream Generation

We want to generate a complete set of compatible lexicons for our study, i.e. to generate a compatible set of streams for testing statistical learning hypotheses. If `streams` is empty, try increasing the allowed maximum rythmicity.

The function `make_streams` will try to generate one stream for each lexicon and TP mode. If you specify 'max_rhythmicity', it will discard those that do not meet the requirement. By default, all streams from a lexicon will be discarded if the lexicon can't generate streams for all requested TP modes. Printed below you see a collection of streams. Because streams can get long, you only see their key consisting of the lexicon used to generate it and its TP mode.

In [None]:
from arc import make_streams
help(make_streams)

In [6]:
streams = make_streams(lexicons)

> ⚠️ The runtime of this function depends on the parameters, especially when you specify a `max_rhythmicity`, because the function re-samples the random stream until `max_rhythmicity` is satisfied. This takes time, because TP-statistics need to be controlled each time. If it takes too long, consider removing the option.

To inspect a stream, select one either by index or by key:

In [66]:
import json

from arc.types.base_types import Register

def write_out_streams(streams: Register, open_in_browser: bool = True, file_name: str = ""):

    with open('streams.json', 'w') as file:
        json.dump({"streams": [{
            "stream": stream.id, 
            "info": {
                "lexicon": "|".join([word.id for word in stream.info["lexicon"]]),
                "rhythmicity_indexes": stream.info["rhythmicity_indexes"],
                "stream_tp_mode": stream.info["stream_tp_mode"],
                "n_syllables_per_word": stream.info["n_syllables_per_word"],
                "n_look_back": stream.info["n_look_back"],
                "phonotactic_control": stream.info["phonotactic_control"],
                "syllables_info": stream.info["syllables_info"],
                }} for stream in streams], "info": streams.info}, file)

    if open_in_browser:
        import webbrowser
        webbrowser.open('file://' + os.path.realpath("streams.json"))

In [67]:
write_out_streams(streams)

In [54]:
type(streams[0].info["rhythmicity_indexes"]["phon_1_son"])

<class 'float'>

In [None]:
stream = streams[0]
print(stream)

In [None]:
print("Lexicon:", stream.info["lexicon"])
print("TP mode:", stream.info["stream_tp_mode"])
print("Feature PRIs:") 
for feat, pri in stream.info["rhythmicity_indexes"].items():
    print(" ", feat, pri)

As you can see, the `.info` field holds some useful information about the generated stream, i.e. which Lexicon has been used to generate it, the rythmicity indexes achieved for each feature, and which randomization/TP-structure mode has been used.

This concludes the second tutorial, and we end this series with the third and last tutorial about how to use your own data.