# Use your own data

In this tutorial, we discuss two ways of using your own data:

1. You have one or more lexicons you want to evaluate and generate streams with
2. You already have streams and just want to evaluate them

If you want to expand ARC, we are happy to invite you to contribute to the [ARC Project](https://github.com/milosen/arc) 


## 1. Loading/creating your custom lexicon

Let's say you have a lexicon consisting of the (pseudo-)words 'piɾuta', 'baɡoli', 'tokuda, and 'ɡuhaɪbo'.

We assume you have prepared your lexicon as a list of lists (see below), and that all syllables are of the same type. The function `to_lexicon()` accepts the syllable types we call 'cv' and 'cV'. 'cv' is a syllable consisting of a single-character consonant and a short vowel, e.g. 'pi'. Because it is common in the literature, 'cv' also allows diphthongs, e.g. 'haɪ'). The 'cV' type is a single-character consonant, together with a long vowel, e.g. 'tuː'.

In [9]:
from alparc import to_lexicon
import numpy as np

raw_lexicon = [
  ['pi', 'ɾu', 'ta'],
  ['ba', 'ɡo', 'li'],
  ['to', 'ku', 'da'],
  ['ɡu', 'haɪ', 'bo']
]

lexicon = to_lexicon(raw_lexicon, syllable_type="cv")

print("Lexicon:", lexicon)
print("")

for key, value in lexicon.info.items():
    print(f"{key}:", lexicon.info[key])


Lexicon: piɾuta|baɡoli|tokuda|ɡuhaɪbo

syllables_info: {'syllable_feature_labels': [['son', 'back', 'hi', 'lab', 'cor', 'cont', 'lat', 'nas', 'voi'], ['back', 'hi', 'lo', 'lab', 'tense', 'long']], 'syllable_type': 'cv'}
cumulative_feature_repetitiveness: 7
max_pairwise_feature_repetitiveness: 2


### 1.1. Custom Lexicon: Moving upstream

Now we "move upstream" in the generation process. We turn the lexicon into a stream using the standard `alparc` functions introduced earlier.

In [10]:
from alparc import make_streams
streams = make_streams([lexicon])

print("Streams (summary):", streams)
print("")

for key, value in streams.info.items():
    print(f"{key}:", value)

Streams (summary): piɾutabaɡolitokudaɡuhaɪbo_random|piɾutabaɡolitokudaɡuhaɪbo_word_structured|piɾutabaɡolitokudaɡuhaɪbo_position_controlled

tp_modes: ('random', 'word_structured', 'position_controlled')
max_rhythmicity: None
max_tries_randomize: 10
stream_length: 15
require_all_tp_modes: True


In [11]:
for stream in streams:
    tp_mode = stream.info['stream_tp_mode']
    pris = stream.info['rhythmicity_indexes']
    
    print(f"Stream ({tp_mode}): ", stream)
    print("PRIs:")
    max = "phon_1_son"
    cum = 0
    for feat, pri in stream.info["rhythmicity_indexes"].items():
        print(" ", feat, pri)
        if pri > stream.info["rhythmicity_indexes"][max]:
            max = feat
        cum += pri

    print("Max PRI across features:", max, stream.info["rhythmicity_indexes"][max])
    print("Cummulative PRI across features:", cum)
    print(" ")

Stream (random):  ta_to_ba_pi_ɡo...bo_haɪ_da_ku_to
PRIs:
  phon_1_son 0.058823529411764705
  phon_1_back 0.08263305322128851
  phon_1_hi 0.08263305322128851
  phon_1_lab 0.07282913165266107
  phon_1_cor 0.06862745098039216
  phon_1_cont 0.058823529411764705
  phon_1_lat 0.004201680672268907
  phon_1_nas 0.0
  phon_1_voi 0.02100840336134454
  phon_2_back 0.0
  phon_2_hi 0.06162464985994398
  phon_2_lo 0.12745098039215685
  phon_2_lab 0.05742296918767507
  phon_2_tense 0.0
  phon_2_long 0.0
Max PRI across features: phon_2_lo 0.12745098039215685
Cummulative PRI across features: 0.696078431372549
 
Stream (word_structured):  pi_ɾu_ta_to_ku...ɡo_li_to_ku_da
PRIs:
  phon_1_son 0.1400560224089636
  phon_1_back 0.14705882352941177
  phon_1_hi 0.14705882352941177
  phon_1_lab 0.12044817927170869
  phon_1_cor 0.04481792717086835
  phon_1_cont 0.1400560224089636
  phon_1_lat 0.0
  phon_1_nas 0.0
  phon_1_voi 0.011204481792717087
  phon_2_back 0.0
  phon_2_hi 0.011204481792717087
  phon_2_lo 0.096

### 1.2. Custom Lexicon: Moving backwards

"moving backwards" in the generation process, i.e. generating words, syllables, and phonemes is less common, but we got you covered. Let's say you want to compare the syllables in your custom lexicon with the arc corpus.

In [12]:
syllables = lexicon.flatten()
print(syllables)

pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|... (12 elements total)


In [13]:
from alparc.io import read_syllables_corpus

corpus_syllables = read_syllables_corpus()

syllables_with_corpus_stats = syllables.intersection(corpus_syllables)

print(syllables_with_corpus_stats)
syllables_with_corpus_stats["pi"].info

#note: mention that frew and prob are new

pi|ta|ba|ɡo|li|to|ku|da|ɡu|haɪ|... (11 elements total)


{'binary_features': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
 'phonotactic_features': [['plo', 'lab'], ['i']],
 'freq': 70,
 'prob': 6.92628e-05}

In [14]:
phonemes = syllables.flatten()
print(phonemes)

p|i|ɾ|u|t|a|b|ɡ|o|l|... (13 elements total)


In [15]:
from alparc.io import read_phoneme_corpus
corpus_phonemes = read_phoneme_corpus()

phonemes_with_stats = phonemes.intersection(corpus_phonemes)
print(phonemes_with_stats)
print(phonemes_with_stats["p"].info)

Only german phoneme corpus available


p|i|ɾ|u|t|a|b|ɡ|o|l|... (13 elements total)
{'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '+', '-', '0', '+', '-', '-', '-', '-', '0', '-']}


## Evaluating your stream

Again, we assume you have prepared your data into a list of syllables like below.

In [16]:
from alparc import to_stream

stream = ['pi', 'ɾu', 'ta', 'ba', 'ɡo', 'li', 'to', 'ku', 'da', 'ɡu', 'ki', 'bo']*streams.info['stream_length']

stream = to_stream(stream)

print("Stream: ", stream, end="\n\n")
print("rhythmicity indexes (PRIs)", stream.info['rhythmicity_indexes'])

Stream:  pi_ɾu_ta_ba_ɡo...ku_da_ɡu_ki_bo

rhythmicity indexes (PRIs) {'phon_1_son': 0.0, 'phon_1_back': 0.1724137931034483, 'phon_1_hi': 0.1724137931034483, 'phon_1_lab': 0.08620689655172414, 'phon_1_cor': 0.0, 'phon_1_cont': 0.0, 'phon_1_lat': 0.0, 'phon_1_nas': 0.0, 'phon_1_voi': 0.0, 'phon_2_back': 0.0, 'phon_2_hi': 0.0, 'phon_2_lo': 0.0, 'phon_2_lab': 0.0, 'phon_2_tense': 0.0, 'phon_2_long': 0.0}


As you can see, even with a custom lexicon, the randomization of a stream has an effect on the PRIs.

This concludes our third and last tutorial. We hope you feel ready to use ARC, and help us extend it.