# Use your own data

In this tutorial, we discuss two ways of using your own data:

1. You have one or more lexicons you want to evaluate and generate streams with
2. You already have streams and just want to evaluate them

If you want to expand ARC, we are happy to invite you to contribute to the [ARC Project](https://github.com/milosen/arc) 


## 1. Loading/creating your custom lexicon

Let's say you have a lexicon consisting of the (pseudo-)words 'piɾuta', 'baɡoli', 'tokuda, and 'ɡuhaɪbo'.

We assume you have prepared your lexicon as a list of lists (see below), and that all syllables are of the same type. The function `to_lexicon()` accepts the syllable types we call 'cv' and 'cV'. 'cv' is a syllable consisting of a single-character consonant and a short vowel, e.g. 'pi'. Because it is common in the literature, 'cv' also allows diphthongs, e.g. 'haɪ'). The 'cV' type is a single-character consonant, together with a long vowel, e.g. 'tuː'.

In [9]:
from arc import to_lexicon
import numpy as np

raw_lexicon = [
  ['pi', 'ɾu', 'ta'],
  ['ba', 'ɡo', 'li'],
  ['to', 'ku', 'da'],
  ['ɡu', 'haɪ', 'bo']
]

lexicon = to_lexicon(raw_lexicon, syllable_type="cv")

print("Lexicon:", lexicon)
print("")

for key, value in lexicon.info.items():
    print(f"{key}:", lexicon.info[key])


Lexicon: piɾuta|baɡoli|tokuda|ɡuhaɪbo

syllable_feature_labels: [['son', 'back', 'hi', 'lab', 'cor', 'cont', 'lat', 'nas', 'voi'], ['back', 'hi', 'lo', 'lab', 'tense', 'long']]
syllable_type: cv
cumulative_feature_repetitiveness: 7
max_pairwise_feature_repetitiveness: 2


### 1.1. Custom Lexicon: Moving upstream

Now we "move upstream" in the generation process. We turn the lexicon into a stream using the standard `arc` functions introduced earlier.

In [10]:
from arc import make_streams
streams = make_streams([lexicon])

print("Streams (summary):", streams)
print("")

for key, value in streams.info.items():
    print(f"{key}:", streams.info[key])

Streams (summary): piɾutabaɡolitokudaɡuhaɪbo_random|piɾutabaɡolitokudaɡuhaɪbo_word_structured|piɾutabaɡolitokudaɡuhaɪbo_position_controlled

tp_modes: ('random', 'word_structured', 'position_controlled')
max_rhythmicity: None
max_tries_randomize: 10
stream_length: 32
require_all_tp_modes: True


In [11]:
for stream in streams:
    tp_mode = stream.info['stream_tp_mode']
    pris = stream.info['rhythmicity_indexes']
    
    print(f"Stream ({tp_mode}): ", stream, end="\n\n")
    print("PRIs:", pris, end="\n\n")

Stream (random):  haɪ|da|ɡo|ku|ɡu|ba|ɾu|bo|pi|li|ta|to|ɡo|pi|ta|ba|haɪ|ɾu|ɡu|li|da|bo|ku|to|da|ba|ɡu|to|ɾu|li|pi|ɡo|bo|ta|haɪ|ku|ɾu|to|haɪ|pi|da|ku|ta|ɡu|bo|ba|li|ɡo|haɪ|ba|da|ɡu|ta|pi|to|bo|li|ku|ɡo|ɾu|ɡo|to|ba|bo|ɾu|ta|da|pi|haɪ|li|ɡu|ku|da|ta|ɾu|haɪ|to|ɡu|ɡo|li|ba|pi|ku|bo|ɡu|pi|ɾu|ku|ba|ta|bo|haɪ|ɡo|da|to|li|to|ku|pi|ba|ɡo|ɡu|ɾu|da|haɪ|ta|li|bo|to|ta|ɡo|ba|ku|li|ɾu|pi|ɡu|haɪ|bo|da|li|haɪ|ɡu|da|ɾu|ba|to|pi|bo|ɡo|ta|ku|haɪ|bo|to|li|ɾu|pi|ku|ɡu|da|ta|ba|ɡo|bo|da|haɪ|pi|ɡo|ta|ɡu|ɾu|to|ku|ba|li|bo|haɪ|ku|li|to|pi|ɾu|ɡu|ba|ta|ɡo|da|ba|ku|to|ɡo|li|ta|pi|bo|ɾu|haɪ|da|ɡu|bo|ta|haɪ|ɡu|pi|da|ku|ɡo|ɾu|li|ba|to|ɡu|to|bo|pi|haɪ|ba|ɾu|da|li|ɡo|ku|ta|to|ɾu|ku|da|bo|ɡo|ba|ɡu|haɪ|ta|li|pi|ta|ɾu|ɡo|to|da|pi|ba|bo|ɡu|ku|haɪ|li|ɡu|ta|ku|bo|li|da|ɾu|ba|haɪ|ɡo|pi|to|haɪ|to|ta|da|ɡo|ɡu|li|ku|ɾu|bo|ba|pi|li|haɪ|ɾu|ta|bo|ku|pi|ɡu|ɡo|da|to|ba|da|ɡu|bo|ɡo|haɪ|ba|to|ku|li|ɾu|ta|pi|ku|da|haɪ|ɡu|li|to|ba|ɾu|bo|pi|ɡo|ta|haɪ|pi|da|ba|ku|ɡo|to|ɡu|ta|li|bo|ɾu|pi|ta|to|bo|da|li|haɪ|ɾu|ɡu|ku|ba|ɡo|ɡu|pi|haɪ|ta|ɡo|bo|b

### 1.2. Custom Lexicon: Moving downstream

"moving downstream" in the generation process, i.e. generating words, syllables, and phonemes is less common, but we got you covered. Let's say you want to compare the syllables in your custom lexicon with the arc corpus.

In [12]:
syllables = lexicon.flatten()
print(syllables)

pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|... (12 elements total)


In [13]:
from arc.io import read_syllables_corpus
corpus_syllables = read_syllables_corpus()

syllables_with_corpus_stats = syllables.intersection(corpus_syllables)
print(syllables_with_corpus_stats)
syllables_with_corpus_stats["pi"].info

pi|ta|ba|ɡo|li|to|ku|da|ɡu|haɪ|... (11 elements total)


{'binary_features': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
 'phonotactic_features': [['plo', 'lab'], ['i']],
 'freq': 70,
 'prob': 6.92628e-05}

In [14]:
phonemes = syllables.flatten()
print(phonemes)

p|i|ɾ|u|t|a|b|ɡ|o|l|... (13 elements total)


In [15]:
from arc.io import read_phoneme_corpus
corpus_phonemes = read_phoneme_corpus()

phonemes_with_stats = phonemes.intersection(corpus_phonemes)
print(phonemes_with_stats)
print(phonemes_with_stats["p"].info)

p|t|a|b|ɡ|l|k|d|h
{'features': ['-', '-', '+', '-', '-', '-', '-', '0', '-', '-', '-', '+', '-', '0', '+', '-', '-', '-', '-', '0', '-'], 'word_position_prob': {0: 0.17205350479361617, 1: 0.21734887192146507, 2: 0.28055571502382454, 3: 0.19685400998909236, 4: 0.04449164705206958, 5: 0.03743039210057983, 6: 0.018542970319765772, 7: 0.012859521212469143, 8: 0.007463114989379413, 9: 0.003961191802055227, 10: 0.003387106033641426, 11: 0.0014926229978758827, 12: 0.0020667087662896836, 13: 0.0006314943452551811, 14: 0.00040186003788966073, 15: 0.0002870428842069005, 16: 5.74085768413801e-05, 17: 0.0, 18: 5.74085768413801e-05, 19: 5.74085768413801e-05}}


## Reading your stream

Again, we assume you have prepared your data into a list of syllables like below.

In [16]:
from arc import to_stream

stream = ['pi', 'ɾu', 'ta', 'ba', 'ɡo', 'li', 'to', 'ku', 'da', 'ɡu', 'ki', 'bo']*streams.info['stream_length']

stream = to_stream(stream)

print("Stream: ", stream, end="\n\n")
print("rhythmicity indexes (PRIs)", stream.info['rhythmicity_indexes'])

Stream:  pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|to|ku|da|ɡu|ki|bo|pi|ɾu|ta|ba|ɡo|li|t

As you can see, even with a custom lexicon, the randomization of a stream has an effect on the PRIs.

This concludes our third and last tutorial. We hope you feel ready to use ARC, and help us extend it.