## Adding more phonetic transcriptions

Not all the verbs found in CHILDES-DB have a transcription in `english_merged.txt`.

As a workaround, we decided to try creating transcriptions for these verbs using [NLTK's CMUdict](https://www.nltk.org/api/nltk.corpus.reader.cmudict.html). 

CMUdict uses ARPAbet transcription. [1](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

This Jupyter/Python notebook contains code that:
(1) obtains ARPAbet transcriptions for the verbs not in `english_merged.txt`,
(2) uses phonecodes to get IPA, transcriptions, and
(3) saves the stem IPA to a new CSV file.

In [26]:
!pip install nltk



Get list of verbs without a transcription from `verb_tokens_transcribed_ipa`

In [27]:
import pandas as pd

df = pd.read_csv('verb_tokens_transcribed.csv')
empty_df = df[df['stem_celex_encoding'].isnull()]

In [28]:
empty_df

Unnamed: 0,gloss,target_child_age,speaker_role,target_child_id,utterance_id,stem,stem_celex_encoding,past_tense,past_tense_celex_encoding
1,says,3.000062,Mother,2530,917970,say,,,
9,fit,3.000062,Mother,2530,918404,fit,,,
10,let's,3.000062,Mother,2530,918416,let,,,
14,let's,3.000062,Mother,2530,918590,let,,,
17,read,3.000062,Mother,2530,918647,read,,,
...,...,...,...,...,...,...,...,...,...
1218416,do,35.986365,Target_Child,16213,10465001,do,,,
1218417,has,35.986365,Target_Child,16213,10465026,have,,,
1218419,do,35.986365,Target_Child,16213,10465128,do,,,
1218420,have,35.986365,Target_Child,16213,10465171,have,,,


In [29]:
verbs_to_add = empty_df['stem'].unique()

In [30]:
verbs_to_add

<StringArray>
[   'say',    'fit',    'let',   'read',     'do',    'put',   'make',
   'have',  'learn',    'dog',
 ...
  'armor',  'crest',  'tiger',   'molt', 'flurry',   'jive',  'scone',
  'hoove', 'lesson',   'sire']
Length: 552, dtype: str

For some reason, there seems to be some nouns in the missing verbs. 

We could try to filter these out with NLTK if we had the context/whole utterance, but we leave this for future work.

## Get ARPAbet transcription using CMUdict

In [48]:
from collections import defaultdict

from nltk.corpus import cmudict

prondict = cmudict.dict()

TAB = '\t'

# transcribed_verbs maps a verb to its transcription string
transcribed_verbs = defaultdict(str)

for verb in verbs_to_add:
    if verb in prondict:
        pronunciations = prondict[verb] # List of pronunciations (also a list) returned. 
        # We use the first pronunciation in the list
        transcription = " ".join(pronunciations[0])
        transcribed_verbs[verb] = transcription

In [49]:
# Check how many transcribed verbs we have vs what we started with

len(transcribed_verbs), len(verbs_to_add)

(484, 552)

Note: We only have IPA transcriptions for the stem. 

Currently, the CSV file `verb_tokens_transcribed_ipa.csv` has transcriptions for the stem + past tense + past type (regular or irregular). 

Maybe the stem-only transcriptions could be used as test data for the model?

## Convert ARPAbet to IPA using phonecodes

Modified code from `celexToIPA.py`

In [50]:
!pip install phonecodes

Collecting phonecodes
  Using cached phonecodes-2.0.0-py3-none-any.whl.metadata (9.2 kB)
Using cached phonecodes-2.0.0-py3-none-any.whl (20 kB)
Installing collected packages: phonecodes
Successfully installed phonecodes-2.0.0


In [64]:
from phonecodes import phonecodes

ipa_transcribed = defaultdict()

for verb, arpa in transcribed_verbs.items():
    ipa_transcription = phonecodes.convert(arpa, "arpabet", "ipa")
    # Remove spaces between words in the IPA transcription
    ipa_transcribed[verb] = ipa_transcription.replace(" ", "")

In [65]:
len(ipa_transcribed) == len(transcribed_verbs)

True

In [66]:
ipa_transcribed.items()

dict_items([('say', 'sˈeɪ'), ('fit', 'fˈɪt'), ('let', 'lˈɛt'), ('read', 'ɹˈɛd'), ('do', 'dˈu'), ('put', 'pˈʊt'), ('make', 'mˈeɪk'), ('have', 'hˈæv'), ('learn', 'lˈɝn'), ('dog', 'dˈɔɡ'), ('close', 'klˈoʊs'), ('ooh', 'ˈu'), ('excuse', 'ɨkskjˈus'), ('bless', 'blˈɛs'), ('hit', 'hˈɪt'), ('use', 'jˈus'), ('wad', 'wˈɑd'), ('baby', 'bˈeɪbi'), ('record', 'ɹəkˈɔɹd'), ('shut', 'ʃˈʌt'), ('snack', 'snˈæk'), ('be', 'bˈi'), ('ribbit', 'ɹˈɪbɨt'), ('set', 'sˈɛt'), ('shampoo', 'ʃæmpˈu'), ('incorporate', 'ɨnkˈɔɹpɚˌeɪt'), ('waffle', 'wˈɑfəl'), ('jack', 'dʒˈæk'), ('best', 'bˈɛst'), ('boink', 'bˈɔɪnk'), ('ham', 'hˈæm'), ('weird', 'wˈɪɹd'), ('chicken', 'tʃˈɪkən'), ('tear', 'tˈɛɹ'), ('bam', 'bˈæm'), ('till', 'tˈɪl'), ('spill', 'spˈɪl'), ('hurt', 'hˈɝt'), ('max', 'mˈæks'), ('yuck', 'jˈʌk'), ('live', 'lˈaɪv'), ('toy', 'tˈɔɪ'), ('meow', 'miˈaʊ'), ('pop', 'pˈɑp'), ('rid', 'ɹˈɪd'), ('quack', 'kwˈæk'), ('loaf', 'lˈoʊf'), ('task', 'tˈæsk'), ('sauce', 'sˈɔs'), ('poof', 'pˈuf'), ('woof', 'wˈuf'), ('even', 'ˈivɨn'), ('

Save to a new csv file

In [68]:
INPUT_CSV = "verb_tokens_transcribed_ipa.csv"
OUTPUT_CSV = "verb_tokens_transcribed_ipa_stemonly.csv"

df = pd.read_csv(INPUT_CSV)

# Add the stem transcription to ipa_transcribed
for word, transcription in ipa_transcribed.items():
    df.loc[df['stem'] == word, 'stem_ipa'] = transcription

df.to_csv(OUTPUT_CSV, index=False)