# PHONEMES: TRANSCRIPTION & PREPROCESSING

This notebook covers the phonemes part of the project and **briefly**:

0. describes the **phonetic set in Indonesian** 


2. explains the choice of the **phonetic transcription**


3. 'walks through' the phonemes **preprocessing**

### 0. PHONEME SET IN INDONESIAN ###

According to Soderberg (2008), **Bahasa Indonesia has 32 phonemes**:

**Vowels** (/a/, /e/, /ə/, /i/, /o/, /u/)

**Diphthongs** (/ai̯/, /au̯/, /oi̯/)

**Plosives** (/b/, /d/, /g/, /k/, /ʔ/, /p/, /t/)

**Affricates**
(/tʃ/, /dʒ/)

**Nasals** (/m/, /n/, /ɲ/, /ŋ/)

**Trill** (/r/) 

**Fricatives** (/f/, /h/, /x/, /s/, /ʃ/, /z/)

**Approximants** (/w/, /j/)

**Lateral approximant** (/l/)


### 1. MODULES AND LOADING DATA ###

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('ind_lw2.csv', ';')

In [5]:
df[:10]

Unnamed: 0,value,segments,language,borrowing_score,age_label,donor
0,dunia,"['d', 'u', 'n', 'i', 'a']",Indonesian,1.0,Early Malay,Arabic
1,alam,"['a', 'l', 'a', 'm']",Indonesian,1.0,Early Malay,Arabic
2,jagat,"['dʒ', 'a', 'g', 'a', 't']",Indonesian,1.0,Early Malay,Sanskrit
3,buana,"['b', 'u', 'a', 'n', 'a']",Indonesian,1.0,Early Malay,Sanskrit
4,darat,"['d', 'a', 'r', 'a', 't']",Indonesian,0.0,Prehistorical,
5,tanah,"['t', 'a', 'n', 'a', 'h']",Indonesian,0.0,Prehistorical,
6,bumi,"['b', 'u', 'm', 'i']",Indonesian,1.0,Early Malay,Sanskrit
7,tanah,"['t', 'a', 'n', 'a', 'h']",Indonesian,0.0,Prehistorical,
8,debu,"['d', 'e', 'b', 'u']",Indonesian,0.0,Prehistorical,
9,serdak,"['s', 'e', 'r', 'd', 'a', 'ʔ']",Indonesian,0.0,Prehistorical,


### 2. TRANSCRIPTION CHOICE ###

To have the words represented as their phonetic transcriptions, we had two choices: either using transcriptions provided by WOLD developers or implementing an external Python package transforming graphemic representation to a phonetic one. 

Since phonemes are key features in our classification, we decided to examine both transcription variants and their outputs in order to choose the more sufficient one. 




#### 2.1. Epitran lib

One of a few grapheme-to-phoneme modules supporting the Indonesian language is `epitran`. We tested it on our words and empirically examined the transcriptions. 

We have noticed that the transcriptions miss some important cases. 

First, plosive [ʔ]	was transcribed as [k]. Ex: ombak -> [o m b a k], instead of a correct [o m b a ʔ]. In the language, the plosives [k] and [ʔ] are independent and quite common phonemes, differentiation of which is crucial for our research. 

Secondly, the diphthong [aʊ] was considered as two separate vowels [a] and [u]. Ex: pulau -> [p u l a u] instead of [p u l aʊ]. We assume that diphthongs should be represented as single units. This is important for our future sequential models and especially important for the Bag of Sounds, analysing each phoneme unit independently. 

**2.2. WOLD representation**

In [6]:
new_dfseg = []

for i in df.segments:
    new_dfseg.append(eval(i))

df.segments = new_dfseg

In [7]:
segments = []

for i in df.segments:
    segments.extend(i)

In [12]:
all_phonemes = []

for l in df.segments:
    for s in l:
        if s not in all_phonemes:
            all_phonemes.append(s)

In [13]:
set(all_letters)

{'+',
 'A/a',
 'J/dʒ',
 'K/k',
 'M/m',
 'R/r',
 'S/s',
 'a',
 'aɪ',
 'aʊ',
 'aː',
 'b',
 'd',
 'dʒ',
 'e',
 'eː',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'kʰ',
 'kʷ',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 'ss/s',
 't',
 'tʃ',
 'u',
 'v',
 'w',
 'x',
 'z',
 'é/e',
 'éé/eː',
 'ŋ',
 'ɲ',
 'ʃ',
 'ʔ'}

WOLD transcriptions seemed complete and diverse, although we questioned some cases ([A/a], [R/r] and other cases containing the slash symbol). We could not justify their presence as separate units, thus decided to check each word instance containing those phonemes. 

After displaying those words from the database, we checked the IPA transcriptions for them and figured out that there is no reason for some of them to be separate. [A/a] stood for [a], [R/r] for [r], etc. The exeption was met for [é/e], [éé/eː]: they were depicting vowels not covered by other existing phonemes. 

We made a decision to use WOLD list of transcriptions, however, to replace unnecessary segments with those which they really refer to. 

### 3. WOLD PHONEMES: PREPROCESSING ###

In [14]:
segments_prep = []

for word in df.segments:
    phonemes = []
    for i in word:
        if 'A/a' == i or 'K/k' == i or 'M/m' == i or 'R/r' == i or 'S/s' == i or 'ss/s' == i:
            i = i.replace(i,i[-1])
            phonemes.append(i)
        if 'J/dʒ' in i:
            i = i.replace(i, 'dʒ')
            phonemes.append(i)
        else:
            phonemes.append(i)
            
    segments_prep.append(phonemes)

In [15]:
df.segments = segments_prep

In [16]:
df.segments[1050:1060]

1050            [dʒ, a, ŋ, k, a, r]
1051                   [s, a, u, h]
1052    [p, e, l, a, b, u, h, a, n]
1053             [b, a, n, d, a, r]
1054       [m, e, n, d, a, r, a, t]
1055         [m, e, m, p, u, ɲ, aɪ]
1056                      [a, d, a]
1057       [m, e, m, i, l, i, k, i]
1058       [m, e, m, i, l, i, k, i]
1059       [m, e, ŋ, a, m, b, i, l]
Name: segments, dtype: object

### Resources

Haspelmath, Martin and Tadmor, Uri (eds.) 2009. World Loanword Database. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wold.clld.org, Accessed on 2021-02-26.)

David R. Mortensen, Siddharth Dalmia, and Patrick Littell. 2018. Epitran: Precision G2P for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).

Soderberg C.D., Olson K.S. 2008. Indonesian, Journal of the International Phonetic Association, 38(2), pages 209-213.

