---
title: "A statistical analysis of vowel inventories of world languages"
subtitle: "Multilingual NLP -- Lab 1"
author: "Philippos Triantafyllou"
date-modified: last-modified
date-format: long
lang: en
format: html
theme: cosmo
toc: true
code-line-numbers: true
echo: true
output: true
embed-resources: true
---

:::{.callout-note}
## Instructions

One of the aims of this practical session is to examine two well-known proposed linguistic 
universals relating to vowel systems. First, that virtually all languages possess the basic vowel triangle [i], [a], [u] (or close equivalents). Second, that if a language has highly marked vowels (for example nasal, long, or front rounded vowels), it almost always also has the corresponding simpler vowels. In this lab, we will test whether these are indeed universals by drawing on `PHOIBLE`, an open, cross-linguistic database of phonological inventories covering over 2,000 languages.

:::

In [1]:
#| echo: false
#| output: false

import os
my_path = 'vowel-inventories'
print(os.path.basename(os.getcwd()) == my_path)

True


## Getting started

First glance of the dataset. Each row is a single phoneme. Phonemes are organized in unique inventories that are labelled by the variable `InventoryID`.

In [2]:
#| echo: false

import pandas as pd

# Load data from csv file
data = pd.read_csv("data/phoible.csv", low_memory=False)
display(data)


Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,Phoneme,Allophones,Marginal,SegmentClass,...,advancedTongueRoot,periodicGlottalSource,epilaryngealSource,spreadGlottis,constrictedGlottis,fortis,lenis,raisedLarynxEjective,loweredLarynxImplosive,click
0,1,kore1280,kor,Korean,,0068,h,ç h ɦ,,consonant,...,-,-,-,+,-,-,-,-,-,-
1,1,kore1280,kor,Korean,,006A,j,j,,consonant,...,-,+,-,-,-,-,-,-,-,-
2,1,kore1280,kor,Korean,,006B,k,k̚ ɡ k,,consonant,...,-,-,-,-,-,-,-,-,-,-
3,1,kore1280,kor,Korean,,006B+02B0,kʰ,kʰ,,consonant,...,-,-,-,+,-,-,-,-,-,-
4,1,kore1280,kor,Korean,,006B+02C0,kˀ,kˀ,,consonant,...,-,-,-,-,+,-,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105479,3020,lamu1254,lby,Tableland Lamalama,,0294,ʔ,,False,consonant,...,-,-,-,-,+,-,-,-,-,-
105480,3020,lamu1254,lby,Tableland Lamalama,,03B8,θ,,False,consonant,...,-,-,-,-,-,-,-,-,-,-
105481,3020,lamu1254,lby,Tableland Lamalama,,0061,a,,False,vowel,...,-,+,-,-,-,0,0,-,-,0
105482,3020,lamu1254,lby,Tableland Lamalama,,0069,i,,False,vowel,...,-,+,-,-,-,0,0,-,-,0


We can have a quick look at the summary statistics of the first 10 columns (variables) that are of interest.

In [None]:
#| label: tbl-dataset
#| tbl-cap: "Summary statistics of first 10 columns"

data.iloc[:, :10].describe(include='all')

Unnamed: 0,InventoryID,Glottocode,ISO6393,LanguageName,SpecificDialect,GlyphID,Phoneme,Allophones,Marginal,SegmentClass
count,105484.0,105465,105459,105484,21985,105484,105484,51904,84610,105484
unique,,2176,2094,2716,544,3142,3142,6891,2,3
top,,kham1282,mis,Iron Ossetic,W2,006D,m,m,False,consonant
freq,,622,828,444,120,2915,2915,1091,83263,72282
mean,1479.331083,,,,,,,,,
std,843.110759,,,,,,,,,
min,1.0,,,,,,,,,
25%,769.0,,,,,,,,,
50%,1464.0,,,,,,,,,
75%,2237.0,,,,,,,,,


Immediately we can see some interesting things:

- the most frequent value for `Glottocode` is `kham1282` with 622 rows;
- the most frequent value for `LanguageName` is `Iron Ossetic` with 444 rows;
- `ISO6393` has 828 values that are labelled `mis`, and we assume that they are missing values;
- amongst all the columns that correspond to language classification, there are discrepancies between their counts.

### Q1: determining the number of unique languages

We want to determine the number of unique languages in our dataset. Arguably, this is an important variable in our analysis so we can spend some time digging into the dataset.

A first variable that grabs our attention is `InventoryID`. When we look at the `PHOIBLE` documentation, we can see:

> For the most part, every phonological inventory in PHOIBLE is based on one-and-only-one language description (usually a research article, book chapter, dissertation, or descriptive grammar).

This seems like a coherent start to base our analysis. Each Inventory ID has a unique description of a  a phonetic inventory. But we also have other variables: `LanguageName`, `Glottocode` and `ISO6393`. Contrary to `InventoryID`, they seem to hold the "language name".

Looking at the summary statistics in the table above, we realize that the counts of these variables are not the same, that means that some values in one variable might map to multiple values in another variable and so on.

Naively, if we start form `InventoryID` and  `LanguageName`, what can we find?

We group by `LanguageName` and verify if they all (in theory they should) all have one phonological inventory.

In [4]:
subset = data.groupby("LanguageName")["InventoryID"].nunique().reset_index()
subset = subset[subset["InventoryID"] > 1]
print("Are there languages with more than one inventory?", len(subset) == 0)

Are there languages with more than one inventory? False


It seems that some languages have more than one inventory. How many languages? How many inventories do they have on average?

In [None]:
#| label: tbl-inventory
#| tbl-cap: "Descriptive statistics for languages with multiple inventories"

from IPython.display import Markdown

print("Number of languages with more than one phonetic inventory:", len(subset))
statistics = subset.describe().round(2).reset_index()
statistics.columns = ["Measure", "Value"]
Markdown(statistics.to_markdown(index=False))

Number of languages with more than one phonetic inventory: 208


| Measure   |   Value |
|:----------|--------:|
| count     |  208    |
| mean      |    2.46 |
| std       |    1.01 |
| min       |    2    |
| 25%       |    2    |
| 50%       |    2    |
| 75%       |    3    |
| max       |   10    |

There are 208 languages (from the 2716) that have more than one inventory, and most of them only have one additional one. We can explain this based on the documentation description given above: presumably, different linguists have compiled slightly different inventories for the same language/dialect. There are some outliers, as there is a language with 10 inventories.

In [6]:
print(subset[subset["InventoryID"] == 10]['LanguageName'].values[0])

Iron Ossetic


Going back to the summary statistics table, there seems to be another language that has a lot of rows, this time it is the most frequent value of the `Glottocode` variable.

In [7]:
data.loc[data['Glottocode'] == 'kham1282']['InventoryID'].value_counts()

InventoryID
2519    133
2489     96
2591     78
2525     77
2587     64
2327     62
2328     61
2600     51
Name: count, dtype: int64

We can correspond this to its value(s) in the `LanguageName` variable.

In [8]:
data.loc[data['Glottocode'] == 'kham1282']['LanguageName'].unique().tolist()

['Rgyalthang Tibetan',
 'Brag-g.yab Tibetan',
 'Nangchenpa Tibetan',
 'Soghpo Tibetan',
 'Kami Tibetan',
 'Sangdam Tibetan',
 'Dongwang Tibetan',
 'Kham Tibetan']

Doing the same with `ISO6393`, we get:

In [9]:
data.loc[data['Glottocode'] == 'kham1282']['ISO6393'].unique().tolist()

['khg']

So it seems that `LanguageName` has more distinctions between "languages". There are are also discrepancies between the different typologies. For example concerning `LanguageName` and `Glottocode` the documentation says:

> Every phonological inventory in PHOIBLE has a unique numeric inventory ID. Since most PHOIBLE inventories (aside from some UPSID or SPA ones, as mentioned above) are based on a single document, it is fairly straightforward to link each PHOIBLE inventory to the Glottolog, which provides links between linguistic description documents and unique identifiers for dialects, languages, and groupings of dialects and languages at various levels.

Furthermore, the documentation notes differences between `Glottocode` and `LanguageName` (for example, some languages with a glottocode do not have ISO IDs, and there might be some shared ISO IDs between different glottocode IDs).

Grouping by `Glottocode` we can see the inventories and the `LanguageName` labels that correspond.

In [22]:
result = (
    data
    .groupby("ISO6393", as_index=False)
    .agg(
        Inventories=("InventoryID", "nunique"),
        Glottocode=("Glottocode", lambda x: ", ".join(pd.unique(x.dropna()))),
        Names=("LanguageName", lambda x: ", ".join(pd.unique(x.dropna())))
    )
    .query("Inventories >= 5")
    .sort_values("Inventories", ascending=False)
)

In [23]:
#| echo: false
#| label: tbl-discrepancies
#| tbl-cap: "Discrepancies between different language classifications"

Markdown(result.to_markdown(index=False))

| ISO6393   |   Inventories | Glottocode                                                                                                                                                                                                                                                                   | Names                                                                                                                                                                                                                                                                                                                                                  |
|:----------|--------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| mis       |            30 | pisa1245, lizu1234, east2773, zhon1235, vach1239, fore1274, mink1237, guwa1244, mith1236, cola1237, yari1243, west2443, djad1246, kera1256, lowe1402, ngin1247, gudj1237, kawa1290, wala1263, tyan1235, luth1234, mbiy1238, ngko1236, yadh1237, bula1255, yulp1239, sout2770 | Pisamira, Lizu, Dolakha Newar, Zhongu Tibetan, Eastern Khanty, Forest Nenets, Minkin, Guwar, Djindewal, Mithaka, Kolakngat, Yari-Yari, East Djadjawurung, Jardwadjali, Keramin, Ngayawang, Ngintait, Gudjal, Ogh Awarrangg, Ogh Unyjan, Walangama, Thaynakwithi, Luthigh, Mbiywom, Ngkoth, Yadhaykenu, Bularnu, Yulparija, West Djadjawurung, Ngunawal |
| oss       |            12 | osse1243, digo1242                                                                                                                                                                                                                                                           | Ossetian, Iron Ossetic, Digor Ossetic                                                                                                                                                                                                                                                                                                                  |
| bzr       |            10 | biri1256                                                                                                                                                                                                                                                                     | Barna, Biri, Garingbal, Miyan, Wiri, Yambina, Yangga, Yilba, Yuwi, Wangan                                                                                                                                                                                                                                                                              |
| eng       |             9 | stan1293                                                                                                                                                                                                                                                                     | English, English (American), American English, English (Australian), English (British), English (New Zealand)                                                                                                                                                                                                                                          |
| khg       |             8 | kham1282                                                                                                                                                                                                                                                                     | Rgyalthang Tibetan, Brag-g.yab Tibetan, Nangchenpa Tibetan, Soghpo Tibetan, Kami Tibetan, Sangdam Tibetan, Dongwang Tibetan, Kham Tibetan                                                                                                                                                                                                              |
| nld       |             8 | dutc1256                                                                                                                                                                                                                                                                     | Dutch                                                                                                                                                                                                                                                                                                                                                  |
| eus       |             8 | basq1248, basq1250                                                                                                                                                                                                                                                           | Basque, BASQUE, Zuberoan Basque                                                                                                                                                                                                                                                                                                                        |
| mhr       |             7 | east2328                                                                                                                                                                                                                                                                     | Cheremis, MARI, Meadow Mari, Eastern Mari                                                                                                                                                                                                                                                                                                              |
| gup       |             7 | gunw1252, gund1246, gune1238, mura1269, guma1252, naia1238                                                                                                                                                                                                                   | Gunwinggu, Gun-Dedjnjenghmi, Gun-Djeihmi, Kune, Kuninjku, Kunwinjku, Mayali                                                                                                                                                                                                                                                                            |
| kca       |             6 | khan1273                                                                                                                                                                                                                                                                     | Ostyak, KHANTY, Eastern Khanty, Northern Khanty                                                                                                                                                                                                                                                                                                        |
| gwn       |             6 | gwan1268                                                                                                                                                                                                                                                                     | Gwandara (Karshi), Gwandara (Cancara), Gwandara (Toni), Gwandara (Gitata), Gwandara (Koro), Gwandara (Nimbia)                                                                                                                                                                                                                                          |
| nyf       |             6 | giry1241, kamb1298, kaum1238                                                                                                                                                                                                                                                 | Giryama, Jiβana, Kambe, Kauma, Raβai, Reβe                                                                                                                                                                                                                                                                                                             |
| nys       |             6 | nyun1247, kani1276                                                                                                                                                                                                                                                           | Balardung, Kaniyang, Minang, Wiilman, Wudjari, Yuwat                                                                                                                                                                                                                                                                                                   |
| sgw       |             6 | seba1251, ezha1238, chah1248, gume1239                                                                                                                                                                                                                                       | Muher, Ezha, Chaha, Gumer, Gura, Gyeto                                                                                                                                                                                                                                                                                                                 |
| lzz       |             6 | lazz1240                                                                                                                                                                                                                                                                     | Laz                                                                                                                                                                                                                                                                                                                                                    |
| lit       |             6 | lith1251                                                                                                                                                                                                                                                                     | Lithuanian, LITHUANIAN                                                                                                                                                                                                                                                                                                                                 |
| spa       |             5 | stan1288                                                                                                                                                                                                                                                                     | Spanish, SPANISH                                                                                                                                                                                                                                                                                                                                       |
| tts       |             5 | nort2741                                                                                                                                                                                                                                                                     | Northeastern Thai                                                                                                                                                                                                                                                                                                                                      |
| udm       |             5 | udmu1245, bese1243                                                                                                                                                                                                                                                           | Udmurt, Beserman                                                                                                                                                                                                                                                                                                                                       |
| unr       |             5 | mund1320                                                                                                                                                                                                                                                                     | Mundari, MUNDARI, Bhumij                                                                                                                                                                                                                                                                                                                               |
| xtc       |             5 | katc1250, katc1249                                                                                                                                                                                                                                                           | Katcha, KADUGLI, Kadugli (Kadugli), Kadugli (Miri), Kadugli (Katcha)                                                                                                                                                                                                                                                                                   |
| ben       |             5 | beng1280                                                                                                                                                                                                                                                                     | Bengali, BENGALI                                                                                                                                                                                                                                                                                                                                       |
| khr       |             5 | khar1287                                                                                                                                                                                                                                                                     | Kharia, KHARIA                                                                                                                                                                                                                                                                                                                                         |
| bod       |             5 | tibe1272                                                                                                                                                                                                                                                                     | Tibetan, Lhasa Tibetan, Drokpa Tibetan, Dingri Tibetan, Shigatse Tibetan                                                                                                                                                                                                                                                                               |
| kbd       |             5 | kaba1278                                                                                                                                                                                                                                                                     | Kabardian, KABARDIAN                                                                                                                                                                                                                                                                                                                                   |
| hin       |             5 | hind1269                                                                                                                                                                                                                                                                     | Hindi-Urdu, HINDI-URDU, Hindi                                                                                                                                                                                                                                                                                                                          |
| hau       |             5 | haus1257                                                                                                                                                                                                                                                                     | Hausa, HAUSA                                                                                                                                                                                                                                                                                                                                           |
| gnl       |             5 | gang1268                                                                                                                                                                                                                                                                     | Barada, Gabalbara, Gangulu, Ganulu, Yetimarala                                                                                                                                                                                                                                                                                                         |
| gle       |             5 | iris1253                                                                                                                                                                                                                                                                     | Irish Gaelic, IRISH, Irish                                                                                                                                                                                                                                                                                                                             |
| ell       |             5 | mode1248                                                                                                                                                                                                                                                                     | Modern Greek, GREEK, Greek                                                                                                                                                                                                                                                                                                                             |
| che       |             5 | chec1245                                                                                                                                                                                                                                                                     | Chechen                                                                                                                                                                                                                                                                                                                                                |
| car       |             5 | gali1262                                                                                                                                                                                                                                                                     | Carib, CARIB                                                                                                                                                                                                                                                                                                                                           |
| bym       |             5 | bidy1243                                                                                                                                                                                                                                                                     | Bidyara, Dharawala, Wadjabangay, Yandjibara, Yiningay                                                                                                                                                                                                                                                                                                  |
| bsk       |             5 | buru1296                                                                                                                                                                                                                                                                     | Burushaski, BURUSHASKI                                                                                                                                                                                                                                                                                                                                 |
| yrk       |             5 | nene1249                                                                                                                                                                                                                                                                     | Yurak, NENETS, Tundra Nenets                                                                                                                                                                                                                                                                                                                           |

## Testing linguistic universals

### Q4: identify the three most frequent vowels

### Q5: compute the proposition of languages that contain vowels form the basic triangle