## Parse SAPhon inventory spreadsheets to look for errors

Use this notebook to process a `.tsv` file downloaded from the SAPhon [Tupian Nasal Typology Input](https://docs.google.com/spreadsheets/d/1dvXFvLIV4y84CglgjAl-ZVb09IuGazs1SzFO_UJpmnI/edit#gid=1164878023) spreadsheet. Warning messages are emitted for spreadsheet values that do not validate. Make corrections in the spreadsheet, then download a new `.tsv` file and try again.

In the terminology of this notebook, each language sheet consists of multiple documents, which appear as consecutive rows in the spreadsheet. A row labelled 'Doctype' introduces a new document. Each language spreadsheet must have exactly one `synthesis` document and may have one or more `ref` documents.

To run the cells in this notebook, click on the cell to execute, then on the `Run` button or press `Ctrl`+`Enter` on the keyboard. Execute cells in order from top to bottom.

## Getting started

When you are ready to validate data for a language:

1. Visit the [input spreadsheet](https://docs.google.com/spreadsheets/d/1dvXFvLIV4y84CglgjAl-ZVb09IuGazs1SzFO_UJpmnI/edit#gid=1164878023).
1. Select the tab for the language you want to check.
1. Click `File` > `Download` > `Tab Separated Values (.tsv)`.
1. (Optional) Rename the downloaded file to a simple form that matches the language name. It is fine to omit diacritics from the filename, as the name won't be reused outside of this notebook.
1. Change the value of the `infile` variable in the next cell to match the filename of the `.tsv` file.

In [7]:
import spreadsheet

infile = 'tapiete.tsv'

## Read the language file

The next cell reads the file into a dictionary named `lang`. The dictionary contains the content of the `synthesis` and `ref` documents.

In [8]:
lang = spreadsheet.read_lang(infile)

No match for field 'optioanlity'. Parse possibly incorrect.


Check the `Natural Classes`/`Segments` fields. The return value `natclasses` is a dictionary that maps each `synthesis`/`ref`document to its structured natural class data that can be referenced in the processes of that document. The `flatnatclasses` variable is a dictionary that contains a simple list of all the symbols used for natural class data for each document.

In [9]:
natclasses, flatnatclasses = spreadsheet.check_natclasses(lang)

Natural Class symbol "ND" not recognized.

Character ŋg (b'\xc5\x8bg') is not in ipa-table.txt.

Character ŋg (b'\xc5\x8bg') is not in ipa-table.txt.



Check the `allophones` field for errors. The target phone for each allophone must match one of the natural class symbols.

In [11]:
allophones = spreadsheet.check_allophones(lang, flatnatclasses)

Allophone 'ɲ' (b'\xc9\xb2') not in Natural Classes/Segments 'T, p, t, tʃ, k, kʷ, ʔ, Z, dʒ, S, s, ʃ, h, N, m, n, ŋ, R, ɾ, w, V, i, e, ɨ, a, u, o, Ṽ, ĩ, ẽ, ɨ̃, ã, ũ, õ' for synthesis

Allophone 'm mb' (b'm mb') not in Natural Classes/Segments 'T, p, t, tʃ, k, kʷ, ʔ, Z, dʒ, S, s, ʃ, h, N, m, n, ŋ, R, ɾ, w, V, i, e, ɨ, a, u, o, Ṽ, ĩ, ẽ, ɨ̃, ã, ũ, õ' for synthesis

Expected 2-ple or 4-ple for allophones. Got 3-ple '['m mb', '_V', 'LO:ND']' for synthesis

Allophone 'ŋɡ' (b'\xc5\x8b\xc9\xa1') not in Natural Classes/Segments 'T, p, t, tʃ, k, kʷ, ʔ, Z, dʒ, S, s, ʃ, h, mb, nd, ŋg, R, ɾ, w, V, i, e, ɨ, a, u, o, Ṽ, ĩ, ẽ, ɨ̃, ã, ũ, õ' for González, Hebe Alicia. 2005. A grammar of Tapiete (Tupi-Guarani). University of Pittsburgh. González, Hebe Alicia. 2008. Una aproximación a la fonología del tapiete. LIAMES (8). pp. 7-43.

proc_name 'LDNH:R' does not match available names 'LO:ND, LDNH:L, LN:ɾ, LN:NT, MPP, fronting, LN, MPP, palatalization, spirantization' for González, Hebe Al

Validate the `Morpheme IDs` field. The return value `morph_id_map` is a dictionary that maps each `synthesis`/`ref`document to a set of morpheme ids that can be referenced in the processes of that document.

In [12]:
morph_id_map = spreadsheet.check_morpheme_ids(lang)

Expected 5-ple for morph_id. Got 4-ple '['mɨ', ' prefix', ' mɨ', ' CAUS']' for synthesis

Expected 5-ple for morph_id. Got 4-ple '[' kɨɾɨ', ' root', ' ŋɡɨnɨ', ' ticklish']' for synthesis

Expected 5-ple for morph_id. Got 4-ple '[' kɨɾɨ', ' root', ' ŋɡɨnɨ', ' ticklish']' for González, Hebe Alicia. 2005. A grammar of Tapiete (Tupi-Guarani). University of Pittsburgh. González, Hebe Alicia. 2008. Una aproximación a la fonología del tapiete. LIAMES (8). pp. 7-43.



In [13]:
morph_id_map

{'synthesis': ['mɨ',
  ' kɨdʒe',
  ' kʷaɾu',
  ' kɨɾɨ',
  ' kaɾu',
  ' kʷeɾa',
  ' kɨʔa',
  ' kuʔi',
  ' tʃe.sleep',
  ' tʃe.enter',
  ' soso'],
 'González, Hebe Alicia. 2005. A grammar of Tapiete (Tupi-Guarani). University of Pittsburgh. González, Hebe Alicia. 2008. Una aproximación a la fonología del tapiete. LIAMES (8). pp. 7-43.': ['mbɨ',
  ' kɨdʒe',
  ' kʷaɾu',
  ' kɨɾɨ',
  ' kaɾu',
  ' kʷeɾa',
  ' kɨʔa',
  ' kuʔi',
  ' tʃe.sleep',
  ' tʃe.enter',
  ' soso']}

Check the processes of the language.

In [14]:
spreadsheet.check_procs(lang, flatnatclasses, morph_id_map)

opacities value 'Nothing' (b'Nothing') not in Natural Classes/Segments/Morpheme IDs 'T, p, t, tʃ, k, kʷ, ʔ, Z, dʒ, S, s, ʃ, h, N, m, n, ŋ, R, ɾ, w, V, i, e, ɨ, a, u, o, Ṽ, ĩ, ẽ, ɨ̃, ã, ũ, õ, mɨ,  kɨdʒe,  kʷaɾu,  kɨɾɨ,  kaɾu,  kʷeɾa,  kɨʔa,  kuʔi,  tʃe.sleep,  tʃe.enter,  soso' for synthesis

