## Parse SAPhon inventory spreadsheets to look for errors

Use this notebook to process a `.tsv` file downloaded from the SAPhon [Tupian Nasal Typology Input](https://docs.google.com/spreadsheets/d/1dvXFvLIV4y84CglgjAl-ZVb09IuGazs1SzFO_UJpmnI/edit#gid=1164878023) spreadsheet. Warning messages are emitted for spreadsheet values that do not validate. Make corrections in the spreadsheet, then download a new `.tsv` file and try again.

In the terminology of this notebook, each language sheet consists of multiple documents, which appear as consecutive rows in the spreadsheet. A row labelled 'Doctype' introduces a new document. Each language spreadsheet must have exactly one `synthesis` document and may have one or more `ref` documents.

To run the cells in this notebook, click on the cell to execute, then on the `Run` button or press `Ctrl`+`Enter` on the keyboard. Execute cells in order from top to bottom.

## Getting started

When you are ready to validate data for a language:

1. Visit the [input spreadsheet](https://docs.google.com/spreadsheets/d/1dvXFvLIV4y84CglgjAl-ZVb09IuGazs1SzFO_UJpmnI/edit#gid=1164878023).
1. Select the tab for the language you want to check.
1. Click `File` > `Download` > `Tab Separated Values (.tsv)`.
1. (Optional) Rename the downloaded file to a simple form that matches the language name. It is fine to omit diacritics from the filename, as the name won't be reused outside of this notebook.
1. Change the value of the `infile` variable in the next cell to match the filename of the `.tsv` file.

In [None]:
import spreadsheet
import os
from pathlib import Path
downloads = os.path.join(str(Path.home()), 'Downloads')

infile = 'tapiete.tsv'

## Read the language file

The next cell reads the file into a dictionary named `lang`. The dictionary contains the content of the `synthesis` and `ref` documents.

In [None]:
lang = spreadsheet.read_lang(os.path.join(downloads, infile))

Check the `Natural Classes`/`Segments` fields. The return value `natclasses` is a dictionary that maps each `synthesis`/`ref`document to its structured natural class data that can be referenced in the processes of that document. The `flatnatclasses` variable is a dictionary that contains a simple list of all the symbols used for natural class data for each document.

In [None]:
natclasses, flatnatclasses, catsymb = spreadsheet.check_natclasses(lang)

Check the `allophones` field for errors. The target phone for each allophone must match one of the natural class symbols.

In [None]:
allophones, alloprocs = spreadsheet.check_allophones(lang, flatnatclasses)

Validate the `Morpheme IDs` field. The return value `morph_id_map` is a dictionary that maps each `synthesis`/`ref`document to a set of morpheme ids that can be referenced in the processes of that document.

In [None]:
morph_id_map = spreadsheet.check_morpheme_ids(lang)

Check the processes of the language.

In [None]:
spreadsheet.check_procs(lang, flatnatclasses, morph_id_map, catsymb, alloprocs)