# Computational Methods for Linguistic Typology


## Part 2 - The typological data cycle

Matthew J Carroll

Australian Linguistic Society 2022 CoEDL Masterclasses

<center>
<img src="Images/datascience.jpg" width="1000"> 

<center>
<img src="Images/typscience.jpg" width="1000"> 

## Setting up a work environment

- Folder structure w/backup & version control
    - [Github](https://github.com/) 
    - [Cloudstor](https://cloudstor.aarnet.edu.au/)
    - Dropbox
    - University server
- Documentation process
- Programming environment (Python/R)
    - Jupyter Notebook
- Writing environment
    - [Overleaf](https://www.overleaf.com/) (Latex)
    - Google Docs 
    - Local


# Identifying a domain

<center>
<img src="Images/typscience.jpg" width="1000"> 

- Literature review
- Defining an informal 'base' 
- Anticipating problems

## Verbose Exponence

|Number|Form|
|-----------|-----------|
|SG| *wɐ-mdəd-ə* |
|DU| *nə-mdəd-anɛ*|
|PL|*nə-mdəd-ə*|

1 PRS Partial Paradigm of yəmdədə 'to sit' in Yei

**Verbose exponence**: *The expression of some category uses more component elements than strictly necessary*

# Data acquisition

<center>
<img src="Images/typscience.jpg" width="1000"> 

<center>
    <img src='Images/data.png' width=700>


**How do we know where to look?**

<img src="Images/pacific_central.jpg" width=15000>

- Inferring possible candidate languages from typological knowledge
- Conferring with colleagues

<img src="Images/library.jpg" class="center">

**How can we speed up the grammar reading process?**

- 1 week to sufficiently read a grammar
- 25% are relevant to the general topic
- < 10% of those fill the distant cells of the typology
- 40 weeks to get a single example for 1 cell in the typology

## Data acquistion: gather data

![](Images/babel2.png)

## Data acquisition: explore the data

![](Images/fedden.png)

![](Images/newscan.png)

![](Images/oldscan.png)

![](Images/oldscanbad.png)

![](Images/oldscan3.png)

Different versions of the same document



Filename: Author-TextIDYear-Version.pdf

- adelaar-amalgamation-malagasy2010.pdf
- adelaar-austronesian2013.pdf
- adelaar-austronesian-asia-madagascar2005.pdf
- adelaar-austronesian-asia-madagascar2005v2.pdf
- adelaar-austronesian-historical2005.pdf
- adelaar-austronesian-historical2005-o.pdf

#### Data understanding report

- 10000 > PDFs
    - The 'best' reference for a language
        - Grammars
        - Other descriptive materials
- Majority OCR’d
    - Already searchable
- Varying quality and sources
    - Fairly messy and difficult to access
- Lots of duplicate files
    - Best files have the longest name

## Data acquisition: update the data

Before we can search the data we need to:

Step 1 - Clean up the PDFs
- Sort the collection
- Remove bad candidates for searching

Step 2 - Extract the text
- Existing tool pdftotext.exe


### Step 1 - Clean the data

How do we handle repeated ﬁles?

Filename: Author-TextIDYear-Version.pdf

- adelaar-amalgamation-malagasy2010.pdf
- adelaar-austronesian2013.pdf
- adelaar-austronesian-asia-madagascar2005.pdf
- adelaar-austronesian-asia-madagascar2005v2.pdf
- adelaar-austronesian-historical2005.pdf
- adelaar-austronesian-historical2005-o.pdf

Import the packages we need:

In [7]:
import os
import re
import fnmatch
from PyPDF2 import PdfFileReader

Our folder containing pdfs:

In [8]:
pdfdir = os.listdir('./grammars/pdfs/')

We want the longest file names:

Find files whose names are a substring of another file:

In [9]:
subsetlst = []

for file in pdfdir:
    filestring = file[0:-4] + '.+' + '\.pdf'
    regex = re.compile(filestring) # <filename>.+\.pdf
    for file2 in pdfdir:
        if regex.match(file2):
            subsetlst.append(file) 

In [10]:
len(subsetlst)

9

Another method:

In [None]:
subsetlst2 = []

for file in pdfdir:
    filestring = file[0:-4] 
    for file2 in pdfdir:
        file2string = file2[0:-4]
        if filestring == file2string: 
            pass
        elif filestring in file2string: 
            subsetlst2.append(file)  

Make a list of the files whose names aren't a subset of another:

In [11]:
uniqlst = []     

for file in pdfdir:
    if file not in subsetlst:
        uniqlst.append(file)

In [12]:
len(uniqlst)

42

### Quality analyser





- How do we work out what is a good quality pdf?
- High quality scans ≠ Good quality OCR


<img src="Images/goodscan.png" class="center">

How do we work out what is a good quality pdf?
- Clear text on white pages
- Prefer generated PDFs
- Very clean scans on text mode
- Lower file sizes

pdfqual:

- File size / Number of pages
- Asks for a threshold (20,000 default)
- Prints out a list of the files from the previous list of all pdfs which are < threshold

In [13]:
threshold = input('Enter quality threshold in bytes (Basic 20000): ')

cleanpdfs = []

for file in uniqlst:
    pdf = PdfFileReader('./grammars/pdfs/' + file,)
    pages = pdf.getNumPages()
    size = os.stat('./grammars/pdfs/' + file).st_size
    qual = size / pages
    if qual < float(threshold):
        cleanpdfs.append((str(file),str(qual)))

Enter quality threshold in bytes (Basic 20000): 20000




In [15]:
len(cleanpdfs)

16

### Step 2 - Scrape the text


`for /F "tokens=1" %%A in (.\statfile.tsv) do .\pdftotext.exe -enc UTF-8 ..\pdfs\%%A .\texts\%%A.txt`


### Gather the data

<center>
    <img src='Images/data.png' width=700>


*Search for the same gloss twice in the one word*

In [16]:
leipzig = ['1','2','3','1SG','1DU', '1DL','1PL','2SG','2DU','2DL','2PL' '3SG','3DU','3DL','3PL','ABE','ABL','ABS','ACC','ACCOM','ACT','ADJ','ADE','ADM','ADV','AFF','AG','AGT','AGR','ALL','AL','ALLOC','ALIEN','AN','AND','ANT','ANTE','ANTIC','ANTIP','AP','AOR','APP','APL','APPL','APPR','APRX','ART','ASP','ASS','AT','ATT','AUD','AUG','AUX','BEN','CAP','CAU','CAUS','CENT','CF','CL','CNSQ','COL','COM','COMP','COMPL','CPL','CONC','COND','CONJ','CONT','CTN','CNTR','COP','COR','CRAS','CRS','DAT','DE','DEC','DECL','DEF','DEL','DEL','DEO','DEP','DES','DESI','DEST','DIM','DIR','DISJ','DIST','DISTR','DITR','DLM','DU','DUB','DUR','DY','DYAD','DYN','ELA','EMP','EPIS','ERG','ESS','EV','EVID','EVIT','EX','EXCL','EXCLAM','DUR','EXESS','EXH','EXIST','EXO','EXP','EXPER','FEM','FACT','FOC','FORM','FP','FR','FREQ','FUT','GEN','GER','GNO','GT','HAB','HBL','HEST','HIST','HOD','HON','HORT','HSY','HUM','HYP','IGNOR','ILL','IMM','IMP','IMPERF','IMPR','IMPREC','INCH','INCHO','INCEP','IND','INDF','NDEF','INE','INF','INFER','INFR','INEL','INS','INSTR','INT','NTR','INV','IO','IPFV','IRR','IS','ITER','JUS','LAT','LD','LOC','LOG','MASC','MID','MIM','MIR','MLT','MLTP','MOD','MOM','NEUT','NEG','NMZ','NZ','NOMI','NOM','NS','NUM','OBJ','OB','OBL','OBV','OPT','PART','PAS','PASS','PAT','PA','PAU','PEG','PER','PERF','PRF','PERS','PFV','PL','PLU','PLUR','PN','PRO','PO','POL','POS','POSS','POST','POSTE','POSTEL','POT','PP','PPFV','PPP','PR','PREC','PRED','PREP','PRESP','PRET','PRT','PRF','PERF','PRIV','PRS','PRES','PROB','PROG','PROH','PROL','PROP','PROS','PROSP','PRSP','PROT','PROX','PST','PT','PTCP','PCP','PTV','PURP','QUOT','REAL','REC','RECP','REF','RFR','REFL','REL','REM','REP','RES','RET','SBJ','SUB','SBJV','SJV','SE','SEM','SENS','SEQ','SG','SGV','SIM','SJV','SBJV','SPEC','SS','STAT','STV','SUB','SU','SUBR','SUBORD','SBRD','SR','SUBE','SUBL','SUC','SUP','SUPE','TAM','TEL','TEMP','TERM','TNS','TOP','TR','TRANS','TRANSL','TRI','TRN','TVF','UH','UND','UR','USIT','VB','VBZ','VD','VEN','VER','VIA','VIS','VI','VN','VOC','VOL','VT','WH']

In [17]:
with open("multiple_exponence.tsv", 'w') as results:
    for lang in os.listdir('./grammars/texts/'):
        print(lang) #provides feedback so we can see this
        for gloss in leipzig:
            page = 1
            linenum = 0 
            rex1 = str(gloss) + "[\|\\\.\-][^\d\s]+[\|\\\.\-]" + str(gloss) 
            rex2 = "[^\d][\|\\\.\-]?" + str(gloss) + "\-" + str(gloss) + "[\|\\\.\-]?[^\d]"
            regex = "(" + rex1 + ")|(" + rex2 + ")"
            filename = './grammars/texts/' + lang
            with open(filename) as file:
                for line in file:
                    if '\x0c' in line:
                        page = page + 1
                        continue
                    linenum = linenum + 1
                    glossline = re.findall(regex, line)# To ignorecase: add after comma in parenth: flags=re.IGNORECASE
                    for thing in glossline:
                        output = str(gloss) + '\t' + lang + '\t' + str(linenum) + '\t' + str(page) + '\t' + str(line) 
                        results.write(output)

quigley_awara2003_s.pdf.txt
pekkanen_clause-tatana1988_o.pdf.txt
gamudze_guhu-samane2013_s.pdf.txt
bowern_bardi2012v2.pdf.txt
mohamed_sihan2011.pdf.txt
stegeman-hunter_akawaio2014.pdf.txt
ballantyne_yapese2005_s.pdf.txt
naitoro_areare2013.pdf.txt
harvey_gaagudju1992v2.pdf.txt
foley_yimas1991.pdf.txt
fedden_mian2011v2.pdf.txt
hardin_maia2002.pdf.txt
campbell_giimbiyu2006_o.pdf.txt
aikhenvald_tariana1999_s.pdf.txt
thoron_kichua1886.pdf.txt
sanders-sanders_kamasau1987.pdf.txt
zeitoun_rukai2005.pdf.txt
toland-toland_karo-rawa1991.pdf.txt
vandervoort_koaia2000.pdf.txt
fleischmann-turpeinen_bine1977.pdf.txt


## Exploring the data

<center>
    <img src='Images/data.png' width=700>


### pandas

- Data analysis-software library for Python
- R users = 'dplyr'
- MS Excel for Python
- Uses dataframes for the manipulation and analysis of data


In [18]:
import pandas as pd

In [19]:
me_df = pd.read_csv("multiple_exponence.tsv",
                 sep='\t',
                 names=["Gloss", "File", "Line", "Page", "Match"])
 

In [20]:
me_df.head(20)

Unnamed: 0,Gloss,File,Line,Page,Match
0,1,bowern_bardi2012v2.pdf.txt,28250,781,splash=TEMP 1-PST-1AUG-do/say-CONT-REM.PST sal...
1,1,bowern_bardi2012v2.pdf.txt,28261,782,1-PST-1AUG-do/say-CONT-REM.PST ﬁsh=TEMP one.place
2,1,bowern_bardi2012v2.pdf.txt,28275,782,1-PST-1AUG-wait.for-CONT-REM.PST tide
3,1,bowern_bardi2012v2.pdf.txt,28297,782,1-PST-1AUG-kill-CONT-REM.PST=3A.DO
4,1,bowern_bardi2012v2.pdf.txt,28303,782,Barda=gid a-ng-arr-a-na-n=irr away=TEMP 1-PST-...
5,1,bowern_bardi2012v2.pdf.txt,28366,783,1-FUT-1AUG-take-FUT=2M.DO off
6,1,bowern_bardi2012v2.pdf.txt,28383,784,2M 1-FUT-1AUG-take-FUT=2M.DO off go=REL away off
7,3,bowern_bardi2012v2.pdf.txt,4053,164,pur i-n-­cal=­cir look 3-TR-see-3A.IO
8,3,bowern_bardi2012v2.pdf.txt,6311,240,clean’im 3-PST-put-REM.PST-3A.DO this
9,3,bowern_bardi2012v2.pdf.txt,6467,245,3-PST-AUG-put-REM.PST-REL-INDF-3A.DO


In [21]:
len(me_df)

1270

In [22]:
me_df.describe()

Unnamed: 0,Line,Page
count,1270.0,1270.0
mean,15667.544882,401.680315
std,7531.868969,208.434678
min,105.0,8.0
25%,9524.5,244.25
50%,16008.0,397.0
75%,22272.25,550.0
max,33235.0,807.0


In [24]:
len(me_df["File"].unique())


9

In [25]:
me_df.groupby('File').size()

File
aikhenvald_tariana1999_s.pdf.txt    119
bowern_bardi2012v2.pdf.txt          535
campbell_giimbiyu2006_o.pdf.txt      10
fedden_mian2011v2.pdf.txt           468
foley_yimas1991.pdf.txt              71
hardin_maia2002.pdf.txt              17
harvey_gaagudju1992v2.pdf.txt        46
naitoro_areare2013.pdf.txt            3
zeitoun_rukai2005.pdf.txt             1
dtype: int64

In [28]:
me_df[me_df['File'].str.contains('hardin')]


Unnamed: 0,Gloss,File,Line,Page,Match
1123,1,hardin_maia2002.pdf.txt,749,23,1s=TP talk PROX=MN return-CAU1-IR.1s 2p-heart ...
1124,1,hardin_maia2002.pdf.txt,789,24,49) wi=nor saki saki-arav=o me+da sarar duwa=g...
1125,1,hardin_maia2002.pdf.txt,1011,29,work-VR1-IPF-RL.1s/3p D1 a.little=LIM talk-VR1...
1126,1,hardin_maia2002.pdf.txt,1025,29,"lamua-t-io."" bad-CAU1-IR.1s ‘If I see just 40 ..."
1127,1,hardin_maia2002.pdf.txt,1147,32,tomato 3s-DAT a.steal-VR1-RL.1p SS run.away-SE...
1128,1,hardin_maia2002.pdf.txt,1355,36,talk do-MN talk-VR1-RL.1s/3p=TP=EM talk NEG+AD...
1129,1,hardin_maia2002.pdf.txt,1784,45,imar-a-go-mo assembly-VR1-IPF-1s/3p ‘they were...
1130,1,hardin_maia2002.pdf.txt,1863,47,165) Awun maia=di ono=ra sinam-tato-mo. dog PL...
1131,1,hardin_maia2002.pdf.txt,2299,55,210) Yo yo-nor emuar=at dumag avia-mi tete ono...
1132,1,hardin_maia2002.pdf.txt,4104,95,407) Yo yo-nor emuar=at dumag avia-mi tete ono...


In [29]:
dropnames = ['hardin', 'natitoro', 'rukai']

for name in dropnames:
    me_df = me_df[~me_df['File'].str.contains(name)]

In [30]:
len(me_df)

1252

# Building datasets

<center>
<img src="Images/typscience.jpg" width="1000"> 

### Types of data

https://wals.info/


https://www.smg.surrey.ac.uk/syncretism/


https://lexicalsplitsdb.surrey.ac.uk/

https://github.com/autotyp/autotyp-data/tree/0.1.0


https://zenodo.org/record/4898419#.Yas2mNBByUk


https://sacha.beniamine.net/dataset/beniamine-maiden-round-2019/


https://www.gerlingo.com/

Move away from *databases* and toward *datasets*

- Data set is a collection of data in a tabular format
- Query the data directly
- Generalisable across projects
- As close to language data as possible

# Data/Code Sharing

<center>
<img src="Images/typscience.jpg" width="1000"> 

## Why share your data and code?

1. Reproducibility
2. Journals require it
3. Learn a lot
4. Extends the impact of your project
5. Build a portfolio

## Data

‘FAIR Guiding Principles for scientific data management and stewardship’ were published in *Scientific Data 2006*. https://www.go-fair.org/fair-principles/


**F**indable
- Will anyone else know that your data exists?
    - Solutions: put it in a standard repository, or at least a description of the data. Get a digital object identifier (DOI).
    
**A**ccessible
- Once someone knows that the data exists, can they get it?
    - Usually solved by being in a repository, but for non-open data, may require more procedures.

**I**nteroperable
-  your data in a format that can be used by others, like csv instead of PDF?

**R**eusable
- Is there a license allowing others to re-use?


### Sharing on Zenodo

Directly:
- Easy
- Good for finalised datasets that wont change

Github:
- A couple more steps
- Allows version control

https://zenodo.org/

https://sandbox.zenodo.org/login/
