# An Interface to Automatic Cognate Detection in LingPy.

This interface will walk you through analyzing a set of linguistic data and suggest possible sets of cognates. You'll be able to download the file with analyzed cognate sets to examine yourself. 

## Usage
In order to use this interface, you'll work through cells in this Jupyter notebook, customizing the code to produce the results you want. If you're not familar with Python, or Jupyter notebooks, please start with the guide at this link: http://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb

It'll get you familiar with how to use the rest of this notebook.

### A quick refresher:
Click on any cell to edit it. Type the needed information according to the instructions. Click "Shift" + "Enter" to evaluate a cell. All of the Python code in cells in the notebook are running together as a single program, so any time you evaluate a cell, you're adding to or changing your program. Cells can be moved or re-run in any order.

### Citations and Motivation
This workflow is built from the workflow example in LingPy's documentation, found here: http://lingpy.org/tutorial/workflow.html. For the first several sections of this guide, the documentation and this document are largely parallel: reading them together could be helpful.
This guide also uses code from a similar, runnable workflow that was included in LingPy 2.5, and has since been deprecated.

Despite the overlap, this file functions as a standalone tool from either of these sources. It will go into more depth than the current example documentation, and provide an interactive experience to start using LingPy.

## Getting Started
First, we'll import the code we need to run this interface in LingPy. These import statements cover everything used later on.

In [None]:
from lingpy import *
from lingpy.tests.util import test_data

## Input and output files
Next, we'll set up the first input and output files. We'll first pick the input file on which we want to try out automatic cognate detection. The first code cell below provides a default file to use the first time through this guide, but there are other example files listed below.

### test data files: 
- KSL.qlc
- tutorial.qlc
- Haspelmath-2009-1460.tsv
- Ringe-2002-421.tsv
- leach1969-67-161.csv


These files can be found at the link https://github.com/lingpy/lingpy/tree/master/lingpy/tests/test_data, where you can peruse them in more detail. If you want to change the file used, you can do so here. This cell defines the input file for the guide:

In [None]:
input_file = test_data("KSL.qlc") # you can change the file name inside the quotation marks

To view the output of LingPy's automatic cognate detection, you're going to end up with a file that contains suggested cognate sets. You can choose here what you'd like to call that output file. For now, the default assumes you used the default input file.

If you'd like to change it, please provide a file name with no spaces and no extension:

In [None]:
output_file = "output/KSL_output" # by starting the string with 'output/' the files will be placed inside a folder

If the above cell produces an error, you've asked for a test data file that does not exist. Check your spelling, and try again.

## Automatic Cognate Detection in LingPy: Overview
Automatic cognate detection is the process of computationally performing the comparative method. LingPy implements several different ways to complete this analysis. 

The interface we're about to work through assumes familiarity with the computational method, a little Python, and enough command line knowledge to be reading this file right now.

In this workflow, the first thing that will be created is a LexStat object. This is a Python object that extends the functionality of a basic word list object (called WordList) that we'll talk about how to build later. 

For now, all we need is an input file, and to know that LexStat is the basic class to run automatic cognate detection in LingPy. LexStat holds the data we're currently looking at, while everything we will run adds information to that object. This can be written to an output file that looks a lot like the input, with more columns added. 

We'll start with just an input file to create a LexStat object:

In [None]:
lex = LexStat(input_file)

We'll calculate a coverage statistic first, to start understanding our dataset. We see a brief summary of the dataset from our input file, showing us each language in the dataset, and how many concepts are included in each language.

In [None]:
lex.coverage()

Since the number of concepts is the same for all of the languages in the dataset, we can make the initial assumption that the dataset was well curated (and probably has the same 200 concepts in each language). 

## Scoring and Clustering Cognate Sets
Next, we will look at the scorers for the dataset, from the LexStat algorithm. Scorers are computed for each pair of languages in the dataset. This allows cognate sets to be determined. The scorer defines a threshhold of similarity and cognacy for each pair of languages.

In [None]:
lex.get_scorer()

Now that scoring functions are computed, the cognate clusters across all languages in the set can be determined. This is done by running a clustering algorithm. Again, we continue with the LexStat method.

In [None]:
lex.cluster()

The last step is to output the data to a new file. This output contains the scoring functions and a lot of computed parameters. We can look at the file here (it should be on your computer, in the same file as this notebook), however, the output isn't very readable right now. Still, we'll pause here to see what we've done so far.

In [None]:
lex.output('tsv', filename=output_file, ignore="all", prettify=False)

Note: The last two parameters here, "ignore" and "prettify", are set to keep the output smaller. If you'd like to later, you can remove these parameters, and the resulting output file will contain the scoring functions computed for all language pairs.

## LexStat all together

Now that we've looked at the current output, let's put the first few steps together. Here, we'll make a function, which performs the steps of basic cognate analysis using the steps of the LexStat we've seen so far.

In [None]:
def basicLexStat(infile, outfile):
    lex = LexStat(infile)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    lex.output('tsv', filename=outfile, ignore="all", prettify=False)
    return lex

We'll run it one more time, all at once:

In [None]:
lex = basicLexStat(input_file, output_file)

## Evaluation
Now that we've detected cognates in this input data, we can evaluate the algorithm. We import the necessary methods from the evaluation package, and run the bcubes() method to calculate scores.

In [None]:
from lingpy.evaluate.acd import bcubes, diff
bcubes(lex, "cogid", "cognates")

B-cubed scores are a method to evaluate and compare cluster decisions (lingpy.evaluate). They are measures of similarity between two clustering processes. If there is a gold standard cognate annotation for a dataset, these scores can be used to evaluate adherence to the standard. Using a different interpretation, the scores can be used to evaluate the differences between cognate detection methods.

## Alignments
The next step is to make the cognate sets more readable for manual inspection. LingPy's Alignments class will help with this. Now that each token in the input dataset has been annotated with cognate scores, indicating suggested cognate sets, we can align the sets visually in a new output file.

To do this, we have to start by making an Alignments object from our existing LexStat object. 

In [None]:
alm = Alignments(lex, ref="cognates")

Here's a guide for the difference betweeen these classes: think of the LexStat object as thhe original dataset, with columns added containing new analysis; think of the Alignments object as the dataset with reordered rows to better highlight the results of the analysis.

Next, we need to run the alignment analysis.

In [None]:
alm.align()

To see the results, we use the output() method. The output options are html or tsv, and the file name can be chosen as needed.

In [None]:
# first argument: 'html' or 'tsv'
# second argument: any string
alm.output('html', filename="KSL") # change filename or output type

The tsv and html output files should be pretty similar. The html file is easier to read, using color and design to show alignments. The tsv file is editable, allowing for adjustments to the automatically detected cognate sets.

## First full automatic cognate detection workflow
Finally, we'll add the alignments steps into our workflow, so we can see everything together.

In [None]:
def basicACD(infile, outfile, output_type, intermediate=False):
    lex = LexStat(infile)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    if intermediate:
        lex.output('tsv', filename=outfile+"_intermediate",\
                   ignore="all", prettify=False)

    alm = Alignments(lex, ref="cognates")
    alm.align()
    alm.output(output_type, filename=outfile) # change filename or output type
    
    return lex, alm

Note: There's one new element in the workflow that we haven't talked about: the 'intermediate' parameter. Since Alignment has its own output, we don't always need output from LexStat, as well. If you'd like to see that file, you can use 'True' as the last parameter in your call to basicACD() below. If you don't want to create that output file, you don't need to include that argument at all--it will default to false, because 'intermediate=False' is included in the function defintion.

In [None]:
lex, alm = basicACD(input_file, output_file, 'html', True)

## Setting Parameters
Now, we can start setting parameters to customize the cognate detection workflow. We'll go through the steps again, adding more detail along the way. We'll make the edits by declaring a method called customACD() and editing and running it repeatedly. Since we'll use the same name, the iPython notebook will re-declare the function each time.

We'll start by copying in the code from basicACD().

In [None]:
def customACD(infile, outfile, output_type, intermediate=False):
    lex = LexStat(infile)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    if intermediate:
        lex.output('tsv', filename=outfile+"_intermediate",\
                   ignore="all", prettify=False)

    alm = Alignments(lex, ref="cognates")
    alm.align()
    alm.output(output_type, filename=outfile) # change filename or output type
    
    return lex, alm

### Sound Class Models
One of the main parameters for the LexStat is the sound class model used to encode the raw phonetic data in the input file. 

Sound classes are based on the idea that a sound is more likely to change to another sound that is similar to it. These similar sounds are grouped into classes, and the model is used to make judgements for correspondences between languages. 

LingPy allows a few different sound classes to be used in cognate detection. The available sound class models are:

##### Dolgo

Dolgopolsky was one of the first linguists to study sound change by using sound classes. He proposed ten sound classes: labial obstruents, dental obstruents, sibilants, velar obstruents and dental and alveolar affricates; labial nasals; the remaining nasals; liquids; voiced labial fricative and intial rounded vowels; palatal approximants; laryngeals and inital velar nasals. To use the 'Dolgo' method, we encode data into strings that mark what sound classes each sound belongs to. 

##### SCA

This is the default sound class model, developed as part of the LexStat method in LingPy. The sound classes used in SCA are largely based on the Dolgopolsky method (dolgo), but is extended further with an eleventh class to cover diacritics and vowels.

Source for Dolgo and SCA:
https://marija.gforge.uni.lu/esslli2010stus_submission_10.pdf

##### ASJP

The ASJP method and model is a standardized orthography (ASJPcode) for language comparison. The sounds of any language in this model are transcribed into characters in ASJPcode, consisting of 7 vowel symbols and 34 consonant symbols. Languages with more than 7 vowels transcibed with this model have multiple vowels represented by a single symbol. Similarly, for consonants, languages have rarer consonants merged with the more common similar sounds.

Source for ASJP:
https://www.researchgate.net/profile/Soren_Wichmann/publication/40853552_Explorations_in_automated_language_classification/links/09e41508a4943d4255000000.pdf

If you'd like to go directly to the source documentation, it's available here: http://lingpy.org/docu/data/model.html

Now that you've read about the models, you can add this parameter to our cognate detection workflow.

In [None]:
user_model_choice = "dolgo" # choose 'sca', 'dolgo', or 'asjp'

Then we can redefine our workflow to include the new parameter:

In [None]:
def customACD(infile, outfile, output_type, model_choice, 
              intermediate=False):
    lex = LexStat(infile, model=model_choice)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    if intermediate:
        lex.output('tsv', filename=outfile+"_intermediate",\
                   ignore="all", prettify=False)

    alm = Alignments(lex, ref="cognates")
    alm.align()
    alm.output(output_type, filename=outfile) # change filename or output type
    
    return lex, alm

Note: this is the first custom parameter we're focusing on. Remember, the output file is at this point still probably the same, so running the new method will write over the old output. If you'd like to chance this, redefine the variable 'output_file' on a new line of code, or replace the parameter 'output_file' with a string of your choice. The same idea goes for any new customization you'd like to make!

And we'll run the new version:

In [None]:
lex, alm = customACD(input_file, output_file, 'html', user_model_choice, True)

Another useful LexStat parameter is 'check'. This turns on error checking for the input file given. Errors might be unexpected characters, or bad formatting. By setting 'check' to 'True', an error log will be created, letting you know if there's anything you should look at in your input file. We'll add it to our workflow here:

In [None]:
def customACD(infile, outfile, output_type, model_choice, 
              check_choice=False,
              intermediate=False):
    lex = LexStat(infile, model=model_choice, check=check_choice)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    if intermediate:
        lex.output('tsv', filename=outfile+"_intermediate",\
                   ignore="all", prettify=False)

    alm = Alignments(lex, ref="cognates")
    alm.align()
    alm.output(output_type, filename=outfile) # change filename or output type
    
    return lex, alm

In [None]:
lex, alm = customACD(input_file, output_file,
                     'html', user_model_choice, True, True)

Several parameters for LexStat allow for simple changes like renaming the headings of columns in output files. 
- Parameter= 'segments'  default= "tokens", the column containing segmented transcriptions
- Parameter= 'transcription'  default= "ipa", the column containing unsegmented token transcriptions
- Parameter= 'classes'  default= "classes", the column containing the sound class representation of token transcriptions
- Parameter= 'numbers'  default= "numbers", the column containing triples with numeric identifiers for tokens: language id, sound class string, and prosodic string

Now, try it yourself! Here's another copy of our workflow function:

In [None]:
def customACD(infile, outfile, output_type, model_choice, 
              check_choice=False,
              intermediate=False):
    # add a parameter in the following line to rename a heading in 
    # your output!
    lex = LexStat(infile, model=model_choice, check=check_choice)
    lex.coverage()
    lex.get_scorer()
    lex.cluster(ref="cognates")
    if intermediate:
        lex.output('tsv', filename=outfile+"_intermediate",\
                   ignore="all", prettify=False)

    alm = Alignments(lex, ref="cognates")
    alm.align(mode='dialign')
    alm.output(output_type, filename=outfile) # change filename or output type
    
    return lex, alm

Try adding a new parameter to the function declaration. Make sure to add it to the LexStat() call on the first line as well.

Remember that you can start a function parameter on a new line, like 'intermediate' above, if that's easier to read. Also remember than any parameters with a default option, like intermediate, must be at the end of the list!

Finally, run your new version of the function:

In [None]:
lex, alm = customACD(input_file, output_file,
                     'html', user_model_choice, True, True)

### Alignments parameters
We'll look at one more parameter before we move on. This one is used on the Alignments object, 'alm'. 'Mode' is a parameter defining alignment analysis type. The options are "global" and "dialign": whether alignment analysis weights global similarities more heavily, or maximizes local similarities. Varying this parameter could produce different results, providing new perspective, or a different baseline for exploring a dataset. The sources are listed below. Let's try it here:

In [None]:
# First, copy and paste your most recent version of customACD() into
# this cell, after the comment.
#
# Change the following line:  
#
# alm.align()
#
# adding the parameter mode. It should look something like this:
#
# alm.align(mode='dialign')
#
# Remember, you can add the new 'mode' parameter as an argument to the 
# customACD() function declaration!


Global alignment method ("global"):
https://s3.amazonaws.com/academia.edu.documents/25023781/6.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1516673409&Signature=GxRLVNQiZPQONdBTlT5RGCuu9vY%3D&response-content-disposition=inline%3B%20filename%3DA_general_method_applicable_to_the_searc.pdf

Local alignment method ("dialign"):
http://www.pnas.org/content/93/22/12098.full.pdf

## Going forward
### More customization
Now, we've looked at the steps of automatic cognate detection in LingPy, created a workflow, and customized the workflow with parameters. 

Where do you go from here?

There are many more parameters and customizations to be made in LingPy. This interface has talked about a few, but more can be found starting at these links:
- http://lingpy.org/docu/compare/lexstat.html 
- http://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments
The documentation on these classes are a good place to start in developing a more detailed workflow for research using automatic cognate detection.

After looking through these classes, you can continue to explore the documentation, here: http://lingpy.org/docu/index.html to find any other useful tools LingPy offers.

### Using larger datasets
After you feel familiar using LingPy, it's time to try it out on larger datasets. The first place to start could be the files in the 'input/' directory of this repository. The following files in this directory are prepared as input to LingPy:
- combined_fijian_maori.csv
- slavic.tsv

Try changing the definition of "input_file" in a new cell, and re-running customACD(). The resulting output should be more robust and complex to explore.

### Using your own data
The following tutorial in LingPy's documentation is a good start for understanding input formats: http://lingpy.org/tutorial/lingpy.basic.wordlist.html. But we'll go over some basics here:

A simple input file ready to be converted into a WordList (performed by creating a LexStat object) has the following:
- a column of tokens
- a column of transcriptions, giving a mapping of each transcription in a language to a token concept
- a column of doculects, marking what language each token is transcribed into
- a gold standard annotation column

The gold standard annotation column can be difficult. When using LingPy to compare automatic methods on existing hand-annotated data, one must just include the expert annotations for each word in the input file. When using LingPy to explore a new dataset, there may be no cognate sets proposed yet for the data. 

In this case, I suggest that as a first exploratory step, each transcription is labeled with the same id for all transcriptions of the same token. This would effectively propose that the considered languages are 100% cognate. LingPy will suggest its own cognate sets, and these can be examined to learn more about the data (as far as LingPy's methods will allow).

Additionally, the Python script swadesh_scraper.py included in this repository outlines the creation of a LingPy input file, from data collection (via web scraping) to formatting. This file could be useful to explore, especially after trying out customACD() on the slavic.tsv input example. slavic.tsv is the product of the swadesh_scraper.py script.

### Exploring output
The html output of our customACD() workflow gives a good visual representation of the cognate sets LingPy proposes. If you'd like a little more room to explore, you can upload output in tsv form to EDICTOR at http://edictor.digling.org, an online file editor built for LingPy. Here, alignments are shown in color, can be toggled by subset, but also are able to be edited in-place. EDICTOR could be useful in the course of research using these tools.

### Statistical metrics
Remember the scoring functions introduced earlier in the notebook? These can be run on any LingPy output, and used to compare methods. Be careful, these scores are limited in what they can tell you about the analysis. Manual inspection is another good place to start.

#### Final Notes
You can keep using this notebook to work with LingPy and explore! Start typing in a cell, and try out more customizations. Or start a new one, and use it to document the research process.