# Handling Cognate Sets: Partial, Strict, and Loose Cognates in LingPy


Handling partial cognates in LingPy can be done in a rather straightforward way, using the `Partial`-class of the `compare` package. If one wants to use this data in "classical" analyses, partial cognates need to be converted to "normal" cognates. The `Partial` class offers also solutions for this conversion. In the following, we will learn how to load a file containing partial cognates in LingPy, converting the data to two different formats of normal cognates derived from partial cognates, and saving the data as a nexus file in multi-state format apt for the usage with Paup.

Let's start by loading a file that contains cognate sets and exporting it to binary format.
Our file of choice is a dataset used in the study of [Hamed and Wang (2006)](http://bibliography.lingpy.org?key=Hamed2006). The second author of the paper was so friendly to provide us with the data, which we converted to LingPy format so that it can be easily manipulated.



In [1]:
from helpers.util import data_path # path to our cookbook data
from lingpy import *
wl = Wordlist(data_path('Wang2006.tsv'))

Now that we have loaded the file, let's look at its content, by querying the header.

In [2]:
print(', '.join(sorted(wl.header, key=lambda x: x[1])))

wangid, partial, doculect, concept, ipa


The column we are interested in here is the column "partial", since it contains the partial cognate information in this dataset (most cognates sets are only partially cognate in Chinese dialects). The cognates are given as Chinese characters, as it is traditionally done in Chinese linguistics. 

In [3]:
for k in wl.get_list(concept='hand', flat=True)[3:7]:
    print(wl[k, 'doculect'], wl[k, 'partial'])

Chengdu 手
Fuzhou 手
Guangzhou 手
Lianchang 手骨


This is only a small part of all the words for "hand", but it shows that Lianchang has two characters for "hand", indicating a two-syllable word, while the others have one and the same character. In terms of partial cognates, we would note this as:

Doculect | Partial Cognate Set
--- | ---
Chengdu | 1
Fuzhou | 1
Guangzhou | 1
Lianchang | 1 2

In order to convert the data in this form, we re-number all Chinese characters.

In [4]:
# assemble characters for chinese data
nums = {}
idx = 1
for k in wl:
    for char in wl[k, 'partial']:
        if char in nums:
            pass
        else:
            nums[char] = str(idx)
            idx += 1
wl.add_entries('partial_ids', 'partial', lambda x: [nums[y] for y in
    x])

We need another hack, as we want to convert the partial cognates to full cognates, using two different methods. `Partial` expects a column "segments" as input. So we create it, using the `add_entries` function. The content of this column is of no interest for us, as we don't want to compare strings. We only want to deal with the partial cognate sets.

In [5]:
wl.add_entries('segments', 'ipa', lambda x: 't t t')

Now we can load the data into partial, and compute full cognate sets of the partial ones. Here, `Partial` offers two versions:

1 strict cognates (only words identical in *all* aspects will be judged to be cognate)
2 loose cognates (all words that form a connected component in a network of shared cognate morphemes are judged to be cognate)

The difference is easily illustrate:

Language | Partal Cognate Sets | Strict Cognate Sets | Loose Cognate Sets 
--- | --- | --- | ---
Lang1 | 1 2 | 1 | 1
Lang2 | 1 3 | 2 | 1 
Lang3 | 3 4 | 3 | 1 
Lang5 | 2 4 | 4 | 1

While in the strict cognate sets, all words are NOT cognate, they are cognate in all loose cognate sets. Why so? Because if we connect all words which each other which share at least one cognate set in common, we will have one line of connected words! Even if Lang2 and Lang4 do not share any morpheme with each other, they will be loosely cognate, as the words can be linked to words which themselves can be linked to the other words.

In the `Partial` class, we can easily create these data, using the `Partial.add_cognate_ids`-function. 

In [6]:
from lingpy.compare.partial import Partial
part = Partial(wl, segments='segments')
part.add_cognate_ids('partial_ids', 'strictid', idtype='strict')
part.add_cognate_ids('partial_ids', 'looseid', idtype='loose')

Writing data to file is now straightforward, as we only need to use the `output` function. Here, we specify "multistate.nex" as format, in order to output the data in Paup format for multistates. We set missing characters to "?", and export the data for two reference points: the loose and the strict encoding.

In [7]:
# export multistate format
part.output('tsv', filename='Wang2004', ignore='all', prettify=False)
part.output('multistate.nex', filename='Wang2004-strict', ref='strictid', missing='?')
part.output('multistate.nex', filename='Wang2004-loose', ref='looseid', missing="?")

And this is how the file for loose encoding looks in the end:

In [8]:
cat Wang2004-loose.multistate.nex

#NEXUS

BEGIN DATA;
DIMENSIONS ntax=23 NCHAR=200;
FORMAT RESPECTCASE DATATYPE=STANDARD symbols="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP0123456789" GAP=? MISSING=? interleave=yes;
OPTIONS MSTAXA = POLYMORPH;

MATRIX

Anyi       (ef)caaa(cf)aaaaabcabaa(ce)aaaa(bd)aba(bce)aaaaab(ab)aaaabba(ae)aa(bc)aaaaaaaaaaaabaaaalaaaaaaaeaaaadacaaaadabacaaaaaaaaaba(ab)aaaaa(ab)aabaaabadaaaakceaaaaaaaabaabjaaaaabaaaaacaaabaaca(mn)aaaaaeaaaaabaaabaasbbaaaaabaaaaaaaaaaaaaaaaaa
Beijing    i(ab)baacaaaaaacababc(ac)aaabababaaaaaea(be)aaabbaeaa(ce)aaaaaaaa(ab)aaabacaaa(ad)aaaaaa(cf)aaabdadaaaa(afg)aba(dij)aaaaaaaadbaaa(ad)aaaaaa(ac)aaabadaa(ac)aa(ch)baaaaaaaa(ab)aacaa(abcd)abaaaaaaabbaacaa(ac)bb(cd)aaaaabaaaabaaa(bf)aatcbaaaaabaaaaaaaabaaaaaaaa(ae)
Changsha   fbbaacaaaaaa(bc)abaa(ac)aaaa(bc)aba(ab)aaaeabaaaaabba(ae)aacaaaaaaaaaaaabaaaabaaaaaaacaaaadadaaeaaabadaaaaaaaaabaaaaaaaaaaaaaabadaaaaa(ac)(ab)aaaaaaaa(bdl)aaaaaaabaaaaaaa(bc)baa(ab)aa(ac)bb(ac)aaaaf(ab)aaaabaaabaa(AB)cbaaa(ab)abaaaaa(ae)aab