In [1]:
import csv # for writing dataframes to csv
import random # for making a random choice
import os # for scanning directories
import itertools
import string # for generating strings
from collections import Counter

import kintypes as kt # bringing large lists of kin types into the namespace
import math # for calculating logs
import pandas as pd
import numpy as np
import scipy.stats
import statistics
import re
from tqdm import tqdm

# testing = True # set to True to run code blocks with tests and examples
# filtering = False # set to True to run the filtering process

# Internal co-selection

Internal co-selection refers to a process of kin term evolution whereby terminological changes in one part of the paradigm co-occur with changes in related parts of the paradigm, increasing the predictive structure of the paradigm.

In this notebook, we will build simulations to investigate the robustness of this tendency cross-linguistically, using data from Kinbank, a global database of kin terminology. 

We will measure internal co-selection in terms of the **mutual information** between Generation N and Generation N+1 in a particular kinship system. That tells us how much information can be gained from one generation by observing the other - how certain we can be about which children 'go with' which parents. This can be calculated as the **entropy** of one generation (how much unpredictable variation there is) minus the **conditional entropy** between the two generations (how much unpredictability remains in one generation after observing another).

## Calculating mutual information of a kinship system

To calculate the mutual information (MI) of a particular kinship system, we must perform the following steps:

1. Extract kin terminology data from Kinbank for this language.
2. Condense the full kinship system down to the terms we are interested in: Ego's generation and Ego's parents' generation.
3. Calculate the probabilities of each kin term within the generation in which it belongs; and the probabilities of each parent-child pair.
4. Calculate entropy, conditional entropy, subtract them from each other to get the mutual information of the system.

After we get that going, we can do these same calculations on simulated kinship systems.

### 1. Extract kin terminology from Kinbank

First, let's actually load our data in. The following function `get_kb_files()` pulls the full list of Kinbank filenames - one file per language. Later, we can iterate through these to generate MI values for every language in our dataset.

In [2]:
def get_kb_files(path) -> list:
    files = []
    directory = os.scandir(path)
    morgan = []
    for file in directory:
        if 'Morgan' in file.name:
            morgan.append(file.name)
        else:
            files.append(file.name)
    files += morgan # all morgan files at the end, so if there is a duplicate, the non morgan data is used
    return files

all_kb_files = get_kb_files('../../kinbank')

In [3]:
all_kb_files

['Ngombe_Binja_binj1250.csv',
 'Ottowa_Ojibwa_otta1242.csv',
 'Tlingit_tlin1245.csv',
 'Bannock_bann1248.csv',
 'Sa_saaa1241.csv',
 'Kimaama_kima1246.csv',
 'Dhuwal-Dhuwala_(Yolngu)_dhuw1248.csv',
 'Tenharim_tenh1241.csv',
 'Mosetén_Chimané_mose1249.csv',
 'Tsaangi_tsaa1242.csv',
 'Ngoni_ngon1269.csv',
 'Nyakyusa_nyak1260.csv',
 'Balaesang_bala1314.csv',
 'KazymKhanty_nort2672.csv',
 'Aklanon_akla1241.csv',
 'Paakantyi_darl1243.csv',
 'Fipa_fipa1238.csv',
 'Namakura_nama1268b.csv',
 'Woleaian_wole1240.csv',
 'Dondo_dond1249.csv',
 'Nengone_neng1238.csv',
 'Punu_punu1239.csv',
 'Kuot_kuot1243.csv',
 'Buin_buin1247.csv',
 'Ontong_Java_onto1237.csv',
 'Nunamiut_nort2944.csv',
 'Angaite_anga1316.csv',
 'Kemtuik_kemt1242.csv',
 'Futunans_east2447.csv',
 'Iraya_iray1237.csv',
 'Mblafe_mbla1238.csv',
 'Godoberi_ghod1238.csv',
 'Chamarro_cham1312.csv',
 'Pipil_pipi1250.csv',
 'Malgana_malg1242a.csv',
 'Central_Tagbanwa_cent2090.csv',
 'Kashmiri_kash1277.csv',
 'Macushi_macu1259.csv',
 'Haya_ha

Using one of these filenames, we can extract the kin terminology from that file and populate a dictionary with it. We're only interested in two columns from the Kinbank data: `parameter`, which contains a short code indicating a **kin type**, and `word`, which contains the **kin term** associated with that kin type. An example of a row in the English data would be `mMeB, uncle`, where `mMeB` means 'male speaker's mother's older brother', and `uncle` is the term associated with that person.

In [4]:
def get_kin_terms(filepath: str) -> dict:
    ks = {}
    with open(filepath, encoding='utf8') as f:
        csv_reader = csv.DictReader(f)
        next(csv_reader) # to skip the header row
        for line in csv_reader:
            kin_type = line['parameter']
            kin_term = line['word']
            kin_term = kin_term.split(',')[0]
            if '(' in kin_term:
                kin_term = kin_term.split('(')[0][:-1]
            ks[kin_type] = kin_term
    return ks

### 2. Condense the system down

We're interested in the mutual information between the kin terms in Generation 0 (Ego's generaiton) and Generation +1 (Ego's parents' generation).

We want create a data structure that pairs up parent types with the corresponding child types. This is because we're interested in whether kinship systems maintain patterns of terminological distinctions and mergers across these two generations, so we will need to know which parent terms 'go with' which child terms.

In `kintypes`, you will find a list of pairs of kin types, where the first element in the pair is a parent type, and the second is their child; e.g. mMeB and mMeBD (mother's elder brother and mother's elder brother's daughter). 

`get_pairs()` takes a kinship system as input, and outputs a list of tuples. The first element in the tuple is the parent term, the second is the corresponding child term. 

In [45]:
def get_pairs(ks: dict) -> list:
    pairs_of_terms = []
    parent_types = []

    for pair in kt.ics_pairs:
        if pair[0] in ks and pair[1] in ks:
            pairs_of_terms.append((ks[pair[0]],ks[pair[1]]))
            parent_types.append(pair[0])
                
    return pairs_of_terms

But for our calculations, we'll still need to know which terms belong to which generation. Luckily, we know that the 0th element in each tuple is from Ego's parents' generation and the 1st element is from Ego's generation. So we can happily split these tuples down the middle and populate two lists with the terms. `split_pairs()` takes a list of pairs and sorts it into terms that belong to Ego's generation and terms that belong to Ego's parents' generation.

In [6]:
def split_pairs(pairs: list) -> list:
    gn = []
    gn1 = []
    for pair in pairs:
        gn.append(pair[1])
        gn1.append(pair[0])
    
    return gn,gn1

### 3. Calculate probabilities

To calculate entropy, we need a probability distribution over the terms in one single generation of a kinship system. So let's start with a function that can calculate the probability of a particular term.

Given a term and the full list of terms in the same generation, `probability()` counts how many times that term exists in `generation` and divides that by the total length of `generation`.

In [7]:
def probability(term: str, generation: list) -> float:
    return generation.count(term)/len(generation)

When calculating mutual information, we also need the **conditional entropy** between the two generations of our system. To calculate this, we will need not only the probabilities of terms in a generation, but also the **joint probabilities** of every pair of terms across those two generations. In other words, we need to calculate the probabilities of our `get_pairs` output.

Given two terms, `joint_probability()` counts how many pairs made of those two terms exist in `pairs`, then divides that by the total length of `pairs`.

In [8]:
def joint_probability(term1: str, term2: str, pairs: list) -> float:
    pair = (term1,term2)
    return pairs.count(pair)/len(pairs)

### 4. Calculating entropy and mutual information

Entropy (in bits) is defined as 

$$
H(X) = -\sum_{x \in X}p(x) log_2p(x)
$$

or, in English, it is the inverse sum over a distribution X of the probability of y * the log probability of y.

Entropy is a measure of the average level of uncertainty about the possible outcomes of a variable.

First, let's define a function `entropy()` that will iterate over a generation of the kinship system and output the entropy of that generation. 

Note: we only need one generation's entropy score to calculate mutual information - we make the arbitrary choice to calculate the entropy of Ego's parents' generation later in this notebook.

In [9]:
def entropy(generation: list) -> list:
    entropy = 0
    for term in set(generation): # using a set as we want to count each unique term only once
        p = probability(term,generation)
        #print('entropy of',term,p*math.log(p))
        entropy += p*math.log2(p)
    return round(-entropy,5)

Conditional entropy of Y given X is defined as

$$
H(Y|X) = -\sum_{x \in X,y \in Y}p(x,y) log_2 {p(x,y) \over p(x)}
$$

or in English, the inverse sum over two distributions Y and X of the probability of each y * the log probability of each y given x.

Conditional entropy is the amount of information needed to describe the outcome of a random variable Y given that we already know the value of another random variable X.

To calculate it, we need the joint probability of each pair (given by `joint_probability()`) and the probability of one member of that pair (given by `probability()`). We can then calculate the conditional probability of parent term given child term as the joint probability of those terms over the probability of the parent term.

`conditional_entropy()` iterates over all pairs to output the conditional entropy of Ego's generation given Ego's parents' generation.

In [10]:
def conditional_entropy(gn: list, pairs:list) -> float:
    entropy = 0
    for x,y in set(pairs): # x = parent, y = child
        p_xy = joint_probability(x,y,pairs)
        p_y = probability(y,gn)
        if p_xy > 0 and p_y > 0:
            #print('p(', x, '|', y,') = ', p_xy/p_y, 'p(y) = ', p_y)
            entropy += p_xy * math.log2(p_xy/p_y)
    return round(-entropy,5)

Finally, mutual information is defined as

$$
I(X;Y) \equiv H(X) - H(X|Y)
$$

or in English, entropy of X minus the conditional entropy of X given Y.

In this study, it is equal to the entropy of Ego's parents' generation minus the conditional entropy of Ego's parents' generation given Ego's generation. It tells us how much mutual dependence there is between these two generations; i.e. how much we can predict about one by observing the other.

In [11]:
def mutual_information(pairs: list):
    gn,gn1 = split_pairs(pairs)
    e = entropy(gn1)
    ce = conditional_entropy(gn,pairs)
    mi = e - ce
    return round(mi,5)

In [12]:
# def mutual_information(e,ce):
#     return round(e - ce, 5)

### 5. Calculate MI for each language and save data

With all of the above infrastructure, we can calculate the mutual information between G0 and G+1 for all the languages in Kinbank and save this data to a separate .csv file.

First, a function that calculates the relevant data and stores it.

In [13]:
def write_data(pairs,results):
    gn,gn1 = split_pairs(pairs)
    egn = entropy(gn)
    egn1 = entropy(gn1)
    ce = conditional_entropy(gn,pairs)
    mi = mutual_information(pairs)
    
    results['mutual_information'] = mi
    results['entropy_gn'] = egn
    results['entropy_gn1'] = egn1
    results['conditional_entropy'] = ce
    results['variation_gn'] = len(set(gn))
    results['variation_gn1'] = len(set(gn1))
    results['number_of_pairs'] = len(set(pairs))
    
    return results

In [14]:
families = ['Afro-Asiatic','Algic','Arawakan','Atlantic-Congo','Austroasiatic','Austronesian','Cariban','Dravidian','Indo-European',
            'Nakh-Daghestanian','Nuclear Trans New Guinea', 'Other','Pama-Nyungan','Pano-Tacanan','Salishan','Sino-Tibetan',
            'Tai-Kadai','Tupian','Turkic','Uralic','Uto-Aztecan']

In [15]:
def calculate_MI(families: list,filename: str):
    df = []
    codes = []
    filepath = '../../kinbank-family/'
    
    for family in families:
        all_files = get_kb_files(filepath + family)
        
        for file in all_files:
            print(file)
            match = re.search("[a-z]{4}[0-9]{4}[a-z]?\\.", file)
            code = match.group()
            code = code[:len(code)-1]
            language = file.split('_' + code)[0]
            
            if 'Morgan' in language:
                match = re.search("Morgan[0-9]{4}",language)
                morgan = match.group()
                language = language.split(morgan + '_')[1]

            if code not in codes:
                codes.append(code)

                ks = get_kin_terms(filepath + family + '/' + file)

                pairs = get_pairs(ks)
                
                # g0,g1 = split_ks(ks)
                g0,g1 = split_pairs(pairs)

                g0_relatives,g1_relatives = split_ks(ks)

                if pairs: # if pairs is not empty

                    results = {}
                    results['language'] = language
                    results['language_family'] = family
                    results['code'] = code
                    results['simulation_code'] = code + '_REAL'
                    results['simulation'] = 'N'
                    # results['mutual_information'] = entropy(list(g1.values())) - conditional_entropy(list(g0.values()),pairs)
                    results['mutual_information'] = mutual_information(pairs)
                    # results['entropy_gn'] = entropy(list(g0.values()))
                    results['entropy_gn'] = entropy(g0)
                    # results['entropy_gn1'] = entropy(list(g1.values()))
                    results['entropy_gn1'] = entropy(g1)
                    # results['conditional_entropy'] = conditional_entropy(list(g0.values()),pairs)
                    results['conditional_entropy'] = conditional_entropy(g0,pairs)
                    # results['variation_gn'] = len(set(g0.values()))
                    results['variation_gn'] = len(g0_relatives)
                    # results['variation_gn1'] = len(set(gn1.values()))
                    results['variation_gn1'] = len(g1_relatives)
                    results['number_of_pairs'] = len(set(pairs))

                    df.append(results)
        
    pd.DataFrame(df).to_csv('../data/raw/' + filename + '.csv',index=False)
    
    return pd.DataFrame(df)

In [16]:
# calculate_MI(['Indo-European'],'test')

In [17]:
# calculate_MI(families,'kinbank_mi_FINAL')

## Simulating kinship systems

To investigate whether kinship systems have higher mutual information than chance, we can build a random baseline for each language to serve as a point of comparison.

To do this, we will take each language in our dataset, and randomly scramble which terms go with which relatives (within generations). This will randomise the syncretisms within the paradigm, breaking any predictable structure built by the internal co-selection process, while maintaining the amount of variation across the system overall.
 
We take the following steps:

1. Extract the kinship system of a language from kinbank.
2. Filter the two generations we are interested in.
3. Randomly reassign the kinship terms to new types.
4. Repeat the process 1000 times for each language.

We already have the infrastructure for the first two! `get_kin_terms()`,  `get_pairs()` and `split_pairs()` will do this for us. So let's skip to 3, and write a function that randomises which terms form pairs, assuming that we have already extracted the kinship system and filtered the relevant pairs.

### 3. Randomly rearrange kin terms

`kintypes.py` also contains a list of which kin types are in which generation, which we can use to split a kinship system by generation.

In [18]:
def split_ks(ks):
    gn = {}
    gn1 = {}
    for entry in ks:
        if entry in kt.generation_n:
            gn[entry] = ks[entry]
        elif entry in kt.generation_n1:
            gn1[entry] = ks[entry]
        else:
            pass

    return gn,gn1        

Once the kinship system is split, we can then shuffle one of the generations and recombine the system.

In [19]:
def shuffle_ks(ks):
    gn_terms = []
    gn,gn1 = split_ks(ks)
    
    new_ks = {}
    
    for term in gn:
        gn_terms.append(gn[term])

    random.shuffle(gn_terms)
    
    for i in range(len(gn)):
        # print(i)
        key = list(gn.keys())[i]
        new_ks[key] = gn_terms[i]
        
    print(entropy(list(new_ks.values())))

    return {**new_ks,**gn1}

In [20]:
# shuffle_ks(test_ks)

If we split the kinship system, shuffle the G0 terms, and stick the system back together, then we can use `get_pairs()` just as we did for the real kinship systems to filter out the two generations we're interested in. From there, we can calculate MI for each simulation and save the data in a similar way as before.

In [21]:
def simulate_ks(ks: dict) -> list:
    simulation = shuffle_ks(ks)
    shuffled_pairs = get_pairs(simulation)
    return shuffled_pairs

In [22]:
def simulate_MI(families: list, filename: str, times: int):
    df = []
    codes = []
    filepath = '../../kinbank-family/'
    
    for family in tqdm(families):
        all_files = get_kb_files(filepath + family)
    
        for file in all_files:
            match = re.search("[a-z]{4}[0-9]{4}[a-z]?\\.", file)
            code = match.group()
            code = code[:len(code)-1]
            language = file.split('_' + code)[0]
            
            if 'Morgan' in language:
                match = re.search("Morgan[0-9]{4}",language)
                morgan = match.group()
                language = language.split(morgan + '_')[1]

            if code not in codes:
                codes.append(code)

                ks = get_kin_terms(filepath + family + '/' + file)
                true_pairs = get_pairs(ks)
                
                tg0,tg1 = split_ks(ks)
                true_value = mutual_information(true_pairs)

                for i in range(times):
                    sim = shuffle_ks(ks)
                    g0,g1 = split_ks(sim)
                    pairs = get_pairs(sim)
                    if pairs:
                        
                        results = {}
                        results['language'] = language
                        results['language_family'] = family
                        results['code'] = code
                        results['simulation_code'] = code + '_' + str(i)
                        results['simulation'] = 'Y'
                        
                        # egn = entropy(list(g0.values()))
                        # print(egn)
                        # egn1 = entropy(list(g1.values()))
                        # ce = conditional_entropy(list(g0.values()),pairs)
                        # mi = egn1 - ce
    
                        results['mutual_information'] = mutual_information(pairs)
                        results['true_value'] = true_value
                        results['entropy_gn'] = entropy(split_pairs(pairs)[0])
                        results['entropy_gn1'] = entropy(split_pairs(pairs)[1])
                        results['conditional_entropy'] = conditional_entropy(split_pairs(pairs)[1],pairs)
                        results['variation_gn'] = len(tg0)
                        results['variation_gn1'] = len(tg1)
                        results['number_of_pairs'] = len(set(pairs))
#                         write_data(pairs,results)

                        df.append(results)
                
    
    pd.DataFrame(df).to_csv('../data/raw/' + filename + '.csv',index=False)
    
    return pd.DataFrame(df)

In [23]:
# simulate_MI(['Indo-European'],'test',10)

In [24]:
# simulate_MI(families,'simulated_mi_FINAL',1000)

## Edit Distance

In this section we calculate the normalised Levenshtein edit distance between pairs of kin terms in each language and generate a monte carlo sample of edit distance for each language so that we can calculate the z score (as well as the correlation between edit distances).

To do this we need:
* A function to calculate the levenshtein distance between two forms.
* A function to calculate the distance between two meanings, which should have e.g. MB and MBD be closer than MZ and MBD.
* A function to calculate the correlation between a table of form distances and a table of meaning distances.
* A function to generate a monte carlo sample of correlations (ie the correlation of scrambled, simulated data).


UPDATE: we are not just comparing all labels for all relatives, but rather all labels for all categories! This prevents languages being penalised for being compositional but not having all unique labels across aunts and uncles.

So the first thing we need to do is work out which categories each language has, and who are the children of which category members.

In [25]:
test_lang = random.choice(all_kb_files)

test_ks = get_kin_terms('../../kinbank/' + test_lang)

gn,gn1 = split_ks(test_ks)

In [26]:
def category_based_pairs(g1):
    
    # work out which g1 relatives share a term
    
    relatives = list(g1.keys())
    terms = list(g1.values())

    categories = []

    for term in terms:
        cat = []
        indices = [index for index, element in enumerate(terms) if element == term]
        for index in indices:
            cat.append(relatives[index])
        categories.append(cat)

    categories = [list(i) for i in set(map(tuple, categories))]
    
    # work out which children should share a term on this basis
    
    child_cats = []

    for cat in categories:
        new_category = []
        for individual in cat:
            for pair in kt.ics_pairs:
                if pair[0] == individual:
                    new_category.append(pair[1])
        child_cats.append(new_category)
        
    # and make a new list of parent-child pairs that should have equal semantic distance
    
    new_pairs = []
    
    for i in range(len(categories)):
        new_pairs += list(itertools.product(categories[i], child_cats[i]))
        
    return new_pairs


In [27]:
# category_based_pairs(gn1)

In [28]:
from Levenshtein import distance as lvs_dist

In [29]:
def edit_distance(g0,g1):
    edit_dists = []

    for relative1 in g0:
        term1 = g0[relative1]
        for relative2 in g1:
            data = {}
            term2 = g1[relative2]
            
            if relative1 == relative2:
                pass
            
            else:
            
                if len(term1) > 0 and len(term2) > 0:
                    dist = lvs_dist(term1,term2)/len(max(term1,term2))
                    edit_dists.append(dist)
                else:
                    pass

    return edit_dists

To calculate semantic distance:

In [30]:
def semantic_distance(g0,g1): # (df,ks,pairs):
    sem_distances = []
    
    pairs = category_based_pairs(g1)

    for relative1 in g0:
        for relative2 in g1:
            
            if relative1 == relative2:
                pass
            
            else:
            
                if len(g0[relative1]) > 0 and len(g1[relative2]) > 0:

                    if (relative1,relative2) in pairs or (relative2,relative1) in pairs:
                        distance = 1
                        sem_distances.append(distance)

                    else:
                        distance = 2
                        sem_distances.append(distance)
                        
#     df['semantic_distance'] = sem_distances
#     
#     return df
    return sem_distances


In [31]:
# semantic_distance(gn,gn1)

The above will get us a meaning distance measure where if you are the child of a person, your meaning distance will be 1, and if you are not, your distance will be 2.

This captures the kind of compositionality in meaning we're interested in (e.g. that children of a person will have a similar form to that person) but not all kinds of kinship compositionality e.g. the Swedish situation.

Might have to do something more complex with features, e.g.
male vs female
mother side vs father side
young vs old
child-of vs not child-of

Anyway, for now we can calculate the correlation between our edit distance and our semantic distance with scipy.

But to get a z-score, we can just reuse our shuffling infrastructure from the original simulation (repeated here).

In [32]:
def shuffle_ks(ks):
    gn_terms = []
    gn,gn1 = split_ks(ks)
    for term in gn:
        gn_terms.append(gn[term])

    random.shuffle(gn_terms)
    
    for i in range(len(gn)):
        # print(i)
        key = list(gn.keys())[i]
        ks[key] = gn_terms[i]

    return ks

In [33]:
# def shuffle_distance(distances):
#     shuffled_edit_dists = distances.copy()
#     random.shuffle(shuffled_edit_dists)
#     return shuffled_edit_dists

And finally combine all of that into a single function that calculates the true correlation between edit distance and semantic distance for each language; simulates that language 1000 times and calculates the same correlation; and saves a z-score for each language.

**EDIT: as of 25/06 we are no longer looking at semantic distance - instead, we are calculating the average edit distance across all parent-child pairs.**

In [34]:
def sample_distance(families,times,filename):
    output_df = []
    
    codes = []
    
    for family in tqdm(families):
        files = get_kb_files('../../kinbank-family/' + family)
        for file in files:
            
            match = re.search('[a-z]{4}[0-9]{4}[a-z]?\\.', file)
            code = match.group()
            code = code.split('.')[0]
            language = file.split('_' + code)[0]

            if 'Morgan' in language:
                match = re.search("Morgan[0-9]{4}",language)
                morgan = match.group()
                language = language.split(morgan + '_')[1]
                
            data = {}

            if code not in codes:
                codes.append(code)

                full_ks = get_kin_terms('../../kinbank/' + file)
                gn,gn1 = split_ks(full_ks)
#                 ks = {**gn,**gn1}
                

                edit = edit_distance(gn,gn1)
                sem = semantic_distance(gn,gn1)
        
                if len(edit) > 2:
                    true_corr = scipy.stats.pearsonr(np.array(edit),np.array(sem))[0]

                    sample_correlations = []
                    for i in range(times):
                        sample = shuffle_ks(full_ks)
                        sample_gn,sample_gn1 = split_ks(sample)
                        sample_edit = edit_distance(sample_gn,sample_gn1)
                        sample_sem = semantic_distance(sample_gn,sample_gn1)
                        sample_corr = scipy.stats.pearsonr(np.array(sample_edit),np.array(sample_sem))[0]
                        sample_correlations.append(sample_corr)

#                     print(sample_correlations)
                    mean = np.mean(sample_correlations)
                    print(mean)
                    sd = np.std(sample_correlations)
                    print(sd)
                    z = (true_corr - mean) / sd
                    print(z)

                    data['language'] = language
                    data['family'] = family
                    data['code'] = code
                    data['correlation'] = true_corr
                    data['mean'] = mean
                    data['sd'] = sd
                    data['z'] = z

                    output_df.append(data)
                
    pd.DataFrame(output_df).to_csv('../data/raw/' + filename + '.csv',index=False)

    return pd.DataFrame(output_df)
                    

In [35]:
def sample_distance(families,times,filename):
    output_df = []
    
    codes = []
        
    for family in tqdm(families):
        
        files = get_kb_files('../../kinbank-family/' + family)
        
        for file in files:

            print(file)
                
            match = re.search('[a-z]{4}[0-9]{4}[a-z]?\\.', file)
            code = match.group()
            code = code.split('.')[0]
            language = file.split('_' + code)[0]

            if 'Morgan' in language:
                match = re.search("Morgan[0-9]{4}",language)
                morgan = match.group()
                language = language.split(morgan + '_')[1]
                                
            data = {}

            if code not in codes:
                codes.append(code)

                full_ks = get_kin_terms('../../kinbank/' + file)
                pairs = get_pairs(full_ks)
                
                edit_distance = []
                for pair in pairs:
                    try:
                        ed = lvs_dist(pair[0],pair[1]) / max(len(pair[0]),len(pair[1]))
                        edit_distance.append(ed)
                    except:
                        continue
                try:            
                    true_mean = statistics.mean(edit_distance)
                except:
                    continue
#                 print(true_mean)

                sample_avgs = []
                for i in tqdm(range(times)):
                    sample = shuffle_ks(full_ks)
                    sample_pairs = get_pairs(sample)
                    distances = []
                    for pair in sample_pairs:
                        try:
                            sample_ed = lvs_dist(pair[0],pair[1]) / max(len(pair[0]),len(pair[1]))
                            distances.append(sample_ed)
                        except:
                            continue
                    sample_avg = statistics.mean(distances)
                    sample_avgs.append(sample_avg)

#                 print(sample_avgs)

# #                     print(sample_correlations)
                sample_mean = statistics.mean(sample_avgs)
#                 print(sample_mean)
                sd = statistics.stdev(sample_avgs)
#                 print(sd)
                if true_mean == sample_mean:
                    z = 0
                else:
                    z = (true_mean - sample_mean) / sd

#                 print(z)

                data['language'] = language
                data['family'] = family
                data['code'] = code
                data['average_edit_distance'] = true_mean
                data['sample_mean'] = sample_mean
                data['sd'] = sd
                data['z'] = -z

                output_df.append(data)
                
    pd.DataFrame(output_df).to_csv('../data/raw/' + filename + '.csv',index=False)

    return pd.DataFrame(output_df)

In [36]:
# sample_distance(['Indo-European'],100,'TEST')

In [37]:
# sample_distance(families, 1000, 'average_edit_distance')

In [38]:
cornish = get_kin_terms('../../kinbank/Cornish_corn1251.csv')
bende = get_kin_terms('../../kinbank/Bende_bend1258.csv')
tagalog = get_kin_terms('../../kinbank/Tagalog_taga1270.csv')
andi = get_kin_terms('../../kinbank/Andi_andi1255.csv')
hindi = get_kin_terms('../../kinbank/Hindi_hind1269.csv')

cornish_pairs = get_pairs(get_kin_terms('../../kinbank/Cornish_corn1251.csv'))

english_pairs = get_pairs(get_kin_terms('../../kinbank/English_stan1293.csv'))

bende_pairs = get_pairs(bende)

tagalog_pairs = get_pairs(tagalog)

andi_pairs = get_pairs(andi)

hindi_pairs = get_pairs(hindi)

print(hindi)
print(hindi_pairs)

{'fDD': 'dhevtī', 'meZ': 'dīdī', 'mF': 'bāp', 'mM': 'mātā', 'mSS': 'potā', 'mSD': 'potī', 'mDS': 'dhevtā', 'mFB': 'tāū', 'mFZ': 'buā', 'mFyB': 'cācā', 'mFeB': 'tāū', 'mFeZ': 'buā', 'mFyZ': 'buā', 'mBS': 'bhatījā', 'mBD': 'bhatījī', 'mZS': 'bhānjā', 'mZD': 'bhānjī', 'meBS': 'bhatījā', 'myBS': 'bhatījā', 'meBD': 'bhatījī', 'myBD': 'bhatījī', 'meZS': 'bhānjā', 'myZS': 'bhānjā', 'meZD': 'bhānjī', 'myZD': 'bhānjī', 'mBW': 'bhābahū', 'mZH': 'bahenoī', 'mWB': 'sālā', 'mHB': 'devar', 'mSW': 'bahū', 'mDH': 'jamāī', 'feZ': 'dīdī', 'fF': 'bāp', 'fM': 'mātā', 'fSS': 'potā', 'fSD': 'potī', 'fDS': 'dhevtā', 'fFB': 'tāū', 'fFZ': 'buā', 'fFyB': 'cācā', 'fFeB': 'tāū', 'fFeZ': 'buā', 'fFyZ': 'buā', 'fBS': 'bhatījā', 'fBD': 'bhatījī', 'fZS': 'bhānjā', 'fZD': 'bhānjī', 'feBS': 'bhatījā', 'fyBS': 'bhatījā', 'feBD': 'bhatījī', 'fyBD': 'bhatījī', 'feZS': 'bhānjā', 'fyZS': 'bhānjā', 'feZD': 'bhānjī', 'fyZD': 'bhānjī', 'fBW': 'bhābahū', 'fZH': 'bahenoī', 'fWB': 'sālā', 'fHB': 'devar', 'fSW': 'bahū', 'fDH': 'ja

In [39]:
dists = []
for pair in hindi_pairs:
    dists.append(lvs_dist(pair[0],pair[1]) / max(len(pair[0]),len(pair[1])))

statistics.mean(dists)



0.888043623043623

In [40]:
sample_avgs = []

for i in range(50):
    sample = shuffle_ks(hindi)
    sample_pairs = get_pairs(sample)
    distances = []
    for pair in sample_pairs:
        try:
            sample_ed = lvs_dist(pair[0],pair[1]) / max(len(pair[0]),len(pair[1]))
            distances.append(sample_ed)
        except:
            continue
    sample_avg = statistics.mean(distances)
    sample_avgs.append(sample_avg)

statistics.mean(sample_avgs)

0.9290237512487513

ottowa_ojibwa otta1242
yurok yuro 1248
wapishana wapi 1253
mokpwe mokp1239

In [44]:
# def get_pairs(ks: dict) -> list:
#     pairs_of_terms = []
#     parent_types = []

#     for pair in kt.pairs_no_parents:
#         if pair[0] in ks and pair[1] in ks:
#             pairs_of_terms.append((ks[pair[0]],ks[pair[1]]))
#             parent_types.append(pair[0])
                
#     return pairs_of_terms

In [42]:
ojibwa = get_kin_terms('../../kinbank/Ottowa_Ojibwa_otta1242.csv') # missing cell values, need to remove empty strings
yurok = get_kin_terms('../../kinbank/Yurok_yuro1248.csv') # nuncles and cousins split by gender, siblings split by age + gender
wapishana = get_kin_terms('../../kinbank/Wapishana_wapi1253.csv') # older siblings split by age and gender, younger not split by gender. cousins all the same. nuncles split by gender and side of the family
mokpwe = get_kin_terms('../../kinbank/Mokpwe_mokp1239.csv') # opposite gender cousins split by gender, siblings split by age, nuncles split by gender and mother's brother distinguished

print(split_ks(mokpwe)[0], '\n\n', split_ks(mokpwe)[1], '\n')
print(get_pairs(mokpwe))

{'mMeBS': 'mwanyango', 'mMyBS': 'mwanyango', 'mMeZS': 'mwanyango', 'mMyZS': 'mwanyango', 'mMeBD': 'ndomé', 'mMyBD': 'ndomé', 'mMeZD': 'ndomé', 'mMyZD': 'ndomé', 'fMeBS': 'ndomé', 'fMyBS': 'ndomé', 'fMeZS': 'ndomé', 'fMyZS': 'ndomé', 'fMeBD': 'mwanyango', 'fMyBD': 'mwanyango', 'fMeZD': 'mwanyango', 'fMyZD': 'mwanyango', 'mFeBS': 'mwanyango', 'mFyBS': 'mwanyango', 'mFeZS': 'mwanyango', 'mFyZS': 'mwanyango', 'mFeBD': 'ndomé', 'mFyBD': 'ndomé', 'mFeZD': 'ndomé', 'mFyZD': 'ndomé', 'fFeBS': 'ndomé', 'fFyBS': 'ndomé', 'fFeZS': 'ndomé', 'fFyZS': 'ndomé', 'fFeBD': 'mwanyango', 'fFyBD': 'mwanyango', 'fFeZD': 'mwanyango', 'fFyZD': 'mwanyango', 'meB': 'waólúlu', 'myB': 'wenongoni', 'meZ': 'waólúlu', 'myZ': 'wenongoni', 'feB': 'waólúlu', 'fyB': 'wenongoni', 'feZ': 'waólúlu', 'fyZ': 'wenongoni'} 

 {'mFeB': 'melá', 'mFyB': 'melá', 'mFeZ': 'atí', 'mFyZ': 'atí', 'mMeZ': 'atí', 'mMyZ': 'atí', 'fFeB': 'melá', 'fFyB': 'melá', 'fFeZ': 'atí', 'fFyZ': 'atí', 'fMeZ': 'atí', 'fMyZ': 'atí', 'mF': 'rángó', 'mM'

In [46]:
mutual_information(get_pairs(mokpwe))

0.72193

Lots of languages get an MI score of 0.9183 because that's what you get if your cousins and nuncles have an MI of 0, but your parents and siblings are distinct from them. This is not a problem! It is still interesting from a structural point of view - namely, that languages seek to distinguish parents and siblings from cousins and nuncles because you can specify people more informatively that way.