In [4]:
import csv # for writing dataframes to csv
import random # for making a random choice
import os # for scanning directories
import itertools
import string # for generating strings
from collections import Counter

import kintypes as kt # bringing large lists of kin types into the namespace
import math # for calculating logs
import pandas as pd
import re

testing = True # set to True to run code blocks with tests and examples
filtering = False # set to True to run the filtering process

# Internal co-selection

Internal co-selection refers to the tendency for kinship systems to have cross-generational consistency in the terminological distinctions or mergers that are made. That is, if your parents' elder brothers share a kin term, then so too will their children. Or if your parents' sisters are distinguished from your parents' brothers, so too will their children be distinguished. 

Imagine a kinship system like so: as in English, you call your parents' brothers are  *uncles*, and their sisters *aunts*. You call the child of your uncle a *chuncle*, and the child of your aunt a *chaunt*. Thus, you make the same sorts of distinctions among your parents' siblings' generation of kin as are made among your own generation of kin - and you can be certain about which children belong to which parents as a result. This is an example of internal co-selection.

In this notebook, we will gather information about the robustness of this tendency cross-linguistically, using data from Kinbank, a global database of kin terminology. We will also create simulations of existing kinship systems to find out whether internal co-selection is more common in kinship systems cross-linguistically (for a given amount of terminological variation) than we would expect by chance.

We will measure internal co-selection in terms of the **mutual information** between Generation N and Generation N+1 in a particular kinship system. That tells us how much information can be gained from one generation by observing the other - how certain we can be about which children 'go with' which parents. This can be calculated as the **entropy** of one generation (how much unpredictable variation there is) minus the **conditional entropy** between the two generations (how much unpredictability remains in one generation after observing another).

## The procedure

To calculate the mutual information (MI) of a particular kinship system, we must perform the following steps:

1. Extract kin terminology data from Kinbank for this language.
2. Condense the full kinship system down to the terms we are interested in: Ego's generation and Ego's parents' generation.
3. Calculate the probabilities of each kin term within the generation in which it belongs; and the probabilities of each parent-child pair.
4. Calculate entropy, conditional entropy, subtract them from each other to get the mutual information of the system.

After we get that going, we can do these same calculations on simulated kinship systems.

### Extract kin terminology from Kinbank

First, let's actually load our data in. The following function `get_kb_files()` pulls the full list of Kinbank filenames. Later, we can iterate through these to generate MI values for every language in our dataset.

In [5]:
def get_kb_files(path) -> list:
    files = []
    directory = os.scandir(path)
    for file in directory:
        files.append(file.name)
    return files

all_kb_files = get_kb_files('../languages/kinbank')

Using one of these filenames, we can extract the kin terminology from that file and populate a dictionary with it. We're only interested in two columns from the Kinbank data: `parameter`, which contains a short code indicating a **kin type**, and `word`, which contains the **kin term** associated with that kin type. An example of a row in the English data would be `mMeB, uncle`, where `mMeB` means 'male speaker's mother's older brother', and `uncle` is the term associated with that person.

In [6]:
def get_kin_terms(filepath: str) -> dict:
    kin_system = {}
    with open(filepath, encoding='utf8') as f:
        csv_reader = csv.DictReader(f)
        next(csv_reader) # to skip the header row
        for line in csv_reader:
            kin_type = line['parameter']
            kin_term = line['word']
            kin_system[kin_type] = kin_term
    return kin_system

Let's pick a random kinship system to test with throughout this notebook.

In [7]:
if testing:
    
    random.seed(47) # set a seed for reproducibility

    file = random.choice(all_kb_files) # pick a random filename from all_kb_files

    filepath = '../languages/kinbank/' # the filepath where the kinbank files are kept

    k = get_kin_terms(filepath + file)

    print(file,k)

Yidiny_yidi1250.csv {'mB': 'nganytyakuman', 'mZ': 'tyangkul', 'meB': 'yapa', 'myB': 'yapatyipa', 'mF': 'pimpi', 'mM': 'ngalpu', 'mS': 'karkun', 'mD': 'kalngkir', 'mFF': 'kamin', 'mFM': 'papim', 'mMF': 'ngatyim', 'mMM': 'kumpu', 'mSS': 'tyumpariy', 'mSD': 'tyumpariy', 'mDS': 'tyumpariy', 'mDD': 'tyumpariy', 'mFB': 'pimpi', 'mFZ': 'tyutyum', 'mMB': 'kalnga', 'mMZ': 'ngalpu', 'mBS': 'karkun', 'mBD': 'kalngkir', 'mFZD': 'manka', 'mMBD': 'manka', 'mFZS': 'manka', 'mMBS': 'manka', 'mH': 'mungka', 'mW': 'wakal', 'mZH': 'muwa', 'mWB': 'muwa', 'mWZ': 'wakal', 'mSW': 'tungkarr', 'mDH': 'tungkarr', 'meZ': 'tyangkul', 'myZ': 'tyangkul', 'mFeB': 'pimpi', 'mFyB': 'pimpi', 'mFeZ': 'tyutyum', 'mFyZ': 'tyutyum', 'mMeZ': 'ngalpu', 'mMyZ': 'ngalpu', 'mMeB': 'kalnga', 'mMyB': 'kalnga', 'meBS': 'karkun', 'myBS': 'karkun', 'meBD': 'kalngkir', 'myBD': 'kalngkir', 'mFeZS': 'manka', 'mFyZS': 'manka', 'mFeZD': 'manka', 'mFyZD': 'manka', 'mMeBS': 'manka', 'mMyBS': 'manka', 'mMeBD': 'manka', 'mMyBD': 'manka', 'mF

As we can see from printing the system, there's a lot of extra kin terms here that we don't need for our experiment today. We're only interested in Ego and Ego's parents' generations, but the system contains kin types like `mS` (male speaker's son) or `mDD` (make speaker's daughter's daughter). In the next section, we'll reduce `k` down to just the terms we're interested in.

### Condense the system down

The list of possible **kin types** is far larger and more unwieldy than the set of **kin terms** in any language. For instance, while 'father's elder brother' and 'father's younger brother' are not distinguished in English (both take the term *uncle*), these distinctions are indeed encoded by terminology in other languages, like Hindi.

We want create a data structure that pairs up parent types with the corresponding child types. This is because we're interested in whether kinship systems maintain patterns of terminological distinctions and mergers across these two generations, we will need to know which parent terms 'go with' which child terms.

In `kintypes`, you will find **a list of pairs of kin types**, where the first element in the pair is a parent type, and the second is their child; e.g. `mMeB` and `mMeBD` (mother's elder brother and mother's elder brother's daughter). We will be filtering our full kinship system according to this list of pairs; that is, both types in the pair need to be present to be counted in the MI calculation.

The following function takes a kinship system as input, and outputs a list of tuples. The first element in the tuple is the parent term, the second is the corresponding child term. 

In [8]:
def filter_ks(ks: dict) -> dict:
    filtered_ks = {}
    for pair in kt.ics_pairs:
        if pair[0] in ks and pair[1] in ks:
            filtered_ks[pair[0]] = ks[pair[0]]
            filtered_ks[pair[1]] = ks[pair[1]]
        else:
            pass

    return filtered_ks

In [9]:
def get_pairs(ks: dict) -> list:
    pairs_of_terms = []
    parent_types = []

    for pair in kt.ics_pairs:
        if pair[0] in ks and pair[1] in ks:
            pairs_of_terms.append((ks[pair[0]],ks[pair[1]]))
            parent_types.append(pair[0])
                
    return pairs_of_terms


But for our calculations, we'll still need to know which terms belong to which generation. Luckily, we know that the 0th element in each tuple is from Ego's parents' generation and the 1st element is from Ego's generation. So we can happily split these tuples down the middle and populate two lists with the terms.

In [10]:
def split_pairs(pairs: list) -> list:
    gn = []
    gn1 = []
    for pair in pairs:
        gn.append(pair[1])
        gn1.append(pair[0])
    
    return gn,gn1

To illustrate what these functions do, let's test them out with our random kinship system, `k`.

In [11]:
if testing:
    k_pairs = get_pairs(k)
    print(k_pairs)

[('ngalpu', 'yapa'), ('ngalpu', 'yapatyipa'), ('ngalpu', 'tyangkul'), ('ngalpu', 'tyangkul'), ('pimpi', 'yapa'), ('pimpi', 'yapatyipa'), ('pimpi', 'tyangkul'), ('pimpi', 'tyangkul'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('ngalpu', 'yapa'), ('ngalpu', 'yapatyipa'), ('ngalpu', 'tyangkul'), ('ngalpu', 'tyangkul'), ('pimpi', 'yapa'), ('pimpi', 'yapatyipa'), ('pimpi', 'tyangkul'), ('pimpi', 'tyangkul'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'manka')]


In [12]:
if testing:
    
    k_gn,k_gn1 = split_pairs(k_pairs)

    print("Ego's generation: ", k_gn, '\n')

    print("Ego's parents' generation: ", k_gn1)

Ego's generation:  ['yapa', 'yapatyipa', 'tyangkul', 'tyangkul', 'yapa', 'yapatyipa', 'tyangkul', 'tyangkul', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka', 'yapa', 'yapatyipa', 'tyangkul', 'tyangkul', 'yapa', 'yapatyipa', 'tyangkul', 'tyangkul', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka', 'manka'] 

Ego's parents' generation:  ['ngalpu', 'ngalpu', 'ngalpu', 'ngalpu', 'pimpi', 'pimpi', 'pimpi', 'pimpi', 'kalnga', 'kalnga', 'kalnga', 'kalnga', 'tyutyum', 'tyutyum', 'tyutyum', 'tyutyum', 'ngalpu', 'ngalpu', 'ngalpu', 'ngalpu', 'pimpi', 'pimpi', 'pimpi', 'pimpi', 'kalnga', 'kalnga', 'kalnga', 'kalnga', 'tyutyum', 'tyutyum', 'tyutyum', 'tyutyum']


`get_pairs()` gives us a long list of pairs.

`split_pairs()` takes this long list of pairs and sorts it into terms that belong to Ego's generation and terms that belong to Ego's parents' generation. Importantly, since the order of the `pairs` list is preserved when we run `split_generations()`, we can still work out which terms form a parent-child pair by indexing `gn` and `gn1`.

Now we have our data structures, we can start to do some calculations.

### Calculating probabilities

To calculate entropy, we need a probability distribution over the terms in one single generation of a kinship system. So let's start with a function that can calculate the probability of a particular term.

Given a term and the full list of terms in the same generation, this function counts how many times that term exists in `generation` and divides that by the total length of `generation`.

In [13]:
def probability(term: str, generation: list) -> float:
    #print(generation.count(term),len(generation))
    return generation.count(term)/len(generation)

So if we pick a term at random from our Nogai kinship system, it will output the probability of picking that term.

In [20]:
if testing:
    print('Generation n+1')
    for term in set(k_gn1):
        print(term, probability(term,k_gn1))
    print('\nGeneration n')
    for term in set(k_gn):
        print(term,probability(term,k_gn))

Generation n+1
tyutyum 0.25
pimpi 0.25
ngalpu 0.25
kalnga 0.25

Generation n
manka 0.5
yapatyipa 0.125
yapa 0.125
tyangkul 0.25


When calculating mutual information, we also need the **conditional entropy** between the two generations of our system. To calculate this, we will need not only the probabilities of terms in a generation, but also the **joint probabilities** of every pair of terms across those two generations. In other words, we need to calculate the probabilities of our `get_pairs` output.

Given two terms, this function counts how many pairs made of those two terms exist in `pairs`, then divides that by the total length of `pairs`.

In [21]:
def joint_probability(term1: str, term2: str, pairs: list) -> float:
    pair = (term1,term2)
    # print(pairs.count(pair)/len(pairs))
    return pairs.count(pair)/len(pairs)

Once again, we can test this with a random pair from our list of pairs:

In [23]:
if testing:
    sum_jp = []
    for pair in set(k_pairs):
        print(pair)
        jp = joint_probability(pair[0],pair[1],k_pairs)
        print(jp)
        sum_jp.append(jp)
        # print(pair, jp)
    print(sum_jp, sum(sum_jp))

('pimpi', 'tyangkul')
0.125
('pimpi', 'yapa')
0.0625
('kalnga', 'manka')
0.25
('pimpi', 'yapatyipa')
0.0625
('ngalpu', 'tyangkul')
0.125
('ngalpu', 'yapa')
0.0625
('tyutyum', 'manka')
0.25
('ngalpu', 'yapatyipa')
0.0625
[0.125, 0.0625, 0.25, 0.0625, 0.125, 0.0625, 0.25, 0.0625] 1.0


Sometimes, these probability values will be really really small, but non-zero, leading to rounding errors where the probabilities do not sum to 1. To resolve this, we can normalise our probability distribution before we use it to do any further calculations.

In [125]:
def normalise(probabilities):
    """Edited from https://stackoverflow.com/questions/26916150/normalize-small-probabilities-in-python#26916260"""
    if sum(probabilities) > 0:
        factor = 1 / sum(probabilities)
        return [factor * p for p in probabilities]

In [26]:
if testing:
    sum_jp = []
    for pair in set(k_pairs):
        print(pair)
        jp = joint_probability(pair[0],pair[1],k_pairs)
        print(jp)
        sum_jp.append(jp)
    print(sum_jp, normalise(sum_jp))

('pimpi', 'tyangkul')
0.125
('pimpi', 'yapa')
0.0625
('kalnga', 'manka')
0.25
('pimpi', 'yapatyipa')
0.0625
('ngalpu', 'tyangkul')
0.125
('ngalpu', 'yapa')
0.0625
('tyutyum', 'manka')
0.25
('ngalpu', 'yapatyipa')
0.0625
[0.125, 0.0625, 0.25, 0.0625, 0.125, 0.0625, 0.25, 0.0625] [0.125, 0.0625, 0.25, 0.0625, 0.125, 0.0625, 0.25, 0.0625]


Now we can calculate probabilities, we can use these functions to calculate entropy, conditional entropy, and mutual information.

### Calculating entropy and mutual information

Entropy (in bits) is defined as 

$$
H(X) = -\sum_{x \in X}p(x) log_2p(x)
$$

or, in English, it is the inverse sum over a distribution X of the probability of y * the log probability of y.

Entropy is a measure of the average level of uncertainty about the possible outcomes of a variable.

The functions we defined above only calculate a single probability at a time, so our next functions will need to iterate over the kinship system in order to have a full probability distribution. First, let's define a function that will iterate over a generation of the kinship system to output the entropy of that generation. 

Note: we only need one generation's entropy score to calculate mutual information - we will make the arbitrary choice to calculate the entropy of Ego's parents' generation later in this notebook.

In [144]:
# def entropy(generation: list) -> list:
#     entropy = 0
#     probs = []
#     for term in set(generation): # using a set as we want to count each unique term only once
#         p = probability(term,generation)
#         probs.append(p)
#         #print('entropy of',term,p*math.log(p))
#     for p in normalise(probs):
#         entropy += p*math.log2(p)
#     return round(-entropy, 5)

In [156]:
def entropy(generation: list) -> list:
    entropy = 0
    for term in set(generation): # using a set as we want to count each unique term only once
        p = probability(term,generation)
        #print('entropy of',term,p*math.log(p))
        entropy += p*math.log2(p)
    return round(-entropy,5)

In [145]:
if testing:
    print(entropy(k_gn1))

2.0


Moving on, conditional entropy of Y given X is defined as

$$
H(Y|X) = -\sum_{x \in X,y \in Y}p(x,y) log_2 {p(x,y) \over p(x)}
$$

or in English, the inverse sum over two distributions Y and X of the probability of each y * the log probability of each y given x.

Conditional entropy is the amount of information needed to describe the outcome of a random variable Y given that we already know the value of another random variable X.

To calculate it, we need the joint probability of each pair (given by `joint_probability()`) and the probability of one member of that pair (given by `probability()`). We can then calculate the conditional probability of parent term given child term as the joint probability of those terms over the probability of the parent term.

As before, we will define a function that iterates over all pairs to output the conditional entropy of Ego's generation given Ego's parents' generation.

In [146]:
# def conditional_entropy(gn: list, pairs:list) -> float:
#     entropy = 0
#     jp = []
#     pr = []
#     probs = {}

#     for term in set(gn):
#         p = probability(term,gn)
#         probs[term] = p

#     # print(sum(probs.values()))
#     norm_probs = normalise(probs.values())
#     index = 0
#     for i in probs:
#         probs[i] = norm_probs[index]
#         index += 1

#     # print(probs)
#     # print(sum(norm_probs))
        
#     for x,y in set(pairs): # x = parent, y = child
#         # print(x,y)
#         p_xy = joint_probability(x,y,pairs)
#         p_y = probs[y]
#         jp.append(p_xy)
#         pr.append(p_y)
        
#     jp = normalise(jp)
    
#     for i in range(len(jp)):
#         if jp[i] > 0 and pr[i] > 0:
#             #print('p(', x, '|', y,') = ', p_xy/p_y, 'p(y) = ', p_y)
#             entropy += jp[i] * math.log2(jp[i]/pr[i])
#             # print(jp[i],pr[i])
#     return round(-entropy,5)

In [157]:
def conditional_entropy(gn: list, pairs:list) -> float:
    entropy = 0
    for x,y in set(pairs): # x = parent, y = child
        p_xy = joint_probability(x,y,pairs)
        p_y = probability(y,gn)
        if p_xy > 0 and p_y > 0:
            #print('p(', x, '|', y,') = ', p_xy/p_y, 'p(y) = ', p_y)
            entropy += p_xy * math.log2(p_xy/p_y)
    return round(-entropy,5)

In [158]:
conditional_entropy(k_gn,k_pairs)

# sum([0.14285714285714285,0.2857142857142857,0.2857142857142857,0.2857142857142857])

1.0

In [59]:
if testing:
    print(conditional_entropy(k_gn,k_pairs))

# [0.125, 0.0625, 0.25, 0.0625, 0.125, 0.0625, 0.25, 0.0625]
#manka 0.5 yapatyipa 0.125 yapa 0.125 tyangkul 0.25

1.0


Finally, mutual information is defined as

$$
I(X;Y) \equiv H(X) - H(X|Y)
$$

or in English, entropy of X minus the conditional entropy of X given Y.

In this study, it is equal to the entropy of Ego's parents' generation minus the conditional entropy of Ego's parents' generation given Ego's generation. It tells us how much mutual dependence there is between these two generations; i.e. how much we can know about one by observing the other.

So long as we make sure to input the right entropy and conditional entropy values, we only need a simple function for this one:

In [150]:
def mutual_information(pairs: list):
    gn,gn1 = split_pairs(pairs)
    e = entropy(gn1)
    ce = conditional_entropy(gn,pairs)
    mi = e - ce
    return round(mi,5)

In [151]:
if testing:
    print(mutual_information(k_pairs))

1.0


And there we have it! Step 4 complete. We can now take any (filtered) Kinbank file and output the mutual information between Ego's generation and Ego's parents' generation in that language.

But right at the beginning of this notebook, I mentioned using **simulations** to test the robustness of our claim that languages exhibit internal co-selection in their kinship systems. These simulations give us a baseline with which to compare the MI scores of real languages. Do languages across the world have greater mutual information between two generations than we would expect by chance?

## Simulations

If we want to argue that internal co-selection is a product of cultural evolution, we need to dispel the possibility that it occurs by chance.

To get an idea of how much information would be shared between two generations purely by chance, we need to create some randomly generated kinship systems. We can compare the MI of these simulations to the real languages to see whether the real languages have significantly greater mutual information between generations.

An important aspect of MI that we have not discussed so far: it is dependent on the amount of variation within the kinship system. A system with only one unique term in each generation would have MI of 0, which seems pretty terrible! But given this very limited variation (indeed, no variation), 0 is the highest MI such a language could have. As such, we perhaps need to modify our claim that kinship systems have "high MI" to be more specific: kinship systems in the wild have high MI *for the amount of variation in terminology they have*.

To compare real languages to simulations, we need a simulation which maintains the number of terms while randomising which child terms pair with which parent terms. To do this, we will take each language in our data, and randomly scramble which terms go with which types (within generation). This will randomise the syncretisms within the generations while maintaining the same amount of variation across the system overall.

To do this, we need to take the following steps:

1. Extract the kinship system of a language from kinbank (check!)
2. Filter the two generations we are interested in (check!)
3. Randomly reassign the kinship terms to new types.
4. Repeat the process a bunch of times for each language.

We already have the infrastructure for the first two! `get_kin_terms()`,  `get_pairs()` and `split_pairs()` will do this for us. So let's skip to 3, and write a function that randomises which terms form pairs, assuming that we have already extracted the kinship system and filtered the relevant pairs.

Remember that the order of `pairs` is preserved when we run `split_pairs()`. So when we pass `gn` and `gn1` to `shuffle_pairs()`, we know that we can re-unite our pairs by using the same index. Equally, when we shuffle `gn` and `gn1` in place, we know that we can safely combine them to make a new, randomised pair in place of the 'real' Nogai pair.

In [60]:
def split_ks(ks):
    filtered_ks = filter_ks(ks)
    gn = {}
    gn1 = {}
    for entry in filtered_ks:
        if entry in kt.generation_n:
            gn[entry] = filtered_ks[entry]
        elif entry in kt.generation_n1:
            gn1[entry] = filtered_ks[entry]
        else:
            pass

    return gn,gn1        

In [64]:
print(split_ks(k))

({'meB': 'yapa', 'myB': 'yapatyipa', 'meZ': 'tyangkul', 'myZ': 'tyangkul', 'mMeBS': 'manka', 'mMeBD': 'manka', 'mMyBS': 'manka', 'mMyBD': 'manka', 'mFeZS': 'manka', 'mFeZD': 'manka', 'mFyZS': 'manka', 'mFyZD': 'manka', 'feB': 'yapa', 'fyB': 'yapatyipa', 'feZ': 'tyangkul', 'fyZ': 'tyangkul', 'fMeBS': 'manka', 'fMeBD': 'manka', 'fMyBS': 'manka', 'fMyBD': 'manka', 'fFeZS': 'manka', 'fFeZD': 'manka', 'fFyZS': 'manka', 'fFyZD': 'manka'}, {'mM': 'ngalpu', 'mF': 'pimpi', 'mMeB': 'kalnga', 'mMyB': 'kalnga', 'mFeZ': 'tyutyum', 'mFyZ': 'tyutyum', 'fM': 'ngalpu', 'fF': 'pimpi', 'fMeB': 'kalnga', 'fMyB': 'kalnga', 'fFeZ': 'tyutyum', 'fFyZ': 'tyutyum'})


In [65]:
def shuffle_ks(ks):
    gn_terms = []
    gn,gn1 = split_ks(ks)
    for term in gn:
        gn_terms.append(gn[term])

    random.shuffle(gn_terms)
    
    for i in range(len(gn)):
        # print(i)
        key = list(gn.keys())[i]
        ks[key] = gn_terms[i]

    return ks

In [66]:
print(shuffle_ks(k))

{'mB': 'nganytyakuman', 'mZ': 'tyangkul', 'meB': 'manka', 'myB': 'manka', 'mF': 'pimpi', 'mM': 'ngalpu', 'mS': 'karkun', 'mD': 'kalngkir', 'mFF': 'kamin', 'mFM': 'papim', 'mMF': 'ngatyim', 'mMM': 'kumpu', 'mSS': 'tyumpariy', 'mSD': 'tyumpariy', 'mDS': 'tyumpariy', 'mDD': 'tyumpariy', 'mFB': 'pimpi', 'mFZ': 'tyutyum', 'mMB': 'kalnga', 'mMZ': 'ngalpu', 'mBS': 'karkun', 'mBD': 'kalngkir', 'mFZD': 'manka', 'mMBD': 'manka', 'mFZS': 'manka', 'mMBS': 'manka', 'mH': 'mungka', 'mW': 'wakal', 'mZH': 'muwa', 'mWB': 'muwa', 'mWZ': 'wakal', 'mSW': 'tungkarr', 'mDH': 'tungkarr', 'meZ': 'manka', 'myZ': 'manka', 'mFeB': 'pimpi', 'mFyB': 'pimpi', 'mFeZ': 'tyutyum', 'mFyZ': 'tyutyum', 'mMeZ': 'ngalpu', 'mMyZ': 'ngalpu', 'mMeB': 'kalnga', 'mMyB': 'kalnga', 'meBS': 'karkun', 'myBS': 'karkun', 'meBD': 'kalngkir', 'myBD': 'kalngkir', 'mFeZS': 'yapatyipa', 'mFyZS': 'manka', 'mFeZD': 'tyangkul', 'mFyZD': 'manka', 'mMeBS': 'manka', 'mMyBS': 'manka', 'mMeBD': 'tyangkul', 'mMyBD': 'manka', 'mFZeS': 'manka', 'mFZ

This should result in a version of our kinship system `k` where all of the generation n (Ego's generation) terms should be randomly shuffled w.r.t which keys they belong with. Everything else will remain the same. The terms we don't need for our analysis will be filtered out by our `get_pairs` function at a later stage.

We don't need to shuffle the generation n+1 terms - shuffling one generation is sufficient to break any correlation between them. Because each term in ego's generation pairs uniquely with a parent term (but not vice versa) this prevents a situation where new kin categories are artificially created.

In [67]:
if testing:
    # print(shuffle_ks(k))
    #print(get_pairs(shuffle_ks(k)))
    print(split_pairs(get_pairs(shuffle_ks(k))))

(['manka', 'manka', 'manka', 'tyangkul', 'manka', 'manka', 'manka', 'tyangkul', 'manka', 'manka', 'manka', 'manka', 'manka', 'yapatyipa', 'yapa', 'tyangkul', 'tyangkul', 'manka', 'yapa', 'manka', 'tyangkul', 'manka', 'yapa', 'manka', 'manka', 'yapatyipa', 'manka', 'manka', 'manka', 'manka', 'tyangkul', 'manka'], ['ngalpu', 'ngalpu', 'ngalpu', 'ngalpu', 'pimpi', 'pimpi', 'pimpi', 'pimpi', 'kalnga', 'kalnga', 'kalnga', 'kalnga', 'tyutyum', 'tyutyum', 'tyutyum', 'tyutyum', 'ngalpu', 'ngalpu', 'ngalpu', 'ngalpu', 'pimpi', 'pimpi', 'pimpi', 'pimpi', 'kalnga', 'kalnga', 'kalnga', 'kalnga', 'tyutyum', 'tyutyum', 'tyutyum', 'tyutyum'])


In [68]:
if testing:
    sim = shuffle_ks(k)
    sim_pairs = get_pairs(sim)
    sim_gn,sim_gn1 = split_pairs(sim_pairs)
    # print(Counter(sim_gn))
    print(sim_pairs)
    print('\n\n', k_pairs)


[('ngalpu', 'yapa'), ('ngalpu', 'manka'), ('ngalpu', 'tyangkul'), ('ngalpu', 'manka'), ('pimpi', 'yapa'), ('pimpi', 'manka'), ('pimpi', 'tyangkul'), ('pimpi', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'yapatyipa'), ('tyutyum', 'manka'), ('tyutyum', 'manka'), ('ngalpu', 'manka'), ('ngalpu', 'manka'), ('ngalpu', 'manka'), ('ngalpu', 'yapa'), ('pimpi', 'manka'), ('pimpi', 'manka'), ('pimpi', 'manka'), ('pimpi', 'yapa'), ('kalnga', 'manka'), ('kalnga', 'tyangkul'), ('kalnga', 'manka'), ('kalnga', 'yapatyipa'), ('tyutyum', 'manka'), ('tyutyum', 'tyangkul'), ('tyutyum', 'manka'), ('tyutyum', 'tyangkul')]


 [('ngalpu', 'yapa'), ('ngalpu', 'yapatyipa'), ('ngalpu', 'tyangkul'), ('ngalpu', 'tyangkul'), ('pimpi', 'yapa'), ('pimpi', 'yapatyipa'), ('pimpi', 'tyangkul'), ('pimpi', 'tyangkul'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('kalnga', 'manka'), ('tyutyum', 'manka'), ('tyutyum', 'man

Now we can treat `sim_pairs` just as we treated `pairs`! Let's calculate the entropy, conditional entropy, and mutual information of this simulated system.

In [72]:
if testing:
    sim = shuffle_ks(k)
    sim_pairs = get_pairs(sim)
    sim_gn,sim_gn1 = split_pairs(sim_pairs)
    e = entropy(sim_gn1)
    ce = conditional_entropy(sim_gn,sim_pairs)
    mi = mutual_information(sim_pairs)
    print('Simulation: ', e,ce,mi)
    print('Real: ', entropy(k_gn1),conditional_entropy(k_gn,k_pairs),mutual_information(k_pairs))

Simulation:  2.0 1.8732093304262503 0.12679066957374974
Real:  2.0 1.0 1.0


Wait! One of these values are exactly the same as the real kinship system! I thought this was a randomised simulation - what gives? 

Variation gives! Entropy remains the same regardless, because the amount of variation in the simulation **does not change** by design.

The MI of these two systems (and by extension, the conditional entropy) **does** vary, which is what we want. Let's try with another language:

In [130]:
if testing:
    # file2 = random.choice(all_kb_files)
    # print(file2)
    # k2 = get_kin_terms(filepath + 'Shoshoni_shos1248.csv')
    k2 = get_kin_terms(filepath + 'Southeast_Ambrym_sout2859.csv')
    # k2 = get_kin_terms(filepath + file2)
    # print(k2)

    k2_pairs = get_pairs(k2)
    print(set(k2_pairs))
    k2_gn,k2_gn1 = split_pairs(k2_pairs)
    print(Counter(k2_gn))
    k2_e = entropy(k2_gn1)
    k2_ce = conditional_entropy(k2_gn,k2_pairs)
    k2_mi = mutual_information(k2_pairs)
    
    k2_sim = shuffle_ks(k2)
    k2_sim_pairs = get_pairs(k2_sim)
    k2_sim_gn,k2_sim_gn1 = split_pairs(k2_sim_pairs)
    # print(k2_sim_gn)
    print(Counter(k2_sim_gn))
    k2_sim_e = entropy(k2_sim_gn1)
    k2_sim_ce = conditional_entropy(k2_sim_gn,k2_sim_pairs)
    k2_sim_mi = mutual_information(k2_sim_pairs)
    
    print('Real:', k2_e,k2_ce,k2_mi)
    print('Simulation:',k2_sim_e,k2_sim_ce,k2_sim_mi)
    
    print(len(k2_sim),len(k2_pairs))
    
    # for i in k2_pairs:
    #     if i not in k2_sim:
    #         print(i)
            
    # for i in k2_sim:
    #     if i not in k2_pairs:
    #         print(i)


{('pap', 'hinne'), ('metuo', 'avukokon'), ('metuo', 'mase'), ('tame', 'hinne'), ('nine', 'tu'), ('nine', 'hinne'), ('tine', 'avukokon'), ('tine', 'mase'), ('pap', 'tu'), ('tame', 'tu')}
Counter({'hinne': 16, 'tu': 14, 'mase': 8, 'avukokon': 8})
pap hinne
metuo avukokon
metuo mase
tame hinne
nine tu
nine hinne
tine avukokon
tine mase
pap tu
tame tu
pap hinne
metuo avukokon
metuo mase
tame hinne
nine tu
nine hinne
tine avukokon
tine mase
pap tu
tame tu
Counter({'tu': 14, 'hinne': 13, 'avukokon': 10, 'mase': 9})
pap mase
tine mase
pap hinne
metuo mase
tine hinne
pap avukokon
metuo hinne
tine avukokon
tame mase
pap tu
metuo avukokon
tine tu
metuo tu
nine mase
tame hinne
nine hinne
tame avukokon
tame tu
nine avukokon
nine tu
pap mase
tine mase
pap hinne
metuo mase
tine hinne
pap avukokon
metuo hinne
tine avukokon
tame mase
pap tu
metuo avukokon
tine tu
metuo tu
nine mase
tame hinne
nine hinne
tame avukokon
tame tu
nine avukokon
nine tu
Real: 2.2571523171759043 1.3238390641791251 0.933313252

Now we see that while the entropy of our new language and its simulation are equal, the conditional entropy for the simulation is greater and therefore the mutual information of the simulation is lower. What about if we did this 1000 times? How often would the mutual information of the simulation be lower then?

## Tidying up

We have all the pieces we need now to calculate MI and simulate kinship systems - all we need to do is write a few more functions that stick all of those pieces together in a neat parcel.

First, a function that takes pairs and spits out entropy, conditional entropy, and MI:

Second, a function that builds a simulated list of pairs when we pass in a kinship system:

In [57]:
# def simulate_ks(ks: dict) -> list:
#     pairs = get_pairs(ks)
#     if pairs:
#         gn,gn1 = split_pairs(pairs)
#         simulation = shuffle_pairs(gn,gn1)
#         return simulation

In [131]:
def simulate_ks(ks: dict) -> list:
    simulation = shuffle_ks(ks)
    shuffled_pairs = get_pairs(simulation)
    return shuffled_pairs

In [132]:
def write_data(pairs,results):
    gn,gn1 = split_pairs(pairs)
    egn = entropy(gn)
    egn1 = entropy(gn1)
    ce = conditional_entropy(gn,pairs)
    mi = mutual_information(pairs)
    
    results['mutual_information'] = mi
    results['entropy_gn'] = egn
    results['entropy_gn1'] = egn1
    results['conditional_entropy'] = ce
    results['variation_gn'] = len(set(gn))
    results['variation_gn1'] = len(set(gn1))
    results['number_of_pairs'] = len(set(pairs))
    
    return results

And a couple of functions that put everything together, saves the results to a separate file, and output a `pandas` dataframe so that we can take a good look. `ics_simulation` takes the full list of Kinbank filenames, extracts the relevant kin terms, performs the randomisation simulation on it a specified number of times, calculates entropy, conditional entropy, and MI for each simulation, and saves all that data to a separate file. It also performs some regex magic on the filename so that we get each language's unique code as well as each language's name in full.

In [152]:
def ics_simulation(families: list, filepath: str, filename: str, times: int):
    df = []
    codes = []
    
    for family in families:
        all_files = get_kb_files(filepath + family)
    
        for file in all_files:
            match = re.search('[a-z]{4}[0-9]{4}[a-z]?', file)
            code = match.group()
            language = file.split('_' + code)[0]

            if code not in codes:
                codes.append(code)

                ks = get_kin_terms(filepath + family + '/' + file)

                for i in range(times):
                    pairs = simulate_ks(ks)
                    if pairs:
                        results = {}
                        results['language'] = language
                        results['language_family'] = family
                        results['code'] = code
                        results['simulation_code'] = code + '_' + str(i)
                        results['simulation'] = 'Y'
                        write_data(pairs,results)

                        df.append(results)
    
    pd.DataFrame(df).to_csv('../data/raw/' + filename + '.csv',index=False)
    
    return pd.DataFrame(df)

`ics_real` performs similarly to `ics_simulation`, but instead of performing the randomisation, it calculates entropy, conditional entropy, and MI for the language as-is. It does this for every file in the Kinbank data and saves the data to a separate file.

In [140]:
def ics_real(families: list,filepath: str,filename: str):
    df = []
    codes = []
    
    for family in families:
        all_files = get_kb_files(filepath +  family)
        
        for file in all_files:
            print(file)
            match = re.search("[a-z]{4}[0-9]{4}[a-z]?", file)
            code = match.group()
            language = file.split('_' + code)[0]

            if code not in codes:
                codes.append(code)

                ks = get_kin_terms(filepath + family + '/' + file)

                pairs = get_pairs(ks)

                if pairs: # if pairs is not empty
                    mi = mutual_information(pairs)

                    results = {}
                    results['language'] = language
                    results['language_family'] = family
                    results['code'] = code
                    results['simulation_code'] = code + '_REAL'
                    results['simulation'] = 'N'
                    write_data(pairs,results)

                    df.append(results)
        
    pd.DataFrame(df).to_csv('../data/raw/' + filename + '.csv',index=False)
    
    return pd.DataFrame(df)

## Let's go!

If we want to create a dataset from the full set of filtered language data, we just have to run:

In [142]:
families = ['Austronesian','Cariban','Dravidian','Indo-European','Nakh-Daghestanian','Nuclear Trans New Guinea',
            'Other','Pama-Nyungan','Pano-Tacanan','Salishan','Sino-Tibetan','Tai-Kadai','Tupian',
            'Turkic','Uralic','Uto-Aztecan']

language_filepath = '../languages/kinbank-family/'

In [160]:
ics_real(families,language_filepath, 'full-kinship-mi-data')

Sa_saaa1241.csv
Namakura_nama1268b.csv
Woleaian_wole1240.csv
Nengone_neng1238.csv
Ontong_Java_onto1237.csv
Pama_(Paamanese)_paam1238.csv
Sa_a_saaa1240.csv
Tigak_tiga1245.csv
Southwest_Tanna_sout2869.csv
Maranao_(Lanao_Moro)_mara1404.csv
Wiwirano_wiwi1237.csv
Maguindanao_(Magindonao_Moro)_magu1243.csv
Penrhyn_(Tongareva)_penr1237.csv
Sungwaloge_(Nalemba_Edward)_mari1426g.csv
Tolaki_tola1247.csv
Yabem_yabe1254.csv
Yami_Tao__yami1254.csv
Merlav_merl1237.csv
Māori_maor1246.csv
West_Coast_Bajau_west2560.csv
Sungwaloge_(Nalemba_Simeone_Tari)_mari1426f.csv
Tausug_(Sulu_Moro)_taus1251.csv
Tasiko_tasi1237.csv
Mussau-Emira_muss1246.csv
Pampanga_(Kampampangan)_pamp1243.csv
Sungagage_mari1426i.csv
Rade_(Rhade)_rade1240.csv
Maori_maor1246.csv
Mekeo_meke1243.csv
Tialo_(Tomini)_tomi1243.csv
Morgan1871_Kings_Mill_Islands_gilb1244.csv
Satawalese_sata1237.csv
Senbarei_unua1237.csv
Mekeo_(West)_meke1243b.csv
Takuu_taku1257.csv
Tombulu_tomb1243.csv
Rotumans_rotu1241.csv
Minangkabau_mina1268.csv
Mekongga_

Unnamed: 0,language,language_family,code,simulation_code,simulation,mutual_information,entropy_gn,entropy_gn1,conditional_entropy,variation_gn,variation_gn1,number_of_pairs
0,Sa,Austronesian,saaa1241,saaa1241_REAL,N,0.00000,1.95021,1.00000,1.00000,4,2,8
1,Namakura,Austronesian,nama1268b,nama1268b_REAL,N,0.49261,2.23593,1.49261,1.00000,5,3,10
2,Woleaian,Austronesian,wole1240,wole1240_REAL,N,0.00000,-0.00000,1.91830,1.91830,1,4,4
3,Nengone,Austronesian,neng1238,neng1238_REAL,N,0.69802,1.75343,2.45321,1.75519,5,6,14
4,Ontong_Java,Austronesian,onto1237,onto1237_REAL,N,0.00000,1.00000,1.53049,1.53049,2,3,6
...,...,...,...,...,...,...,...,...,...,...,...,...
798,Luiseńo,Uto-Aztecan,luis1253,luis1253_REAL,N,1.82193,2.62193,2.62193,0.80000,8,7,13
799,Cahuilla,Uto-Aztecan,cahu1264,cahu1264_REAL,N,1.63910,2.48346,2.78064,1.14154,7,8,16
800,Lower_Pima_Pima_Bajo,Uto-Aztecan,pima1248,pima1248_REAL,N,0.00000,-0.00000,1.00000,1.00000,1,2,2
801,Chemehuevi,Uto-Aztecan,chem1251,chem1251_REAL,N,1.00000,1.50000,1.50000,0.50000,3,3,4


And for the simulations:

In [162]:
ics_simulation(families,language_filepath,'simulated-kinship-mi-data',1000)

Unnamed: 0,language,language_family,code,simulation_code,simulation,mutual_information,entropy_gn,entropy_gn1,conditional_entropy,variation_gn,variation_gn1,number_of_pairs
0,Sa,Austronesian,saaa1241,saaa1241_0,Y,0.0,1.95021,1.0,1.0,4,2,8
1,Sa,Austronesian,saaa1241,saaa1241_1,Y,0.0,1.95021,1.0,1.0,4,2,8
2,Sa,Austronesian,saaa1241,saaa1241_2,Y,0.0,1.95021,1.0,1.0,4,2,8
3,Sa,Austronesian,saaa1241,saaa1241_3,Y,0.0,1.95021,1.0,1.0,4,2,8
4,Sa,Austronesian,saaa1241,saaa1241_4,Y,0.0,1.95021,1.0,1.0,4,2,8
...,...,...,...,...,...,...,...,...,...,...,...,...
802995,Aztec,Uto-Aztecan,clas1250,clas1250_995,Y,0.0,1.00000,1.0,1.0,2,2,4
802996,Aztec,Uto-Aztecan,clas1250,clas1250_996,Y,0.0,1.00000,1.0,1.0,2,2,4
802997,Aztec,Uto-Aztecan,clas1250,clas1250_997,Y,0.0,1.00000,1.0,1.0,2,2,4
802998,Aztec,Uto-Aztecan,clas1250,clas1250_998,Y,0.0,1.00000,1.0,1.0,2,2,4


If we wanted to run a simulation on a single language family:

In [139]:
ics_simulation(['Austronesian'],family_filepath,10)

Unnamed: 0,language,language_family,code,simulation_code,simulation,mutual_information,entropy_gn,entropy_gn1,conditional_entropy,variation_gn,variation_gn1,number_of_pairs
0,Sa,Austronesian,saaa1241,saaa1241_0,Y,-2.220446e-16,1.950212,1.000000,1.000000,4,2,8
1,Sa,Austronesian,saaa1241,saaa1241_1,Y,-2.220446e-16,1.950212,1.000000,1.000000,4,2,8
2,Sa,Austronesian,saaa1241,saaa1241_2,Y,-2.220446e-16,1.950212,1.000000,1.000000,4,2,8
3,Sa,Austronesian,saaa1241,saaa1241_3,Y,-2.220446e-16,1.950212,1.000000,1.000000,4,2,8
4,Sa,Austronesian,saaa1241,saaa1241_4,Y,-2.220446e-16,1.950212,1.000000,1.000000,4,2,8
...,...,...,...,...,...,...,...,...,...,...,...,...
2045,Takia,Austronesian,taki1248,taki1248_5,Y,1.326965e-01,1.356039,1.932112,1.799415,3,4,10
2046,Takia,Austronesian,taki1248,taki1248_6,Y,5.755225e-02,1.325785,1.932112,1.874559,3,4,11
2047,Takia,Austronesian,taki1248,taki1248_7,Y,4.905052e-02,1.421912,1.932112,1.883061,3,4,11
2048,Takia,Austronesian,taki1248,taki1248_8,Y,1.010305e-01,1.325785,1.932112,1.831081,3,4,11
