# Evolution of Culture and Institutions

## Introduction

Welcome to a project that I have been working on, on and off, for some time. The basic idea is to use information on language and geography to supplement cross-cultural data sets such as the George Murdock's [**Ethnographic Atlas**](http://intersci.ss.uci.edu/wiki/index.php/Ethnographic_Atlas). The project involves creating tools for recreating migratory histories (really distributions of histories), including some of Python's wonderful geography and geometry packages, and then matching it with Ethnographic data. 

In the end, the hope is that I can get a feel for the nature of cultural evolution, and the mutual dependencies between culture, environment, and technological progress. 

## Historical Recreation

The first step in the project is to recreate histories (or really distributions over possible migratory histories, based on what is known, and what seems most likely). Accordingly, in this initial notebook, I develop some Python tools for estimating branching times and locations. This approach is borrowed wholly from the historical linguistics literature, with a few tweaks to make it work better with available data. 

## Packages and Setup

I have created (rather haphazardly as the experimental process is ongoing) two modules: `PyIEClasses` and `PyIETools` for which I hope to cook up thorough documentation, as the way that I have conceptualized and organized phylogenetic relationships is a little different than is usual I think. Anyways, here, we import the required packages:

In [1]:
import os
import pandas as pd
import pathlib
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
import random
from scipy.stats import norm

# Some Kung-Fu needed to go to the next directory and import everything...

start_dir = os.getcwd()
python_dir = pathlib.Path(os.getcwd()).parts[:-1] + ('Python',)
os.chdir(pathlib.Path(*python_dir))

import PyInstEvo

os.chdir(start_dir)
os.chdir('..')

pd.set_option('display.width', 1000)
np.set_printoptions(linewidth=120)

%matplotlib inline

## Importing and Arranging Basic Data

The data set consists of so=called [**Swadesh lists**](https://en.wikipedia.org/wiki/Swadesh_list) for approximately 4500 languages, which I have obtained from the [**Automatic Similarity Judgement Program**]( http://asjp.clld.org/). I have taken this data and merged it with the **Ethnographic Atlas** to the best of my ability. I have also merged in some information on climate and geography. Most of this was done in **`Stata`**, but might be improved on now that I have a better handle on Python. 

I have also, to the best of my ability, added in [**Merritt Ruhlen's**](https://en.wikipedia.org/wiki/Merritt_Ruhlen) linguistic classifications from his **Languages of the World**. I did this only because, as a [**Joseph Greenberg**](https://en.wikipedia.org/wiki/Joseph_Greenberg) enthusiast, Ruhlen likes reduction so his classification system made the number of different language stocks to deal with much smaller. In retrospect, nothing really depends upon this and I might be injecting controversy into the project by using these classifications, even if my purpose was pragmatic. 

The raw data that I use in the project should all be in the `\\IEData` folder in this repository. The source of my **Ethnographic Atlas** data is the [**World Cultures Journal**](http://www.worldcultures.org/).  

One last note: it is interesting to note that Joseph Greenberg is a New Yorker from Brooklyn, and [**Morris Swadesh**](https://en.wikipedia.org/wiki/Morris_Swadesh) was employed by the City University of New York, like myself, up until he was fired in 1949 during the Red Scare for "being a communist."

Anyways...a first step is reading the data in...

In [10]:
Data = pd.read_stata(os.getcwd() + '\\IEData\\words_4_useMerged.dta')

In [11]:
# A look at all the columns

print(Data.columns)

Index(['name', 'trimFlag', 'ruhlen_1', 'ruhlen_2', 'ruhlen_3', 'lat1', 'lon1', 'eaName', 'wordCount', 'trimFlag2',
       ...
       'v99', 'vclimate1', 'vclimate2', 'v1000', 'vmet', 'vweave', 'vpot', 'vtechcomp', 'langnotes', '_merge'], dtype='object', length=268)


Note how the variables `ruhlen_1`, `ruhlen_2`, etc. appear. These variables have a nested panel structure, so in the code below, I create string variables that get progressively more unique so I have a result that can be rendered as nested panels. In due time: for now, a first thing to have might be a means of adding in additional information about the nesting structure. 

What follows is an example where I shift all the information to the right for a particular language stock (Altaic) and then fill in a new base grouping, which keeps together Japanese and Korean languages as a first branching. Note that there are [two levels of controversy here](https://en.wikipedia.org/wiki/Altaic_languages) - whether or not Japanese and Korean should be together, and also whether or not they should be included in Altaic at all. 
 

In [12]:
for i in reversed(range(2, 16)):
    to  = 'ruhlen_' + str(i + 1)
    fro = 'ruhlen_' + str(i)
    Data.loc[ Data['ruhlen_1'] == 'ALTAIC', to] = Data.loc[ Data['ruhlen_1'] == 'ALTAIC', fro]


ruhlenTree=['ruhlen_1', 'ruhlen_2', 'ruhlen_3', 'ruhlen_4', 'ruhlen_5', 'ruhlen_6', 'ruhlen_7', 'ruhlen_8',
                'ruhlen_9', 'ruhlen_10', 'ruhlen_11', 'ruhlen_12', 'ruhlen_13', 'ruhlen_14', 'ruhlen_15', 'ruhlen_16']    
    
Data[ruhlenTree].loc[Data['ruhlen_1'] == 'ALTAIC']
Data['ruhlen_2'].loc[Data['ruhlen_1'] == 'ALTAIC'] = 'AltaicProp'
Data['ruhlen_2'].loc[Data['ruhlen_3'] == '']       = 'JaponicKorean'
Data['ruhlen_2'].loc[Data['ruhlen_3'] == 'Japanese-Ryukyuan'] = 'JaponicKorean'
Data['ruhlen_2'].loc[Data['ruhlen_3'] == 'Ryukyuan']          = 'JaponicKorean'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Now, let's get down to details and sort the data according to the tree:

In [13]:
Data.sort_values(by=ruhlenTree, inplace=True)

One aspect of the project is that 4500 language groups have proven to be too many for me to deal with all at once! I therefore created a variable called `trimFlag` which I use to trim down the list. I have made some modifications to this file at various times for reasons I forget, but these are all catalogued in the file `TrimMod.txt`. To read in this file and arrange it isn't so bad:

In [14]:
trimList = []
try:
    trimFile = open(os.getcwd() + '//IEData//TrimMod.txt', 'r')
    for line in trimFile.readlines():
        trimList.append(line.replace('\"', '').replace('\\', '').rstrip())
except:
    print('Wrong computer!')

The languages in the above list we wish to keep because they are important. I'm envisioning that we will want to change it at some point, so its useful to have it in the above form, where we can add or subtract languages without too much problem. The variable `trimFlag` we use to mark with a one languages we want to exclude, hence we keep all the languages with a `trimFlag` of zero.

In [15]:
Data.loc[Data.name.isin(trimList), 'trimFlag'] = 0
Data = Data.loc[Data['trimFlag'] == 0]

print(Data['ruhlen_1'].value_counts())

AMERIND      205
INDOPAC      126
NIGERKORD    110
AUSTRIC      105
INDOHITT      96
NILOSAHAR     93
SINOTIBET     86
AUSTRAL       84
AFROASIA      77
ALTAIC        39
CAUCASIAN     24
URALICYUK     23
NADENE        21
ELAMODRA      17
KHOISAN       17
BURUSHASK     10
ESKIMOAL       9
KHET           6
CHUKCHI        5
NAHALI         2
BASQUE         1
SUMERIAN       1
GILYAK         1
Name: ruhlen_1, dtype: int64


We also need to get rid of a few pigeon languages and things of that nature. Moreover, we need to clean up some of the expiry dates as languages that have effectively existed until the present don't need to be seen as moribund (this adds parameters to the modeling below. 

For now, we also suppose that the standard deviation of expiration dates is 40 (rather arbitrarily - but this really means it is most likely plus or minus a century). 

Line by line, I:
- drop pidgeon languages
- drop languages that don't have categories
- If a language died to recently, make it non-expired
- Fill in a value for the standard deviation of the expiration date
- If a language is missing a name, fill in the Ethnographic Atlas name
- Make a container for a dead language dummy

In [16]:
Data = Data.loc[Data.ruhlen_3.str.contains('based') == False] 

Data['ruhlen_1'].replace('', np.nan, inplace=True)              

Data.dropna(subset = ['ruhlen_1'], inplace=True)                 

Data.loc[Data['ex_date'] > 1900, 'ex_date'] = np.NaN             # missing values

Data['ex_date_sd'] = np.NaN

Data.loc[Data['ex_date'] != np.NaN, 'ex_date_sd'] = 40

Data.loc[Data['name'] == '','name'] = Data['eaName']

Data['dead'] = 0
Data['deadOne'] = 1-Data['dead']

Just a quick look at the data, to be sure that it is sorted correctly:

In [17]:
print(Data[ ['ruhlen_1', 'ruhlen_2'] ][0:10])
print(Data[ ['ruhlen_1', 'ruhlen_2', 'ruhlen_3'] ][500:510])

      ruhlen_1 ruhlen_2
104   AFROASIA   Berber
310   AFROASIA   Berber
3867  AFROASIA   Berber
207   AFROASIA   Berber
206   AFROASIA   Berber
1002  AFROASIA   Berber
106   AFROASIA   Berber
601   AFROASIA   Berber
3172  AFROASIA   Berber
603   AFROASIA   Berber
    ruhlen_1       ruhlen_2   ruhlen_3
43   AUSTRIC  Austroasiatic  Mon-Khmer
136  AUSTRIC  Austroasiatic  Mon-Khmer
405  AUSTRIC  Austroasiatic  Mon-Khmer
699  AUSTRIC  Austroasiatic  Mon-Khmer
137  AUSTRIC  Austroasiatic  Mon-Khmer
342  AUSTRIC  Austroasiatic      Munda
28   AUSTRIC  Austroasiatic      Munda
317  AUSTRIC  Austroasiatic   Paiwanic
689  AUSTRIC  Austroasiatic   Paiwanic
117  AUSTRIC       Miao-Yao     Mienic


## Nested panels from classifications


We want to render the data in the form of nested panels. So, in the above, we would have the *Afroasia* group marked by a unique dummy variable, and then have the *Afroasia / Berber* group have its own dummy, etc. The follow is one way to create unique strings for each nested group, and then to transform them into numbers. 

This is done in the following block of code:

In [18]:
Data['T1'] = Data.ruhlen_1
for x in range(2, 17):
    T       = 'T' + str(x)
    Tm1     = 'T' + str(x - 1)
    ruhlen  = 'ruhlen_' + str(x)
    Data[T] = Data[Tm1] + Data[ruhlen]
Data['T17'] = Data.T16+Data.name


Here is a way to arrange all these grouping variables into what is something like a Stata local.


In [19]:
columns  = []
columns2 = []
for x in range(1, 17):
    columns.append('ruhlen_' + str(x))
    columns2.append('T' + str(x))

Everything looks okay, but let's check and see that we have all unique identifiers at the end.
That is, our last column should uniquely identify each language with a number. Just as a check, let's be sure that everything in the last row is unique!

In [20]:
print(len(Data) == len(set(Data['T17'])))

True


The next job is to make unique numerical identifiers for each thing just so they are easier to look at, 
if for no other reason. 

In [21]:
for n in range(1, 18):
    print(n, end= ' ')
    counter = 1
    TR       = 'TR' + str(n)
    T        = 'T' + str(n)
    Data[TR] = np.nan
    for x in collections.OrderedDict.fromkeys(Data[T]):   
        Data.loc[Data[T] == x, TR] = counter
        counter = counter + 1 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

As we have a lot of words, but many are missing, we can pare things down to a reduced Swadesh list of those words that are most commonly occuring in the data. These are the following. Once we have it up and running we can take a quick look at a few words. One thing we should also add is some description of what these words are!

At some point, we should add in word meanings. But let's get our pared-down list in place, and then take a look at how the **ASJP** renders them in `Ascii`:


In [22]:
words=['word1', 'word2', 'word3', 'word11', 'word12', 'word18', 'word19', 'word21', 'word22', 'word23', 
       'word25', 'word28', 'word30', 'word31', 'word34', 'word39', 'word40', 'word41', 'word43', 'word44', 
       'word47', 'word48', 'word51', 'word53', 'word54', 'word57', 'word58', 'word61', 'word66', 'word72', 
       'word74', 'word75', 'word77', 'word82', 'word85', 'word86', 'word92', 'word95', 'word96', 'word100']
print(Data[ ['name', 'word1', 'word2','word3','word11','word18', 'word77'] ][0:10])

                                   name word1 word2      word3 word11     word18   word77
104                                SIWI   niS   Sik     inCini   ijin    bunadem    adxax
310               TASHELHIT/IDA_USEMLAL    nk    ky       nkni    yan      bnadm   tagunt
3867  TAMAZIGHT_CENTRAL_ATLAS/AYT_NDHIR   n3k   S3g      nukni    yun    b3na83m    azr7u
207                             METMATA  n3CC  S3kk      n3Sni                       azru
206                             TUMZABT   n3S   n3C     n3Snin   ig3n     bnad3m     adxa
1002                     OUARGLA_BERBER  n3SS  S3kk     n3Snin  igg3n    takrumt    adXaX
106               TARIFIT/BENI_IZNASSEN   n3C   S3k      n3Cin   ij3n     bna83m     azru
601                      AHAGGAR_TUAREG   nak   kay  nakkaned7   iyan     awadam   tahunt
3172                           TAMASHEQ   nak   kay   nakaned7   iyan     awad3m   t3hunt
603                              ZENAGA  n37k   k3k      n3kni   yu7n  3g8inad3m  t37rg3t


In [23]:
Loc = np.where(Data['name'] == 'ENGLISH')[0][0]

EnglishList=[]
for word in words:
    EnglishList.append(Data[word].iloc[Loc])

In [24]:
for word in EnglishList:
    print(word, end=" ")

Ei yu wi w3n tu pers3n fiS dag laus tri lif skin bl3d bon horn ir Ei nos tu8 t3N ni hEnd brest liv3r drink si hir dEi k3m s3n star wat3r ston fEir pE8 maunt3n nEit ful nu nem 

From the above, we can intuit, anyways, that the words are: I, You, We, One, Two, Person (man), Fish, Dog, Louse, Tree, Leaf, Skin, Blood, Bone, Horn, Ear, Eye, Nose, Tooth, Tongue, Knee, Hand, Breast, Liver, Drink, See, Hear, Die, Come, Sun, Star, Water, Stone, Fire, Path, Mountain, Night, Full, New, Name. If one is interested, one can read about how and why these words are chosen from the [Swadesh list Wikipedia entry](https://en.wikipedia.org/wiki/Swadesh_list)

## Dogolpolsky classes

The first actual function we will make reduces words into so-called Dogopolsky classes. These classes, while simple and inexact, have a nice attribute from the perspective of comparative linguistics: a constant number of states. If one relies on expert cognancy judgements, as is commonly done in historical linguistics, we have more classes for larger linguistic stocks. We avoid this by using a constant set of 10 states for languages. Some research suggests this isn't a terrible first approximation. 

We have two functions in the `PyIETools` module that accomplish this: `worddolg` which takes a word and returns its Dogolopolsky class, and `worddolgs`, which does the same for a list of words. We apply these functions and get back numeric classes in the following block.

Having completed the function, we can convert our list into an array of Dogopolsky classes. We also print out a sample of the first row.

In [25]:
Words = np.asarray(Data[words])
W = PyInstEvo.worddolgs(Words[:][0])

for i in range(1, len(Words)):
    reduction = PyInstEvo.worddolgs(Words[:][i])
    W = np.vstack((W,reduction))
print(W[:,1])

[2. 3. 2. ... 1. 1. 1.]


The print statement shows what we have - an array of numbers for each society/language telling us in which class each word on our list belongs to. 


In estimation, we need to have these in a dummy variable form. The function `charnum_to_dummies` translates the numbers from the above work into a string of 10 dummies. Each word then becomes amenable to Markov chain analysis. Class 3, for example,  becomes
`[0 0 1 0 0 0 0 0 0 0]`. Felsenstein notes that this can easily handle missing words: we just use a state indicator vector with all ones in it. `[1 1 1 1 1 1 1 1 1 1]`.

In [26]:
States = PyInstEvo.charnum_to_dummies(W)

states = []
dim1   = int(np.shape(States)[1]/10)
DogList= ['p', 't', 's', 'c', 'm', 'N', 'l', 'w', 'y', 'i']
for i in range(0, dim1):
    for j in range(0, 10):
        states.append(words[i] + str(i) + '_' + DogList[j])

print(states[0:5])

for i in range(0, int(dim1*10)):
    Data[ states[i] ] = States[:, i]

# Print out result for verification:

print(Data['word10_p'][0:10])


['word10_p', 'word10_t', 'word10_s', 'word10_c', 'word10_m']
104     0
310     0
3867    0
207     0
206     0
1002    0
106     0
601     0
3172    0
603     0
Name: word10_p, dtype: uint8


Here is a small sample. If a word is missing, all of its states are possible so it is a bunch of ones.

In [27]:
States[:5,0:40]

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1]], dtype=uint8)

## Prior information on language splits

The next thing to be done is to read in all of our prior information on splits. This is fundamental in manipulating trees, as it is what one must use to calibrate the tree. 

We have gathered this information from various sources which we should indicate, including Ackerman, the ASJP website, and also some additional sources. In fact, what we need to do is write in another column, so we keep track of where the sources are! Of course, the methods we use can easily allow for additional prior information on linguistic splits. 

All of our information is expressed in terms of bilateral splits. We read this in as a separate data frame:

In [28]:
Splits=pd.read_csv(os.getcwd() + '//IEData//PriorSplitInfo.csv',
                   names=('phylum','language1','language2','years','sdyears'))

In [29]:
print(Splits[0:15])            # A small sample

       phylum        language1                language2  years  sdyears
0   AfroAsia          ETHIOPIC                     GEEZ   2450       10
1   AfroAsia          ETHIOPIC                    TIGRE   2450       10
2   AfroAsia          ETHIOPIC                  ARGOBRA   2450       10
3   AfroAsia           MALTESE  TUNISIAN_ARABIC_MAGHRIB    910       10
4   AfroAsia     EASTERN_OROMO              MECHA_OROMO   2500       10
5   AfroAsia     EASTERN_OROMO                  W_OROMO    460       10
6   AfroAsia           W_OROMO                     ORMA    460       10
7     Altaic         MONGOLIAN                   MOGHOL    750       10
8     Altaic           TURKISH           JONK_KHORASANI   1419       10
9     Altaic            MANCHU                    HEZHE    236       10
10    Altaic            KYRGYZ              KAZAN_TATAR    900       10
11    Altaic           CHUVASH                    UZBEK   2500       10
12    Altaic            KOREAN                 JAPANESE   5000  

The other source of prior information for calibration of linguistic trees is prior depth. We haven't worked nearly hard enough on setting these up, being content merely to figure out how to incorporate the information for the time being. We have to go through this and pin down some information that is a bit more exact for the approximate depth of each language stock (in terms of millenia from the present).

In [30]:
Depths=pd.read_csv(os.getcwd() + '//IEData//PriorDepth.csv', 
                   names=('phylum','min','max'))
print(Depths)

       phylum  min  max
0   AfroAsia     9   12
1     Altaic     6    8
2     Amerind    9   11
3     Austric    8   12
4     Austral    9   12
5   Burushask    3    6
6   Caucasian    4    7
7     Chukchi    3    6
8    ElamoDra    4    6
9    EskimoAl    4    5
10   IndoHitt    7    9
11    IndoPac    8   10
12       Khet    3    5
13    Khoisan   10   12
14     NaDene    6    8
15  NigerKord   10   12
16  NiloSahar   10   12
17  SinoTibet    6    8
18  UralicYuk    5    7


As a final thing to do, we need to organize our material on expiration dates of certain languages. Some of this information is in the original data, while some of it is in a free-standing file, so we want to merge this information together and into our original data file. 

Here goes, while reading from the `DeathMods.csv` file which describes modifications to moribund languages. 

In [31]:
DeathDates = pd.read_csv(os.getcwd() + '//IEData//DeathMods.csv', 
                         names=('language', 'ex_date', 'ex_date_sd'))
DeathDates.rename(columns = {'language': 'name'}, inplace=True)
Data.set_index('name')
Data['ex_date_new']    = pd.Series(DeathDates['ex_date'], index=DeathDates.name)
Data['ex_date_sd_new'] = pd.Series(DeathDates['ex_date_sd'], index=DeathDates.name)

# Now, replace ex_date with ex_date_new if ex_date_new is not missing...

Data.loc[Data['ex_date_new']<10000 , "ex_date"  ]    = Data['ex_date_new']
Data.loc[Data['ex_date_sd_new']<10000, "ex_date_sd"] = Data['ex_date_sd_new']

## Pickling

As a last step, we will pickle the data for ease of access in our next workbook. 



In [32]:
Data.to_pickle(os.getcwd()    + '//IEData//MasterData.pkl')
Splits.to_pickle(os.getcwd() + '//IEData//Splits.pkl')
Depths.to_pickle(os.getcwd()  + '//IEData//Depths.pkl')

Subsequent work builds on this combined data through development of tools and classes. 