# Basic Statistics

The corpus contains the following 12 files, which live in the `corpus` directory at https://github.com/ilonabudapesti/buddhism-nlp/tree/master/Pali_Oxford_MSt/corpus

They are plain text files directly taken from the Digital Pali Reader which uses an electronic version of the CTS Edition.

In [15]:
import nltk, os, re, pprint
from nltk import word_tokenize, sent_tokenize

fileids = os.listdir('../corpus')
fileids

['01 Angulimala MN86.txt',
 '02 Yodhajiva SN42.3.txt',
 '03 Baddhekaratta MN131.txt',
 '04 Mahagovinda DN19.txt',
 '06 Kammavācā.txt',
 '06 Upasampadaviddhi V.txt',
 '07 Parivārapāḷi V482, 485.txt',
 '08 Parivāra-aṭṭhakathā V484-5 linguistics.txt',
 '09 Jataka 113 A.txt',
 '09 Jataka 148 A.txt',
 '09 Jataka 407 A.txt',
 '10 Bhikkhunīsaṃyutta SN5.1-5.txt',
 '11 Bhikkhuni pācittiya 8-9.txt',
 '12 Vaccakuṭivatta V.txt']

The 12 files are combined into one large string. The length of this string is 122,497 characters. 

In [28]:
raw = ''
c = open('corpus.txt','w')
for fileid in fileids:
    f = open('../corpus/' + fileid)
    for line in f:
        raw += line
        c.write(line)
    f.close()
c.close()
len(raw)

122497

In [30]:
sents = sent_tokenize(raw)
len(sents)

1474

In [31]:
for fileid in fileids:
    f = open('../corpus/' + fileid)
    raw = f.read()
    sents = sent_tokenize(raw)
    print(fileid, len(sents))
    f.close()
    
#Texts and number of sentences in each.

01 Angulimala MN86.txt 191
02 Yodhajiva SN42.3.txt 19
03 Baddhekaratta MN131.txt 51
04 Mahagovinda DN19.txt 432
06 Kammavācā.txt 229
06 Upasampadaviddhi V.txt 144
07 Parivārapāḷi V482, 485.txt 6
08 Parivāra-aṭṭhakathā V484-5 linguistics.txt 49
09 Jataka 113 A.txt 35
09 Jataka 148 A.txt 58
09 Jataka 407 A.txt 89
10 Bhikkhunīsaṃyutta SN5.1-5.txt 64
11 Bhikkhuni pācittiya 8-9.txt 62
12 Vaccakuṭivatta V.txt 54


In [4]:
sents[500:505]

['atha kho, bho, mahāgovindassa brāhmaṇassa catunnaṃ māsānaṃ accayena ahudeva ukkaṇṭhanā ahu paritassanā — “sutaṃ kho pana metaṃ brāhmaṇānaṃ vuddhānaṃ mahallakānaṃ ācariyapācariyānaṃ bhāsamānānaṃ — ‘yo vassike cattāro māse paṭisallīyati, karuṇaṃ jhānaṃ jhāyati, so brahmānaṃ passati, brahmunā sākaccheti brahmunā sallapati brahmunā mantetī’ti.',
 'na kho panāhaṃ brahmānaṃ passāmi, na brahmunā sākacchemi na brahmunā sallapāmi na brahmunā mantemī’”ti.',
 '♦  brahmunā sākacchā (DN 19)\n\n♦ 318.',
 '“atha V.2.176 kho, bho, brahmā sanaṅkumāro mahāgovindassa brāhmaṇassa cetasā cetoparivitakkamaññāya P.2.240 seyyathāpi nāma balavā puriso samiñjitaṃ vā bāhaṃ pasāreyya, pasāritaṃ vā bāhaṃ samiñjeyya, evameva, brahmaloke antarahito mahāgovindassa brāhmaṇassa sammukhe pāturahosi.',
 'atha kho, bho, mahāgovindassa brāhmaṇassa ahudeva bhayaṃ ahu chambhitattaṃ ahu lomahaṃso yathā taṃ adiṭṭhapubbaṃ rūpaṃ disvā.']

It would be more useful to know the number of words rather than number of characters in the corpus. Therefore we tokenize the raw string, which will split the string up into words taking empty space as the delimiting character. We sample a few element.

Then we use NLTK's built in `Text` method to turn our list of tokens into an NLTK `Text` format. This gives us access to several built-in methods, such as `concordance`, which lists out the occurance of a specific token in the `Text`.

In [5]:
tokens = word_tokenize(raw)
tokens[100:110]

['pasupālakā',
 'kassakā',
 'pathāvino',
 'bhagavantaṃ',
 'yena',
 'coro',
 'aṅgulimālo',
 'tenaddhānamaggapaṭipannaṃ',
 '.',
 'disvāna']

In [6]:
text = nltk.Text(tokens)
text.concordance('samaṇa')

Displaying 9 of 9 matches:
āna bhagavantaṃ etadavocuṃ — “ mā , samaṇa , etaṃ maggaṃ paṭipajji . etasmiṃ ,
, etaṃ maggaṃ paṭipajji . etasmiṃ , samaṇa , magge coro aṅgulimālo nāma luddo 
 aṅgulīnaṃ mālaṃ dhāreti . etañhi , samaṇa , maggaṃ dasapi purisā vīsampi puri
ino bhagavantaṃ etadavocuṃ — “ mā , samaṇa , etaṃ maggaṃ paṭipajji , etasmiṃ s
a , etaṃ maggaṃ paṭipajji , etasmiṃ samaṇa magge coro aṅgulimālo nāma luddo lo
vā aṅgulīnaṃ mālaṃ dhāreti . etañhi samaṇa maggaṃ dasapi purisā vīsampi purisā
āya ajjhabhāsi — ♦ “ gacchaṃ vadesi samaṇa ṭhitomhi , ♦ mamañca brūsi ṭhitamaṭ
si ṭhitamaṭṭhitoti . ♦ pucchāmi taṃ samaṇa etamatthaṃ , ♦ kathaṃ ṭhito tvaṃ ah
oyaṃ paccupādi ( sī . ) , mahāvanaṃ samaṇa paccupādi ( syā . kaṃ . ) } . ♦ soh


We can take a frequency distribution of the text, which will show us how frequently each token (also called *word-type*) occurs.

We list the tokens in order of freqency. Punctuation such as ',' and '.' are most frequent appearing 1557 and 1377 times respectively. This is followed by 'ti', 'kho', 'na' 364, 278 and 203 occurrences each. Words which appear with high frequency but carry no semantic meaning are called *stop-words*. Examples of stop-words in English are 'and', 'the', 'it' and many pronouns.

The first word in the frequency distribution with semantic meaning is 'bhante', which as a form of address falls within the frequency range of stop-words.

Note that words from the Angulimala and Mahagovinda, both from the DN, will appear higher in the frequency distribution due to their texts being longer and therefore there is more chance for repetition.

In [7]:
fd = nltk.FreqDist(text)
fd.most_common()
mc = open('most_common.txt','w')
for token in fd.most_common():
    mc.write(str(token[0] + " " + str(token[1]) + "\n"))
mc.close()

In [8]:
len(text) # number of tokens (words, including punctuation and references) in the corpus

19395

In [9]:
# We remove punctuation and numbers, and count unique tokens.

vocab = sorted(set([w.lower() for w in text if w.isalpha()]))
len(vocab) 

# There are 4,052 unique word-types or tokens in the corpus

4052

A list of all tokens is below, ordered according to the Latin alphabet, with capital letters preceeding small caps.
Note that words with different conjugations and declensions count as different tokens. 
To find out the unique number of head-words we will need to **lemmatize** or **stem** our vocabulary.

In [10]:
vocab

['abbhamiva',
 'abbhantare',
 'abbhañjāpetvā',
 'abbhaññāsi',
 'abbhokāsaṃ',
 'abbhuggacchi',
 'abbhuggantvā',
 'abbhuggato',
 'abbhuggañchi',
 'abbhutaṃ',
 'abbhā',
 'abhabbo',
 'abhikkantavaṇṇo',
 'abhikkantaṃ',
 'abhikkantāya',
 'abhinanditvā',
 'abhinandunti',
 'abhinimanteyyāma',
 'abhinimminitvā',
 'abhinippanno',
 'abhinipphanno',
 'abhipatthitan',
 'abhipatthitaṃ',
 'abhiramatu',
 'abhiramāmī',
 'abhiramāmīti',
 'abhiruyha',
 'abhisambhosi',
 'abhisambhoti',
 'abhisaṅkhāresi',
 'abhisaṅkhāsi',
 'abhisitto',
 'abhisiñceyyuṃ',
 'abhisiñci',
 'abhisiñcissāmī',
 'abhisiñciṃsu',
 'abhivādeti',
 'abhivādetvā',
 'abhivādeyyāma',
 'abhiññā',
 'abhiññāya',
 'abhāsatha',
 'abhāsi',
 'abhāsittha',
 'abhūtaṃ',
 'abyattena',
 'abyattā',
 'abyāpajjena',
 'acarimaṃ',
 'accantaṃ',
 'accayena',
 'acchariyaṃ',
 'acchādetvā',
 'adantānaṃ',
 'adassāvī',
 'adatvā',
 'adayāpanno',
 'adaṇḍena',
 'addasa',
 'addasaṃ',
 'addasā',
 'addasāsuṃ',
 'addhagato',
 'addhamāsassa',
 'addhamāsaṃ',
 'addhamāso',

In [11]:
text.concordance('aggi')

Displaying 1 of 1 matches:
me pubbe , yiṭṭhukāmassa me sato . ♦ aggi pajjalito āsi , kusapattaparitthato 


In [12]:
wc = text.concordance('aggi')
type(wc)

Displaying 1 of 1 matches:
me pubbe , yiṭṭhukāmassa me sato . ♦ aggi pajjalito āsi , kusapattaparitthato 


NoneType

In [13]:
for i in wc:
    print(i)

TypeError: 'NoneType' object is not iterable

# Q: Iterable concordance list would be nice 

In [None]:
def occur(word):
    return [sent for sent in sents if word in sent]

import re
def word_occur(word):
    return [sent for sent in sents if re.search(r"\b"+word+r"\b" ,sent)]

occur('aggi')

In [None]:
word_occur('aggi')

In [None]:
text.concordance('Taṃ') # The concordance method finds occurences regardless of capitalization

In [None]:
len(occur('taṃ'))

In [None]:
len(word_occur('taṃ'))

# Q: concordance 85 vs regexp 76 ? (capitalization?)

In [None]:
word_occur('taṃ')

# Letter Frequencies

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['font.family'] = 'Arial'

raw[:100]

In [None]:
# alpha only tokens
alphatokens = [t for t in tokens if t.isalpha()]
raw2 = ('').join(alphatokens)
raw3 = [s.lower() for s in raw2]
fd = nltk.FreqDist(raw3)
fd.plot()
fd.most_common(50)

Note there is only one occurance of 'ś', which is strange. The occurance of 'f' and 'w' are from bracketed references pointing to variant readings probably.

### Question: Should we remove content in brackets before processing? () {} []

In [None]:
def make_string(file):
    raw = ''
    with open(file) as f:
        for line in f:
            raw += line
    tokens = word_tokenize(raw)
    alphatokens = [t for t in tokens if t.isalpha()]
    raw2 = ('').join(alphatokens)
    raw3 = [s.lower() for s in raw2]
    return raw3

cfd = nltk.ConditionalFreqDist(
        (fileid, s)
        for fileid in fileids
        for s in make_string('../corpus/' + fileid)
)
cpd = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist)

In [None]:
pd1 = cpd[fileids[0]]
pd1.samples()

In [None]:
fd1 = nltk.FreqDist((s, pd1.prob(s)) for s in pd1.samples())
fd1 = pd1.freqdist()
fd1.plot()

In [None]:
allfds = [cpd[fn].freqdist() for fn in fileids]
unsorted_list = [(x, allfds[0].freq(x)) for x in allfds[0].keys()]
sorted_list = sorted(unsorted_list)
print(unsorted_list)
xlabs, vals = zip(*sorted_list)
nvals = len(xlabs)
plt.plot(range(nvals), vals, label='Angulimala')
plt.xticks(range(nvals), xlabs)
plt.legend()

In [None]:
allxlabs, allvals = [], []
for fd in allfds:
    arr = sorted([(x, fd.freq(x)) for x in fd.keys()])
    xlabs, vals = zip(*arr)
    allxlabs.append(xlabs)
    allvals.append(vals)
assert(allxlabs[0]==allxlabs[1] and allxlabs[1]==allxlabs[2])

# Zipf's Law in Pāli

Let `f(w)` be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. `f × r = k`, for some constant `k`). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.

Zipf's law hold empirically true for English, the larger the sample size the better the fit, but not for randomly generated texts using the English alphabet.

The question is whether Zipf's law holds true for Pāli.

In [None]:
text

In [None]:
fd = nltk.FreqDist(alphatokens)
n = 100
mc = fd.most_common(n)
ranks = list(range(1,n+1))
freqs = [x[1] for x in mc]
freqstimesranks = [(i+1)**0.7*x for i, x in enumerate(freqs)]
plt.plot(ranks, freqstimesranks)

In [None]:
plt.loglog(ranks, freqs)

In [48]:
studylist= [pron1st, pron2nd, pron3rd, deic]

In [47]:
pron1st = ["ahaṃ", "maṃ", "mayā", "me", "mama", "mayhaṃ", "mayi", "mad", 
           "mayaṃ", "amhe", "no", "amhehi", "amhākaṃ", "asmākaṃ", "amhesu"]

pron2nd = ["tvaṃ", "tuvaṃ", "taṃ", "tayā", "tvayā", "tava", "tuyhaṃ", "te", "tayi", "tvayi",
          "tumhe", "vo", "tumhehi", "tumhākaṃ", "tumhesu"]

pron3rd = ["so", "sa", "ta", "tena", "tasmā", "tamhā", "tassa", "tasmiṃ", "tamhi", 'tad',
          'tehi', 'tesaṃ', 'tesu', 
          'sā', 'tāya', 'tassā', 'tissā', 'tāya', 'tassaṃ', 'tissaṃ', 
          'tā', 'tāyo', 'tāhi', 'tāsaṃ', 'tāsu',
          'tāni']

deic = ['ayaṃ', 'imaṃ', 'iminā', 'anena', 'imasmā', 'imamhā', 'asmā', 'imassa', 'assa', 
       'imasmiṃ', 'imamhi', 'asmiṃ', 'idaṃ',
       'ime', 'imehi', 'imesaṃ', 'imesu', 
       'imāya', 'imissā', 'assā', 'imissaṃ', 'imāyaṃ',
       'imā', 'imāhi', 'imāsaṃ', 'imāsu']

In [51]:
%pprint
[element for list in studylist for element in list]

Pretty printing has been turned OFF


['ahaṃ', 'maṃ', 'mayā', 'me', 'mama', 'mayhaṃ', 'mayi', 'mad', 'mayaṃ', 'amhe', 'no', 'amhehi', 'amhākaṃ', 'asmākaṃ', 'amhesu', 'tvaṃ', 'tuvaṃ', 'taṃ', 'tayā', 'tvayā', 'tava', 'tuyhaṃ', 'te', 'tayi', 'tvayi', 'tumhe', 'vo', 'tumhehi', 'tumhākaṃ', 'tumhesu', 'so', 'sa', 'ta', 'tena', 'tasmā', 'tamhā', 'tassa', 'tasmiṃ', 'tamhi', 'tad', 'tehi', 'tesaṃ', 'tesu', 'sā', 'tāya', 'tassā', 'tissā', 'tāya', 'tassaṃ', 'tissaṃ', 'tā', 'tāyo', 'tāhi', 'tāsaṃ', 'tāsu', 'tāni', 'ayaṃ', 'imaṃ', 'iminā', 'anena', 'imasmā', 'imamhā', 'asmā', 'imassa', 'assa', 'imasmiṃ', 'imamhi', 'asmiṃ', 'idaṃ', 'ime', 'imehi', 'imesaṃ', 'imesu', 'imāya', 'imissā', 'assā', 'imissaṃ', 'imāyaṃ', 'imā', 'imāhi', 'imāsaṃ', 'imāsu']

In [74]:
def show_example_sentences(word):
    out = ''
    for fileid in fileids:
        f = open('../corpus/' + fileid)
        raw = f.read()
        sents = sent_tokenize(raw)
        ex_sents = [s for s in sents if word in word_tokenize(s)]
        out += fileid + " " + word + " " + str(len(ex_sents)) + "\n" + str(ex_sents) + "\n"
        f.close()
    return out
   
show_example_sentences("imasmiṃ")

"01 Angulimala MN86.txt imasmiṃ 0\n[]\n02 Yodhajiva SN42.3.txt imasmiṃ 0\n[]\n03 Baddhekaratta MN131.txt imasmiṃ 0\n[]\n04 Mahagovinda DN19.txt imasmiṃ 0\n[]\n06 Kammavācā.txt imasmiṃ 0\n[]\n06 Upasampadaviddhi V.txt imasmiṃ 0\n[]\n07 Parivārapāḷi V482, 485.txt imasmiṃ 0\n[]\n08 Parivāra-aṭṭhakathā V484-5 linguistics.txt imasmiṃ 0\n[]\n09 Jataka 113 A.txt imasmiṃ 0\n[]\n09 Jataka 148 A.txt imasmiṃ 1\n['so “laddhaṃ dāni me imasmiṃ sarīre mudu khāditabbayuttakaṭṭhānan”ti tato paṭṭhāya khādanto antokucchiṃ pavisitvā vakkahadayādīni khāditvā pipāsitakāle lohitaṃ pivitvā nipajjitukāmakāle udaraṃ pattharitvā nipajjati.']\n09 Jataka 407 A.txt imasmiṃ 0\n[]\n10 Bhikkhunīsaṃyutta SN5.1-5.txt imasmiṃ 0\n[]\n11 Bhikkhuni pācittiya 8-9.txt imasmiṃ 2\n['♦ 826. yā P.4.266 panāti yā yādisā ... pe ... bhikkhunīti ... pe ... ayaṃ imasmiṃ atthe adhippetā bhikkhunīti.', '♦ 830. yā panāti yā yādisā ... pe ... bhikkhunīti ... pe ... ayaṃ imasmiṃ atthe adhippetā bhikkhunīti.']\n12 Vaccakuṭivatt

In [75]:
f = open("pronouns_with_examples.txt", "w")
[f.write(show_example_sentences(element)) for list in studylist for element in list]
f.close()