# XML corpus

We import the Pāli Canon from the Digital Pāli Reader. The files are in Extensible Markup Language ([XML](https://en.wikipedia.org/wiki/XML)) format.

- Number of XML files:  162
- Number of characters in the whole Tipiṭaka:  90,083,712

The file names give indication of their content. 

1. The first character of the file name:
    - Vinaya
        - v
    - Sutta
        - d
        - m
        - s
        - a
        - k
    - Abhidhamma
        - x
1. The last character of the file name: 
    - `m` stands for *mūla*
    - `a` for *athakathā*
    - `t` for *tīkā*.

Finally we remove the XML markup and produce a plain-text version of the corpus. This is written to the file `corpustextonly.txt`. (Found in the same directory as this Jupyter Notebook file.)

In [2]:
%pprint
import nltk, os, matplotlib
from bs4 import BeautifulSoup
%matplotlib inline

# list of xml files

fileids = os.listdir("..")
fileids = [f for f in fileids if len(f) < 9 and 'xml' in f ]
print(fileids)
print("Number of XML files: ", len(fileids))
raw = ''
c = open('corpusinxml.txt','w')
for fileid in fileids:
    f = open("../" + fileid)
    for line in f:
        raw += line
        c.write(line)
c.close()
print("Number of characters in the whole Tipiṭaka: ", len(raw))

Pretty printing has been turned ON
['a10a.xml', 'a10m.xml', 'a10t.xml', 'a11a.xml', 'a11m.xml', 'a11t.xml', 'a1a.xml', 'a1m.xml', 'a1t.xml', 'a2a.xml', 'a2m.xml', 'a2t.xml', 'a3a.xml', 'a3m.xml', 'a3t.xml', 'a4a.xml', 'a4m.xml', 'a4t.xml', 'a5a.xml', 'a5m.xml', 'a5t.xml', 'a6a.xml', 'a6m.xml', 'a6t.xml', 'a7a.xml', 'a7m.xml', 'a7t.xml', 'a8a.xml', 'a8m.xml', 'a8t.xml', 'a9a.xml', 'a9m.xml', 'a9t.xml', 'b1m.xml', 'b2m.xml', 'd1a.xml', 'd1m.xml', 'd1t.xml', 'd2a.xml', 'd2m.xml', 'd2t.xml', 'd3a.xml', 'd3m.xml', 'd3t.xml', 'g1m.xml', 'g2m.xml', 'g3m.xml', 'g4m.xml', 'g5m.xml', 'k10a.xml', 'k10m.xml', 'k11m.xml', 'k12a.xml', 'k12m.xml', 'k13a.xml', 'k13m.xml', 'k14a.xml', 'k14m.xml', 'k15a.xml', 'k15m.xml', 'k16m.xml', 'k17m.xml', 'k18m.xml', 'k19m.xml', 'k1a.xml', 'k1m.xml', 'k20m.xml', 'k21m.xml', 'k2a.xml', 'k2m.xml', 'k3a.xml', 'k3m.xml', 'k4a.xml', 'k4m.xml', 'k5a.xml', 'k5m.xml', 'k6a.xml', 'k6m.xml', 'k7a.xml', 'k7m.xml', 'k8a.xml', 'k8m.xml', 'k9a.xml', 'k9m.xml', 'm1a.xml', 'm1m.x

In [4]:
f = open('corpusinxml.txt')
raw = f.read()
raw2 = BeautifulSoup(raw, 'lxml').get_text()
c = open('corpustextonly.txt','w')
c.write(raw2)
c.close()

# Text only corpus

The next step is to tokenize our corpus, that is to break up a single long string into a list of words. The lexer we use is the built-in word tokenizer of the NLTK toolkit.

After cleaning away manuscript variations and non-alphanumeric words we produce a list of tokens. This process is considered a sub-task of parsing.

We list the most common collocations of the text, followed by the 50 most common words.

In [7]:
from nltk import word_tokenize

f = open('corpustextonly.txt', 'r')
raw3 = f.read()
tokens = word_tokenize(raw3)
tokens = [t.lower() for t in tokens 
          if "^ea^" not in t 
          and "^eb^" not in t
          and t.isalpha()]
tokens[:50]

['namo',
 'tassa',
 'bhagavato',
 'arahato',
 'aṅguttaranikāye',
 'paṭhamapaṇṇāsakaṃ',
 'ānisaṃsavaggo',
 'kimatthiyasuttavaṇṇanā',
 'dasakanipātassa',
 'paṭhame',
 'anavajjasīlāni',
 'amaṅkubhāvassa',
 'avippaṭisārassa',
 'atthāya',
 'saṃvattantīti',
 'so',
 'nesaṃ',
 'ānisaṃsoti',
 'nāma',
 'taruṇavipassanā',
 'nāma',
 'balavavipassanā',
 'nāma',
 'maggo',
 'nāma',
 'arahattaphalaṃ',
 'nāma',
 'paccavekkhaṇañāṇaṃ',
 'arahattatthāya',
 'gacchanti',
 'cetanākaraṇīyasuttavaṇṇanā',
 'dutiye',
 'cetanāya',
 'na',
 'cetetvā',
 'kappetvā',
 'pakappetvā',
 'kātabbaṃ',
 'dhammasabhāvo',
 'eso',
 'kāraṇaniyamo',
 'ayaṃ',
 'pavattenti',
 'paripuṇṇaṃ',
 'karonti',
 'pāraṃ',
 'orimatīrabhūtā',
 'tebhūmakavaṭṭā',
 'nibbānapāraṃ',
 'gamanatthāya']

In [8]:
text = nltk.Text(tokens)
text.collocations()

dhammaṃ paṭicca; dhammo uppajjati; uppajjati hetupaccayā; atha kho;
hevaṃ vattabbe; eseva nayo; tasmiṃ samaye; kho bhikkhave; tesaṃ
tattha; ārammaṇapaccayena paccayo; āpatti dukkaṭassa; tenupasaṅkami
upasaṅkamitvā; paraṃ maraṇā; puna caparaṃ; upanissayapaccayena
paccayo; kāyassa bhedā; dhammassa ārammaṇapaccayena; kha yassa; pana
samayena; atha naṃ


In [9]:
fd = nltk.FreqDist(text)
fd.most_common(50)

[('ca', 124664),
 ('na', 115997),
 ('ti', 100631),
 ('vā', 86685),
 ('hoti', 56120),
 ('pana', 55227),
 ('taṃ', 48846),
 ('tattha', 42965),
 ('kho', 42961),
 ('evaṃ', 41307),
 ('so', 39691),
 ('bhikkhave', 32747),
 ('nāma', 32175),
 ('te', 31782),
 ('tassa', 29663),
 ('hi', 28832),
 ('vuttaṃ', 23984),
 ('nti', 22969),
 ('attho', 20175),
 ('ayaṃ', 20132),
 ('tena', 19863),
 ('viya', 19721),
 ('me', 19462),
 ('tesaṃ', 19068),
 ('atha', 18408),
 ('dhammaṃ', 17971),
 ('bhagavā', 17081),
 ('dhammo', 16889),
 ('uppajjati', 16857),
 ('katvā', 16826),
 ('paccayo', 16463),
 ('dhammā', 16308),
 ('yaṃ', 15895),
 ('ekaṃ', 15311),
 ('āha', 15143),
 ('idaṃ', 15058),
 ('paṭicca', 14963),
 ('yathā', 14905),
 ('no', 14823),
 ('bhante', 14781),
 ('tasmā', 14600),
 ('bhikkhu', 13841),
 ('tato', 13812),
 ('attano', 13470),
 ('dve', 13458),
 ('yo', 13386),
 ('ettha', 12278),
 ('tathā', 11925),
 ('tīṇi', 11289),
 ('atthi', 11206)]

# Most Frequent 1,000

Kurt Schmidt published a book that consists of the 1,000 most frequent Pāli words: [Frequency Dictionary of Pāli: Core Vocabulary for Learners](https://www.amazon.com/Frequency-Dictionary-Pali-Vocabulary-Learners/dp/1478369159).

We can do the same below in three lines of code. The list can be found in the file `most_common_1000.txt` in the same directory as this file. (Instant publishing, just add water.)

Note the error in Schmidt's title. p.416 the first page in the index of words shows already that his list takes as separate entries different tokens of the same vocabulary item. For example each of the following pairs are given as their own `"vocabulary"` items.

https://www.dropbox.com/s/afa79kkw4yz1bsb/Screenshot%202017-11-28%2016.53.21.png?dl=0

- aham
- ahampi


- aññataraṃ
- aññattaro


This is a mistake in the type-token distinction. It would have been better to title his book Most Frequent Surface Forms (Tokens). To be counted as vocabulary we would only count the word types, but not each word as a separate entry. I will talk more about this in the stemming/lemmatization section.

(#todo: compare Schmidt's list with mine)

In [10]:
with open('most_common_1000.txt','w') as mc:
    for token in fd.most_common(1000):
        mc.write(str(token[0] + " " + str(token[1]) + "\n"))

# Rare form of personal pronoun

Warder (needs citation) lists *asmākam* as a rare form of *amhākam* the genitive of the first person singular pronoun. We now have a tool to take a look at the frequency distribution of these two forms.

|form|frequency|probability|
|-----|---------|----------|
|amhākaṃ | 3271 |  99.42% |
|asmākaṃ | 19 |     0.58%  |
| **total**| **3290** |   **100%** |

We find that indeed asmākaṃ is about 17 times less frequent.

Note 1: This approach does not take into account any forms with sandhi where the initial `a` may have been omitted such as 'mhākaṃ, 'smākaṃ. (#todo)

Note 2: It would be interesting to see exactly where the 19 occurrences of asmākaṃ come from and whether there is some shared property among them e.g. author, or era. (#todo ?)

In [52]:
text.concordance('amhākaṃ')

Displaying 25 of 3271 matches:
sa gotamassa dhammadesanāya saddhiṃ amhākaṃ dhammadesanaṃ amhākaṃ vā dhammadesa
anāya saddhiṃ amhākaṃ dhammadesanaṃ amhākaṃ vā dhammadesanāya saddhiṃ samaṇassa
cetiyaṃ pūjimhā ti kathetuṃ vaṭṭati amhākaṃ ñātakā sūrā samatthā ti vā pubbe ma
 bhikkhuno kālakato pitā vā mātā vā amhākaṃ ñātakatthero sīlavā kalyāṇadhammo t
māsi therā disvāva ime paccayā neva amhākaṃ na kokālikassa kappantī ti paṭikkhi
o atthaṃ nīharitvā dassetā no yathā amhākaṃ bhagavā byākareyya ajitasuttavaṇṇan
 nānākaraṇaṃ samaṇassa vā gotamassa amhākaṃ vā yadidaṃ dhammadesanāya vā dhamma
ṃ piṇḍāya pavisimhā tesaṃ no bhante amhākaṃ etadahosi kho tāva sāvatthiyaṃ piṇḍ
 nānākaraṇaṃ samaṇassa vā gotamassa amhākaṃ vā yadidaṃ dhammadesanāya vā dhamma
aṃ paviṭṭho ca tathā tesaṃ no āvuso amhākaṃ acirapakkantassa bhagavato etadahos
sa vitthārena atthaṃ tesaṃ no āvuso amhākaṃ etadahosi kho āyasmā ānando satthu 
ave veditabbo tathā tesaṃ no bhante amhākaṃ acirapakkantassa bhagavato etadahos
a vitthār

In [53]:
text.concordance('asmākaṃ')

Displaying 19 of 19 matches:
ppavedite ayaṃ kho panāvuso amhākaṃ asmākaṃ bhagavatā bhagavato dhammo svākkhāt
aṃmamaṃ naṃsesvamussa savibhattissa asmākaṃ mamaṃ honti vā yathākkamaṃ asmākaṃ 
 asmākaṃ mamaṃ honti vā yathākkamaṃ asmākaṃ amhākaṃ mamaṃ mama simhi amhassa sa
i amhebhi mayhaṃ mama amhaṃ amhākaṃ asmākaṃ mayā amhehi amhebhi mayhaṃ mama amh
i amhebhi mayhaṃ mama amhaṃ amhākaṃ asmākaṃ mayi amhesu asmesu ettha pana katha
smiṃ vuttā ayañhi suddhakattuvisaye asmākaṃ ruci sukhatīti sukhito dukkhatīti d
akārena pāḷiyo paṭibhanti no tattha asmākaṃ khantiyā dajjā dajja ntiādīni satta
i tesaṃ doso hotīti na hoti suṇātha asmākaṃ sodhanaṃ tathā hi aṭṭhakathācariyeh
kvaci tumhaṃ tumhākaṃ amhaṃ amhākaṃ asmākaṃ vā pañcamiyaṃ amhatumhanturāja iccā
ggaho tumhaṃ tumhākaṃ amhaṃ amhākaṃ asmākaṃ dhammatā smiṃmhī ti vattate sabbesa
 puriso nāma dullabho tena kāraṇena asmākaṃ evarūpesu ṭhānesu adhivāsanakhantiy
ṃ kumāro kathessatī ti vadati nāyaṃ asmākaṃ ruccati ettha nipātamattaṃ vidhuro 
ā ñātisaṅgh

# Basic statistics of the Tipiṭaka

The three 'baskets' of the Buddhist Canon are Vinaya, Sutta, and Abhidhamma. We will now take the *mūla* files of each basket and list below how many tokens each contain (that is how many words long they are in total).


| basket  | #files | #tokens |
|---------|--------|---------|
|Vinaya   |    6   |  403,657|
|Sutta    |   43   |1,534,386|
|Abhidamma|    2   |  114,368|
|**total**|   51   |2,052,411|

In [21]:
# make vinaya text, mūla only
vinaya_files=[f for f in os.listdir('..') if f[0] == 'v' and f[-5]=='m']
vinaya_files

['v1m.xml', 'v2m.xml', 'v3m.xml', 'v4m.xml', 'v5m.xml', 'v6m.xml']

In [57]:
vinaya_tokens = []
for fileid in vinaya_files:
    f = open("../" + fileid)
    raw = f.read()
    raw2 = BeautifulSoup(raw, 'lxml').get_text()
    tokens = word_tokenize(raw2)
    tokens = [t.lower() for t in tokens 
          if "^ea^" not in t 
          and "^eb^" not in t
          and t.isalpha()]
    print(fileid, len(tokens), 'cumulative', len(vinaya_tokens))
    vinaya_tokens.extend(tokens)
print('vinaya tokens', len(vinaya_tokens))

v1m.xml 69771 cumulative 0
v2m.xml 46150 cumulative 69771
v3m.xml 31302 cumulative 115921
v4m.xml 100456 cumulative 147223
v5m.xml 94064 cumulative 247679
v6m.xml 61914 cumulative 341743
vinaya tokens 403657


In [14]:
# make sutta text, mūla only
sutta_files = [f for f in os.listdir('..') if f[0] in 'dmsak' and f[-5]=='m']
sutta_files
print(len(sutta_files))

43


In [16]:
sutta_tokens = []
for fileid in sutta_files:
    f = open("../" + fileid)
    raw = f.read()
    raw2 = BeautifulSoup(raw, 'lxml').get_text()
    tokens = word_tokenize(raw2)
    tokens = [t.lower() for t in tokens 
          if "^ea^" not in t 
          and "^eb^" not in t
          and t.isalpha()]
    print(fileid, len(tokens))
    sutta_tokens.extend(tokens)
print('sutta tokens', len(sutta_tokens))

a10m.xml 45990
a11m.xml 8095
a1m.xml 7530
a2m.xml 9420
a3m.xml 37783
a4m.xml 48188
a5m.xml 43596
a6m.xml 26897
a7m.xml 21592
a8m.xml 29842
a9m.xml 16757
d1m.xml 43569
d2m.xml 51792
d3m.xml 43343
k10m.xml 62048
k11m.xml 12254
k12m.xml 9564
k13m.xml 3762
k14m.xml 42057
k15m.xml 46897
k16m.xml 73095
k17m.xml 48428
k18m.xml 72520
k19m.xml 71467
k1m.xml 1131
k20m.xml 28007
k21m.xml 33035
k2m.xml 5295
k3m.xml 20253
k4m.xml 11675
k5m.xml 20119
k6m.xml 12236
k7m.xml 10321
k8m.xml 15237
k9m.xml 6004
m1m.xml 80387
m2m.xml 89322
m3m.xml 69114
s1m.xml 37052
s2m.xml 41180
s3m.xml 43465
s4m.xml 65319
s5m.xml 68748
sutta tokens 1534386


In [None]:
print(len(set(sutta_tokens)))
sutta_text = nltk.Text(sutta_tokens)
sutta_text.collocations()
fd = nltk.FreqDist(sutta_text)
fd.most_common(50)
fd.plot(cumulative=True)

In [None]:
dir(fd)
len(fd.hapaxes())

In [13]:
# make abhidamma text , mūla only
abhi_files = [f for f in os.listdir('..') if f[0] == 'x'  and f[-5]=='m']
abhi_files

['x1m.xml', 'x2m.xml']

In [17]:
abhidhamma_tokens = []
for fileid in abhi_files:
    f = open("../" + fileid)
    raw = f.read()
    raw2 = BeautifulSoup(raw, 'lxml').get_text()
    tokens = word_tokenize(raw2)
    tokens = [t.lower() for t in tokens 
          if "^ea^" not in t 
          and "^eb^" not in t
          and t.isalpha()]
    print(fileid, len(tokens), 'cumulative', len(abhidhamma_tokens))
    abhidhamma_tokens.extend(tokens)
print('abhidamma tokens', len(abhidhamma_tokens))

x1m.xml 59416 cumulative 0
x2m.xml 54952 cumulative 59416
abhidamma tokens 114368
