# Basic Statistics

The corpus contains the following 12 files, which live in the `corpus` directory at https://github.com/ilonabudapesti/buddhism-nlp/tree/master/Pali_Oxford_MSt/corpus

They are plain text files directly taken from the Digital Pali Reader which uses an electronic version of the CTS Edition.

In [None]:
import nltk, os, re, pprint
from nltk import word_tokenize

fileids = os.listdir('../corpus')

The 12 files are combined into one large string. The length of this string is 122,497 characters. 

In [None]:
raw = ''
for fileid in fileids:
    f = open('../corpus/' + fileid)
    for line in f:
        raw += line

len(raw)

It would be more useful to know the number of words rather than number of characters in the corpus. Therefore we tokenize the raw string, which will split the string up into words taking empty space as the delimiting character. We sample a few element.

Then we use NLTK's built in `Text` method to turn our list of tokens into an NLTK `Text` format. This gives us access to several built-in methods, such as `concordance`, which lists out the occurance of a specific token in the `Text`.

In [None]:
tokens = word_tokenize(raw)
tokens[100:110]

In [None]:
text = nltk.Text(tokens)
text.concordance('samaṇa')

We can take a frequency distribution of the text, which will show us how frequently each token (also called *word-type*) occurs.

We list the tokens in order of freqency. Punctuation such as ',' and '.' are most frequent appearing 1557 and 1377 times respectively. This is followed by 'ti', 'kho', 'na' 364, 278 and 203 occurrences each. Words which appear with high frequency but carry no semantic meaning are called *stop-words*. Examples of stop-words in English are 'and', 'the', 'it' and many pronouns.

The first word in the frequency distribution with semantic meaning is 'bhante', which as a form of address falls within the frequency range of stop-words.

Note that words from the Angulimala and Mahagovinda, both from the DN, will appear higher in the frequency distribution due to their texts being longer and therefore there is more chance for repetition.

In [None]:
fd = nltk.FreqDist(text)
fd.most_common()

In [None]:
len(text) # number of tokens (words, including punctuation and references) in the corpus

In [None]:
# We remove punctuation and numbers, and count unique tokens.

vocab = sorted(set([w.lower() for w in text if w.isalpha()]))
len(vocab) 

# There are 4,052 unique word-types or tokens in the corpus

A list of all tokens is below, ordered according to the Latin alphabet, with capital letters preceeding small caps.
Note that words with different conjugations and declensions count as different tokens. 
To find out the unique number of head-words we will need to **lemmatize** or **stem** our vocabulary.

In [None]:
vocab

In [None]:
text.concordance('aggi')

In [None]:
text.concordance('Taṃ') # The concordance method finds occurences regardless of capitalization

# Letter Frequencies

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['font.family'] = 'Arial'

raw[:100]

In [None]:
# alpha only tokens
alphatokens = [t for t in tokens if t.isalpha()]
raw2 = ('').join(alphatokens)
raw3 = [s.lower() for s in raw2]
fd = nltk.FreqDist(raw3)
fd.plot()
fd.most_common(50)

Note there is only one occurance of 'ś', which is strange. The occurance of 'f' and 'w' are from bracketed references pointing to variant readings probably.

### Question: Should we remove content in brackets before processing? () {} []

# Zipf's Law in Pāli

Let `f(w)` be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. `f × r = k`, for some constant `k`). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.

Zipf's law hold empirically true for English, the larger the sample size the better the fit, but not for randomly generated texts using the English alphabet.

The question is whether Zipf's law holds true for Pāli.