<a href="https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/Rank_frequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# ILOs

After completing this tutorial the student 

* can analyse rank frequency and fit a Zipf/Zeta distribution to data



# Table of contents


* [Rank-Frequency](#rankfreq)
  

## Packages

Everything can be installed with pip, just run in a cell `!pip install numpy`, for example.
Some tools might require restarting the notebook's kernel.


In [None]:
!pip install numpy
!pip install scipy
!pip install matplotlib
!pip install pandas
!pip install seaborn
!pip install nltk
!pip install tabulate

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
import scipy.stats as st
import urllib  # sometimes we need to download stuff
import gzip    # sometimes the stuff we downloaded is gzipepd
import json    # sometimes we download dictionaries stored in json format
import pandas as pd    # great for organising tabular data
import seaborn as sns  # lots of fancy plotting functions coded for us
import nltk
from tabulate import tabulate
from collections import Counter
from itertools import cycle

## <a name="nltk">  NLTK

[NLTK](https://www.nltk.org) is a platform for building Python programs to work with human language data. It provides access to corpora and other linguistic resources, as well as a simple interface for developing NLP applications. 

Before you start programming make sure you have installed all necessary packages. You can install packages directly from your jupyter notebook using the command `!pip install <package>`.

In [None]:
import nltk

The first time you use nltk, you will have to download some packages. 

In [None]:
## These are the packages needed for this tutorial:

nltk.download('punkt')
nltk.download('treebank')
nltk.download('alpino')
nltk.download('floresta')


## If you are running this locally, you can also install 'all', 
##  but it will take a moment though (hence, we don't recommend downloading 'all' on colab)

<details>
    <summary> Some people reported an error on macOS <i>SSL: Certificate verify failed</i>, if it happens to you, you can use the following 
    </summary>
    
```python
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('all')
```

</details>

## <a name="rankfreq"> Rank-Frequency

    
***If you choose not to work through this section in week 1, it's still advisable to study the section at a later moment: its content is useful and relevant to this course.***
    
Natural languages are remarkably productive, an human speakers are very creative. Day after day, the vocabulary of every natural language actively spoken on the planet is continuously changing. New words are created, existing words are reused in novel ways, some words lose their prominence.

In most NLP applications the vocabulary of a language is frozen. We consider a "vocabulary" the set of all known types at a given time. Here we use the word *type* to distinguish, for example, the unique token `the` from its many occurrences in a corpus, which we usually call *instances*).

Entries in a vocabulary are generally referred to as *words*, but in NLP they really are *tokens*, where a *token* is whatever sequence of characters that we treat as a unit (typically a sequence of non-blank characters). For example, linguistically speaking `Oct.` is not a word (it's an abbreviated form of the word `October`), but it may well be a token in our NLP system's vocabulary. Conversely, `camera-ready` is a word in English, but any one occurrence of `camera-ready` may be split into one or more tokens depending on our tokenization strategy (e.g., `camera`, `-`, `ready`). Moreover, while linguistically `Oct.` really is an instance of the word `October`, unless we are explicit about it, a computer cannot tell that. NLP systems won't be able to infer the relationship between these two strings, `Oct.` and `October`, unless we give them the means to do so. 

We will now begin to appreciate one of the most important aspects of written language: *data sparsity*. Data sparsity affects many aspects of NLP systems, and a system's vocabulary is probably the best example.

You are probably aware of [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law), an empirical finding that the frequency of any word is inversely proportional to its rank in the frequency table. We will now verify this finding. 

**Data** NLTK also provides access to corpora. 

You can check the documentation of the [corpus package](https://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus) online or on your own jupyter notebook using `nltk.corpus?`. Here is a [list of available corpora](http://www.nltk.org/nltk_data/).

Corpora in NLTK are mostly already pre-processed at the basic levels (e.g., sentence splitting, and tokenization). 

Let's have a look at sample from the English PeenTreebank (again, a section of the WSJ corpus). If you are a Dutch speaker, you can also check Alpino (Dutch). 

In [None]:
from nltk.corpus import floresta as pt_floresta
from nltk.corpus import alpino as nl_alpino
from nltk.corpus import treebank as en_ptb

In [None]:
len(en_ptb.words()), len(nl_alpino.words()), len(pt_floresta.words())

To work with rank and frequency we need to determine the number of occurrences of each token in the vocabulary of a given corpus. In python, a `Counter` (from `collections`) can help us achieve that (if you are not familiar with `Counter` but know `dict`, they are very similar, check the python docs).

In [None]:
from collections import Counter

In [None]:
counter = Counter(en_ptb.words())

The counter stores a dictionary where each key is an observed token and its value is the number of times it occurred.

In [None]:
# Note that we can use the counter as a dictionary that maps from a token to its count:
'day' in counter.keys(), counter['day']

In [None]:
'NTMI' in counter.keys()  # it looks like our course has not been mentioned in the Penn Treebank yet ;)

Counters can sort the vocabulary for us:

In [None]:
counter.most_common(10)

Unfortunately, computers do not know that tokens like 'Day' and 'day' refer to the same thing.

In [None]:
# Note that for a computer 'Day' and 'day' are different tokens.
'Day' in counter.keys(), counter['Day']

**Exercise with solution** One way to deal with that is to lowercase the data as a pre-processing step. This has downsides, can you think of some?

<details>
    <summary> <b>Click for a solution</b>  </summary>

Not every language uses lower/upper case characters (e.g., Chinese, Japanese, Arabic).

---
    
</details>    

This is not the only issue contributing to sparse vocabularies. Morhopological inflection does that too, for example, singular vs plural, gender marking, syntactic case, all these linguistic devices contribute to data sparsity, and in some applications we might want to treat all instances of `day`, `Day`, `days`, and `Days` as if they referred to the same type (the English word `DAY`).

One relatively simple way to reduce the vocabulary size by collapsing different variants of a certain base form is to use a [stemmer](https://en.wikipedia.org/wiki/Stemming). NLTK provides options for a few languages including [English, Dutch, and Portuguese](https://www.nltk.org/api/nltk.stem.html). 

Here is an example of what stemmers do:

In [None]:
from nltk.stem.snowball import EnglishStemmer, DutchStemmer, PortugueseStemmer

In [None]:
en_stemmer = EnglishStemmer()
nl_stemmer = DutchStemmer()
pt_stemmer = PortugueseStemmer()

In [None]:
from tabulate import tabulate

for i, s in zip(range(3), en_ptb.sents()):
    rows = []
    rows.extend([(w, en_stemmer.stem(w)) for w in s])
    print(tabulate(rows, headers=['word', 'stem']))
    print()

Here we use a loglog plot to verify Zipf's law (i.e., if you plot the log of the rank vs the log of the frequency, you should see something close to a straight line). 


In [None]:
def get_ranks(words):
    """Map a list of words to a np.array of ranks, where the most frequent word is assigned rank 1"""
    counter = Counter(words)
    w2r = {word: rank for rank, (word, count) in enumerate(counter.most_common(), 1)}
    return np.array([w2r[w] for w in words])

def get_rankfreq_pairs(words):
    """
    Map a list of words to an np.array with shape [K, 2] where K is the number of distinct tokens in the input list.
    The first column of the array is the rank, the second column of the array is the count.
    """
    counter = Counter(words)
    # rank-frequency
    rf = np.array([[r, n] for r, (w, n) in enumerate(counter.most_common(), 1)])
    return rf

Log-log plots

In [None]:
from itertools import cycle
palette = cycle(sns.color_palette())

for corpus_name, corpus, stemmer in zip(['en_ptb', 'nl_alpino', 'pt_floresta'], [en_ptb, nl_alpino, pt_floresta], [en_stemmer, nl_stemmer, pt_stemmer]):

    words = corpus.words()
    rf = get_rankfreq_pairs((stemmer.stem(w) for w in words)) # I'll be lowercasing the words, since it makes sense for these languages   
    c = next(palette)
    _ = plt.loglog(rf[:,0], rf[:,1], '-', c=c, label=corpus_name)
_ = plt.legend()

The general tendency indeed looks like a straight line, which in a log-log plot denotes an exponential relationship between rank and frequency, as the line has a negative angle (with the x-axis) we can conclude that frequency decays exponentially fast with an increase in rank. That's the second most frequent word is exponentially less frequent than the first, and so on. 

The lines are not very straight at the extremes (lowest and highest ranks). The lowest ranks are probably distorted because of the presence of stop words. As for the highest rank, the information is probably distorted because of dataset size. Generally, it does look like Zipf's law is indeed a robust finding.



### Zipf and Zeta 

We will now design a statistical model of the rank. For that, we can use the [Zipf distribution](https://en.wikipedia.org/wiki/Zipf's_law) which predicts  that out of a population of $N > 0$ elements, the probability of the element of rank $k \ge 1$ is 
\begin{align}
\mathrm{Zipf}(k|N, s) = \frac{\frac{1}{k^s}}{H_{N,s}}
\end{align}
where the normalisation terms is defined as $H_{N,s} = \sum_{n=1}^N \frac{1}{k^s}$, and $s>1$ is called the *power parameter*.

The Zipf distribution requires a fixed number of draws $N$ and thus supports ranks in $\{1, \ldots, N\}$.

A slightly more convenient option is the [Zeta distribution](https://en.wikipedia.org/wiki/Zeta_distribution) which generalises the Zipf distribution removing the need to specify the total number of elements in the population. This is convenient when we are analysing populations (corpora) of different size.

The Zeta distribution assigns probability 
\begin{align}
\mathrm{Zeta}(k|s) = \frac{\frac{1}{k^s}}{\zeta(s)}
\end{align}
to rank $X=k$. The power parameter is $s>1$ as before, and $\zeta(s) = \sum_{n=1}^{\infty} \frac{1}{n^s}$ is the [Riemann zeta function](https://en.wikipedia.org/wiki/Riemann_zeta_function) which is implemented in `scipy.special.zeta`. The support of the Zeta distribution is $\mathbb N_1 = \{1, 2, \ldots\}$.

**Exercise with solution - Grid search MLE for Zeta** 
    
Perform a grid search to estimate the power parameter of the Zeta distribution for each dataset. Note that power parameters must be greater than 1. A reasonable range of parameters to test is something like `[1.01, 3.0]`. Obtain samples from the MLE Zeta and plot them against the observations (use as many samples as you have observations). Comment on the fit: do you think the model fits the data well, does it fit the data well for most rank values or are there types of values (eg, very low, or very high) for which the model does not do so well?

As scipy has a stable implementation of the Riemann zeta function `scipy.special.zeta`, we could implement the pmf of the Zeta distribution yourselves. Generally, however, it is always a good idea to reuse high quality mathematical code. It turns out, the statistics package in scipy has a stable implementation of the Zeta distribution, but, funnily enough, it is named `scipy.stats.zipf`, instead of `scipy.stats.zeta`. For some historical accident, the two terms 'Zipf distribution' and 'Zeta distribution' came to be used somewhat interchangeably in statistics. In doubt, see that if it takes two parameters (total population size $N > 1$ and power $s > 1$) we have the classic Zipf, if it takes one parameter (just the power $s>1$), we have Zipf's generalisation called Zeta.


In any case, we will go on with the *Zeta* distribution, you can count on its good implementation from scipy which is called `scipy.stats.zipf`.


<details>
 <summary> Hint </summary>

Check the grid search we did for the Poisson case, we first implemented the Poisson likelihood function, and then implemented the search. The strategy here is very similar, but watch out that you use the correct pmf.

---

</details>

<details>
 <summary> You do not need to use this trick, but you may find it useful. </summary>

Check the grid search we did for the Poisson case, we first implemented the Poisson likelihood function, and then implemented the search. The strategy here is very similar, but watch out that you use the correct pmf.

For discrete data, sometimes we store the *counts* of the outcomes rather than the outcomes themselves, that is, we store a vector $\mathbf c$ where $0 \le c_k \le N$ is the number of times outcome $k$ occurs in $\mathcal D$. We can re-express the log-likelihood function in terms of counts:

\begin{align} 
\mathcal L_{\mathcal{D}}(\theta) &= \sum_{n=1}^N \log f_\theta(x_n) = \sum_{k \in \mathrm{supp}(P_X)} c_k \log f_\theta(k) ~,
\end{align}

where in practice we only evaluate the terms for which $c_k > 0$ in the dataset.

</details>

In [None]:
def zeta_log_likelihood(power, rank, freq):
    """
    This function should return a single real value representing the log-likelihood     
     \sum_{r=1}^R count(r) * log Zeta(rank=r|power)
    where count(r) is the number of times the rank r occurs in the dataset.
    """
    raise NotImplementedError("Implement me!")

<details>
 <summary> Click for a solution </summary>

```python
def zeta_log_likelihood(power, rank, freq=None):
    pmf = st.zipf(power)   # scipy calls it Zipf, but it is the single parameter version (thus, Zeta) 
    ll = pmf.logpmf(rank)
    if freq is not None and len(freq) == len(rank):
         ll *= freq
    return ll.sum()
```

</details>

Grid search:

In [None]:
from itertools import cycle

palette = cycle(sns.color_palette())

for corpus_name, corpus, stemmer in zip(['en_ptb', 'nl_alpino', 'pt_floresta'], [en_ptb, nl_alpino, pt_floresta], [en_stemmer, nl_stemmer, pt_stemmer]):
    # get all words
    words = corpus.words()
    # obtain their rank frequencies
    rf = get_rankfreq_pairs((stemmer.stem(w) for w in words))
    
    # plot the data
    c = next(palette)
    _ = plt.loglog(rf[:,0], rf[:,1], '-', c=c)
    
    # make a grid
    grid = np.linspace(1.01, 3., 300)
    # evaluate log-likeilhood for each parameter value in the grid
    lls = np.array([zeta_log_likelihood(power, rf[:,0], rf[:,1]) for power in grid])
    # find the id of the parameter that maximises the log-likelihood in the grid
    k = np.argmax(lls)
    # this is the parameter that attains the maximum (in the grid)
    mle = grid[k]

    # plot samples from the corresponding Zeta distribution
    N = len(words)
    x_ = st.zipf(mle).rvs(N) # scipy calls it Zipf, but it's the single parameter version (i.e., Zeta)
    rf_ = get_rankfreq_pairs(x_)
    _ = plt.loglog(rf_[:,0], rf_[:,1], '--', c=c, label=f"X ~ Zeta({mle:.4f}): {corpus_name}")
_ = plt.legend()

The models are fairly decent approximation of the data, except towards the extremes (lowest and highest ranks). But, generally, it does look like the data could follow a Zeta distribution.