<a href="https://colab.research.google.com/github/probabll/ntmi-tutorials/blob/main/Werkcollege1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we practice identifying and estimating parameters for statistical models of linguistic data.

In [None]:
!pip install nltk

In [None]:
import nltk

The NLTK gives us access to the [WordNet](https://www.nltk.org/howto/wordnet.html), a resource containing rich information about the lexicon of various languages.

In [None]:
nltk.download('wordnet')
nltk.download('omw')

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# Data

For example, the WordNet has a repository of English lemmas, and these lemmas are categorised into parts-of-speech (syntactic function) such as nouns (n) or verbs (v).

In [None]:
sum(1 for _ in wn.all_lemma_names()), sum(1 for _ in wn.all_lemma_names('n')), sum(1 for _ in wn.all_lemma_names('v'))

Let's check some nouns:

In [None]:
for lemma in list(wn.all_lemma_names('n'))[50:60]:
    print(lemma)

And some verbs:

In [None]:
for lemma in list(wn.all_lemma_names('v'))[50:60]:
    print(lemma)

A few other languages are also part of the resource:

In [None]:
for lemma in list(wn.all_lemma_names('v', lang='nld'))[50:60]:
    print(lemma)

Let's analyse the English data:

In [None]:
nouns = [lemma for lemma in wn.all_lemma_names('n')]
verbs = [lemma for lemma in wn.all_lemma_names('v')]

In [None]:
num_synsets_nouns = np.array([len(wn.synsets(lemma)) for lemma in nouns])
num_synsets_verbs = np.array([len(wn.synsets(lemma)) for lemma in verbs])

In [None]:
fig, ax = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(8, 4))
_ = ax[0].hist(num_synsets_nouns, bins='auto')
_ = ax[0].set_xlabel('Number of senses for nouns')
_ = ax[1].hist(num_synsets_verbs, bins='auto')
_ = ax[1].set_xlabel('Number of senses for verbs')

In [None]:
st.describe(num_synsets_nouns)

In [None]:
st.describe(num_synsets_verbs)

# Empirical investigation and solution

The number of senses decays fairly quickly. Nouns in particular do not seem as polysemous as verbs in English. The distribution for nouns is very concentrated at 1. The distribution for verbs is smoother.

The number of observations in each bin decreases rather fast with increase in number of senses. Perhaps exponentially fast?

In [None]:
fig, ax = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(8, 4))
_ = ax[0].hist(num_synsets_nouns, bins='auto', log=True)
_ = ax[0].set_xlabel('Number of senses for nouns')
_ = ax[1].hist(num_synsets_verbs, bins='auto', log=True)
_ = ax[1].set_xlabel('Number of senses for verbs')

Plotting the y-axis in log scale helps us see that an exponential decay is plausible. 

Though we also see that nouns still concentrate badly at 1.

In [None]:
fig, ax = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(8, 4))
_ = ax[0].hist(np.log(num_synsets_nouns), bins='auto', log=True)
_ = ax[0].set_xlabel('Number of senses for nouns')
_ = ax[1].hist(np.log(num_synsets_verbs), bins='auto', log=True)
_ = ax[1].set_xlabel('Number of senses for verbs')

A log-log plot shows the same trend.

To capture the behaviour of the data in terms of "number of senses" in a statistical law, we need to look for a law that

1. supports integers starting from 1
2. the pmf decays (roughly) exponentially quickly
3. the variance is not too high

We can contrast properties of known laws against these objectives.

The Binomial distribution does not seem appropriate: its generative story involves a known number of fixed draws, which we don't have here.

The Geometric, the Poisson, and the Zipf distributions are possibly appropriate in terms of goal number 1. 

If we look carefuly, it looks like the mode of the data samples is always at 1, namely, the majority of nouns/verbs have a single sense. The Poisson distribution seems less adequate now. The only Poisson with a mode at 1 is Poisson(1), and if we tried to pick another Poisson (for example, in an attempt to find one with mean and variance more similar to the data, we would have to give up on matching the observed mode). 

The Geometric and the Zipf remain candidates for both distributions have their modes fixed at 1, for any choice of parameter. Both have pmfs that decay quickly, (roughly) exponentially quickly, so the final decision will have to depend on other properties.

One thing to know about the Zipf law is that it has extremely [heavy tails](https://en.wikipedia.org/wiki/Heavy-tailed_distribution). Draws from Zipf will often deviate drammatically far from 1, even though 1 will remain the most frequent outcome.

Let's enumerate a few Zipf distributions, sample from them, and describe some properties of the samples:

In [None]:
for power in np.linspace(1.001, 1.5, 10):
    x_ = st.zipf(power).rvs(size=10000)
    print(f"1st trial with Zipf({power:.4f})", st.mode(x_), st.describe(x_))  
    x_ = st.zipf(power).rvs(size=10000)
    print(f"2nd trial with Zipf({power:.4f})", st.mode(x_), st.describe(x_))

Look the range of samples we get (property `minmax` of the describe result), those are clearly inadequate for the data we are trying to model.

So, we have enough to continue with the Geometric as our first plausible attempt.

We can find the [Geometric MLE solution on Wikipedia](https://en.wikipedia.org/wiki/Geometric_distribution) and note that by default the scipy Geometric is the version with support $\{1, 2, \ldots \}$ (not $\{0, 1, \ldots\}$).

In [None]:
def mle_geometric(x):
    prob = 1 / np.mean(x)  # the scipy Geometric has support {1, ...} not {0, ...}
    return st.geom(prob)

In [None]:
geom_n = mle_geometric(num_synsets_nouns)
geom_n.args

In [None]:
geom_v = mle_geometric(num_synsets_verbs)
geom_v.args

In [None]:
n_ = geom_n.rvs(size=num_synsets_nouns.size)
v_ = geom_v.rvs(size=num_synsets_verbs.size)

Clearly, we can capture the mean, after all the geometric parameter is directly estimated from the sample mean:

In [None]:
st.describe(n_)

In [None]:
st.describe(v_)

In [None]:
fig, ax = plt.subplots(2, 2, sharex='col', sharey='col', figsize=(8, 4))
_ = ax[0, 0].hist(num_synsets_nouns, bins='auto')
_ = ax[0, 0].set_xlabel('Number of senses for nouns')
_ = ax[0, 1].hist(num_synsets_verbs, bins='auto')
_ = ax[0, 1].set_xlabel('Number of senses for verbs')

_ = ax[1, 0].hist(n_, bins='auto')
_ = ax[1, 0].set_xlabel('Geometric for nouns')
_ = ax[1, 1].hist(v_, bins='auto')
_ = ax[1, 1].set_xlabel('Geometric for verbs')

It looks like the Geometric assumption works better for verbs than for nouns.

At this point in the course, we do not know enough to prescribe a model capable of a better fit for the nouns, but we will get there.