# Learning a Predictive N-Gram Model

This notebook demonstrates how to use a Markov model to predict the next word in a text of the legal domain. Specifically, we model the language used in [German cases](https://de.wikipedia.org/wiki/Urteil_(Recht)). The focus lies on showing how data from the [Open Legal Data Project](https://openlegaldata.io) can be used to do machine learning.

_Note_: This demo is not about building the best predictive model for the legal domain, and not about building a competitive n-gram implementation. We use a simple fixed-order n-gram implementation without escaping, smoothing or exclusion techniques.

## Installation

Install all repo requirements by running:
```
pipenv --python 3.7
pipenv install
```

To install this environment as a Jupyter Notebook kernel run:
```
pipenv run python -m ipykernel install --name oldp-notebook
```

## Obtain

We obtain the training (and test) data using the [OLDP SDK for Python](https://github.com/openlegaldata/oldp-sdk-python). For a more detailed example about the API client usage refer to the [OLDP Client Demo](https://github.com/openlegaldata/oldp-notebooks/blob/master/notebooks/oldp-client-demo.ipynb) notebook.

In [1]:
import oldp_client 

conf = oldp_client.Configuration()
conf.api_key['api_key'] = '123abc'  # Replace this with your API key
api_client = oldp_client.ApiClient(conf)
cases_api = oldp_client.CasesApi(api_client)
cases = cases_api.cases_list(court_id=2).results  # first page (10 cases) for court=Europäischer Gerichtshof

## Clean

The raw data that we obtain from the API is in the HTML format. Before we can tokenize the text we have to clean it from the HTML tags and some special characters.

In [2]:
from utils import preprocessing

def clean(content):
    content = preprocessing.remove_pattern(content, r'\n|\t', replace_with=' ')
    content = preprocessing.remove_pattern(content, r'<[^>]+>')
    content = preprocessing.replace_html_special_ents(content)
    content = preprocessing.remove_whitespace(content)
    return content

text = ''
for case in cases:
    text += clean(case.content)
    
print("Before: ...{}...".format(cases[0].content[0:100]))
print("After: ...{}...".format(text[0:100]))

Before: ...<h2>Tenor</h2>

<div>
					
					<p>Als funktional zust&#228;ndig wird die allgemeine Zivilkammer be...
After: ...Tenor Als funktional zuständig wird die allgemeine Zivilkammer bestimmt. Gründe I. Die in München an...


## Explore

In [3]:
import spacy
import collections
import numpy as np

np.random.seed(0)

class Corpus:

    def __init__(self, text, test_percentage=0.1):
        self.test_percentage = test_percentage
        
        # use spacy NLP to do the tokenization and sentence boundary detection
        nlp = spacy.load('de_core_news_sm')
        self.doc = nlp(text)

    def get_words(self):
        for token in self.doc:
            yield token.text
    
    def get_sentences(self, test=False):
        for sent in self.doc.sents:
            # split into training and test sentences, according to the given percentage
            if (np.random.random() >= self.test_percentage and not test) or \
                (np.random.random() < self.test_percentage and test):
                yield sent
                
    def get_ngrams(self, n, test=False):
        for sent in self.get_sentences(test=test):
            if len(sent) < 10:
                continue
            for pos in range(len(sent)):
                if len(sent)-pos < n:
                    break
                yield (*[sent[pos+i].text for i in range(n)],)

In [4]:
def print_most_common(n):
    counter = collections.Counter(corpus.get_ngrams(n))
    print('\nThe most common {}-grams:'.format(n))
    for k, v in counter.most_common(5):
        print('{}: {}'.format(k, v))

corpus = Corpus(text)

print('Number of words in corpus: ', len(list(corpus.get_words())))
print('Number of training sentences in corpus: ', len(list(corpus.get_sentences())))
print('Number of test sentences in corpus: ', len(list(corpus.get_sentences(test=True))))
print('Size of alphabet:', len(set(corpus.get_words())))
    
print_most_common(1)
print_most_common(3)
print_most_common(5)

Number of words in corpus:  30282
Number of training sentences in corpus:  1673
Number of test sentences in corpus:  192
Size of alphabet: 5270

The most common 1-grams:
(',',): 1225
('.',): 1008
('der',): 882
('die',): 666
('des',): 407

The most common 3-grams:
(',', 'dass', 'die'): 35
('Abs.', '1', 'Satz'): 32
(',', 'dass', 'der'): 22
('1', 'Satz', '1'): 21
('§', '11', 'Abs.'): 18

The most common 5-grams:
('§', '124', 'Abs.', '2', 'Nr.'): 13
('Abs.', '5', 'Satz', '1', 'VwGO'): 8
('§', '11', 'Abs.', '2a', 'TierSchG'): 8
(',', 'juris', ',', 'Rn', '.'): 7
('vom', '19', '.', 'März', '2018'): 7


## Learning a Model

In [5]:
class NgramModel:
    
    def __init__(self, n=3):
        self.n = n
        self.ngrams = None
        self.alphabet = None
    
    def learn(self, corpus):
        self.ngrams = collections.Counter(corpus.get_ngrams(self.n))
        self.alphabet = set(corpus.get_words())
        
    def predict(self, context):    
        if len(context) < self.n - 1:
            raise ValueError('The context has to be at least of length {}!'.format(self.n - 1))
        if len(context) >= self.n:
            context = context[-self.n + 1:]
            
        matches = {}
        for word in self.alphabet:
            count = self.ngrams[tuple(context) + (word,)]
            if count > 0:
                matches[word] = count
        total_count = sum(matches.values(), 0.0)
        return {k: v / total_count for k, v in matches.items()}
    
    def predict_str(self, context_str):
        nlp = spacy.load('de_core_news_sm')
        context = [token.text for token in nlp(context_str)]
        return self.predict(context)
        

corpus = Corpus(text)

model = NgramModel(n=3)
model.learn(corpus)

model.predict(['der', 'Europäischen'])

{'Union': 0.75, 'Gemeinschaft': 0.125, 'Kommission': 0.125}

## Interpret

We can use the predictive model to guess the next word in a sentence with legal content. This could be used as an autocompletion feature in a legal text editor.

To compare the performance of several fixed-order models, we use cross entropy as a measure. We see that out of the tested values, n=10 has the best test performance. However, presumably due to the training dataset being too small, only about 12% of the contexts could be completed (if a context was not seen in the training data the implemented algorithm does not make a prediction). It seems likely, that the good performance especially with higher _n_ is caused by a large amount of set phrases (or tokens) in this domain.

In [6]:
d = model.predict_str('Am 23. Dezember 2006 nahm der Sicherheitsrat der Vereinten Nationen (im')
pred_next_word = max(d.keys(), key=lambda key: d[key])
pred_next_word

'Folgenden'

In [7]:
def eval(n):
    corpus = Corpus(text)

    model = NgramModel(n=n)
    model.learn(corpus)
    
    print('\nN={}:'.format(n))
    print('Training cross ent: {} (count={})'.format(*cross_ent(model, corpus, n)))
    print('Test cross ent: {} (count={})'.format(*cross_ent(model, corpus, n, test=True)))

def cross_ent(model, corpus, n, test=False):
    cross_ent = 0.0
    count = 0
    for ngram in corpus.get_ngrams(n, test=test):
        context = ngram[0:n-1]
        pred = ngram[n-1]
        distr = model.predict(context)

        # only count ngrams that occurred in the training data
        if pred in distr:
            cross_ent -= np.log2(distr[pred])
            count += 1
        
    cross_ent /= count
    return cross_ent, count

eval(2)
eval(3)
eval(5)
eval(10)


N=2:
Training cross ent: 3.6120551673497037 (count=23229)
Test cross ent: 3.5750781588445535 (count=2215)

N=3:
Training cross ent: 0.8136898275785924 (count=21265)
Test cross ent: 0.7985905308149392 (count=2364)

N=5:
Training cross ent: 0.07329458669651902 (count=19292)
Test cross ent: 0.08784666850808225 (count=2149)

N=10:
Training cross ent: 0.005888186456617001 (count=14754)
Test cross ent: 0.0020242914979757085 (count=1482)
