<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/exercises_notebooks/04_LM_PP_Attachment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PP Attachment 

Let's try out the Preposition Phrase attachment classification!

Through this exercise, you'll practice classification of linguistic aspects of text.

# Setup
Loading the data

In [1]:
import csv

from tqdm.notebook import tqdm
from random import choice
from urllib.request import urlopen


In [2]:
def read_pp_examples(file_url):
    pp_examples = []
    
    for line in tqdm(urlopen(file_url)):
      line = line.decode("utf-8").strip().split()
      assert(len(line) == 5)
      v,n1,p,n2,answer = line
      pp_examples.append( {'answer':answer,'pp':(v,n1,p,n2)} )
    return pp_examples

In [3]:
pp_samples_url = 'https://raw.githubusercontent.com/liadmagen/NLP-Course/master/dataset/pp_examples.txt'

In [4]:
pp_examples = read_pp_examples(pp_samples_url)


0it [00:00, ?it/s]

# Step #1 - looking at the data

Always look at the data first!

In [5]:
len(pp_examples)

25858

In [6]:
print(choice(pp_examples))

{'answer': 'N', 'pp': ('see', 'improvement', 'in', 'formulation')}


In [7]:
example = choice(pp_examples)
example['pp']

('block', 'measure', 'with', 'actions')

In [8]:
example['answer']

'V'

# Step 2: Deciding on the measurement

In [9]:
amt = int(0.75 * len(pp_examples))
train_examples, test_examples = pp_examples[:amt], pp_examples[amt:]

print(len(train_examples), len(test_examples))

19393 6465


We'll define a classifier evaluator.

Given a set of examples and an evaluator, it returns the accuracy score

In [10]:
def evaluate_classifier(examples, pp_resolver):
    """
    examples: a list of {'pp':(v,n1,p,n2), 'answer':answer }
    pp_resolver has a classify() function: from (v,n1,p,n2) to 'N' / 'V'
    """
    correct = 0.0
    incorrect = 0.0
    for example in examples:
        answer = pp_resolver.classify(example['pp'])
        if answer == example['answer']:
            correct += 1
        else:
            incorrect += 1
    return correct / (correct + incorrect)


# Classifiers

Let's test it on an extremely naive classifiers:

In [11]:
class AlwaysSayN:
    """
    This naive clasifier answers always with 'Noun'
    """
    def __init__(self): pass
    def classify(self, pp):
        return 'N'


In [12]:
class AlwaysSayV:
    """
    This naive clasifier answers always with 'Verb'
    """
    def __init__(self): pass
    def classify(self, pp):
        return 'V'


In [13]:
evaluate_classifier(test_examples, AlwaysSayV())

0.4634184068058778

In [None]:
evaluate_classifier(test_examples, AlwaysSayN())


0.5365815931941222

We can see that saying always 'Noun', leads to an accuracy result of 53%.

---



It also means that our dataset is quite balaneced ;)

We could, instead, have tested which class has the majority and simply select it:

In [14]:
class MajorityClassResolver:
    def __init__(self, training_examples):
        answers = [item['answer'] for item in training_examples]
        num_n = len([a for a in answers if a == 'N'])
        num_v = len([a for a in answers if a == 'V'])
        if num_v > num_n:
            self.answer = 'V'
        else:
            self.answer = 'N'
    def classify(self, pp):
        return self.answer


In [15]:
evaluate_classifier(test_examples, MajorityClassResolver(train_examples))

0.5365815931941222

Or make it a bit more sophisticated by peeking at the training examples:

In [16]:
class LookupResolver:
    def __init__(self, training_examples):
        self.answers = {}
        for item in training_examples:
            self.answers[item['pp']] = item['answer']
        self.backoff = MajorityClassResolver(training_examples)
        
    def classify(self, pp):
        if pp in self.answers:
            return self.answers[pp]
        else:
            return self.backoff.classify(pp)


In [17]:
evaluate_classifier(test_examples, LookupResolver(train_examples))

0.6009280742459396

# Excersize - Your Turn:

Implement a discriminative PP-attachment model, using a classifier of your choice (i.e. - Naive Bayes Classifier https://web.stanford.edu/~jurafsky/slp3/4.pdf) from a toolkit such as [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB).

Possible features:

Single items ​
* Identity of v ​
* Identity of p ​
* Identity of n1 ​
* Identity of n2 ​

Pairs:​
* dentity of (v, p) ​
* Identity of (n1, p) ​
* Identity of (p, n1)​

Triplets:​
* Identity of (v, n1, p)​
* Identity of (v, p, n2) ​
* Identity of (n1, p, n2) ​

Quadruple:​
* Identity of (v, n1, p, n2)​


Corpus Level:​

* Have we seen the (v, p) pair in a 5-word window in a big corpus?​
* Have we seen the (n1, p) pair in a 5-word window in a big corpus? ​
* Have we seen the (n1, p, n2) triplet in a 5-word window in a big corpus?​
*  Also: we can use counts, or binned counts.​

Distance:​
* Distance (in words) between v and p ​
* Distance (in words) between n1 and p​

In [None]:
from sklearn.naive_bayes import GaussianNB

class NaiveBayesClassifier():
  
  def __init__(self, training_examples):
    classifier = GaussianNB()
    
  def classify(self, pp):
    pass

In [None]:
evaluate_classifier(test_examples, NaiveBayesClassifier(train_examples))