# Explore structured prediction tasks and compare different prediction methods

**Authors: maria.boritchev@telecom-paris.fr and matthieu.labeau@telecom-paris.fr**

## Objectives:

- Explore Part-of-Speech (POS) tagging, in particular through tools integrated to the ```nltk``` library.

- Use the notion of _chunking_ and its different implementations and explore sentence structure.

- Implement a BIO tagging for POS and compare our implementation with tools integrated to the ```nltk``` library.

- Explore Named Entity Recognition (NER) with tools integrated to the ```nltk``` library and ```spaCy``` visualisations.

- Compare these tools with a neural network implementation using BERT. 

- Throughout this lab, we will use several datasets: ```nlp-getting-started```, ```entity-annotated-corpus```, and ```wikibooks-dataset```. These datasets need to be downloaded from the course page.

In [1]:
# The main packages needed for this lab:
#test
import numpy as np
import pandas as pd
import nltk 
import os

In this lab, we will first use the functions ```word_tokenize``` and ```pos_tag``` for the POS tagging task. 

### Obtaining and exploring the data
We start from the beginning with the ```nlp-getting-started``` dataset. 

In [2]:
# Read data from nlp-getting-started:
nlp_start_df = pd.read_csv('nlp-getting-started/train.csv')

Explore the data directories and get familiar with their contents and type: how is the data organised? 

<span style="color:red">Questions:</span> What type of natural language data are we working with (sentences, words)? What are the sources and languages of the data?

In [3]:
# Examine example sentences:


<span style="color:green">To code:</span> Using the functions ```word_tokenize``` and ```pos_tag``` of ```nltk```, tokenize an example sentence and apply POS tagging to it.

In [12]:
from nltk import word_tokenize, pos_tag
# Tokenize a sentence and apply POS tagging:


<span style="color:red">Question:</span> What is the set of POS-tags used by the ```pos_tag``` function?

As we discussed during the lecture, there can be different sets of POS-tags. The following shows you the documentation for the ```UPENN``` tagset.

In [90]:
nltk.download('tagsets')

nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those


[nltk_data] Downloading package tagsets to /Users/maria/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


<span style="color:green">To code:</span>  Produce a different POS tagging of the same sentence using a different set of POS-tags of your choice.

In [None]:
# Tokenize the sentence and apply a different set of POS-tags of your choice:


### Chunking
_Chunking_  is a process of grouping small pieces of information into large units. The primary use of chunking in the context of POS-tagging is in building sub-syntactic trees, using POS-tags-based grammars, implemented using _regular expressions_. The resulting subsentences are called _chunks_. For example, we can use chunking to subdivide a sentence in groups of _noun phrases_ (NPs). There are no pre-defined rules or grammars for chunking, these need to be implemented using regular expressions depending on our need. For example, if we want to chunk **only** ```NN``` tags, we need to use the pattern ````mychunk:{<NN>}````. Conversly, if we want to chunk **all types of tags** which start with 'NN', we'll use `mychunk:{<NN.*>}`.

<span style="color:green">To code:</span>  Apply the chunking method to a sentence from the ```nlp-getting-started``` dataset.

In [7]:
from nltk import RegexpParser
from nltk.draw.tree import TreeView
from IPython.display import Image
import svgling

# Chunk all nouns from 'sentence' sentence
patterns= """mychunk:{<NN>+}"""
chunker = RegexpParser(patterns)
output = chunker.parse(sentence)
print("After Chunking:\n",output)
svgling.draw_tree(output)

NameError: name 'sent' is not defined

<span style="color:green">To code:</span>  Produce a chunking of the same sentence retrieving all the adjacence nouns (noun-based tags).

In [8]:
# Chunk all adjacence nouns from 'sentence' sentence


Chunking can also be done using BIO algorithm: ```B```eginning of a chunk, ```I```inside of a chunk, ```O```utside of a chunk.

In [92]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint # To display the list nicely

iob_tagged = tree2conlltags(output)
iob_tagged

[('Experts', 'NNS', 'B-mychunk'),
 ('in', 'IN', 'O'),
 ('France', 'NNP', 'B-mychunk'),
 ('begin', 'VB', 'O'),
 ('examining', 'VBG', 'O'),
 ('airplane', 'JJ', 'O'),
 ('debris', 'NN', 'B-mychunk'),
 ('found', 'VBD', 'O'),
 ('on', 'IN', 'O'),
 ('Reunion', 'NNP', 'B-mychunk'),
 ('Island', 'NNP', 'I-mychunk'),
 (':', ':', 'O'),
 ('French', 'JJ', 'O'),
 ('air', 'NN', 'B-mychunk'),
 ('accident', 'NN', 'I-mychunk'),
 ('experts', 'NNS', 'I-mychunk'),
 ('on', 'IN', 'O'),
 ('Wedn', 'NNP', 'B-mychunk'),
 ('...', ':', 'O'),
 ('http', 'NN', 'B-mychunk'),
 (':', ':', 'O'),
 ('//t.co/v4SMAESLK5', 'NN', 'B-mychunk')]

<span style="color:green">To code:</span> Implement a tree-crawling retrieving the list of triples ```('word','POS-tag','BIO-tag')``` from ```chunker.parse``` output trees. 

In [None]:
# Tree crawler retrieval of triples from chunker.parser output trees

# Named Entity Recognition (NER)

NER is also implemented in ```nltk```, using chunking: the ```ne_chunk()``` method in the ```nltk.chunk``` module. 

<span style="color:green">To code:</span>  Implement NER on the same sentence with the method showcased below.

In [14]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk

def extract_ne(trees, labels):
    
    ne_list = []
    for tree in ne_res:
        if hasattr(tree, 'label'):
            if tree.label() in labels:
                ne_list.append(tree)
    
    return ne_list

# NER on 'sentence' sentence    
            
ne_res = ne_chunk(pos_tag(word_tokenize(sentence)))
labels = ['ORGANIZATION']

print(extract_ne(ne_res, labels))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/maria/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/maria/nltk_data...
[nltk_data]   Package words is already up-to-date!


NameError: name 'sentence' is not defined

<span style="color:red">Questions:</span> What other labels are available?

<span style="color:green">To code:</span> Implement a NE extraction using two or more different labels. 

In [None]:
# NER using two or more different labels

## NER and ```spaCy```

```spaCy``` module also contains a statistical NER system. It is trained to identify named and numeric entities, such as companies, locations, organizations and products. First, we need more complex data. 

In [15]:
import sqlite3

cnx = sqlite3.connect('wikibooks.sqlite')
df_wikibooks = pd.read_sql_query("SELECT * FROM en", cnx)
df_wikibooks.head()

Unnamed: 0,title,url,abstract,body_text,body_html
0,Wikibooks: Radiation Oncology/NHL/CLL-SLL,https://en.wikibooks.org/wiki/Radiation_Oncolo...,Chronic Lymphocytic Leukemia and Small Lymphoc...,Front Page: Radiation Oncology | RTOG Trials |...,"<div class=""mw-parser-output""><table width=""10..."
1,Wikibooks: Romanian/Lesson 9,https://en.wikibooks.org/wiki/Romanian/Lesson_9,==Băuturi/Beverages==,Băuturi/Beverages[edit | edit source]\nTea : C...,"<div class=""mw-parser-output""><h2><span id=""B...."
2,Wikibooks: Karrigell,https://en.wikibooks.org/wiki/Karrigell,Karrigell is an open Source Python web framewo...,Karrigell is an open Source Python web framewo...,"<div class=""mw-parser-output""><p>Karrigell is ..."
3,Wikibooks: The Pyrogenesis Engine/0 A.D./GuiSe...,https://en.wikibooks.org/wiki/The_Pyrogenesis_...,====setupUnitPanel====,setupUnitPanel[edit | edit source]\nHelper fun...,"<div class=""mw-parser-output""><h4><span class=..."
4,Wikibooks: LMIs in Control/pages/Exterior Coni...,https://en.wikibooks.org/wiki/LMIs_in_Control/...,== The Concept ==,Contents\n\n1 The Concept\n2 The System\n3 The...,"<div class=""mw-parser-output""><div id=""toc"" cl..."


The following displays ```spaCy```'s NER on a given document from the ```wikibooks``` dataset.

In [16]:
import spacy
#spacy.cli.download('en_core_web_sm')
nlp = spacy.load("en_core_web_sm")
wiki_ex = df_wikibooks.iloc[11]['body_text']
# print(wiki_ex)
doc = nlp(wiki_ex)
doc 

This Wikibooks page is a fact sheet and analysis on the article "Habitual physical activity in children and adolescents with cystic fibrosis" about how exercise is related to the disease Cystic Fibrosis.

Contents

1 Background of this research
2 Where is the research from ?
3 What kind of research was this?
4 What did the research involve?

4.1 Pulmonary Function testing
4.2 Pros / Cons of this test


5 What were the basic results?
6 What conclusion can we take from this research ?
7 Practical Advice
8 Further information/ Resources

8.1 Cystic Fibrosis Australia
8.2 Cystic Fibrosis's National Ambassador Nathan Charles


9 References



Background of this research[edit | edit source]
The research was about the effects of taking part in exercise constantly or making it a habit in the population of children and teens that are severing from the genetic condition cystic Fibrosis.
What is  Cystic Fibrosis
It is a genetic condition, affecting lungs and digestion. Unfortunately, there is no 

In [17]:
print('All entity types that spacy recognised from the document above')
set([ent.label_ for ent in doc.ents])

All entity types that spacy recognised from the document above


{'CARDINAL',
 'DATE',
 'GPE',
 'NORP',
 'ORG',
 'PERCENT',
 'PERSON',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART'}

<span style="color:red">Questions:</span> What other labels are available?

<span style="color:green">To code:</span> Print out all persons and organizations recognised in the document.

In [18]:
# Print out all persons and organisations from the document above


```spaCy``` also features a very nice visualization tool for NE. The following showcases this tool on a Wikibooks page.

In [None]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

<span style="color:red">Questions:</span> Can you see problems in this annotation? What adaptations would you suggest?

### Comparing accuracies with a pre-trained model

We will now use a pre-trained BERT-based model from the ```transformers``` library. Assuming we use the model as is, and do not do any fine-tuning, we can use the high-level interface from the library, ```pipeline```.
First, let's look at the model and the tags it uses. 

In [19]:
from transformers import AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
print('Entities from the pretrained model')
print(model.config.id2label)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entities from the pretrained model
{0: 'O', 1: 'B-MISC', 2: 'I-MISC', 3: 'B-PER', 4: 'I-PER', 5: 'B-ORG', 6: 'I-ORG', 7: 'B-LOC', 8: 'I-LOC'}


Now, let's look at one last dataset. It contains **reference tags** for every word in the sentences. However, the tagset is not exactly the same - we will need to use a mapping. For simplicity, we will only look at the *Location* tags in what follows ! 

In [20]:
# Let's re-format the dataset for convenience
df = pd.read_csv('./entity-annotated-corpus/ner_dataset.csv', encoding='unicode_escape')
df['Sentence #'] = df['Sentence #'].ffill()
# Grouping sentences together into one sentence by row, for words, pos, tags
df_gr = df.groupby('Sentence #').agg(lambda x: list(x))
# Just renaming indexes
df_gr.index  = [int(s[9:]) for s in df_gr.index]

In [None]:
# What's the tagset ? 
tags = []
for tag in df_gr['Tag'].to_list():
    tags.extend(tag)
print('Entities in our data set')
print(set(tags))

<span style="color:red">Questions:</span> What is the tagset used in the previous cell?

In [21]:
# Let's look at an example ! 
example = df_gr.loc[1]['Word']
example_tag = df_gr.loc[1]['Tag']
NER_model = pipeline("ner",
                     model="dslim/bert-base-NER",
                     grouped_entities=True)
print("Output from the pipeline containing the BERT-based model:")
print(generator(example))
print("Reference tag list")
print(example_tag)

NameError: name 'pipeline' is not defined

<span style="color:green">To code:</span> Assuming the following mapping between the tags from the dataset and those output by the BERT model:
```python
entity_mapping = {
'O': 'O',
'B-per': 'B-PER',
'I-per': 'I-PER',
'B-org': 'B-ORG',
'I-org': 'I-ORG',
'B-geo': 'B-LOC',
'I-geo': 'I-LOC',
'B-art': 'B-MISC', 'B-eve': 'B-MISC', 'B-gpe': 'B-MISC', 'B-nat': 'B-MISC', 'B-tim': 'B-MISC',
'I-art': 'I-MISC', 'I-eve': 'I-MISC', 'I-gpe': 'I-MISC', 'I-nat': 'I-MISC', 'I-tim': 'I-MISC',
}
``` 

Assuming you can use the less precise tagset (from the BERT model),
Find a way to compute the accuracy of the BERT-based model on all **complete locations** on the dataset. Compare with the same value obtained for the NLTK model !