# Named-entity recognition startup notebook

### General goals

Learn about NLP in general
- Data preprocessing
- State of the art models and how to use them
- Error measurements in NLP
- What is easy and hard for the model learn in NER of the given dataset

### Tasks

- Start med at forstå og installere modellerne og forstå hvordan de kan anvendes til at løse "NER" som et classification problem

- Evaluate performance (NER classification performance, computational complexity: space and time) on the dataset (1) using models (2).

(1) Dataset: UD-DDT (DaNE) ( https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md )
- Annotated with PER (person), ORG (organization) and LOC (location)

(2) Models ( https://github.com/alexandrainst/danlp/tree/master/danlp/models )
- BERT NER model
- flair_ner_model

### Questions

- Is the classification performance only measured in accuracy (total and per category)? Or do I look at some more fancy methods?
- How do I measure computational space complexity?
- Check: time complexity is prediction time as a function of the number of words/sentences/observations?

In [43]:
from danlp.datasets import DDT
ddt = DDT()

In [44]:
parts = ddt.load_as_simple_ner(True)

In [6]:
from danlp.models.bert_models import BertNer

In [8]:
model = BertNer()

Downloading file /var/folders/bd/r7xjdv5927lc2m8hkq9g_dbh0000gn/T/tmprbpw979l


In [45]:
parts[0][0][0]

['På',
 'fredag',
 'har',
 'SID',
 'inviteret',
 'til',
 'reception',
 'i',
 'SID-huset',
 'i',
 'anledning',
 'af',
 'at',
 'formanden',
 'Kjeld',
 'Christensen',
 'går',
 'ind',
 'i',
 'de',
 'glade',
 'tressere',
 '.']

In [46]:
parts[0][0][0], parts[0][1][0]

(['På',
  'fredag',
  'har',
  'SID',
  'inviteret',
  'til',
  'reception',
  'i',
  'SID-huset',
  'i',
  'anledning',
  'af',
  'at',
  'formanden',
  'Kjeld',
  'Christensen',
  'går',
  'ind',
  'i',
  'de',
  'glade',
  'tressere',
  '.'],
 ['O',
  'O',
  'O',
  'B-ORG',
  'O',
  'O',
  'O',
  'O',
  'B-LOC',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-PER',
  'I-PER',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'])

In [47]:
modinput = parts[0][0][0]
targs = parts[0][1][0]
preds = model.predict(modinput)[1]

In [48]:
tot_words = len(preds)
correctly_classified_words = 0

for category in range(tot_words):
    #print(preds[category], targs[category])
    #print(preds[category]==targs[category])
    correctly_classified_words += preds[category]==targs[category]
    
print(correctly_classified_words/tot_words)

1.0
