# Refomat data from CoNLL 2003

CoNLL 2003 is an anotated dataset for name entity recognition. It contains the following types of entities:
- Person
- Location
- Organisation
- Groupe of people

We will use this dataset in order to test the performance of the NER model based on transformer provided by the open source framework [spaCy](https://spacy.io/).

The format of the data was idea to use it in the model. Therefore we had o reformat it. This is the purpose of this notebook.

## Imports

In [11]:
import copy
import string
import pickle

from tqdm import tqdm_notebook as tqdm

## Data loading

In [4]:
f = open("../../data/conll2003.txt", "r")
whole_text = f.read()

In [5]:
print(whole_text[:150])

EU	B-ORG
rejects	O
German	B-MISC
call	O
to	O
boycott	O
British	B-MISC
lamb	O
.	O

Peter	B-PER
Blackburn	I-PER

BRUSSELS	B-LOC
1996-08-22	O

The	O
Euro


## Text processing
### Sperating the sentences, the tokens and the word from the associated entity

In [6]:
# Split the sentence which are seperated by two concecutive new lines
sentences = whole_text.split("\n\n")
# Seperate the token which are seperated by a single newlines
# Seperate the word from the entity which are seperated by a tab
tokenized_sentences = [[t.split("\t") for t in s.split("\n")] for s in sentences]

In [7]:
tokenized_sentences[:3]

[[['EU', 'B-ORG'],
  ['rejects', 'O'],
  ['German', 'B-MISC'],
  ['call', 'O'],
  ['to', 'O'],
  ['boycott', 'O'],
  ['British', 'B-MISC'],
  ['lamb', 'O'],
  ['.', 'O']],
 [['Peter', 'B-PER'], ['Blackburn', 'I-PER']],
 [['BRUSSELS', 'B-LOC'], ['1996-08-22', 'O']]]

### Compute the index of first and last letter of each token

In [8]:
all_sentence_and_list_of_ents = []
# Loop over the sentences
for sentence in tqdm(tokenized_sentences[:-2]):
    # We rebuild the sentence bit by bit, so we start with an empty string
    string_sentence = ""
    # We keep in memory the current entoty type to see when it changes
    cureent_ent = 'O'

    # For each new sentence we initialise an empty list of entities
    list_of_entities = []
    # Each entity will be described by a dictionairy
    current_entity_dict = {}
    # We need to know if the current token is the first token to manage the
    # space between the tokens in the sentence
    first_token = True
    for t_ent  in sentence:
        # Get the next token and it's entity type
        t, ent = t_ent[0], t_ent[1].split('-')[-1]
        # Unless the current token is the first one or it starts with an appostrophy
        # we add a space at the end of the sentence before adding the token
        if (not first_token) and (not "'" == t[0]):
            string_sentence += " "
        # We add the token to the sentence
        string_sentence += t
        # If we just changed type of current entity
        if cureent_ent != ent:
            # if we end a meaningfull entity ( O means no entity) we complete 
            # its coresponding dictionnairy
            if cureent_ent != 'O':
                # We get the index of the last character of the entity in the sentence
                current_entity_dict["end"] = len(string_sentence) - len(t) - 1
                # We add the dictionnairy to the list of entities
                list_of_entities.append(copy.deepcopy(current_entity_dict))
                # We create a new dictionnairy for the new entity
                current_entity_dict = {}
            # if we start a meaningfull entity, we start putting information in 
            # its dictionnairy
            if ent != 'O':
                # We add the type of the ew enetity
                current_entity_dict['ent'] = ent
                # We get the index of the first character of the entity in the sentence
                current_entity_dict['start'] = len(string_sentence) - len(t)
        # We update current entity
        cureent_ent = ent
        # We indicate that the current token will no longer be the first one
        first_token = False
    # At the end of the sentence we deal with the potential last entity
    if cureent_ent != 'O':
                current_entity_dict["end"] = len(string_sentence)
                list_of_entities.append(copy.deepcopy(current_entity_dict))
                current_entity_dict = {}
    # We add the list of entity corresponding to the current sentence in the dataset
    all_sentence_and_list_of_ents.append({"string": string_sentence, "ents": list_of_entities})

    

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


  0%|          | 0/14040 [00:00<?, ?it/s]

### We have a look at the results

In [9]:
all_sentence_and_list_of_ents[:5]

[{'string': 'EU rejects German call to boycott British lamb .',
  'ents': [{'ent': 'ORG', 'start': 0, 'end': 2},
   {'ent': 'MISC', 'start': 11, 'end': 17},
   {'ent': 'MISC', 'start': 34, 'end': 41}]},
 {'string': 'Peter Blackburn',
  'ents': [{'ent': 'PER', 'start': 0, 'end': 15}]},
 {'string': 'BRUSSELS 1996-08-22',
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 8}]},
 {'string': 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
  'ents': [{'ent': 'ORG', 'start': 4, 'end': 23},
   {'ent': 'MISC', 'start': 59, 'end': 65},
   {'ent': 'MISC', 'start': 94, 'end': 101}]},
 {'string': "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 6},
   {'ent

#### We verify if the index of the entities are correct for the first sentence

In [63]:
all_sentence_and_list_of_ents[0]['string'][0:2]

'EU'

In [64]:
all_sentence_and_list_of_ents[0]['string'][11:17]

'German'

In [65]:
all_sentence_and_list_of_ents[0]['string'][34:41]

'British'

##### Observation:
It seems all good.

## Save the reformated dataset into pickle

In [12]:
filename = '../../data/ner_testing.pkl'
outfile = open(filename,'wb')
pickle.dump(all_sentence_and_list_of_ents,outfile)
outfile.close()