# Assignment #4: Extraction of Named Entities: Prerequisites
Author: Pierre Nugues

__You must execute this notebook before your start the assignment.__

The goal of the assignment is to create a system to extract syntactic groups from a text. You will apply it to the CoNLL 2003 dataset. 

In this part, you will collect the datasets and the files you need to train your models. You will also collect the script you need to evaluate them.

## Collecting a Training and a Test sets

As annotated data and annotation scheme, you will use the data created for [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/).
1. Read the description of the CoNLL 2003 task
2. Download both the training, validation, and test sets from https://data.deepai.org/conll2003.zip and decompress them. See the instructions below
3. Note that the tagging scheme has been changed to IOB2 


In [1]:
!wget https://data.deepai.org/conll2003.zip
!unzip -u conll2003.zip

--2023-03-22 13:59:02--  https://data.deepai.org/conll2003.zip
Résolution de data.deepai.org (data.deepai.org)… 169.150.247.33
Connexion à data.deepai.org (data.deepai.org)|169.150.247.33|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 982975 (960K) [application/zip]
Sauvegarde en : « conll2003.zip.1 »


2023-03-22 13:59:02 (9,61 MB/s) — « conll2003.zip.1 » sauvegardé [982975/982975]

Archive:  conll2003.zip


## The evaluation script

You will train the models with the training set and the test set to evaluate them. For this, you will apply the `conlleval` script that will compute the harmonic mean of the precision and recall: F1. 

`conlleval` was written in Perl. Some people rewrote it in Python and you will use such such a translation in this lab. The line below installs it. The source code is available from this address: https://github.com/kaniblu/conlleval

In [2]:
!pip install conlleval



## Collecting the Embeddings

You will represent the words with dense vectors, instead of a one-hot encoding. GloVe embeddings is one such representation. The Glove files contain a list of words, where each word is represented by a vector of a fixed dimension. In this notebook, we will use the file of 400,000 lowercase words with the 50 and 100-dimensional vectors.
Download either:
*  The GloVe embeddings 6B from <a href="https://nlp.stanford.edu/projects/glove/">https://nlp.stanford.edu/projects/glove/</a> and keep the 50d and 100d vectors; or
* A local copy of this dataset with the cell below (faster)

In [3]:
!wget https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz

--2023-03-22 13:59:07--  https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.100d.txt.gz
Résolution de fileadmin.cs.lth.se (fileadmin.cs.lth.se)… 130.235.16.7
Connexion à fileadmin.cs.lth.se (fileadmin.cs.lth.se)|130.235.16.7|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 134409071 (128M) [application/x-gzip]
Sauvegarde en : « glove.6B.100d.txt.gz »


2023-03-22 13:59:08 (112 MB/s) — « glove.6B.100d.txt.gz » sauvegardé [134409071/134409071]



In [4]:
!wget https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.50d.txt.zip

--2023-03-22 13:59:08--  https://fileadmin.cs.lth.se/nlp/nobackup/embeddings/nobackup/glove.6B.50d.txt.zip
Résolution de fileadmin.cs.lth.se (fileadmin.cs.lth.se)… 130.235.16.7
Connexion à fileadmin.cs.lth.se (fileadmin.cs.lth.se)|130.235.16.7|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 69240158 (66M) [application/zip]
Sauvegarde en : « glove.6B.50d.txt.zip »


2023-03-22 13:59:09 (111 MB/s) — « glove.6B.50d.txt.zip » sauvegardé [69240158/69240158]



In [5]:
!gunzip -k glove.6B.100d.txt.gz
!unzip -u glove.6B.50d.txt.zip
!mkdir glove
!mv glove.6B.100d.txt glove
!mv glove.6B.50d.txt glove
!rm glove.6B.100d.txt.gz glove.6B.50d.txt.zip

Archive:  glove.6B.50d.txt.zip
  inflating: glove.6B.50d.txt        
