<a href="https://colab.research.google.com/github/isegura/BasicNLP/blob/master/Exercise_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Reading NLP dataset

When you are developing a NLP system, the first task that you should do is to read the dataset (or corpus, which is a collection of annotated dataset). So for a NER system, the annotations repreent the entities to recognize. These annotations usually include the text of the entity, the start and end positions within the text and its entity type. On the other hand, the annotations needed for a text classifiction system will be at document level. That is, each document is annotated with its corresponding category.

In this tutorial, we will learn how to load the texts and their annotations to develop a system for recognizing drug names in abstracts (summaries of medical articles published in Medline). 

We will load a small part of the CHEMDNER corpus (https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/). Our dataset includes two files:
- **training.abstracts.txt** contains the texts. Each line contains the Id and text of a Medline abstract.
- **training.annotations.txt** contains their corresponding annotations of drug mentions in these abstracts. Each line contains a drug mention, with the following fields: id of the abstract, A (refers to 'Abstract'), start position, end position, drug name, and its type.

In [0]:
from google.colab import drive
drive.mount("/content/drive/")
!ls

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
drive  sample_data


In [0]:
sst_home='drive/My Drive/Colab Notebooks/'
#replace this folder with the name of your folder in Google Colab, 
#where you are saving your notebooks of this course
sst_home += 'TESI/basicNLP/sample/'

file_text = sst_home + 'training.abstracts.txt'
file_ann = sst_home + 'training.annotations.txt'


texts = open(file_text)
for text in texts:
  print(text)

annotations=open(file_ann)
for a in annotations:
  print(a)


21826085	DPP6 as a candidate gene for neuroleptic-induced tardive dyskinesia.	We implemented a two-step approach to detect potential predictor gene variants for neuroleptic-induced tardive dyskinesia (TD) in schizophrenic subjects. First, we screened associations by using a genome-wide (Illumina HumanHapCNV370) SNP array in 61 Japanese schizophrenia patients with treatment-resistant TD and 61 Japanese schizophrenia patients without TD. Next, we performed a replication analysis in 36 treatment-resistant TD and 138 non-TD subjects. An association of an SNP in the DPP6 (dipeptidyl peptidase-like protein-6) gene, rs6977820, the most promising association identified by the screen, was significant in the replication sample (allelic P=0.008 in the replication sample, allelic P=4.6 × 10(-6), odds ratio 2.32 in the combined sample). The SNP is located in intron-1 of the DPP6 gene and the risk allele was associated with decreased DPP6 gene expression in the human postmortem prefrontal cortex. Ch

The goal of this exercise is to read the texts and their annotations and then create an IOB format for the tokens in texts. 

The IOB format is a simple text chunking format that divides texts into single tokens per line, and, separated by a whitespace, tags to mark named entities. To mark named entities that span multiple tokens, the tags have a prefix of either B- (beginning of named entity) or I- (inside of named entity). O (outside of named entity) tags are used to mark tokens that are not a named entity.

For example:

* **Mercury** *induces the expression of cyclooxygenase-2 and inducible **nitric oxide** synthase.* 

The IOB format will be:
<table>
<tr><td>Mercury B</td></tr>
<tr><td>induces O</td></tr>
<tr><td>the O</td></tr>
<tr><td>expression O</td></tr>
<tr><td>of O</td></tr>
<tr><td>cyclooxygenase-2 O</td></tr>
<tr><td>and O</td></tr>
<tr><td>inducible O</td></tr>
<tr><td>nitric B</td></tr>
<tr><td>oxide I</td></tr>
<tr><td>synthase O</td></tr>
<tr><td>. O</td></tr>
</table>

Moreover, we can extend this IOB format by including the following features:
- PoS tag, 
- lemma, 
- shape, 
- isUpper (a boolean value indicanding if the token is uppercase),
- isNumber (a boolean value refering if the token is a number),  
- isPunct (a boolean value referring if the token is a puntuaction), 
- start and end positions within the text 
- $w_{-2}$, $w_{-1}$ the two previous words, their PoS tags and their IOB tags.
- $w_1, w_2$ the two next words, their PoS tags and their IOB tags.
- IOB tag

You should use Spacy (or NLTK) to obtain these features. 

Remember that each line should represent a token and their features. Please, use '\t' to separate the features. Also, each line should contains the id of the abstract of the token. 

Save the new format into a file "training.iob.txt". 




