# MedNER
---  
  
Medikuntzako NER sistema gainbegiratu bat egingo da.

Horretarako, lehenik eta behin beharrezko liburutegiak inportatuko dira. Erabiliko da the Hugging Face Transformers liburutegia zeinak diseinatuta dagoen aurre-entrenatutako Transformer modeloak, BERT adibidez, erabiltzeko. Instalatzeko:


In [39]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Gainerako liburutegiak inportatuko dira:

In [40]:
from transformers import BertModel, BertTokenizer, BertTokenizerFast
import torch
from tqdm import tqdm
import torch.nn as nn
import numpy as np
import random
import time
import spacy
import os

Konprobatuko da notebook-a GPUan exekutatzen dagoela ("Running on cuda" inprimatu beharko luke).

In [41]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

Running on cpu


Datuak dituen direktoria erabiliko da:

In [42]:
from google.colab import drive
drive.mount('/content/drive')

%cd "/content/drive/MyDrive/universidad/4.CURSO/HP/PROIEKTUA/data/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/universidad/4.CURSO/HP/PROIEKTUA/data


## Datuak irakurri eta *train*, *dev* eta *test* partizioak egin

Anotatutako corpus guztia *corpus_pubtator.txt* fitxategian dago.  Datu hauek PubTator formatuan daude. Zehazki kasu bakoitza honela dago errepresentatuta:
  
```
PMID | t | Title text  
PMID | a | Abstract text    
PMID TAB StartIndex TAB EndIndex TAB MentionTextSegment TAB SemanticTypeID TAB EntityID
...
```

Lehenengo bi lerroek izenburuko eta laburpeneko testuak aurkezten dituzte (lerro-jauzirik eta tabulaziorik gabe testuan). Hurrengo lerroek aipamenak aurkezten dituzte, lerro bakoitzeko bana. *StartIndex* eta *EndIndex* dokumentuaren testuan 0an oinarritutako karaktereen aurkibideak dira, Izenburua eta Laburpena kateatuz eraikiak, SPACE karaktere batez bereiziak. *MentionTextSegment* karaktere-posizio horien arteko benetako aipamena da. *EntityID* UMLS entitatearen id da (kontzeptua), eta *SemanticTypeID* entitatea UMLSen lotuta dagoen mota semantikoaren id da. UMLS entitate mota semantiko bati baino gehiagori lotuta badago, eremu honek komen arabera bereizitako zerrenda bat du ID mota guztiekin. 2017-AA bertsio aktiboan ez dauden UMLS kontzeptu guztiak *UnknownType* mota semantiko bereziari lotuta daude.  

Jarraian adibide bat azaltzen da:
```
25763772|t|DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis
25763772|a|Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain reaction and direct sequencing were used to screen DNA samples for DCTN4 variants. A total of 121 adult CF patients from the Cochin Hospital CF centre have been included, all of them carrying two CFTR defects: 103 developed at least 1 pulmonary infection with Pa, and 68 patients of them had CPA. DCTN4 variants were identified in 24% (29/121) CF patients with Pa infection and in only 17% (3/18) CF patients with no Pa infection. Of the patients with CPA, 29% (20/68) had DCTN4 missense variants vs 23% (8/35) in patients without CPA. Interestingly, p.Tyr263Cys tend to be more frequently observed in CF patients with CPA than in patients without CPA (4/68 vs 0/35), and DCTN4 missense variants tend to be more frequent in male CF patients with CPA bearing two class II mutations than in male CF patients without CPA bearing two class II mutations (P = 0.06). Our observations reinforce that DCTN4 missense variants, especially p.Tyr263Cys, may be involved in the pathogenesis of CPA in male CF.
25763772        0       5       DCTN4   T116,T123    C4308010
25763772        23      63      chronic Pseudomonas aeruginosa infection        T047    C0854135
25763772        67      82      cystic fibrosis T047    C0010674
25763772        83      120     Pseudomonas aeruginosa (Pa) infection   T047    C0854135
...
```

Bestalde, *train*, *dev* eta *test* partizioak *corpus_pubtator_pmids_trng.txt, corpus_pubtator_pmids_dev.txt, corpus_pubtator_pmids_test.txt* fitxategietan daude.  Hauek dokumentuaren  %60, %20, %20-ko ausazko partizioak dituzte, hurrenez hurren. Bakoitzean agertzen dira PMID kodeak zehaztuz bakoitzean dauden kasue kodeak.  

Beraz, partizioak egingo dira dokumentu bana izateko *train*, *dev* eta *test* kasuetarako. Horretarako:
1. Sortuko da fitxategi bana *train*, *dev* eta *test*-eko kasuak gordetzeko.
2. Irakurriko dira *corpus_pubtator_pmids_trng.txt, corpus_pubtator_pmids_dev.txt, corpus_pubtator_pmids_test.txt* fitxategiak. Irakurtzen den kode bakoitzeko *corpus_pubtator.txt* fitxategian  dagoen kasu hori hartuko da eta dagokion fitxategian (*train*, *dev* edo *test*) kopiatuko da.

### Aldagai orokorrak definitu direktorioak gordetzeko

In [43]:
data_directory = "corpus_pubtator.txt"

train_directory = "corpus_pubtator_pmids_trng.txt"
dev_directory = "corpus_pubtator_pmids_dev.txt"
test_directory = "corpus_pubtator_pmids_test.txt"

train_def_directory = "corpus_pubtator_train.txt"
dev_def_directory = "corpus_pubtator_dev.txt"
test_def_directory = "corpus_pubtator_test.txt"

### Fitxategiak sortu

In [44]:
def createFile(path):
    if not os.path.exists(path):
        os.makedirs(os.path.dirname(path), exist_ok=True)

def createDirectory(path):
    if not os.path.exists(path):
        os.mkdir(path)

createFile("./" + train_def_directory)
createFile("./" + dev_def_directory)
createFile("./" + test_def_directory)

### Fitxategiak idatzi

In [45]:
def idatzi_kasua(write_directory, line):
  with open(write_directory, "a") as f:
    f.write(line)

### Fitxategiak irakurri

In [46]:
def aurkitu_kasua(data_directory,write_directory, code):
  aurkitu = False
  with open(data_directory) as file:
        for line in file:
          if  "|" in line:
            lines = line.split("|")
          else:
            lines = line.split("\t")
            
          if lines[0] == code:
            aurkitu = True
            idatzi_kasua(write_directory, line)
          elif aurkitu:
            break

In [47]:
def read_makePartitions_txt(filename, write_directory):
    
    """ Read input one by line """
    with open(filename) as file:
        for line in file:
            aurkitu_kasua(data_directory, write_directory, line.strip())

In [None]:
read_makePartitions_txt(train_directory, train_def_directory)
read_makePartitions_txt(dev_directory, dev_def_directory)
read_makePartitions_txt(test_directory, test_def_directory)

## Z1
MedMentions corpusa erabiliz NER sistema orokor bat
entrenatu, termino bat UMLSkoa den, hau da, osasunarekin zerikusia ote duen
jakiteko. Hau da, terminoak identifikatu, klase bakarra irteeran duzularik
(Medikuntzkoa edo MED).  

[DCTN4] as a modifier of [chronic Pseudomonas aeruginosa infection] in
[cystic fibrosis]

### BIO etiketatzea
Z1 betetzeko, partizioak eginda daudela hauetan dauden kasuak MedMentions-eko formatutik BIO etiketatzera egokituko dira. Aurrerago erabiliko diren ereduek formatu hau eskatzen dutelako.     


Etiketa bakarra egongo da, MED dena. Beraz, tokena entitate baten hasiera baldin bada B-MED etiketa jasoko du. I-BER izango du aldiz, entitatearen barruan baldin badago eta 0 ez bada entitate bat. Jarraian adibide bat aurkezten da:

```
[DCTN4] as a modifier of [chronic Pseudomonas aeruginosa infection] in [cystic fibrosis]

B-MED O O O O B-MED I-MED I-MED I-MED O B-MED I-MED
```

Lehenik eta behin fitxategi berriak sortuko dira *train*, *dev* eta *test* partizio bakoitzeko kasuak BIO etiketatzean gordetzeko.

In [23]:
#Aldagai orokorrak
train_BIO_directory = "corpus_BIO_train.txt"
dev_BIO_directory = "corpus_BIO_dev.txt"
test_BIO_directory = "corpus_BIO_test.txt"

#sortu fitxategiak
createFile("./" + train_BIO_directory)
createFile("./" + dev_BIO_directory)
createFile("./" + test_BIO_directory)

Orain aurretik sortu diren fitxategiak (corpus_pubtator_train.txt, corpus_pubtator_test.txt eta corpus_pubtator_dev.txt) irakurri dira eta etiketatuko dira. Horretarako kasu bakoitzeko lehenengo bi lerroak irakurriko dira izenburua eta deskripzioa direnak, hain zuzen ere. Hauek etiketatuko dira O jarriz baldin eta hurrengo lerroetan ez badaude. Izan ere, lerro bana dago entitate bakoitzeko. B-MED jarriko da baldin eta entitate baten hasiera bada eta I-MED entitate baten parte bada. Horretarako, hurrengo lerroak begiratuko dira eta aztertuko da *MentionTextSegment* eta *StartIndex* jakiteko entitatearen hasiera edo parte den.  

Bestalde, *SpaCy*-ko tokenizatzailea erabiliko da titulua eta abstracta tokenizatzeko. Gainera, *MentionTextSegment* hainbat hitzez osatuta egon daitekenez baita erabiliko da hau tokenizatzeko.

SpaCy-ko inportak egingo eta tokenizatzailea definituko da:

In [24]:
# spaCy-ko lematizatzailea behar dugu baina gainerakoa kenduko dugu analisi-katetik
nlp = spacy.load('en_core_web_sm', disable=['tagger,ner,parser'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

def tokenize(text):
  new_text = ""
  for t in text:
    if t == " ":
      new_text += "  "
    else:
      new_text += t

  spacy_tokens=[]
  spacy_tokens.extend([token.text for token in nlp(new_text)])
  return spacy_tokens


Fitxategi bakoitza irakurri eta BIO moduan etiketatu klase bat daukagula bakarrik:

In [25]:
def idatzi_entitateak(entitate_guztiak, write_directory):
  for entitatea in entitate_guztiak:
    if entitatea[0] != " ":
      text = entitatea[0] + "\t" + entitatea[3] + "\n"
      idatzi_kasua(write_directory, text)


In [73]:
def readBIO_txt(filename, write_directory):
    
    """ Read input one by line """
    with open(filename) as file:
        lines = "" #it  will contain the tittle and abstract
        entitate_guztiak = []
        cont = 0
        cont_len = 0
        for line in file:
          if  "|" in line:
            if cont % 2 == 0:
              #idatzi
              idatzi_entitateak(entitate_guztiak, write_directory)
              entitate_guztiak = []
              lines = ""           
              cont_len = 0   
            cont += 1
            text_ta = (line.split("|")[2]).split("\n")[0]
            tokenizatu_text_ta = tokenize(text_ta+ " ")
            for t in tokenizatu_text_ta:
              entitate_guztiak.append([t,cont_len , cont_len+len(t) ,"O"])
              cont_len += len(t)
            lines += text_ta + " "
          else:
            desk = line.split("\t")
            index_start = int(desk[1])
            index_end = int(desk[2])
            entitate = lines[index_start:index_end]
            entitate_tokenizatu = tokenize(entitate)
            first = True
            for ent in entitate_tokenizatu:
              #print(str(ent)+" "+str(index_start)+ " "+str(index_start + len(ent)))
              if first:
                if ent != " ":
                  try:
                    index = entitate_guztiak.index([ent,index_start, index_start + len(ent), "O"])
                    entitate_guztiak[index] = [ent,index_start, index_start + len(ent), "B-MED"]
                    index_start += len(ent)
                    first = False
                  except:
                    index += 1
                    entitate_guztiak.insert(index, [ent, len(entitate_guztiak[index-1][0]), len(entitate_guztiak[index-1][0])+len(ent), "B-MED"])
                    index_start += len(ent)
                    first = False
                else:
                  index_start += 1
              elif ent != " ":
                try:
                  index = entitate_guztiak.index([ent,index_start, index_start +len(ent), "O"])
                  entitate_guztiak[index] = [ent,index_start, index_start +len(ent), "I-MED"]
                  index_start += len(ent)
                except:
                  index += 1
                  entitate_guztiak.insert(index, [ent, len(entitate_guztiak[index-1][0]), len(entitate_guztiak[index-1][0])+len(ent), "I-MED"])
                  index_start += len(ent)
              else:
                index_start += 1


In [74]:
readBIO_txt(dev_def_directory ,dev_BIO_directory)
readBIO_txt(test_def_directory ,test_BIO_directory)
readBIO_txt(train_def_directory ,train_BIO_directory)

### BIO moduan etiketatutako datuak irakurri

Jarrain definituko dira funtzio batzuk aurretik etiketatu diren datuak irakurtzeko. Funtzio honek datuak irakurri eta esaldiz osatutako lista bat sortzen du. Esaldi bakoitza tuplas osatutako lista bat da eta tupla bakoitz bi elementuz osatuta dago: hitza eta etiketa (O,B,I).


In [None]:
def read_txt(filename):    
    sentences=[]
    sentence=[]
    with open(filename) as file:
        for line in file:
            cols=line.rstrip().split("\t")
            if len(cols) < 2:
                if len(sentence) > 0:
                    sentences.append(sentence)
                sentence=[]
                continue
                
            word=cols[0]
            tag=cols[1]
            
            sentence.append((word, tag))
            
        if len(sentence) > 0:
            sentences.append(sentence)
            
    return sentences


###BERT-base sekuentzia-etiketatzailea definitzea

Kasu honetan, tokenizatzailea eta transformerraren parametroak deskargatuko dira Transformers liburutegitik. Modeloak "model_name" zehatz batekin gordetzen dira.

Mini-batchak tokenizatzailearen arabera antolatzen dira.

Hugging Face (BertForTokenClassification) modeloa erabili beharrean, BERTek NERrako sekuentzia-sailkatzaile gisa funtzionatzeko kodea sartzen dugu. Arrazoia da tokenizazio azpihitzetik eratorritako konplikazioak agerian uztea.

NER datasetek hitz bakoitzari BIO-formatuko etiketa bat ematen dion bitartean, BERT tokenizatzaileak hitz horiek token askotan bana ditzake. Horrela, etiketak egokitu behar ditugu sekuentzia tokenizatuaren luzera berriarekin bat etortzeko. Hau egiteko modu asko daude: jatorrizko hitzaren BIO etiketa dagozkion fitxa guztiei eslei dakieke, edo, bestela, BIO etiketa lehenengo azpihitzari esleitu, eta, aldi berean, etiketa berezi bat (-100) esleitu hitz horri dagozkion gainerako azpihitzei. Lehengo aukera erabiliko da, eta etiketak "luzatuko" dira jatorrizko hitz bakoitzari dagozkion azpihitz guztiei BIO etiketa bera emanez.

Hona adibide bat:
```
input in sentences (3 tokens): 

                     words: Hello San Sebastian! 
                     tags:  O     B   I

input as organized for batch_x and batch_adjusted_tags: 

                subwords:    [CLS] Hel #lo San Sebast #ian ! [SEP] [PAD] ... [PAD]
                .word_ids(): None  0   0   1   2      2    2 None  None      None
                tags:        -1    O   O   B   I      I    I -1    -1        -1    
                
                Note that -1 tags will be ignored when computing loss
```


In [None]:
class BERTSequenceLabeler(nn.Module):

    
    def __init__(self, params):
        super().__init__()
    
        self.model_name=params["model_name"]
        self.tokenizer = BertTokenizerFast.from_pretrained(self.model_name, do_lower_case=params["doLowerCase"], do_basic_tokenize=False)
        self.bert = BertModel.from_pretrained(self.model_name)
        self.num_labels = params["label_length"]

        self.fc = nn.Linear(params["embedding_size"], self.num_labels)

    def forward(self, batch_x): 
    
        bert_output = self.bert(input_ids=batch_x["input_ids"],
                         attention_mask=batch_x["attention_mask"],
                         token_type_ids=batch_x["token_type_ids"],
                         output_hidden_states=True)

        bert_hidden_states = bert_output['hidden_states']

        # Note that the hidden states of all layers are returned, hence the use of -1 to access the states of the top layer
        out = bert_hidden_states[-1]
        out = self.fc(out)

        return out.squeeze()

    def get_batches(self, all_data, batch_size=32, max_toks=256):
            
        """ Get batches for input x, y data, with data tokenized according to the BERT tokenizer 
            (and limited to a maximum number of BERT tokens """

        batches_x=[]
        batches_y=[]
        
        for i in range(0, len(all_data), batch_size):

            current_batch=[]

            data=all_data[i:i+batch_size]       # returns a list of word,tag pairs
            def extract_words(item):            # returns a list of words
              return [word for word,tag in item]
            def extract_tags(item):             # returns a list of tags
              return [tag for word,tag in item]
            sentences = [extract_words(item) for item in data]

            # is_split_in_words=True, Tokenizer assumes input is already tokenized into words
            # as the tokenizer returns subwords, the `batch_x.word_ids(i)` method returns the original word index of each subword 
            batch_x = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=max_toks, is_split_into_words=True)

            batch_adjusted_tags = []
            for i, item in enumerate(data):    
              word_ids = batch_x.word_ids(i)     
              sentence = extract_words(item)
              tags = extract_tags(item)
              def get_tag_from_id(id,tags):
                if id==None:
                  return -1
                else:
                  return tag_vocab[tags[id]]
              adjusted_tags = [get_tag_from_id(id,tags) for id in word_ids]
              batch_adjusted_tags.append(adjusted_tags)

            mapped_ids = batch_x.word_ids()

            batches_x.append(batch_x.to(device))
            batches_y.append(torch.LongTensor(batch_adjusted_tags).to(device))
            
        return batches_x, batches_y
  

# TRASH

In [None]:

            '''
            desk = line.split("\t")
            index_start = int(desk[1])
            index_end = int(desk[2])
            text_segment = desk[3].split(" ")
            first = True
            #print(cont)
            while cont != index_end:
              if cont < index_start:
                fragment = lines[cont:index_start]
                print(fragment)
                tokens = fragment.split(" ")
                for token in tokens:
                  if token !=" ":
                    text = token+"\t O \n"
                    idatzi_kasua(write_directory, text)
                cont += len(fragment)
                print(cont)
              elif cont == index_start:
                print("sartu")
                if len(text_segment) == 1: #hitz bakarrez osatuta
                  text = lines[cont:index_end+1] +"\t B-MED \n"
                  idatzi_kasua(write_directory, text)
                else: 
                  for segment in text_segment:
                    if first:
                      text = lines[cont:cont+len(segment)] +"\t B-MED \n"
                      idatzi_kasua(write_directory, text)
                      first = False
                      cont = len(text)+1
                    else:
                      text = lines[cont:cont+len(segment)] +"\t I-MED \n"
                      idatzi_kasua(write_directory, text)
                cont = index_end
              lines = lines[index_end:]
            
            sartu = False
            print(lines)
            desk = line.split("\t")
            index_start = int(desk[1])
            index_end = int(desk[2])
            del  lines[0:len(remove_list)]
            remove_list = []
            for l in lines:
              print(cont)
              if cont == index_end:
                print("sartu")
                break
              if l == " ":
                cont += 1
                remove_list.append(lines.index(l))
              elif cont == index_start:
                text = l +"\t B-MED \n"
                sartu = True
                idatzi_kasua(write_directory, text)
                remove_list.append(lines.index(l))
              elif cont < index_end and sartu:
                sartu = False
                cont += len(l)
                text = l +"\t I-MED \n"
                idatzi_kasua(write_directory, text)
                remove_list.append(lines.index(l))
              else:
                cont += len(l) 
                text = l +"\t O \n"
                idatzi_kasua(write_directory, text)
                remove_list.append(lines.index(l))








        atera = False
            while not atera:
              if cont == index_start or cont > index_start:
                if len(text_segment) == 1: #entity has only one word
                  if cont > index_start:
                    cont = index_start
                  l = lines[cont:index_end]
                  cont += len(l)
                  text = l +"\t B-MED \n"
                  idatzi_kasua(write_directory, text)
                  atera = True
                elif len(text_segment) > 1 : #entity has more than one word 
                  count = 0
                  if cont > index_start:
                      cont = index_start
                  while cont != index_end:
                    if cont == index_start:
                      l = lines[cont:len(text_segment[0])]
                      text =  l +"\t B-MED \n"
                      idatzi_kasua(write_directory, text)
                    else:
                      l = lines[cont:len(text_segment[count])]
                      text = l +"\t I-MED \n"
                      idatzi_kasua(write_directory, text)
                    count += 1
                    cont += len(l)
                  atera = True

              else:
                text_other_segment = tokenize(lines[cont:index_start])
                count = 0
                for l in text_other_segment:
                  if l != " ":
                    text = l +"\t O \n"
                    idatzi_kasua(write_directory, text)
                    count += 1
              
                  cont += len(l)

                '''
              
            
           

In [10]:
def readBIO2_txt(filename, write_directory):
    
    """ Read input one by line """
    with open(filename) as file:
        lines = [] #it  will contain the tittle and abstract
        cont = 0
        remove_list = []
        prev_index_start = None
        prev_index_end = -1
        sartu  = False
        for line in file:
          if  "|" in line:
            text_ta = (line.split("|")[2]).split("\n")[0]
            lines.extend(tokenize(text_ta))
          else:
            print(lines)
            desk = line.split("\t")
            index_start = int(desk[1])
            index_end = int(desk[2])
            prev = False
            if prev_index_start == index_start:
              del  lines[0:len(remove_list)-len(text_segment)]
            elif  index_start < prev_index_end:
              prev = True
              del lines[0:len(remove_list)-len(text_segment)]
            else:
              del  lines[0:len(remove_list)]
            prev_index_start = index_start
            prev_index_end = index_end
            text_segment = tokenize(desk[3])
            remove_list = []
            for l in lines:
              if l in text_segment: #if it is in MentionTextSegment
                cont += 1
                if len(text_segment) == 1 and text_segment[0] == l: #entity has only one word
                  text = "  ".join(text_segment) + " " +l +"\t B-MED \n"
                  idatzi_kasua(write_directory, text)
                elif len(text_segment) > 1 and text_segment[0] == l: #entity has more than one word and l is the fist one
                  sartu = True
                  text = "  ".join(text_segment)+ " "  + l +"\t B-MED \n"
                  idatzi_kasua(write_directory, text)
                elif  len(text_segment) > 1 and text_segment[0] != l and sartu: #entity has more than one word and l is not the fist one
                  text = "  ".join(text_segment)+ " "  + l +"\t I-MED \n"
                  idatzi_kasua(write_directory, text)
                remove_list.append(lines.index(l))

                if cont == len(text_segment):
                  sartu = False
                  cont = 0
                  break

              elif  not prev: #if it is NOT  in MentionTextSegment
                text =   "  ".join(text_segment)+ " "  + l +"\t O \n"
                idatzi_kasua(write_directory, text)
                remove_list.append(lines.index(l))

            