<a href="https://colab.research.google.com/github/isegura/BasicNLP/blob/master/Dictionary_based_NER_(spacy).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Named Entity Recognition based on dictionaries

In this tutorial, we are going to study the easiest approach to deal with the NER task. This approach is based on the use of dictionaries. We will use the spacy-lookup library.

https://github.com/mpuig/spacy-lookup

spacy-lookup requires spacy v2.0.16 or higher.


In particular, this tutorial will teach you to:
- Add Named Entities metadata to Doc objects in Spacy. 
- Detect Named Entities using dictionaries. 

First, you need to install spacy and spacy-lookup


In [30]:
!python -m spacy download es
!pip install spacy-lookup

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/es_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/es
You can now load the model via spacy.load('es')


Now, we must load our local 

In [31]:
from google.colab import drive
drive.mount("/content/drive/")

!ls 'drive/My Drive/Colab Notebooks/TESI/4-NER/'

#MODIFY THIS PATH 
path='drive/My Drive/Colab Notebooks/TESI/4-NER/'


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
 chemdner_corpus		       IntroNER-spacy.ipynb   Untitled0.ipynb
'Dictionary based NER (spacy).ipynb'   resources


The following cell contains the code to load dictionaries (stored in 'resources'). In particular, we are going to load a dictionary containing rare diseases names and another containing symptoms. 

In [32]:
import csv

def load_dictionary(path_file,dictionary):
  """This function reads a file and save the first column into a list. 
  This list is an input parameter, which is modified. We need to pass
  the list as parameter, because if the list is returned, 
  its type becomes to NoneType"""
  with open(path_file, 'r') as f:
    reader = csv.reader(f)
    list_values = list(reader)

  for x in list_values:
      #print(x[0])
      dictionary.append(x[0].lower())

  print('dictionary loaded',len(dictionary))


path_file=path+'resources/diseases-rare.csv'
diseases=[]
load_dictionary(path_file,diseases)
#print(type(diseases))

path_file=path+'resources/sintomas-es.csv'
symptoms=[]
load_dictionary(path_file,symptoms)
#print(type(symptoms))



dictionary loaded 20633
dictionary loaded 82



Now, we load the model and replace the NER module with de entity diseaseEnt. We do this to avoid overlapping of entities.
Then, we also add symptomEnt.



In [33]:
import spacy
from spacy_lookup import Entity

nlp = spacy.load('es')

#diseases = Entity(keywords_list=['ataxia de Friedrich', 'FRDA', 'escoliosis', 'escoliosis', 'diabetes mellitus'],label="Disease")
diseaseEnt = Entity(keywords_list=diseases,label="DISEASE")
symptomEnt = Entity(keywords_list=symptoms,label="SYMPTOM")

#we replace the common entities with diseases
nlp.replace_pipe("ner", diseaseEnt)
#now we add pipe
nlp.add_pipe(symptomEnt, last=True)

print('entities loaded in nlp')


entities loaded in nlp


Now, we can process a text and show its entities:

In [35]:
text = u"La ataxia de Friedrich (FRDA) es un trastorno neurodegenerativo hereditario que se caracteriza clásicamente por una ataxia progresiva de la marcha, disartria, disfagia, disfunción oculomotora, pérdida de los reflejos tendinosos profundos, signos de afectación del tracto piramidal, escoliosis, y en algunos casos, miocardiopatía, diabetes mellitus, pérdida visual y audición defectuosa"
text=text.lower()
doc = nlp(text)
#print([(token.text, token._.canonical) for token in doc if token._.is_entity])
#we show the entities and their types recognized in the text
#dict([(str(x), x.label_) for x in nlp(str(text)).ents])

for x in doc.ents:
  print(x.text,x.label_,x.start_char,x.end_char)

ataxia de friedrich DISEASE 3 22
frda DISEASE 24 28
ataxia progresiva de la marcha SYMPTOM 116 146
disartria SYMPTOM 148 157
disfagia SYMPTOM 159 167
disfunción oculomotora SYMPTOM 169 191
pérdida de los reflejos tendinosos profundos SYMPTOM 193 237
signos de afectación del tracto piramidal SYMPTOM 239 280
miocardiopatía SYMPTOM 314 328
pérdida visual SYMPTOM 349 363
audición defectuosa SYMPTOM 366 385


In [36]:
from spacy import displacy
ents=["DISEASE","SYMPTOM"]
colors={"DISEASE":"#F9E79F","SYMPTOM":"#82E0AA"}
options = {"ents": ents, "colors": colors}

#displacy.render(doc, jupyter=True, style="ent", options=get_entity_options(ents))
displacy.render(doc, jupyter=True, style="ent", options=options)
