# ScispaCy for Bio-medical Named Entity Recognition(NER)

It is notable that the fast development and spread of the Internet have brought about immense measures of data created and shared, accessible as literary information, pictures, recordings or sounds. This mind-boggling flood of information is additionally valid for explicit zones, for example, biomedicine, where the quantity of distributed archives, for example, articles, books, and specialized reports, is expanding exponentially.

Named entity recognition (NER) doles out a named entity tag to an assigned word by using rules and heuristics. The named entity, which shows a human, location, and an association, ought to be perceived. Named entity recognition is an errand that concentrates ostensible and numeric data from an archive and characterizes the word into an individual, an association, or a date. NER characterizes all words in the record into existing classifications and “nothing from what was just mentioned”.

Biomedical named entity recognition (Bio-NER) is a major errand in taking care of biomedical texts, for example, RNA, protein, cell type, cell line, DNA drugs, and diseases. Bio-NER is one of the most basic and center errands in biomedical information disclosure from texts. Finding a gene name in writings correlates to finding an organization name or a human name in papers. Perceiving biomedical named entities are more troublesome than perceiving natural named entities. Numerous research studies have recognized named entities by using supervised learning algorithms based on many rules.

Supervised learning approaches have used Hidden Markov Models (HMMs), decision trees, support vector machines (SVMs), and conditional random fields (CRFs). Supervised learning methods normally train with data of many features based on various linguistic rules, and evaluate the performance with test data that could not be found in the training data. Nowadays people are working on developing deep learning techniques for Bio-NER. scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.

##What is scispaCy?
scispaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for processing biomedical, scientific or clinical text.

scispaCy is the most ideal approach to prepare text for deep learning. It interoperates flawlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python’s AI ecosystem. With scispaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Original Source - 
https://maheshdivakaran.medium.com/scispacy-for-bio-medical-named-entity-recognition-ner-63ed548f1df0 



#Features
Non-destructive tokenization

Named entity recognition

Support for 52+ languages

20 statistical models for 9 languages

Pre-trained word vectors

State-of-the-art speed

Easy deep learning integration

Part-of-speech tagging

Labeled dependency parsing

Syntax-driven sentence segmentation

Built-in visualizers for syntax and NER

Convenient string-to-hash mapping

Export to numpy data arrays

Efficient binary serialization

Easy model packaging and deployment

Robust, rigorously evaluated accuracy




In [2]:
! pip install -U spacy
! pip install scispacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Using cached spacy-3.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB)
Collecting thinc<8.2.0,>=8.1.0
  Using cached thinc-8.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (821 kB)
Installing collected packages: thinc, spacy
  Attempting uninstall: thinc
    Found existing installation: thinc 8.0.17
    Uninstalling thinc-8.0.17:
      Successfully uninstalled thinc-8.0.17
  Attempting uninstall: spacy
    Found existing installation: spacy 3.2.4
    Uninstalling spacy-3.2.4:
      Successfully uninstalled spacy-3.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scispacy 0.2.5 requires spacy<3.0.0,>=2.3.0, but you have spacy 3.4.1 which is incompatible.
en-core-sci-sm 0.5.0 requires spacy<3.3.0,>=3.2.3

You can download the pre-trained model for scispaCy. You can check the models from the link.

scispaCy pre-trained model has a list of entity classes.

https://allenai.github.io/scispacy/

In [3]:
! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz (15.9 MB)
Collecting spacy<3.3.0,>=3.2.3
  Using cached spacy-3.2.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
Collecting thinc<8.1.0,>=8.0.12
  Using cached thinc-8.0.17-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (660 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.8-py3-none-any.whl (17 kB)
Collecting srsly<3.0.0,>=2.4.1
  Using cached srsly-2.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (458 kB)
Installing collected packages: catalogue, srsly, thinc, spacy
  Attempting uninstall: catalogue
    Found existing installation: catalogue 1.0.0
    Uninstalling catalogue-1.0.0:
      Successfully uninstall

In [4]:
import scispacy
import spacy
import en_core_sci_sm   #The model we are going to use
from spacy import displacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.umls_linking import UmlsEntityLinker


#Define the model

In [5]:
nlp = spacy.load("en_core_sci_sm")   #Define the pre-trained model.

Input text

In [6]:
text = '''Myeloid derived suppressor cells (MDSC) are immature 
          myeloid cells with immunosuppressive activity. 
          They accumulate in tumor-bearing mice and humans 
          with different types of cancer, including hepatocellular 
          carcinoma (HCC).'''

##Finding the sentences in the text

In [7]:
doc = nlp(text)

print(list(doc.sents))

[Myeloid derived suppressor cells (MDSC) are immature 
          myeloid cells with immunosuppressive activity., 
          They accumulate in tumor-bearing mice and humans 
          with different types of cancer, including hepatocellular 
          carcinoma (HCC).]


## Find the Bio-Medical Entities in the given text:

In [8]:
print(doc.ents)

(Myeloid, suppressor cells, MDSC, immature, myeloid cells, immunosuppressive activity, accumulate, tumor-bearing mice, humans, cancer, hepatocellular 
          carcinoma, HCC)


## We can also visualize dependency parses

In [9]:
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)

## Add AbbreviationDetector

## Visualizing named entities
If you want visualize the entities, you can run displacy.serve() function.
    displacy.serve(doc, style="ent")
this will require a tunnel like pyrok to run n colab but you can render it with ease like this

In [15]:
displacy.render(doc, style="ent", jupyter=True)