# Spacy

Spacy provides a lot of NER models in the open-source space.
* spaCy contains a range of models which cover entities such as name and location etc.
* scispaCy contains models for processing biomedical, scientific or clinical text. [More Information here](https://allenai.github.io/scispacy/)
* medspaCy toolkit uses clinical NLP with spaCy. [More Information here](https://spacy.io/universe/project/medspacy)

In [None]:
import json
import spacy

In [None]:
with open('../../../data/llm_dataset.json') as f:
    data = json.load(f)

**spacy.cli** can be used to download models, which can then be used in this pipeline.

In [None]:
# import spacy.cli 

# spaCy Models
# spacy.cli.download("en_core_web_sm")
# spacy.cli.download("en_core_web_md")
# spacy.cli.download("en_core_web_lg")
# spacy.cli.download("en_core_sci_scibert")

# scispaCy Models
nlp = spacy.load("en_core_sci_scibert")
# nlp= spacy.load("en_core_sci_md")

# medspaCy


In [None]:
num = [0, 15, 30, 165, 345, 567, 735][5]

# This function generate anotation for each entities and label
def generate_annotation(texts):
    annotations = []
    for text in texts:
        doc = nlp(text)
        entities = []
        for ent in doc.ents:
            entities.append((ent.start_char, ent.end_char, ent.label_))
        annotations.append((text, {'entities': entities}))
    return annotations


# Let's generate annotations
annotations = generate_annotation(data[num])

col_dict = {}
s_colours = ['#e6194B', '#3cb44b', '#ffe119', '#ffd8b1', '#f58231', '#f032e6', '#42d4f4']
for label, colour in zip(nlp.pipe_labels['ner'], s_colours):
    col_dict[label] = colour

options = {'ents': nlp.pipe_labels['ner'], 'colors':col_dict}

doc = nlp(data[num])

spacy.displacy.render(doc, style = 'ent', jupyter = True, options = options)

# This prints annotated text with colour to user - is pretty
[(ent.text, ent.label_) for ent in doc.ents]

## This creates a dictionary of results that could be useful for overlaying.
# def spacy_large_ner(document, model):
#   return {(ent.text.strip(), ent.label_) for ent in model(document).ents}

# print(spacy_large_ner(data[1], nlp))
