<a href="https://colab.research.google.com/github/jay05Hawk/Spacy/blob/main/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What's $\color{red}{\text{spaCy}}$?

SpaCy is **free**, **open-source library** for advanced **Natural language processing**(NLP) in Python.

Suppose you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What does the words mean in the context? Who is doing what to whom? What products and compnaies are mentioned in the text? Which texts are simmilar to each other.

spaCy is designed specifically for **production use** and helps you build applications that process and "understand" large volume of text. It can be used to build **information extraction** or **natural language processing** systems, or to pre-process text for **deep learning**.



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

# Text: The original word text.
# Lemma: The base form of the word.
# POS: The simple part-of-speech tag.
# Tag: The detailed part-of-speech tag.
# Dep: Syntactic dependency, i.e. the relation between tokens.
# Shape: The word shape – capitalization, punctuation, digits.
# is alpha: Is the token an alpha character?
# is stop: Is the token part of a stop list, i.e. the most common words of the language?

## $\color{red}{\text{Tokenization}}$

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.  Each *Doc* consists of individual tokens, and we can iterate over them:

In [None]:


nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

In [None]:
#Using spaCy’s built-in **displaCy** visualizer, here’s what our example sentence and its dependencies look like:

from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Google, Apple crack down on fake coronavirus apps")
#displacy.serve(doc, style="dep")

displacy.render(doc, style='dep', jupyter=True, options={'distance': 150})

##$\color{red}{\text{Named Entities }}$ 

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

$\color{red}{\text{en_core_web_sm}}$


LANGUAGE ------> **en** English

TYPE   ------> **core**  Vocabulary, syntax, entities

GENRE ------> **web**  written text (blogs, news, comments)

SIZE------> **sm** 12 MB

## Visualizing the $\color{red}{\text{Named Entity recognizer}}$

The entity visualizer, *ent* , highlight named entities and their label in the text.

In [None]:
#from spacy import displacy

text = "Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India"

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)
#displacy.serve(doc, style="ent")
# https://spacy.io/api/annotation#named-entities

## Words vector and similarity

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word.Word vectors can be generated using an algorithm like word2vec and usually look like this:

__Important_note:__ To make them compact and fast, spaCy's small models(all the pacakages end with sm) **don't ship with the word vectors**, and only include context-sensitive tensors. This means you can still use the similarity() to compare documents, tokens and spans - but result won't be as good, and individual tokens won't have any vectors is assigned. So, in orders to use *real* word vectors, you need to download a larger model:


In [None]:
import spacy.cli
spacy.cli.download("en_core_web_md")
import en_core_web_md
nlp = en_core_web_md.load()

In [None]:
nlp = spacy.load("en_core_web_md")
tokens = nlp("lion bear apple banana fadsfdshds")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
# Vector norm: The L2 norm of the token’s vector (the square root of the sum of the values squared)
# has vector: Does the token have a vector representation?
# OOV: Out-of-vocabulary

In [None]:
nlp = spacy.load("en_core_web_md")  # make sure to use larger model!
tokens = nlp("lion bear cow apple mango spinach")

for token11 in tokens:
    for token13 in tokens:
        print(token11.text, token13.text, token11.similarity(token13))

## Pipelines

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.
<img src='/content/18.png'>
<img src="19.png">



## Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab, that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, “coffee” has the hash 3197928453018144401. Entity labels like “ORG” and part-of-speech tags like “VERB” are also encoded. Internally, spaCy only “speaks” in hash values.

<img src="/content/20.png">

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])  # 3197928453018144401

print(doc.vocab.strings[3197928453018144401])  # 'coffee'

In [None]:
doc = nlp("I love tea, over coffee")
print(doc.vocab.strings["tea"]) # 6041671307218480733


print(doc.vocab.strings[6041671307218480733])

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love tea, over coffee!")
for word in doc:
    lexeme = doc.vocab[word.text]
    # print(lexeme)
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
            lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

In [None]:
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load("en_core_web_sm")
doc = nlp("I love tea, over coffee")  # Original Doc
print(doc.vocab.strings["tea"])  # 6041671307218480733
print(doc.vocab.strings[6041671307218480733])  # 'tea' 

empty_doc = Doc(Vocab())  # New Doc with empty Vocab
# empty_doc.vocab.strings[6041671307218480733] will raise an error :(

empty_doc.vocab.strings.add("tea")  # Add "tea" and generate hash
print(empty_doc.vocab.strings[6041671307218480733])  # 'tea' 

new_doc = Doc(doc.vocab)  # Create new doc with first doc's vocab
print(new_doc.vocab.strings[6041671307218480733])  # 'tea' 👍

##$\color{red}{\text{Knowledge base}}$ 

To support the entity linking task, spaCy stores external knowledge in a KnowledgeBase. The knowledge base (KB) uses the Vocab to store its data efficiently.

A knowledge base is created by first adding all entities to it. Next, for each potential mention or alias, a list of relevant KB IDs and their prior probabilities is added. The sum of these prior probabilities should never exceed 1 for any given alias.

In [None]:
from spacy.kb import KnowledgeBase

# load the model and create an empty KB
nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3)

# adding entities
kb.add_entity(entity="Q1004791", freq=6, entity_vector=[0, 3, 5])
kb.add_entity(entity="Q42", freq=342, entity_vector=[1, 9, -3])
kb.add_entity(entity="Q5301561", freq=12, entity_vector=[-2, 4, 2])

# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])
kb.add_alias(alias="Douglas Adams", entities=["Q42"], probabilities=[0.9])

print()
print("Number of entities in KB:",kb.get_size_entities()) # 3
print("Number of aliases in KB:", kb.get_size_aliases()) # 2

In [None]:
# adding aliases
kb.add_alias(alias="Douglas", entities=["Q1004791", "Q42", "Q5301561"], probabilities=[0.6, 0.1, 0.2])

candidates = kb.get_alias_candidates("Douglas")#get_alias_candidates
for c in candidates:
    print(" ", c.entity_, c.prior_prob, c.entity_vector)