In [1]:
!pip install -U spacy




In [2]:
!python -m spacy download en_core_web_lg


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ---------------------------------------- 0.0/587.7 MB ? eta -:--:--
     -------------------------------------- 0.0/587.7 MB 640.0 kB/s eta 0:15:19
     -------------------------------------- 0.0/587.7 MB 640.0 kB/s eta 0:15:19
     ---------------------------------------- 0.2/587.7 MB 1.4 MB/s eta 0:07:01
     ---------------------------------------- 0.6/587.7 MB 3.8 MB/s eta 0:02:33
     ---------------------------------------- 1.1/587.7 MB 5.2 MB/s eta 0:01:53
     ---------------------------------------- 1.4/587.7 MB 6.1 MB/s eta 0:01:36
     ---------------------------------------- 1.9/587.7 MB 6.8 MB/s eta 0:01:26
     ---------------------------------------- 2.4/587.7 MB 7.4 MB/s eta 0:01:19
     ---------------------------------------- 3.0/587.7 MB 7.6 MB/s eta 0:01:18
     -------------------------

In [3]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 163.8 kB/s eta 0:01:18
     --------------------------------------- 0.0/12.8 MB 163.8 kB/s eta 0:01:18
     --------------------------------------- 0.0/12.8 MB 151.3 kB/s eta 0:01:25
     - -------------------------------------- 0.4/12.8 MB 1.4 MB/s eta 0:00:09
     -- ------------------------------------- 1.0/12.8 MB 2.9 MB/s eta 0:00:05
     ---- ----------------------------------- 1.5/12.8 MB 4.0 MB/s eta 0:00:03
     ------ --------------------------------- 2.0/12.8 MB 4.7 MB/s eta 0:00:03
     ------- -------------------------------- 2.4/12

In [4]:
!pip install wordcloud




In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

# 1.4 Linguistic annotations
## spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure. 
## This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.

## Once you have downloaded and installed a model, you can load it via spacy.load(). This will return a Language object containing all components and data needed to process text. We usually call it nlp. Calling the nlp object on a string of text will return a processed Doc:

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("My friend param is starting its own Company which is GenerationX")
for token in doc:
    print(token.text, token.pos_, token.dep_)


My PRON poss
friend NOUN compound
param NOUN nsubj
is AUX aux
starting VERB ROOT
its PRON poss
own ADJ amod
Company PROPN dobj
which PRON nsubj
is AUX relcl
GenerationX PROPN attr


# SpaCy’s Processing Pipeline
The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through.

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Create a nlp object
doc = nlp("Manvendra is playing chess and eating pizza")

In [8]:
nlp.pipe_names


['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
nlp.disable_pipes('tagger', 'parser')


['tagger', 'parser']

In [10]:
nlp.pipe_names


['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language.



In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("Akash has been buyed by byju's in 73,000 Core's")
for token in doc:
    print(token.text)

Akash
has
been
buyed
by
byju
's
in
73,000
Core
's


# 2.2 Part-Of-Speech (POS) Tagging
Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are eight parts of speech.

1. Noun
2. Pronoun
3. Adjective
4. Verb
5. Adverb
6. Preposition
7. Conjunction
8. Interjection

In [12]:
import spacy 
nlp = spacy.load('en_core_web_sm')

# Create an nlp object
doc = nlp("I am Ritesh,currently a Computer Science and NLP Researcher")
 
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

I PRP PRON pronoun, personal
am VBP AUX verb, non-3rd person singular present
Ritesh NNP PROPN noun, proper singular
, , PUNCT punctuation mark, comma
currently RB ADV adverb
a DT DET determiner
Computer NNP PROPN noun, proper singular
Science NNP PROPN noun, proper singular
and CC CCONJ conjunction, coordinating
NLP NNP PROPN noun, proper singular
Researcher NNP PROPN noun, proper singular


In [13]:
import spacy
from spacy import displacy

doc = nlp("I am Ritesh,currently a Computer Science and NLP Researcher")
displacy.render(doc, style="dep" , jupyter=True)


## 2.3 Dependency Parsing
Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

The dependencies can be mapped in a directed graph representation:

Words are the nodes.
The grammatical relationships are the edges.
Dependency parsing helps you know what role a word plays in the text and how different words relate to each other. It’s also used in shallow parsing and named entity recognition.

Here’s how you can use dependency parsing to see the relationships between words

In [14]:
import spacy 
nlp = spacy.load('en_core_web_sm')

# Create an nlp object
doc = nlp("I am Ritesh,currently a Computer Science and NLP Researcher")
 
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.dep_)


I --> nsubj
am --> ROOT
Ritesh --> attr
, --> punct
currently --> advmod
a --> det
Computer --> nmod
Science --> appos
and --> cc
NLP --> compound
Researcher --> conj


# The dependency tag ROOT denotes the main verb or action in the sentence. The other words are directly or indirectly connected to the ROOT word of the sentence. You can find out what other tags stand for by executing the code below:



In [15]:
spacy.explain("nsubj"), spacy.explain("ROOT"), spacy.explain("aux"),spacy.explain('nmod'), spacy.explain("advcl"), spacy.explain("dobj")


('nominal subject',
 'root',
 'auxiliary',
 'modifier of nominal',
 'adverbial clause modifier',
 'direct object')

2.4 Lemmatization
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

spaCy has the attribute lemma_ on the Token class. This attribute has the lemmatized form of a token

In [16]:
import spacy 
nlp = spacy.load('en_core_web_sm')

# Create an nlp object
doc = nlp("Reliance is looking at buying U.K. based analytics startup for $7 billion")
 
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.lemma_)

Reliance --> reliance
is --> be
looking --> look
at --> at
buying --> buy
U.K. --> U.K.
based --> base
analytics --> analytic
startup --> startup
for --> for
$ --> $
7 --> 7
billion --> billion


# 2.5 Sentence Boundary Detection (SBD)
Sentence Boundary Detection is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

In spaCy, the sents property is used to extract sentences. Here’s how you would extract the total number of sentences and the sentences for a given input text:

In [17]:
import spacy 
nlp = spacy.load('en_core_web_sm')

# Create an nlp object
doc = nlp("Reliance is looking at buying U.K. based analytics startup for $7 billion.This is India.India is great")
 
sentences = list(doc.sents)
len(sentences)


3

In [18]:
sentences

[Reliance is looking at buying U.K. based analytics startup for $7 billion.,
 This is India.,
 India is great]

In [19]:
for sentence in sentences:
     print (sentence)


Reliance is looking at buying U.K. based analytics startup for $7 billion.
This is India.
India is great


# 2.6 Named Entity Recognition (NER)


A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Reliance is looking at buying U.K. based analytics startup for $7 billion")
#See the entity present
print(doc.ents)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

(Reliance, U.K., $7 billion)
Reliance 0 8 ORG
U.K. 30 34 GPE
$7 billion 63 73 MONEY


# 2.7 Entity Detection
Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since you can quickly pick out important topics or indentify key sections of text.

Let’s try out some entity detection using a few paragraphs from this recent article in the Washington Post. We’ll use .label to grab a label for each entity that’s detected in the text, and then we’ll take a look at these entities in a more visual format using spaCy‘s displaCy visualizer.



In [3]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc= nlp(u"""The Amazon rainforest,[a] alternatively, the Amazon Jungle, also known in English as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations.

The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Bolivia, Ecuador, French Guiana, Guyana, Suriname, and Venezuela. Four nations have "Amazonas" as the name of one of their first-level administrative regions and France uses the name "Guiana Amazonian Park" for its rainforest protected area. The Amazon represents over half of the planet's remaining rainforests,[2] and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.[3]

Etymology
The name Amazon is said to arise from a war Francisco de Orellana fought with the Tapuyas and other tribes. The women of the tribe fought alongside the men, as was their custom.[4] Orellana derived the name Amazonas from the Amazons of Greek mythology, described by Herodotus and Diodorus.[4]

History
See also: History of South America § Amazon, and Amazon River § History
Tribal societies are well capable of escalation to all-out wars between tribes. Thus, in the Amazonas, there was perpetual animosity between the neighboring tribes of the Jivaro. Several tribes of the Jivaroan group, including the Shuar, practised headhunting for trophies and headshrinking.[5] The accounts of missionaries to the area in the borderlands between Brazil and Venezuela have recounted constant infighting in the Yanomami tribes. More than a third of the Yanomamo males, on average, died from warfare.[6]""")

entities=[(i, i.label_, i.label) for i in doc.ents]
print(entities)
unique_labels = set([(ent.label_, ent.label) for ent in doc.ents])
print("Unique Entity Labels:+++++++++++++++++++++++++++++++++++++++++++++++++++")
for label, label_id in unique_labels:
    print(f"{label}: {spacy.explain(label)}")

[(Amazon, 'ORG', 383), (Amazon, 'ORG', 383), (Jungle, 'PRODUCT', 386), (English, 'LANGUAGE', 389), (Amazonia, 'GPE', 384), (Amazon, 'ORG', 383), (Amazon, 'ORG', 383), (South America, 'LOC', 385), (7,000,000, 'CARDINAL', 397), (2,700,000 sq mi, 'QUANTITY', 395), (5,500,000, 'CARDINAL', 397), (2,100,000 sq mi, 'QUANTITY', 395), (nine, 'CARDINAL', 397), (Brazil, 'GPE', 384), (60%, 'PERCENT', 393), (Peru, 'GPE', 384), (13%, 'PERCENT', 393), (Colombia, 'GPE', 384), (10%, 'PERCENT', 393), (Bolivia, 'GPE', 384), (Ecuador, 'GPE', 384), (French, 'NORP', 381), (Guiana, 'PERSON', 380), (Guyana, 'GPE', 384), (Suriname, 'GPE', 384), (Venezuela, 'GPE', 384), (Four, 'CARDINAL', 397), (Amazonas, 'PERSON', 380), (one, 'CARDINAL', 397), (first, 'ORDINAL', 396), (France, 'GPE', 384), (Guiana Amazonian Park, 'WORK_OF_ART', 388), (Amazon, 'ORG', 383), (over half, 'CARDINAL', 397), (rainforests,[2, 'PRODUCT', 386), (an estimated 390 billion, 'MONEY', 394), (16,000, 'CARDINAL', 397), (Amazon, 'ORG', 383), (F

NameError: name 'doc' is not defined

Using this technique, we can identify a variety of entities within the text. The spaCy documentation provides a full list of supported entity types, and we can see from the short example above that it’s able to identify a variety of different entity types, including specific locations (GPE), date-related words (DATE), important numbers (CARDINAL), specific individuals (PERSON), etc.

Using displaCy we can also visualize our input text, with each identified entity highlighted by color and labeled. We’ll use style = "ent" to tell displaCy that we want to visualize entities here

In [22]:
displacy.render(doc, style = "ent",jupyter = True)


# 2.8 Similarity
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

Spacy also provides inbuilt integration of dense, real valued vectors representing distributional similarity information.

Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors

In [23]:
import spacy

nlp = spacy.load("en_core_web_lg")
tokens = nlp("Let's Visit Taj Mahel, and then go to Goa")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

Let True 68.22077 False
's True 85.970726 False
Visit True 44.06666 False
Taj True 54.14669 False
Mahel False 0.0 True
, True 64.72698 False
and True 60.75837 False
then True 43.94667 False
go True 100.762375 False
to True 125.107445 False
Goa True 54.965668 False


The words “Let”, “Visit” and “Goa” are all pretty common in English, so they’re part of the model’s vocabulary, and come with a vector. The word “Mahel” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger models or loading in a full vector package, for example, en_vectors_web_lg, which includes over 1 million unique vectors.

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether “Goa” and “taj” are similar really depends on how you’re looking at it. spaCy’s similarity model usually assumes a pretty general-purpose definition of similarity

In [24]:
import spacy

nlp = spacy.load("en_core_web_lg")  # make sure to use larger model!
tokens = nlp("Dog Cat Monkey")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Dog Dog 1.0
Dog Cat 0.7704364657402039
Dog Monkey 0.5838620066642761
Cat Dog 0.7704364657402039
Cat Cat 1.0
Cat Monkey 0.6340912580490112
Monkey Dog 0.5838620066642761
Monkey Cat 0.6340912580490112
Monkey Monkey 1.0


In this case, the model’s predictions are pretty on point. A dog is very similar to a cat, whereas a monkey is not very similar to either of them. Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).



# WordCloud

In [None]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
nlp = spacy.load("en_core_web_lg")  # make sure to use larger model!
tokens = nlp("Bitcoin is the name of the best-known cryptocurrency, the one for which blockchain technology was invented. A cryptocurrency is a medium of exchange, such as the US dollar, but is digital and uses encryption techniques to control the creation of monetary units and to verify the transfer of funds,A blockchain is a decentralized ledger of all transactions across a peer-to-peer network. Using this technology, participants can confirm transactions without a need for a central clearing authority. Potential applications can include fund transfers, settling trades, voting, and many other issues.")

newText =''
for word in tokens:
 if word.pos_ in ['ADJ', 'NOUN']:
  newText = " ".join((newText, word.text.lower()))

wordcloud = WordCloud(stopwords=STOPWORDS).generate(newText)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

nlp = spacy.load("en_core_web_lg")  # make sure to use a larger model!

text="I love programming,you can test me with my skills any time"
tokens = nlp(text)

new_text = ''
for word in tokens:
    if word.pos_ in ['ADJ', 'NOUN']:
        if new_text:  # Check if new_text is not empty before joining
            new_text += " "
        new_text += word.text.lower()

wordcloud = WordCloud(stopwords=STOPWORDS).generate(new_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()