spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

**Tokenization**	Segmenting text into words, punctuations marks etc. 

**Part-of-speech (POS) Tagging**	Assigning word types to tokens, like verb or noun.

**Dependency Parsing**	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

**Lemmatization**	Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.

**Sentence Boundary Detection (SBD)**	Finding and segmenting individual sentences.

**Named Entity Recognition (NER)**	Labelling named “real-world” objects, like persons, companies or locations.

**Entity Linking (EL)**	Disambiguating textual entities to unique identifiers in a knowledge base.

**Similarity**	Comparing words, text spans and documents and how similar they are to each other.

T**ext Classification**	Assigning categories or labels to a whole document, or parts of a document.

**Rule-based Matching**	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

Training	Updating and improving a statistical model’s predictions.

Serialization	Saving objects to files or byte strings.

In [None]:
1. NLTK   - Text Data
2. Spacy  - Text Data

In [2]:
! pip install spacy

Collecting spacy
  Downloading spacy-3.7.1-cp310-cp310-win_amd64.whl (12.1 MB)
     ---------------------------------------- 0.0/12.1 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.1 MB 2.0 MB/s eta 0:00:06
     ---------------------------------------- 0.1/12.1 MB 1.7 MB/s eta 0:00:07
     ---------------------------------------- 0.1/12.1 MB 1.4 MB/s eta 0:00:09
      --------------------------------------- 0.2/12.1 MB 1.2 MB/s eta 0:00:11
      --------------------------------------- 0.2/12.1 MB 1.0 MB/s eta 0:00:12
      --------------------------------------- 0.3/12.1 MB 1.1 MB/s eta 0:00:11
     - -------------------------------------- 0.3/12.1 MB 1.2 MB/s eta 0:00:11
     - -------------------------------------- 0.4/12.1 MB 1.2 MB/s eta 0:00:10
     - -------------------------------------- 0.5/12.1 MB 1.3 MB/s eta 0:00:10
     - -------------------------------------- 0.6/12.1 MB 1.3 MB/s eta 0:00:09
     -- ------------------------------------- 0.6/12.1 MB 1

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.2/12.8 MB 3.5 MB/s eta 0:00:04
      --------------------------------------- 0.2/12.8 MB 3.0 MB/s eta 0:00:05
     - -------------------------------------- 0.3/12.8 MB 2.4 MB/s eta 0:00:06
     - -------------------------------------- 0.4/12.8 MB 2.2 MB/s eta 0:00:06
     - -------------------------------------- 0.5/12.8 MB 2.4 MB/s eta 0:00:06
     -- ------------------------------------- 0.7/12.8 MB 2.4 MB/s eta 0:00:06
     -- ------------------------------------- 0.8/12.8 MB 2.5 MB/s eta 0:00:05
     -- ------------------------------------- 1.0/12.8 MB 2.5 MB/s eta 0:00:05
     --- ------------------------------------ 1.1/12.8 MB 2.5 MB/s eta 0:00:05
     --- --------------------------------

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [6]:
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

The the DET True
rain rain NOUN False
in in ADP True
Spain Spain PROPN False
falls fall VERB False
mainly mainly ADV False
on on ADP True
the the DET True
plain plain NOUN False
. . PUNCT False


In [7]:
import pandas as pd
cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []
for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)
df = pd.DataFrame(rows, columns=cols)
print(df)

     text   lemma    POS      explain  stopword
0     The     the    DET   determiner      True
1    rain    rain   NOUN         noun     False
2      in      in    ADP   adposition      True
3   Spain   Spain  PROPN  proper noun     False
4   falls    fall   VERB         verb     False
5  mainly  mainly    ADV       adverb     False
6      on      on    ADP   adposition      True
7     the     the    DET   determiner      True
8   plain   plain   NOUN         noun     False
9       .       .  PUNCT  punctuation     False


In [8]:
from spacy import displacy
displacy.render(doc, style="dep")

In [9]:
text = "We were all out at the zoo one day, I was doing some acting, 
walking on the railing of the gorilla exhibit. I fell in.Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."
doc = nlp(text)
for sent in doc.sents:
    print(">", sent)

> We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit.
> I fell in.
> Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket.
> The gorillas just went wild.


In [10]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

> 0 25
> 25 29
> 29 48
> 48 54


In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Ashutosh is going to the movie wit 10 friends to cine planet in Kompally")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Ashutosh Ashutosh PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
going go VERB VBG ROOT xxxx True False
to to ADP IN prep xx True True
the the DET DT det xxx True True
movie movie NOUN NN compound xxxx True False
wit wit NOUN NN pobj xxx True False
10 10 NUM CD nummod dd False False
friends friend NOUN NNS npadvmod xxxx True False
to to PART TO aux xx True True
cine cine VERB VB xcomp xxxx True False
planet planet NOUN NN dobj xxxx True False
in in ADP IN prep xx True True
Kompally Kompally PROPN NNP pobj Xxxxx True False


## Named Entities

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Mayank is going to the movie with 10 friends to cine planet in Kompally")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Mayank 0 6 ORG
10 34 36 CARDINAL
Kompally 63 71 LOC


In [21]:
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [22]:
doc = nlp("This is a sentence about Infosys.")
doc.user_data["title"] = "This is a title"
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [23]:
import spacy
from spacy import displacy
from spacy.tokens import Span

text = "Welcome to the Bank of India."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
    Span(doc, 3, 6, "ORG"),
    Span(doc, 5, 6, "GPE"),
]

displacy.serve(doc, style="span")


Using the 'span' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [25]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.0/en_core_web_md-3.7.0-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/42.8 MB 6.4 MB/s eta 0:00:07
     ---------------------------------------- 0.2/42.8 MB 2.9 MB/s eta 0:00:15
     ---------------------------------------- 0.4/42.8 MB 2.8 MB/s eta 0:00:16
     ---------------------------------------- 0.5/42.8 MB 2.7 MB/s eta 0:00:16
      --------------------------------------- 0.6/42.8 MB 2.5 MB/s eta 0:00:17
      --------------------------------------- 0.7/42.8 MB 2.5 MB/s eta 0:00:17
      --------------------------------------- 0.7/42.8 MB 2.3 MB/s eta 0:00:19
      --------------------------------------- 0.9/42.8 MB 2.4 MB/s eta 0:00:18
      --------------------------------------- 1.0/42.8 MB 2.3 MB/s eta 0:00:19
     - ----------------------------------

### Similarity between two sentences

In [26]:
import spacy

nlp = spacy.load("en_core_web_md")  # make sure to use larger package!
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))


I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761
salty fries <-> hamburgers 0.6938489675521851
