In [1]:
import spacy
import pandas as pd

##### Features of SpaCy:
* **Tokenization**: segmenting text into words, punctuation marks, etc.

* **Part-of-Speech (PoS) Tagging**: assigning word types to tokens, like verb or noun.

* **Dependency Parsing**: assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

* **Lemmatization**: assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".

* **Sentence Boundary Detection (SBD)**: finding and segmenting individual sentences.

* **Named Entity Recognition**: labelling named "real-world" objects, like persons, companies, or locations.

* **Entity Liking**: disambiguating textual entities to unique identifiers in a knowledge base.

* **Similarity**: comparing words, text spans and documents and how similar they are to each other.

* **Text Classification**: assigning categories or labels to a whole document, or parts of a document.

* **Rule-based Matching**: finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

* **Training**: updating and improving a statistical model's predictions.

* **Serialization**: saving objects to files or byte strings.

In [2]:
spacy.cli.download("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
nlp_model = spacy.load("en_core_web_sm")
string = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp_model(string)

for token in doc:
  print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN nsubj
startup VERB ccomp
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj
. PUNCT punct


In [4]:
columns = ["text", "lemma", "pos", "tag", "dep", "shape", "alpha", "stop"]

df = pd.DataFrame(columns=columns)

nlp_model = spacy.load("en_core_web_sm")
string = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp_model(string)

for token in doc:
    df = pd.concat([
        df, pd.DataFrame([[token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop]], columns=columns)
    ], axis=0)

df

Unnamed: 0,text,lemma,pos,tag,dep,shape,alpha,stop
0,Apple,Apple,PROPN,NNP,nsubj,Xxxxx,True,False
0,is,be,AUX,VBZ,aux,xx,True,True
0,looking,look,VERB,VBG,ROOT,xxxx,True,False
0,at,at,ADP,IN,prep,xx,True,True
0,buying,buy,VERB,VBG,pcomp,xxxx,True,False
0,U.K.,U.K.,PROPN,NNP,nsubj,X.X.,False,False
0,startup,startup,VERB,VBD,ccomp,xxxx,True,False
0,for,for,ADP,IN,prep,xxx,True,True
0,$,$,SYM,$,quantmod,$,False,False
0,1,1,NUM,CD,compound,d,False,False


In [5]:
columns = ["text", "start", "end", "label"]

df = pd.DataFrame(columns=columns)

nlp_model = spacy.load("en_core_web_sm")
string = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp_model(string)

for ent in doc.ents:
    df = pd.concat([
        df, pd.DataFrame([[ent.text, ent.start_char, ent.end_char, ent.label_]], columns=columns)
    ], axis=0)

df

Unnamed: 0,text,start,end,label
0,Apple,0,5,ORG
0,U.K.,27,31,GPE
0,$1 billion,44,54,MONEY


GPE - Geopolitical entity, i.e., countries, cities, states

In [7]:
spacy.cli.download("en_core_web_md")

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [17]:
columns = ["text", "has_vector", "vector_norm", "is_oov"]

df = pd.DataFrame(columns=columns)

nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana rtwreterwt")

for token in tokens:
  df = pd.concat([
    df, pd.DataFrame([[token.text, token.has_vector, token.vector_norm, token.is_oov]], columns=columns)
  ], axis=0)

df

  df = pd.concat([


Unnamed: 0,text,has_vector,vector_norm,is_oov
0,dog,True,7.443447,False
0,cat,True,7.443447,False
0,banana,True,6.895898,False
0,rtwreterwt,False,0.0,True


In [23]:
nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like salty fries and hamburgers.") # type: Doc
doc2 = nlp("Fast food tastes very good.") # type: Doc

print(doc1, '<->', doc2, "similarity score:", doc1.similarity(doc2))

french_fries = doc1[2:4] # type: Span
burgers = doc1[5] # type: Span

print(french_fries, '<->', burgers, "similarity score:", french_fries.similarity(burgers))

I like salty fries and hamburgers. <-> Fast food tastes very good. similarity score: 0.8015959858894348
salty fries <-> hamburgers similarity score: 0.5733411312103271


In [25]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("I love coffee")
print(doc.vocab.strings["coffee"])
print(doc.vocab.strings[3197928453018144401])

3197928453018144401
coffee
