In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

### Intro to tokenization and POS

In [4]:
# Simply opening a wikipedia page, saved in text format
with open("data/wiki_us.txt", "r") as f:
    text = f.read()
print(text)


The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America, between Canada and Mexico. It is a federation of 50 states, a federal capital district (Washington, D.C.), and 326 Indian reservations. Outside the union of states, it asserts sovereignty over five major unincorporated island territories and various uninhabited islands.[j] The country has the world's third-largest land area,[d] largest maritime exclusive economic zone,
and the third-largest population, exceeding 334 million.[k]
Paleo-Indians migrated across the Bering land bridge more than 12,000 years ago. British colonization led to the first settlement of the Thirteen Colonies in Virginia in 1607. Clashes with the British Crown over taxation and political representation sparked the American Revolution, with the Second Continental Congress formally declaring independence on July 4, 1776. Following its victory in the Revolutionary

In [5]:
#Creating a doc object
doc = nlp(text)
print(doc)


The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America, between Canada and Mexico. It is a federation of 50 states, a federal capital district (Washington, D.C.), and 326 Indian reservations. Outside the union of states, it asserts sovereignty over five major unincorporated island territories and various uninhabited islands.[j] The country has the world's third-largest land area,[d] largest maritime exclusive economic zone,
and the third-largest population, exceeding 334 million.[k]
Paleo-Indians migrated across the Bering land bridge more than 12,000 years ago. British colonization led to the first settlement of the Thirteen Colonies in Virginia in 1607. Clashes with the British Crown over taxation and political representation sparked the American Revolution, with the Second Continental Congress formally declaring independence on July 4, 1776. Following its victory in the Revolutionary

Notice in the above that both of them look very similar. But the doc object and text objects vary significanty as shown in the below code snippet

In [6]:
print(len(text))
print(len(doc))

#The difference in length, shows the variation of the two types of data


66643
11728


The difference in length, shows the variation of the two types of data.

This difference is because of tokenization. While the text object is basically a string and hence split into invidual characters, the doc object is actually tokenized into proper words and hightlights the different parts of the sentence.

The partition done is not just on the basis of whitespaces, but also every comma, special characters, acronyms etc. are also taken into account, which cannot be done by a simple Strip() operation.

This makes the spacy tokeniser a powerful tool which can ease the operations based on language.

In [7]:
#Division in normal text
for token in text[0:5]:
    print(token)

#Division by splitting the text
for token in text.split()[:10]:
    print(token)



T
h
e
 
The
United
States
of
America
(USA
or
U.S.A.),
commonly
known


In [8]:
#Division with the hep of spacy
for token in doc[:10]:
    print(token)



The
United
States
of
America
(
USA
or
U.S.A.


In [9]:
#Printing every sentence as a token
#This is done using sentence boundary detection
for sent in doc.sents:
    print("\n",sent)


 
The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America, between Canada and Mexico.

 It is a federation of 50 states, a federal capital district (Washington, D.C.), and 326 Indian reservations.

 Outside the union of states, it asserts sovereignty over five major unincorporated island territories and various uninhabited islands.[j]

 The country has the world's third-largest land area,[d] largest maritime exclusive economic zone,
and the third-largest population, exceeding 334 million.[k]
Paleo-Indians migrated across the Bering land bridge more than 12,000 years ago.

 British colonization led to the first settlement of the Thirteen Colonies in Virginia in 1607.

 Clashes with the British Crown over taxation and political representation sparked the American Revolution, with the Second Continental Congress formally declaring independence on July 4, 1776.

 Following its victory in the

In [10]:
sen = list(doc.sents)[0] 
#Here doc.sents is not usable as it is a generator, so we list all the sentences to use it after
print(sen)


The United States of America (USA or U.S.A.), commonly known as the United States (US or U.S.) or America, is a country primarily located in North America, between Canada and Mexico.


In [15]:
for token in sen[:10]:
    print(token)



The
United
States
of
America
(
USA
or
U.S.A.


### Properties of tokens

In [62]:
token = sen[0]
print(token)
token = sen[1]
print(token)
token = sen[2]
print(token)
token = sen[7]
print(token)



The
United
USA


In [63]:
token.text

'USA'

### Token Attributes

left_edge: The leftmost token of this token’s syntactic descendants.

right _edge: The rightmost token of this token’s syntactic descendants.

This shows the part of the noun chunk that is present to the left of the token and should not be confused with simpy showing the left index of the sentence.

It shows how spacy has chunked and categorised the sentence bits

In [64]:
token.left_edge 
# This shows the part of the noun chunk that is present to the left of the token
# Should not be confused with simpy showing the left index of the sentence
# It shows how spacy has chunked and categorised the sentence bits

USA

In [65]:
token.right_edge

U.S.A.

Addidng and extra _ at the end shows the alphabetical variation of the type/lemma/iob etc

ent_type: Shows named entity type (Numberical encoding)

ent_type_: Shows named entitiy type (Alphabetical type)

In [69]:
print(token.ent_type)
print(token.ent_type_) # Type of token
print(token.ent_iob_) # Shows position of the token in the larger entity/groups B:Beginning, I: Intermediate postion
print(token.lemma_) # Shows the root form of the word after lemmatization (This cuts of the ing part from verbs)
print(token.morph) # Shows the morphological analysis of the statement which gives the grammer of that word
print(token.pos_) # Shows part of speech in the statement given
print(token.dep_) # Shows the dependences in the statement
print(token.lang_) # Shows the language used for the token

384
GPE
B
USA
Number=Sing
PROPN
appos
en


In [67]:
print(sen[13].lemma_)
print(sen[13]) #Example of lemmatization in action
print(sen[13].morph)

know
known
Aspect=Perf|Tense=Past|VerbForm=Part


In [70]:
text = "Hello, this is a new statement for experimentation purposes with Jarvis"
doc2 = nlp(text)

In [72]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Hello INTJ intj
, PUNCT punct
this PRON nsubj
is AUX ROOT
a DET det
new ADJ amod
statement NOUN attr
for ADP prep
experimentation NOUN compound
purposes NOUN pobj
with ADP prep
Jarvis PROPN pobj


In [73]:
from spacy import displacy
displacy.render(doc2, style="dep")
# Displacy helps show how the pos tagging was done to the statement