In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is looking at buying U.K startup for 1 billion dollars($)')

for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K PROPN dobj
startup VERB dep
for ADP prep
1 NUM compound
billion NUM nummod
dollars($ PROPN pobj
) PUNCT punct


### Tokenization

During processing, spaCy first tokenizes the text, i.e segment it into words, punctuations and so on. This is done by applying rules specific to each language.

In [2]:
text = """Let's go to N.Y.!"""

In [3]:
text

"Let's go to N.Y.!"

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
docs = nlp(text)

In [6]:
docs

Let's go to N.Y.!

In [7]:
for token in docs:
    print(token.text)

Let
's
go
to
N.Y.
!


### Part-Of-Speech Tagging and Dependency Parsing

**Text** : The original word text.

**Lemma** : The base form of the word.

**POS** : The simple UPOS part-of-speech tag.

**Tag** : The detailed part-of-speech tag.

**Dep** : Synthetic dependency, i.e. the relation between tokens.

**Shape** : The word shape - capitalization, punctuation, digits.

**is alpha** : is the toekn an alpha characters?

**is stop** : is the token part of a stop list, i.e. the most common words of the language

In [8]:
import spacy

In [9]:
nlp = spacy.load('en_core_web_sm')

In [10]:
text = "Apple is looking at buying U.K. startup for $a billion"

In [11]:
doc = nlp(text)

In [13]:
from beautifultable import BeautifulTable

In [14]:
table = BeautifulTable()

In [15]:
table.columns.header = ['text','POS','TAG','Dep','Shape','is_alpha','is_stop']
for token in doc:
    table.rows.append([token.text, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop])

In [17]:
print(table)

+---------+-------+-----+----------+-------+----------+---------+
|  text   |  POS  | TAG |   Dep    | Shape | is_alpha | is_stop |
+---------+-------+-----+----------+-------+----------+---------+
|  Apple  | PROPN | NNP |  nsubj   | Xxxxx |    1     |    0    |
+---------+-------+-----+----------+-------+----------+---------+
|   is    |  AUX  | VBZ |   aux    |  xx   |    1     |    1    |
+---------+-------+-----+----------+-------+----------+---------+
| looking | VERB  | VBG |   ROOT   | xxxx  |    1     |    0    |
+---------+-------+-----+----------+-------+----------+---------+
|   at    |  ADP  | IN  |   prep   |  xx   |    1     |    1    |
+---------+-------+-----+----------+-------+----------+---------+
| buying  | VERB  | VBG |  pcomp   | xxxx  |    1     |    0    |
+---------+-------+-----+----------+-------+----------+---------+
|  U.K.   | PROPN | NNP |   dobj   | X.X.  |    0     |    0    |
+---------+-------+-----+----------+-------+----------+---------+
| startup 

### Visualizing Dependency Parsing

* Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords of their dependents.

In [18]:
from spacy import displacy

In [23]:
options = {'compact':True, 'distance':100, 'bg':"#AAFFFF", "font":"Source Sans Pro"}
displacy.render(doc, style='dep', jupyter=True, options=options)

### Sentence Boundary Detection

Sentence boundary detection is the process of locating the start and end of sentences in a given text.

In [24]:
para = """In this lecture, you will learn how to detect sentences in a large paragraph. Later on, you can apply other processing techniques to these sentences.
🔊 Watch till last for a detailed description

💯 Read Full Blog with Code
       https://kgptalkie.com
💬 Leave your comments and doubts in the comment section
📌 Save this channel and video for watch later
👍 Like this video to show your support and love ❤️"""

In [25]:
doc = nlp(para)

In [26]:
list(doc)

[In,
 this,
 lecture,
 ,,
 you,
 will,
 learn,
 how,
 to,
 detect,
 sentences,
 in,
 a,
 large,
 paragraph,
 .,
 Later,
 on,
 ,,
 you,
 can,
 apply,
 other,
 processing,
 techniques,
 to,
 these,
 sentences,
 .,
 ,
 🔊,
 Watch,
 till,
 last,
 for,
 a,
 detailed,
 description,
 
 ,
 💯,
 Read,
 Full,
 Blog,
 with,
 Code,
 
        ,
 https://kgptalkie.com,
 ,
 💬,
 Leave,
 your,
 comments,
 and,
 doubts,
 in,
 the,
 comment,
 section,
 ,
 📌,
 Save,
 this,
 channel,
 and,
 video,
 for,
 watch,
 later,
 ,
 👍,
 Like,
 this,
 video,
 to,
 show,
 your,
 support,
 and,
 love,
 ❤,
 ️]

In [27]:
sents = list(doc.sents)

In [28]:
sents

[In this lecture, you will learn how to detect sentences in a large paragraph.,
 Later on, you can apply other processing techniques to these sentences.,
 🔊 Watch till last for a detailed description
 
 💯 Read Full Blog with Code
        https://kgptalkie.com
 💬 Leave your comments and doubts in the comment section
 📌,
 Save this channel and video for watch later
 👍,
 Like this video to show your support and love ❤️]

### Stop Words

Stop words are the most common words in a language. In the english language, some examples of stop words are the,are,but, and they.


i am eating -> i eating


i was eating -> i eating

In [29]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [30]:
stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [31]:
for sent in sents:
    print(sent.text)
    for token in sent:
        if token.text not in stopwords:
            print(token)

In this lecture, you will learn how to detect sentences in a large paragraph.
In
lecture
,
learn
detect
sentences
large
paragraph
.
Later on, you can apply other processing techniques to these sentences.

Later
,
apply
processing
techniques
sentences
.


🔊 Watch till last for a detailed description

💯 Read Full Blog with Code
       https://kgptalkie.com
💬 Leave your comments and doubts in the comment section
📌
🔊
Watch
till
detailed
description



💯
Read
Full
Blog
Code

       
https://kgptalkie.com


💬
Leave
comments
doubts
comment
section


📌
Save this channel and video for watch later
👍
Save
channel
video
watch
later


👍
Like this video to show your support and love ❤️
Like
video
support
love
❤
️


### Lemmatization


Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.



Playing

Plays  -----> Play ----> Common root form 'play'

Played



Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item.

In [32]:
text = 'playing played plays play'

In [33]:
doc = nlp(text)

In [36]:
for token in doc:
    print(token.text, token.lemma_)

playing play
played play
plays play
play play
