In [6]:
import spacy
from spacy import displacy

In [7]:
nlp = spacy.load("en_core_web_sm") # This loads a small English model trained on web data.
# creating the spaCy object 'nlp'

We can use this object to process text through a defined pipeline of modules and store the result as a value for another variable for accessing it. The results is another spaCy object of the type 'Doc' which gives us access to all the different analyses of the pipeline through different functions. In a Doc object we can access tokens, their lemmas, their PoS, sentences, chunks, named entities, etc.

In [8]:
test_input = "In a video on social media, he said there was now a culture where a claim is made with the idea that a settlement will be cheaper than taking it to court, even if there's no basis for the claim"
# running nlp pipeline on the test_input
doc = nlp(test_input)

In [3]:
example = "It was an extraordinarily good day."

In [4]:
preprocessed = nlp(example)

In [5]:
spacy.displacy.render(preprocessed, style="dep")

# 1. Tokenization

The basic unit in NLP is usually the token. Punctuation is treated as a separate token and check how "It's" is tokenized. Try a few other test inputs to better understand the concept of a token. Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens.

In [9]:
for token in doc:
    print(token.i, token, token.idx)
    
print()

0 In 0
1 a 3
2 video 5
3 on 11
4 social 14
5 media 21
6 , 26
7 he 28
8 said 31
9 there 36
10 was 42
11 now 46
12 a 50
13 culture 52
14 where 60
15 a 66
16 claim 68
17 is 74
18 made 77
19 with 82
20 the 87
21 idea 91
22 that 96
23 a 101
24 settlement 103
25 will 114
26 be 119
27 cheaper 122
28 than 130
29 taking 135
30 it 142
31 to 145
32 court 148
33 , 153
34 even 155
35 if 160
36 there 163
37 's 168
38 no 171
39 basis 174
40 for 180
41 the 184
42 claim 188



In [12]:
for token in preprocessed:
    print(token.i, token, token.idx)
    
print()

0 It 0
1 was 3
2 an 7
3 extraordinarily 10
4 good 26
5 day 31
6 . 34



<b>spaCy</b> provides sentence segmentation by grouping tokens together. Try different test inputs to analyze the quality of the sentence segmentation.

In [15]:
sentences = doc.sents

for sentence in sentences:
    print()
    print(sentence)
    for token in sentence:
        print(token.text)


In a video on social media, he said there was now a culture where a claim is made with the idea that a settlement will be cheaper than taking it to court, even if there's no basis for the claim
In
a
video
on
social
media
,
he
said
there
was
now
a
culture
where
a
claim
is
made
with
the
idea
that
a
settlement
will
be
cheaper
than
taking
it
to
court
,
even
if
there
's
no
basis
for
the
claim


In [16]:
sentence_two = preprocessed.sents

for sentence in sentence_two:
    print()
    print(sentence)
    for token in sentence:
        print(token.text)


It was an extraordinarily good day.
It
was
an
extraordinarily
good
day
.


# 2. Lemmatization

Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming.

When we apply the <b>‘lemmatize’</b> process to the word <b>‘made’</b>, it change it and reaches the word <b>‘make’</b>, which is the dictionary form of the word.

In [17]:
for token in doc:
    print(token.text, token.lemma_) 

In in
a a
video video
on on
social social
media medium
, ,
he he
said say
there there
was be
now now
a a
culture culture
where where
a a
claim claim
is be
made make
with with
the the
idea idea
that that
a a
settlement settlement
will will
be be
cheaper cheap
than than
taking take
it it
to to
court court
, ,
even even
if if
there there
's be
no no
basis basis
for for
the the
claim claim


# 3. Part of Speech Tagging (POS-Tagging)

POS-Tag is the labeling of the words in a text according to their word types (noun, adjective, adverb, verb, etc.). A part-of-speech tagger assigns a word class to each token. The number of word classes depends on the tagset that the model uses. POS tagging is a supervised learning solution that uses features like the previous word, next word, is first letter capitalized etc. When applying this, we first need to split a sentence into tokens. Tagging works after splitting to tokens.

In [18]:
for token in doc:
    print(token.text, token.pos_, token.tag_)

In ADP IN
a DET DT
video NOUN NN
on ADP IN
social ADJ JJ
media NOUN NNS
, PUNCT ,
he PRON PRP
said VERB VBD
there PRON EX
was VERB VBD
now ADV RB
a DET DT
culture NOUN NN
where SCONJ WRB
a DET DT
claim NOUN NN
is AUX VBZ
made VERB VBN
with ADP IN
the DET DT
idea NOUN NN
that SCONJ IN
a DET DT
settlement NOUN NN
will AUX MD
be AUX VB
cheaper ADJ JJR
than ADP IN
taking VERB VBG
it PRON PRP
to ADP IN
court NOUN NN
, PUNCT ,
even ADV RB
if SCONJ IN
there PRON EX
's VERB VBZ
no DET DT
basis NOUN NN
for ADP IN
the DET DT
claim NOUN NN


In [21]:
spacy.explain("VBZ") # SpaCy provides a short explanation for each tag

'verb, 3rd person singular present'

In [22]:
first_token = doc[0]
dir(first_token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

The attributes without <b>_</b> return numerical values which spaCy uses internally. Variants with <b>_</b> provide the human readable rendering of the value in unicode.

In [23]:
print(first_token.tag, first_token.tag_)

1292078113972184607 IN


# 4. Named Entity Recognition

Named entity recognition is a natural language processing technique that can automatically scan entire articles and pull out some fundamental entities in a text and classify them into predefined categories. Named Entity Recognition is the process of detecting the named entities such as person names, location names, company names, etc from the text. It is also known as entity identification or entity extraction or entity chunking.

In [24]:
text = "But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption."""
doc2 = nlp(text)

In [25]:
for ent in doc2.ents:
    print(ent.text, ent.label_)

Google ORG
Apple’s Siri ORG
iPhones ORG
Amazon’s Alexa ORG
Echo GPE


In [26]:
displacy.render(doc2, jupyter=True, style="ent")

# 5. Calculating Frequencies

A common analysis step for language corpora is the extraction of frequency statistics.

In [28]:
from collections import Counter

test_input2 = "Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations."
doc3 = nlp(test_input2) # running the NLP pipeline on the test input

word_frequencies = Counter()

for sentence in doc3.sents:
    words = []
    
    for token in sentence:
        # filtering out the punctuation
        if not token.is_punct:
            words.append(token.text)
        word_frequencies.update(words)
    

print(word_frequencies)

Counter({'’s': 83, 'to': 69, 'raw': 62, 'text': 58, 'it': 56, 'is': 51, 'words': 44, 'in': 33, 'and': 32, 'That': 31, 'Processing': 29, 'difficult': 29, 'exactly': 29, 'While': 28, 'what': 28, 'spaCy': 27, 'intelligently': 26, 'a': 26, 'possible': 25, 'designed': 25, 'solve': 23, 'do': 23, 'most': 22, 'some': 22, 'the': 21, 'problems': 21, 'you': 21, 'are': 20, 'starting': 20, 'put': 20, 'rare': 19, 'that': 19, 'from': 19, 'different': 18, 'only': 18, 'Even': 16, 'same': 15, 'splitting': 15, 'useful': 15, 'characters': 15, 'common': 14, 'get': 14, 'for': 13, 'The': 13, 'can': 13, 'into': 13, 'back': 13, 'completely': 12, 'mean': 11, 'word': 11, 'usually': 11, 'Doc': 11, 'look': 10, 'better': 10, 'object': 10, 'like': 9, 'units': 8, 'use': 8, 'order': 7, 'linguistic': 7, 'comes': 7, 'be': 6, 'knowledge': 6, 'with': 6, 'almost': 5, 'something': 4, 'add': 4, 'variety': 4, 'many': 3, 'of': 3, 'thing': 2, 'languages': 2, 'information': 2, 'annotations': 2})


In [29]:
num_tokens = len(doc)
num_words = sum(word_frequencies.values())
num_types = len(word_frequencies.keys())

print(num_tokens, num_words, num_types)

43 1447 74
