In [1]:
# language pipeline brekdown

import spacy
#--
nlp = spacy.load('en_core_web_md')
#spacy.load() returned a Language class instance, nlp. The Language class is the text processing pipeline.

#--
doc = nlp("I went there.")
#After that, we applied nlp on the sample sentence I went there and got a Doc class instance, doc.


#### Tokenizer

It is first step in NLP Pipline.
Tokenization simply means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text.

Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence.

In [7]:
# process of tokenization 

import spacy
nlp = spacy.load('en_core_web_md')

doc = nlp("I like apple and Apple.")

print([token.text for token in doc])
#print([i.text for i in doc])

#for i in doc:
    #print(i.text)
    
# spaCy generates the Token objects implicitly when we created the Doc object.
# that is why use token and not anything else.

['I', 'like', 'apple', 'and', 'Apple', '.']


Most domains that you'll process have characteristic words and phrases that need custom tokenization rules.

When we work with a specific domain such as medicine, insurance, or finance, we often come across words, abbreviations, and entities that needs special attention


In [9]:
# invoke tokenizer's custom class

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_core_web_md')

doc = nlp("lemme that")

print([i.text for i in doc])


['lemme', 'that']


In [10]:
special_case = [{ORTH:"lem"} , {ORTH:"me"}]
# ORTH is for orthography which means text
nlp.tokenizer.add_special_case("lemme" , special_case)
# We defined a special case, where the word lemme should tokenize as two tokens, lem and me.
# and then added the rule to the nlp object's tokenizer.

print([i.text for i in nlp("lemme that")])

['lem', 'me', 'that']


In [11]:
# similarly

print([i.text for i in nlp("lemme that!")])
# here it will tokenize ! also.

['lem', 'me', 'that', '!']


Debugging the tokenizer using :  nlp.tokenizer.explain(sentence)

In [15]:
import spacy
nlp = spacy.load("en_core_web_md")
text = "Let's go!"
doc = nlp(text)

# the explainer !
tok_exp = nlp.tokenizer.explain(text)

for t in tok_exp:
    #print(t[0], "\t", t[1])
    #print(t[1], "\t", t[2]) --> out of range!

    print(t[1], "\t", t[0])

Let 	 SPECIAL-1
's 	 SPECIAL-2
go 	 TOKEN
! 	 SUFFIX


#### Tokenizing sentences!!

Spacy uses dependency parser to tokenize sentences!

In [16]:
import spacy

nlp = spacy.load("en_core_web_md")

text = "I flied to N.Y yesterday. It was around 5 pm."

doc = nlp(text)

for sent in doc.sents:

    print(sent.text)

I flied to N.Y yesterday.
It was around 5 pm.


#### Lemmatization
Lemma is base form of token.
e.g eating - eat 
sitting - sit

In [19]:
import spacy
nlp = spacy.load('en_core_web_md')

#doc = nlp("I go there and will eat an ice-cream.")

doc = nlp("I went there and will eats an ice-creams.")

for token in doc:
    print(token.text , token.lemma_)

I I
went go
there there
and and
will will
eats eat
an an
ice ice
- -
creams cream
. .


Playing with token classes.

-token.text

-token.text_with_ws

-token.i

-token.idx

-token.doc

-token.sent

-token.is_sent_start

-token.ent_type

In [20]:
doc = nlp("Hello doctor!")
doc[0]

Hello

In [23]:
# token.text
# -- doc = nlp("Hello doctor!")
print(doc.text)
print(doc[0].text)

Hello doctor!
Hello


In [24]:
# token.text_with_ws gives whitespace if present in sentence

doc[0].text_with_ws

'Hello '

In [26]:
# length of token
print(len(doc))
print(len(doc[0]))


3
5


In [29]:
# token.i gives index of token

token = doc[0]
token.i

0

In [30]:
# token.idx gives poistion 
doc[0].idx

0

In [31]:
doc[1].idx
# returns 6 because doctor which is at doc[1] starts at 6th poistion.

6

In [32]:
# using entities

doc = nlp("Tim Cook is CEO")
doc.ents

(Tim Cook,)

In [37]:
doc[3].ent_type_
# if result is ' ' that means it is not an entity
# eg is and CEO are not entities
# however TIM and COOK are entities of type PERSON.

''

In [38]:
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_get_array_attrs',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'se

#### Span Object
Span objects represent phrases or segments of the text. Technically, a Span has to be a contiguous sequence of tokens. We usually don't initialize Span objects, rather we slice a Doc object.



In [39]:
doc = nlp("Tim Cook is CEO. He is from USA.")
doc[4:]

. He is from USA.

In [40]:
doc[:4]

Tim Cook is CEO

In [41]:
doc[3:8]

CEO. He is from

In [42]:
doc[9:]

.

In [43]:
doc[10] # gives indexerror.

IndexError: [E040] Attempt to access token at 10, max length 10.

### Spacy commonly used features

token.lower_ returns the token in lowercase.

is_alpha returns True if all the characters of the token are alphabetic letters.

is_ascii returns True if all the characters of token are ASCII characters.

is_digit returns True if all the characters of the token are numbers.

is_punct returns True if the token is a punctuation mark.

is_left_punct and is_right_punct return True if the token is a left punctuation mark or right punctuation mark, respectively.

is_space returns True if the token is only whitespace characters.

is_bracket returns True for bracket characters.

is_quote returns True for quotation marks.


is_currency returns True for currency symbols such as $ and €.

like_url, like_num, and like_email are methods about the token shape and return True if the token looks like a URL, a number, or an email, respectively.

is_oov and is_stop are semantic features, as opposed to the preceding shape features. is_oov returns True if the token is Out Of Vocabulary (OOV), that is, not in the Doc object's vocabulary. OOV words are unknown words to the language model.


### token.shape_
token.shape_ is an unusual feature – there is nothing similar in other NLP libraries. It returns a string that shows a token's orthographic features. Numbers are replaced with d, uppercase letters are replaced with X, and lowercase letters are replaced with x. 

In [52]:
# example of token.shape_

doc = nlp("Tim Cook is Apple CEO in 2021.")

for token in doc:
    print(token.text , token.shape_)
    
# so for every caps letter it prints X
# and for small letter it prints x
# for number it prints d so 123 becomes ddd
# . remains as .

Tim Xxx
Cook Xxxx
is xx
Apple Xxxxx
CEO XXX
in xx
2021 dddd
. .


**is_stop**  is a feature that is frequently used by machine learning algorithms. Often, we filter words that do not carry much meaning, such as the, a, an, and, just, with, and so on. Such words are called stop words.