## Tokenization in Spacy
In Spacy library, tokenization refers to the process of breaking down a text into individual units called tokens. These tokens are typically words, punctuation marks, or other meaningful elements of the text. The tokenization process involves identifying word boundaries and separating them from other elements of the text such as numbers, symbols, and whitespace. This is an important step in NLP tasks as it allows for further analysis and processing of the text data. Spacy uses advanced tokenization algorithms that can handle complex text structures such as compound words, contractions, and hyphenated words.

According to NLP pipeline which is discussed in previouse notebook & shown in image bellow, we'll discuss sentence and word tokenization in **Spacy library.**

<img src = "img.jpg" width = "800px" height = "400px"></img>

**Tokenization is a process of splitting text into meaningful segments.**

<img src = "img1.jpg" width = "800px" height = "400px"></img>

In [1]:
# So let's import 'Spacy':
import spacy

In [2]:
# Next we want to create a language object. In spacy we can create object in different ways. One way is to create blank 
# object for English language 'en'.
nlp = spacy.blank("en")   # this just understand english. if you're using other language, then you can search for other
                          # language modesl. (spacy language models)

<img src = "img2.jpg" width = "600px" height = "200px"></img>

In [13]:
# Next we create a document and here we provide our text. Our text could be a paragraph, or multi page documents...
# Next we simple create text tokens.
doc = nlp('''"let's go to N.Y.!"''')
for token in doc:
    print(token)

"
let
's
go
to
N.Y.
!
"


**The process is perform as follow:**

<img src = "img3.jpg" width = "800px" height = "600px"></img>

In [3]:
# Let's pass some other text.
doc = nlp("Dr. Faizan visited the hospotal and he ordered the necessary tools.")
for token in doc:
    print(token)

Dr.
Faizan
visited
the
hospotal
and
he
ordered
the
necessary
tools
.


In [4]:
# We can choose each token individually:
doc[3]

the

In [5]:
# or
doc[:4]

Dr. Faizan visited the

In [10]:
# Now to check the object type, it should be an english language object. 
type(nlp)

spacy.lang.en.English

In [11]:
# Similarly 'doc' will be an object of document:
type(doc)

spacy.tokens.doc.Doc

In [14]:
# token:
type(token)

spacy.tokens.token.Token

In [18]:
token1 = doc[1]
token1

let

In [16]:
# Now if you do 'dir' for python variable, it will shows you all the methods of that class:
token1 = doc[1]
dir(token1)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [19]:
# So let's try to use some of these methods.
# 
token1.is_alpha   # It will answer you 'True' because it's simply alphabets.

True

In [20]:
token1.like_num

False

In [22]:
# Now let's have another text:
doc = nlp("She gives two $ to her brother.")

In [23]:
token2 = doc[2]
token2.text

'two'

In [25]:
token2.like_num # Will answer you 'True'

True

In [27]:
token3 = doc[3]
token3.text

'$'

In [30]:
token3.is_currency

True

In [31]:
# Let's print some tokens using loop, 'token.i' call for index.
for token in doc:
    print(token, "==>", "index: ", token.i,
          "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

She ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gives ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
her ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
brother ==> index:  6 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  7 is_alpha: False is_punct: True like_num: False is_currency: False


### Collecting email ids of students from students information sheet

In [33]:
# First we read the file:
with open("students.txt") as f:
    text = f.readlines()
text   # It will return the array.

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [34]:
# Now to convert the array into a simple text:
text = " ".join(text)
text



In [35]:
# Next we can find emails from the text using 'like_emails' function:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails 

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [37]:
# You can use different language models. Here let's see the Persian example:
nlp = spacy.blank("fa")
doc = nlp("ممنون از لطف شما، مبلغ 6000$ به حساب شما اضافه شد.")
for token in doc:
    print(token, token.is_currency, token.like_num)

ممنون False False
از False False
لطف False False
شما False False
، False False
مبلغ False False
6000 False True
$ True False
به False False
حساب False False
شما False False
اضافه False False
شد False False
. False False


In [38]:
# Sometimes you want to customize your tokenizer. I means when you see a special keyword in text, instead of that keyword 
# return me the splitted form of that keyword.

from spacy.symbols import ORTH

doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [39]:

nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"},
])
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens    # In tokenization we can't change the actual things, for example to return 'give' and 'me' instead of 'gim' and
          # 'me'. We'll do that in comming notebooks.

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [40]:
# The next topic is with blank tokenizer we can just take tokenizer not other things.
# For example now if we want to split a text into sentences, it will give us an error.
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")
for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

In [41]:
# So the error says that the pipeline is empty:
nlp.pipe_names 

[]

In [42]:
# Now if we add sentence tokenizer the problem will be solved.
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1cccaf0c840>

In [45]:
# Now if we check the pipeline, the sentenceizer will be added:
nlp.pipe_names

['sentencizer']

In [47]:
# So now we can split the text into sentences:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi.")
for sentence in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi.


* The sentencizer which we added is not able to know full features of English language, that's why its outputed the whole text as single sentence. But if we use 'nlp = spacy.load("en_core_web_sm")' pipeline, this issue will be removed. 

In [51]:
# Let's see the 'nlp = spacy.load("en_core_web_sm")' pipeline:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi.")
for sentence in doc.sents:
    print(sentence)            # This will work fine.

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi.


* So the basic idea is, if we use **blank pipeline (nlp = spacy.blank("en"))**, then we need to add every component manually. But if we add **(nlp = spacy.load("en_core_web_sm")) pipeline** then all the components will be added. 

<img src = "img4.jpg" width = "600px" height = "300px"></img>

### Exercise
(1) Think stats is a free book to study statistics (https://greenteapress.com/thinkstats2/thinkstats2.pdf)

This book has references to many websites from where you can download free datasets. You are an NLP engineer working for some company and you want to collect all dataset websites from this book. To keep exercise simple you are given a paragraph from this book and you want to grab all urls from this paragraph using spacy

In [52]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

# TODO: Write code here
# Hint: token has an attribute that can be used to detect a url

(2) Extract all money transaction from below sentence along with currency. Output should be,

two $

500 €

In [53]:

transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

# TODO: Write code here
# Hint: Use token.i for the index of a token and token.is_currency for currency symbol detection

In [55]:
# Let's first find the websites URLs from the text:
doce = nlp(text)
URLs = []
for token in doce:
    if token.like_url:
        URLs.append(token.text)
URLs 

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

In [58]:
# Next to extract money transaction:
doc_e = nlp(transactions)
for token in doc_e:
    if token.like_num and doc_e[token.i+1].is_currency:
        print(token.text, doc_e[token.i+1].text)

two $
500 €


### Further Reading
https://spacy.io/usage/linguistic-features#tokenization