<h2 align="center">Spacy Tokenization Tutorial</h2>

In [1]:
import spacy

  from .autonotebook import tqdm as notebook_tqdm


Create blank language object and tokenize words in a sentence

In [6]:
#nlp = spacy.blank("en")
nlp = spacy.blank("bn")

#doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.")
doc = nlp("আমাদের দেশে অনেক নদী-নালা আছে।")


for token in doc:
    print(token)

আমাদের
দেশে
অনেক
নদী
-
নালা
আছে
।


Creating blank language object gives a tokenizer and an empty pipeline. We will look more into language pipelines in next tutorial

<img src="spacy_blank_pipeline.jpg" height=100, width=500/>

<h3>Using index to grab tokens</h3>

In [4]:
doc[0]

Dr.

In [5]:
token = doc[1]
token.text

'Strange'

In [6]:
type(nlp)

spacy.lang.en.English

In [7]:
type(doc)

spacy.tokens.doc.Doc

In [8]:
type(token)

spacy.tokens.token.Token

<h3>Span object</h3>

In [10]:
span = doc[0:5]
span

Dr. Strange loves pav bhaji

In [11]:
type(span)

spacy.tokens.span.Span

<h3>Token attributes</h3>

In [12]:
doc = nlp("Tony gave two $ to Peter.")

In [13]:
token0 = doc[0]
token0

Tony

In [14]:
token0.is_alpha

True

In [15]:
token0.like_num

False

In [16]:
token2 = doc[2]
token2

two

In [17]:
token2.like_num

True

In [18]:
token3 = doc[3]
token3

$

In [19]:
token3.like_num

False

In [20]:
token3.is_currency

True

In [21]:
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Tony ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> index:  2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> index:  3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> index:  6 is_alpha: False is_punct: True like_num: False is_currency: False


<h3>Collecting email ids of students from students information sheet</h3>

In [22]:
with open("students.txt") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [23]:
text = " ".join(text)
text



In [24]:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails        

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

<h3>Support in other languages</h3>

Spacy support many language models. Some of them do not support pipelines though!
https://spacy.io/usage/models#languages

In [28]:
nlp = spacy.blank("bn")
doc = nlp("আমি পাঁচ হাজার ৳ নিয়ে বাজারে গেলাম।")
for token in doc:
    #print(token, token.is_currency)
    print(token, token.like_num)

আমি False
পাঁচ False
হাজার False
৳ False
নিয়ে False
বাজারে False
গেলাম False
। False
