# Tokenization

**Tokenization :** Given a character sequence and a defined document unit

In [0]:
#!pip install spacy

In [0]:
import spacy

## Solve the examples from the slides using spaCy

In [0]:
nlp = spacy.load('en_core_web_sm')

example1 = nlp("This is an example of text tokenization")

for token in example1:
    print(token.text)


In [0]:
example2 = nlp("The quick brown fox jumps over the lazy dog")

for token in example2:
    print(token.text)

In [0]:
example3 = nlp("We’re the champions")

for token in example3:
    print(token.text)

In [0]:
example4 = nlp("Will we have dinner today?")

for token in example4:
    print(token.text)

# Stemming

**Stemming**

## Solving the Stemming examples from the slides using NLTK
## More examples added in this notebook
### You can install NLTK from https://www.nltk.org/install.html

In [0]:
!pip install nltk



In [0]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [0]:
example = "Cats Running Was"
example = [stemmer.stem(token) for token in example.split(" ")]
print(" ".join(example))

cat run wa


In [0]:
lyrics = "You better lose yourself in the music, the moment "\
+ "You own it, you better never let it go "\
+ "You only get one shot, do not miss your chance to blow "\
+ "This opportunity comes once in a lifetime "
lyrics = [stemmer.stem(token) for token in lyrics.split(" ")]
print(" ".join(lyrics))

you better lose yourself in the music, the moment you own it, you better never let it go you onli get one shot, do not miss your chanc to blow thi opportun come onc in a lifetim 


In [0]:
review = "Bromwell High is a cartoon comedy. "\
+ "It ran at the same time as some other programs about school life, such as \"Teachers\". "\
+ "My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much "\
+ "closer to reality than is \"Teachers\". The scramble to survive financially, the insightful "\
+ "students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation "\
+ ", all remind me of the schools I knew and their students. When I saw the episode in which a student "\
+ "repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. "\
+ "A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. "\
+ "I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!"

review= [stemmer.stem(token) for token in review.split(" ")]
print(" ".join(review))

bromwel high is a cartoon comedy. It ran at the same time as some other program about school life, such as "teachers". My 35 year in the teach profess lead me to believ that bromwel high' satir is much closer to realiti than is "teachers". the scrambl to surviv financially, the insight student who can see right through their pathet teachers' pomp, the petti of the whole situat , all remind me of the school I knew and their students. when I saw the episod in which a student repeatedli tri to burn down the school, I immedi recal ......... at .......... high. A classic line: inspector: i'm here to sack one of your teachers. student: welcom to bromwel high. I expect that mani adult of my age think that bromwel high is far fetched. what a piti that it isn't!


# Lemmatization

## In this notebook we solve the examples from the slides using spaCy.

In [0]:
 import spacy

In [0]:
nlp = spacy.load('en_core_web_sm')

In [0]:
example1 = nlp("Animals")
for token in example1:
    print(token.lemma_)

animal


In [0]:
example2 = nlp("is am are")
for token in example2:
    print(token.lemma_)

be
be
be


In [0]:
lyrics = "You better lose yourself in the music, the moment "\
+ "You own it, you better never let it go " \
+ "You only get one shot, do not miss your chance to blow "\
+ "This opportunity comes once in a lifetime"

example3 = nlp(lyrics)

for token in example3:
    print(token.lemma_)

-PRON-
better
lose
-PRON-
in
the
music
,
the
moment
-PRON-
own
-PRON-
,
-PRON-
better
never
let
-PRON-
go
-PRON-
only
get
one
shot
,
do
not
miss
-PRON-
chance
to
blow
this
opportunity
come
once
in
a
lifetime


# Vectorization
## In this notebook we solve the examples in the slides and more using scikit-learn
## You can install scikit-learn from: http://scikit-learn.org/stable/install.html

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, token_pattern=r'\b[^\d\W]+\b')

In [0]:
corpus = ["The dog is on the table", "the cats now are on the table"]
vectorizer.fit(corpus)
print(vectorizer.transform(["The dog is on the table"]).toarray())

[[0 0 1 1 0 1 1 1]]


In [0]:
vocab = vectorizer.vocabulary_

for key in sorted(vocab.keys()):
    print("{}: {}".format(key, vocab[key]))

are: 0
cats: 1
dog: 2
is: 3
now: 4
on: 5
table: 6
the: 7


In [0]:
corpus2 = ["I am jack", "You are john", "I am john"]
vectorizer.fit(corpus2)

print(vectorizer.transform(corpus2).toarray())

[[1 0 1 1 0 0]
 [0 1 0 0 1 1]
 [1 0 1 0 1 0]]


In [0]:
vocab = vectorizer.vocabulary_

for key in sorted(vocab.keys()):
    print("{}: {}".format(key, vocab[key]))

am: 0
are: 1
i: 2
jack: 3
john: 4
you: 5


# Word2vec
## In this notebook we will play with spaCy's word vectors

In [0]:
import spacy

#nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en_core_web_sm')

In [0]:
example1 = "man woman king queen"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

man man 1.0
man woman 0.63921684
man king 0.5535262
man queen 0.18746983
woman man 0.63921684
woman woman 1.0
woman king 0.6757708
woman queen 0.26638454
king man 0.5535262
king woman 0.6757708
king king 1.0
king queen 0.3645325
queen man 0.18746983
queen woman 0.26638454
queen king 0.3645325
queen queen 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


In [0]:
example1 = "walking walked swimming swam"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue
        print(token1.text, token2.text, token1.similarity(token2))

walking walked 0.11502897
walking swimming 0.7308439
walking swam 0.15747786
walked walking 0.11502897
walked swimming 0.13930066
walked swam 0.09911855
swimming walking 0.7308439
swimming walked 0.13930066
swimming swam 0.24586298
swam walking 0.15747786
swam walked 0.09911855
swam swimming 0.24586298


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


In [0]:
example1 = "spain russia madrid moscow"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue
        print(token1.text, token2.text, token1.similarity(token2))

In [0]:
example1 = "cat dog"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue        
        print(token1.text, token2.text, token1.similarity(token2))

cat dog 0.44518518
dog cat 0.44518518


  "__main__", mod_spec)
  "__main__", mod_spec)


In [0]:
example1 = "cat pizza"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue        
        print(token1.text, token2.text, token1.similarity(token2))

cat pizza 0.3983068
pizza cat 0.3983068


  "__main__", mod_spec)
  "__main__", mod_spec)


In [0]:
example1 = "flower pasta"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue        
        print(token1.text, token2.text, token1.similarity(token2))

flower pasta 0.4998248
pasta flower 0.4998248


  "__main__", mod_spec)
  "__main__", mod_spec)


# Named Entity Recognition
## In this notebook we will explore spaCy's abilities at detecting named entities.

In [0]:
import spacy

nlp = spacy.load('en_core_web_sm')

In [0]:
example = "Google, a company founded by Larry Page and Sergey Brin in the United States of America "\
+ "has one of the world’s most advanced search engines."

doc = nlp(example)

for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Larry Page PERSON
Sergey Brin PERSON
the United States of America GPE
one CARDINAL


In [0]:
example = "The company's rapid growth since incorporation has triggered a chain of products, acquisitions, and partnerships beyond Google's core search engine (Google Search). It offers services designed for work and productivity (Google Docs, Google Sheets, and Google Slides), email (Gmail), scheduling and time management (Google Calendar), cloud storage (Google Drive), instant messaging and video chat (Duo, Hangouts), language translation (Google Translate), mapping and navigation (Google Maps, Waze, Google Earth, Street View), video sharing (YouTube), note-taking (Google Keep), and photo organizing and editing (Google Photos). The company leads the development of the Android mobile operating system, the Google Chrome web browser, and Chrome OS, a lightweight operating system based on the Chrome browser. Google has moved increasingly into hardware; from 2010 to 2015, it partnered with major electronics manufacturers in the production of its Nexus devices, and it released multiple hardware products in October 2016, including the Google Pixel smartphone"
doc = nlp(example)

for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Google Search ORG
Google Docs ORG
Google Sheets ORG
Google Slides ORG
Gmail PERSON
Google Calendar ORG
Google Drive ORG
Duo PERSON
Hangouts NORP
Google Translate ORG
Google Maps PERSON
Waze PERSON
Google Earth PERSON
Street View FAC
Google Keep PERSON
Google ORG
Android ORG
Google Chrome PRODUCT
Chrome OS ORG
Chrome ORG
Google ORG
2010 DATE
2015 DATE
Nexus ORG
October 2016 DATE
Google Pixel ORG


In [0]:
example = "Google, a company founded by Larry Page and Sergey Brin in the United States of America "\
+ "has one of the world’s most advanced search engines."

doc = nlp(example)

for ent in doc.ents:
    print(ent.text, ent.label_)

In [0]:
example = "U.S. officials are meeting with former Taliban members "\
+ "amid intensifying efforts to wind down America's longest war, three of the "\
+ "militant group's commanders told NBC News."

doc = nlp(example)

for ent in doc.ents:
    print(ent.text, ent.label_)

U.S. GPE
Taliban ORG
America GPE
three CARDINAL
NBC News ORG


In [0]:
example = "It’s been an arduous year for German chancellor Angela Merkel, so far. "\
+ "She has battled through coalition negotiations to form a government, chivvied "\
+ "the European Union into a loose agreement on migrants, weathered insults from "\
+ "US president Donald Trump, and headed off a revolt from her interior minister. No wonder "\
+ "then one journalist at her summer news conference in Berlin today (July 20) asked if "\
+ "she was, honestly, just exhausted. “I can’t complain,” Merkel said, “I have a few days "\
+ "holiday now and am looking forward to sleeping a bit longer.”"

doc = nlp(example)

for ent in doc.ents:
    print(ent.text, ent.label_)

an arduous year DATE
German NORP
Angela Merkel PERSON
the European Union ORG
US GPE
Donald Trump PERSON
summer DATE
Berlin GPE
today DATE
July 20 DATE
Merkel ORG
a few days holiday DATE
