<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week9/Tokenization_stemming_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analytics Tokenization - Stemming - Lemmatization

There are two main packages for doing text processing and analytics with Python: NLTK and spaCy. SpaCy is the "new kid on the block" and should superior and faster results than NLTK, but for starting in text analytics both packages are good.

We will play a bit with spaCy and NLTK.

In [1]:
# we update and install spaCy
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy
  Downloading spacy-3.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 16.3 MB/s 
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.4.2
    Uninstalling spacy-3.4.2:
      Successfully uninstalled spacy-3.4.2
Successfully installed spacy-3.4.3


In [2]:
# we load the english language model
!python -m spacy download en

2022-11-12 23:48:16.134536: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 10.9 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Load the spaCy language model.

In [3]:
import spacy
sp = spacy.load('en_core_web_sm')

## Tokenization

We create a simple spaCy document.


In [4]:
sentence = sp(u'The Royal Swedish Academy of Sciences has awarded the Nobel Prize in Physics 2019 "for contributions to our understanding of the evolution of the universe and Earth''s place in the cosmos"')

SpaCy automatically breaks your document into tokens when a document is created using the model.

A token simply refers to an individual part of a sentence having some semantic value. Let's see what tokens we have in our document:

In [5]:
for word in sentence:
    print(word.text)

The
Royal
Swedish
Academy
of
Sciences
has
awarded
the
Nobel
Prize
in
Physics
2019
"
for
contributions
to
our
understanding
of
the
evolution
of
the
universe
and
Earths
place
in
the
cosmos
"


You can see we have the following tokens in our document. We can also see the parts-of-speech (POS) of each of these tokens using the `.pos_` attribute shown below. POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "fish" can be used as both a noun and verb, depending upon the context.

In [6]:
s1= sp("I like to fish")

for word in s1:
    print(word.text,  word.pos_)
print()
s2= sp("The fish jumped out of my hand")
for word in s2:
    print(word.text,  word.pos_)


I PRON
like VERB
to PART
fish VERB

The DET
fish NOUN
jumped VERB
out ADP
of ADP
my PRON
hand NOUN


You see how powerful POS tagging is! 

Let's visualize it.



In [7]:
from spacy import displacy

sen = sp(u"The fish jumped out of my hand")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

In [11]:
# create another sentence
sentence2 = sp(u"She isn't looking to buy an apartment.")

# and we print out the dependences
for word in sentence2:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

She          PRON       PRP      pronoun, personal
is           AUX        VBZ      verb, 3rd person singular present
n't          PART       RB       adverb
looking      VERB       VBG      verb, gerund or present participle
to           PART       TO       infinitival "to"
buy          VERB       VB       verb, base form
an           DET        DT       determiner
apartment    NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer


Notice that the isn't becomes 2 tokens.

You can also break down a document in sentences.

In [12]:
document = sp(u'The Big Bang model describes the universe from its very first moments.  Even today, this ancient radiation is all around us.')
for i,sentence in enumerate(document.sents):
    print(i, ":", sentence); 

0 : The Big Bang model describes the universe from its very first moments.  
1 : Even today, this ancient radiation is all around us.


In [13]:
# more elaborate tokenization
sentence3 = sp(u'I\'m leaving U.K. for U.S.A.')
for word in sentence3:
    print(word.text)

I
'm
leaving
U.K.
for
U.S.A.


You see that U.K. and U.S.A. are correctly recognized as different token and not split into several ones.

In [14]:
# and another one

sentence4 = sp(u"Hello, I am Michalis from Zurich, Switzerland, email me at michalis@gmail.com")
for word in sentence4:
    print(word.text)

Hello
,
I
am
Michalis
from
Zurich
,
Switzerland
,
email
me
at
michalis@gmail.com


The email was correctly recognized as one token.

## Stemming

Stemming refers to reducing a word to its root form. While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.

SpaCy does not include stemming; we will use NLTK. The most popular stemmer (for English) is the "Porter Stemmer". 

In [15]:
import nltk
from nltk.stem.porter import *

stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


## Lemmatization

Lemmatization is less aggressive than stemming.

In [16]:
sentence = sp(u'run runs running runner talks talk talking talked')

for word in sentence:
    print(word.text," --> ", word.lemma_)

run  -->  run
runs  -->  run
running  -->  run
runner  -->  runner
talks  -->  talk
talk  -->  talk
talking  -->  talk
talked  -->  talk


## Stopwords

Stop words are English words such as "the", "a", "an" etc that do not have any meaning of their own. Stop words are often not very useful for NLP tasks such as text classification or language modeling. So it is often better to remove these stop words before further processing of the document.

The spaCy library contains 305 stop words. In addition, depending upon our requirements, we can also add or remove stop words from the spaCy library.

To see the default spaCy stop words, we can use stop_words attribute of the spaCy model as shown below:

In [17]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 326
First ten stop words: ['after', 'is', 'indeed', 'am', 'several', '‘d', 'somewhere', 'seeming', 'perhaps', 'whereby', 'make', 'latterly', 'me', 'no', 'against', 'front', 'can', 'full', 'already', 'between']


In [18]:
# typically we should remove stopwords from our text.

text = "There are many documents that contain stopwords they are not very useful"
filtered_sentence=[]
doc = sp(text)

# filtering stop words
for word in doc:
    if word.is_stop==False:
        filtered_sentence.append(word)
print("Filtered Sentence:",filtered_sentence)

Filtered Sentence: [documents, contain, stopwords, useful]


In [19]:
# is a word a stopword?
sp.vocab['wonder'].is_stop

False

## Detecting entities

While we are at it, we can see that it's very easy to detect entities (this is called **Named Entity Recognition**. SpaCy, comes with a pre-trained classifier that detects important entities: location, time, people, money etc.

To get the named entities from a document, you have to use the `ents` attribute. Let's retrieve the named entities from the above sentence. Execute the following script:

In [20]:
for entity in sentence4.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Zurich - GPE - Countries, cities, states
Switzerland - GPE - Countries, cities, states


## Exercise

Use the sentence below. 

- How many tokens does it have?
- How many entities are recognized?


In [21]:
sentence = 'This year''s Nobel Prize in economics was awarded to three scholars who revolutionized the effort to end global poverty: Abhijit Banerjee and Esther Duflo of MIT and Michael Kremer of Harvard are essentially credited with applying the scientific method to an enterprise that, until recently, was largely based on gut instincts.'

# show them. how many tokens

# show them. how many entities



## Advanced Entities and showing them in the text

Of course there are many more entities that we can detect in a text. Let's see an example.

In [22]:
from spacy import displacy

nytimes= sp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'PERSON', 380),
 (four, 'CARDINAL', 397),
 (Zip, 'PERSON', 380),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'NORP', 381),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

In [23]:
displacy.render(nytimes, style = "ent",jupyter = True)

You can see that all the 4 words have been reduced to "comput" which actually isn't a word (but it can be considered as a token), and it does show that all 4 words have something in common. 

# Text Representation (Bag of words)

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = [ "I like the Matrix and the Patriot", "I did not like the Ants movie", "I hate comedies, but the "]
# texts = [
#     "Walks like a duck, talks like a duck", 
#     "Beijing duck is the dish I like",
#     "Roger Rabbit has the recipe of success",
#     "A recipe for rabbit",
#     "A recipe for Beijing duck"
# ]

# using default tokenizer 
count = CountVectorizer(ngram_range=(1,2))
bow = count.fit_transform(texts)

# Show feature matrix
bow.toarray()

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 2, 0,
        1, 1],
       [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
        0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0]])

In [25]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names



['and',
 'and the',
 'ants',
 'ants movie',
 'but',
 'but the',
 'comedies',
 'comedies but',
 'did',
 'did not',
 'hate',
 'hate comedies',
 'like',
 'like the',
 'matrix',
 'matrix and',
 'movie',
 'not',
 'not like',
 'patriot',
 'the',
 'the ants',
 'the matrix',
 'the patriot']

In [26]:
# show as a dataframe
pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

Unnamed: 0,and,and the,ants,ants movie,but,but the,comedies,comedies but,did,did not,...,matrix,matrix and,movie,not,not like,patriot,the,the ants,the matrix,the patriot
0,1,1,0,0,0,0,0,0,0,0,...,1,1,0,0,0,1,2,0,1,1
1,0,0,1,1,0,0,0,0,1,1,...,0,0,1,1,1,0,1,1,0,0
2,0,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0


## Exercise:

Above, we only used 1-grams (each word on each own). Change the code above to create a bag-of-words representation that includes 

    A. 1-grams and 2-grams.
    B. 1,2,3-grams

**Hint:** Use the `ngram_range` parameter in the `CountVectorizer`.

What do you notice? How many features does your document-term matrix have now?


## TF-IDF Representation

Recall that:

- term frequency tf = count(word, document) / len(document) 
- term frequency idf = log( len(collection) / count(document_containing_term, collection) )
- tf-idf = tf * idf 

It is important to mention that the IDF value for a word remains the same throughout all the documents as it depends upon the total number of documents. On the other hand, TF values of a word differ from document to document.

The TF for the word "car" is 1/7.

Let's find the IDF frequency of the word "car". Since we have 2 documents and the word "car" occurs in 1 of them, therefore the IDF value of the word "car" is log(2/1) = 1.66.



Finally, the TF-IDF values are calculated by multiplying TF values with their corresponding IDF values.

**Note**: In the example below, you may not get the exact values by multiplying those two numbers, because nltk normalizes each row to have norm of 1. However the relative importance of the terms won't change.



In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
    "The car is driven on the road.", 
    "The truck is driven on the highway"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1))
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names()
)



Unnamed: 0,car,driven,highway,is,on,road,the,truck
0,0.424717,0.30219,0.0,0.30219,0.30219,0.424717,0.60438,0.0
1,0.0,0.30219,0.424717,0.30219,0.30219,0.0,0.60438,0.424717


## Combining NLTK and spaCy

Of course you can combine the two tools.

In [29]:
import spacy
import pandas as pd
from html import unescape

# create a dataframe from a word matrix
def wordmatrix_2_df(wm, feat_names):
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

# create a spaCy tokenizer
spacy.load('en_core_web_sm')
lemmatizer = spacy.lang.en.English()

# remove html entities from docs and
# set everything to lowercase
def my_preprocessor(doc):
    return(unescape(doc).lower())

# tokenize the doc and lemmatize its tokens
def my_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])

corpora = ['University of Lausanne', 'University of Geneva', 'University of Zurich']

custom_vec = CountVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer)
cwm = custom_vec.fit_transform(corpora)
tokens = custom_vec.get_feature_names()
wordmatrix_2_df(cwm, tokens)



Unnamed: 0,Unnamed: 1
Doc0,3
Doc1,3
Doc2,3
