In [1]:
import warnings
warnings.filterwarnings('ignore')

# Tokenization

Tokenization using nltk:

In [2]:
import nltk
nltk.download('punkt')

# The text I used is a song named Mist Mountains
lyrics = """Far over the Misty Mountains cold To dungeons deep and caverns old
We must away, ere break of day,
To find our long forgotten gold.

The pines were roaring on the height,
The winds were moaning in the night.
The fire was red, it flaming spread;
The trees like torches blazed with light.

The wind was on the withered heath,
But in the forest stirred no leaf:
There shadows lay be night or day,
And dark things silent crept beneath.
 
The wind went on from West to East;
All movement in the forest ceased,
But shrill and harsh across the marsh
Its whistling voices were released.

Farewell we call to hearth and hall!
Though wind may blow and rain may fall,
We must away ere break of day
Far over the wood and mountain tall.""".replace('\n', ' ')

# Tokenizing sentences
sentences = nltk.sent_tokenize(lyrics)
# Tokenizing words
words = nltk.word_tokenize(lyrics)

[nltk_data] Downloading package punkt to /home/jrock/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
print(f'Here are the first 5 words: \n{words[:5]}.\nAnd here is the first sentence: \n{sentences[:1]}')

Here are the first 5 words: 
['Far', 'over', 'the', 'Misty', 'Mountains'].
And here is the first sentence: 
['Far over the Misty Mountains cold To dungeons deep and caverns old We must away, ere break of day, To find our long forgotten gold.']


In [4]:
print(f'There are {len(sentences)} sentences, and {len(words)} words.')

There are 7 sentences, and 153 words.


How many unique tokens are there in the text?

In [5]:
print(f'There are unique {len(set(words))} words.')

There are unique 99 words.


In [6]:
print(f'If we just split the text with pure python we have {len(lyrics.split())} words.')

If we just split the text with pure python we have 135 words.


**Why is there a difference in the word's list?**

In [7]:
list(set(lyrics.split()) - set(words))

['red,',
 'spread;',
 'night.',
 'fall,',
 'away,',
 'beneath.',
 'East;',
 'gold.',
 'day,',
 'heath,',
 'leaf:',
 'light.',
 'ceased,',
 'height,',
 'tall.',
 'released.',
 'hall!']

In [8]:
list(set(words) - set(lyrics.split()))

['East',
 'spread',
 ',',
 'hall',
 'leaf',
 'height',
 'fall',
 '.',
 '!',
 'red',
 'released',
 ';',
 'ceased',
 'gold',
 'beneath',
 'heath',
 ':',
 'tall',
 'light']

We cam see the tokenizer removed chars like `!` and ignored 'words' like `:`

# Lemmatizing and Stemming

#### **What is the difference between lemmatizing and stemming?**

**Stemming uses the stem of the word, while lemmatization uses the context in which the word is being used.**
  
Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.

Lemmatization takes into consideration the morphological analysis of the words. It is necessary to have dictionaries which the algorithm can look through to link the form back to its lemma. 
* Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.  
* For instance, stemming the word ‘Caring‘ would return ‘Car‘, Lemmatizing the word ‘Caring‘ would return ‘Care‘.

Lemmatization keeps the meaning of the word,  
Stemming transform the word by rules.  
  
If we need the meaning - use Lemmatization, if not - Stemming is much faster.

## Stemming

In [9]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in words]
print(f'There are {len(set(stemmed))} unique words after the stemming')

There are 92 unique words after the stemming


In [10]:
print('The stemmed words:\n' ,stemmed[:12] , '...')

The stemmed words:
 ['far', 'over', 'the', 'misti', 'mountain', 'cold', 'to', 'dungeon', 'deep', 'and', 'cavern', 'old'] ...


We can compare some of the stemmed words:

In [11]:
before_and_after = {words[idx]: stemmed[idx] for idx in range(8)} 
before_and_after

{'Far': 'far',
 'over': 'over',
 'the': 'the',
 'Misty': 'misti',
 'Mountains': 'mountain',
 'cold': 'cold',
 'To': 'to',
 'dungeons': 'dungeon'}

There are a few changes that have been made to the words.  
for example, we can see that the ending of the word got changed.  
some 'y' 'i' 'e' 'ing' were removed.

## Stemmer implementation

Here is an implementation of a simple stemmer:

In [12]:
def stemmer(string):
# cutting the end of a verb
    if string[-3:] in ['ize', 'ing']:
        return string[:-3]

# cutting the end of a noun
    if string[-4:] in ['ment', 'ship'] and string not in ['ment', 'ship']: 
        return string[:-4]
    return string

In [13]:
imp_stemmed = [stemmer(word) for word in words]
before_and_after_imp = {words[idx]: imp_stemmed[idx] for idx in range(len(words))} 
before_and_after_imp['whistling']

'whistl'

## Lemmatization

We will use the library Spacy to lemmatize the text and compare the output to the stemming performed above. First we load the default Spacy model for English.  
This contains Spacy's saved data about how to process English text. Now we will use this to lemmatize:

In [22]:
import spacy
# spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm')
lem = [string.lemma_.lower() for string in nlp(lyrics)]
print(f'There are {len(set(lem))} unique words after the lemmatization')

There are 94 unique words after the lemmatization


And now lets see some before and after:

In [23]:
before_and_after_lem = {words[idx]: lem[idx] for idx in range(13)}
before_and_after_lem

{'Far': 'far',
 'over': 'over',
 'the': 'the',
 'Misty': 'misty',
 'Mountains': 'mountains',
 'cold': 'cold',
 'To': 'to',
 'dungeons': 'dungeon',
 'deep': 'deep',
 'and': 'and',
 'caverns': 'cavern',
 'old': 'old',
 'We': 'we'}

Not as what happened when stemming,  
`Misty` became `misty` and with stemming it transformed to `misti`.

# Stop Words

Removing the stop words (which don't contribute to the text's meaning) can reduce the noise for the nlp models.

In [24]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jrock/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let's have a look on the list of the stop words:

In [25]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

And filter the lyrics from stop words:

In [26]:
filtered_lyrics = [word for word in words if not word.lower() in stop_words]
print(f'Here are the first 10 words in the lyrics:\n{words[:10]}')
print(f'Here are the first 10 words in the filterd lyrics:\n{filtered_lyrics[:10]}')

Here are the first 10 words in the lyrics:
['Far', 'over', 'the', 'Misty', 'Mountains', 'cold', 'To', 'dungeons', 'deep', 'and']
Here are the first 10 words in the filterd lyrics:
['Far', 'Misty', 'Mountains', 'cold', 'dungeons', 'deep', 'caverns', 'old', 'must', 'away']


# Bag of Words

We will now see how a sentence can be transformed into a feature vector using a bag of words model.  
We can represent each word as a one-hot encoded vector (with a single 1 in the column for that word), and add their vectors together to get the feature vector for a sentence:  


In [27]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lyrics.split('.'))
X.shape

(7, 90)

What do the rows and columns of the feature matrix X represent?  

A column is a word, a row is a sentence.  
The values are the number of times the word appear per sentence.   
The second column is the word 'document' and the third is 'first'.  
note: the model doesnt save the order of the wards, therefor, bag of words.

In [28]:
print(f'There are {X.shape[0]} sentences in the text and {X.shape[1]} unique words')

There are 7 sentences in the text and 90 unique words


In [29]:
for idx, word in enumerate(vectorizer.get_feature_names_out()):
    print(f'at index {idx}, the word is {word}')

at index 0, the word is across
at index 1, the word is all
at index 2, the word is and
at index 3, the word is away
at index 4, the word is be
at index 5, the word is beneath
at index 6, the word is blazed
at index 7, the word is blow
at index 8, the word is break
at index 9, the word is but
at index 10, the word is call
at index 11, the word is caverns
at index 12, the word is ceased
at index 13, the word is cold
at index 14, the word is crept
at index 15, the word is dark
at index 16, the word is day
at index 17, the word is deep
at index 18, the word is dungeons
at index 19, the word is east
at index 20, the word is ere
at index 21, the word is fall
at index 22, the word is far
at index 23, the word is farewell
at index 24, the word is find
at index 25, the word is fire
at index 26, the word is flaming
at index 27, the word is forest
at index 28, the word is forgotten
at index 29, the word is from
at index 30, the word is gold
at index 31, the word is hall
at index 32, the word is h