# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [1]:
# Imporing necessary libraries

from typing import List
import re

In [2]:
# Definitions of new functions

def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    list_of_words = re.split(r'\W+', text)   
    return list_of_words

def tokenize_sentence(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    list_of_sentences =  re.split('(?<=[.!?])',text)   
    return list_of_sentences

In [3]:
# Checking function performance

text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("--> Tokenized sentences:\n")
print(tokenize_sentence(text))

print("\n--> Tokenized words:\n")
print(tokenize_words(text))

--> Tokenized sentences:

['Here we go again.', ' I was supposed to add this text later.', "Well, it's 10.", 'p.', 'm.', " here, and I'm actually having fun making this course.", ' :oI hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.', '']

--> Tokenized words:

['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', 'Well', 'it', 's', '10', 'p', 'm', 'here', 'and', 'I', 'm', 'actually', 'having', 'fun', 'making', 'this', 'course', 'oI', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try', 'And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better', '']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [4]:
# Importing necessary libraries

import nltk 
from nltk import word_tokenize  
from nltk import pos_tag     

In [5]:
# Loading file

file = open("../datasets/trump.txt", "r",encoding="utf-8") 
trump = file.read()

In [7]:
# Tokenizing and tagging words in file

list_of_words = word_tokenize(trump)
list_of_words_tagged = pos_tag(list_of_words)

N_words_printed = 100
N_words_total = len(word_tokenize(trump))

print("--> Tagged words from Trump's speach:\n")

print(list_of_words_tagged[:N_words_printed],"\n")

print("--> For the purpose of not littering the display with excess amount of output, \
I'm only printing here the first",N_words_printed,"out of",N_words_total,"words with their tags (the list_of_words_tagged list contains all words with their tags).\n")

--> Tagged words from Trump's speach:

[('Thank', 'NNP'), ('you', 'PRP'), ('very', 'RB'), ('much', 'RB'), ('.', '.'), ('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Mr.', 'NNP'), ('Vice', 'NNP'), ('President', 'NNP'), (',', ','), ('Members', 'NNP'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('the', 'DT'), ('First', 'NNP'), ('Lady', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), (',', ','), ('and', 'CC'), ('citizens', 'NNS'), ('of', 'IN'), ('America', 'NNP'), (':', ':'), ('Tonight', 'NN'), (',', ','), ('as', 'IN'), ('we', 'PRP'), ('mark', 'VBP'), ('the', 'DT'), ('conclusion', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('celebration', 'NN'), ('of', 'IN'), ('Black', 'NNP'), ('History', 'NNP'), ('Month', 'NNP'), (',', ','), ('we', 'PRP'), ('are', 'VBP'), ('reminded', 'VBN'), ('of', 'IN'), ('our', 'PRP$'), ('Nation', 'NN'), ("'s", 'POS'), ('path', 'NN'), ('towards', 'NNS'), ('civil', 'JJ'), ('rights', 'NNS'), ('and', 'CC'), ('the', 'DT'), ('work', 'NN'), ('th

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [8]:
# Imporing necessary libraries

import spacy
from nltk.tokenize import sent_tokenize # Sentence tokenizer

In [9]:
# Loading file

file = open("../datasets/trump.txt", "r",encoding='utf-8') 
trump = file.read() 

In [10]:
# Tokenizing text into sentences and selecting 10 last sentences

N_sentences = 10

trump_sentences = sent_tokenize(trump)
trump_last_N_sentences = trump_sentences[len(trump_sentences) - N_sentences : ]
trump = ''.join(trump_last_N_sentences)

print("--> Last 10 sentences in Trump's speach are:\n")       
print(trump,"\n")

nouns_count = 0

print("--> Nouns in last 10 sentences in Trump's speach are:\n")

nlp = spacy.load('en_core_web_sm')
for token in nlp(trump):
    if token.pos_ == 'NOUN':
        nouns_count = nouns_count + 1
        print(token.text," ", end='')
        
print("\n\n--> Average number of nouns per sentence in last 10 sentences in Trump's speach is:",nouns_count / N_sentences)

--> Last 10 sentences in Trump's speach are:

When we fulfill this vision, when we celebrate our 250 years of glorious freedom, we will look back on tonight as when this new chapter of American greatness began.The time for small thinking is over.The time for trivial fights is behind us.We just need the courage to share the dreams that fill our hearts, the bravery to express the hopes that stir our souls, and the confidence to turn those hopes and those dreams into action.From now on, America will be empowered by our aspirations, not burdened by our fears; inspired by the future, not bound by failures of the past; and guided by our vision, not blinded by our doubts.I am asking all citizens to embrace this renewal of the American spirit.I am asking all Members of Congress to join me in dreaming big and bold, and daring things for our country.I am asking everyone watching tonight to seize this moment.Believe in yourselves, believe in your future, and believe, once more, in America.Thank y

## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [11]:
# Imporing necessary libraries

import numpy as np
import pandas as pd
import spacy

In [12]:
# BagOfWords class

class BagOfWords:
    """Basic BoW implementation."""
    
    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []
    __sentences = []
    __words = []
    __unique_words = []
    
    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """
        self.__sentences = corpus
        
        # Tokenizing corpus
        for s in self.__sentences:          
            for word in tokenize_words(s):
                self.__words.append(word)
                
        # Removing empty string from the end of the list               
        self.__words = list(set(self.__words))
        self.__words.sort(key=lambda v: v.lower())   
        if self.__words[0] == '':
            self.__words.pop(0)
        
        # Selecting unique words
        for i in self.__words:
            if not i.lower() in self.__unique_words:
                self.__unique_words.append(i.lower());
              
        # Calculating number of occurances of each word in the bag
        for s in self.__sentences:
            tmp = []
            for word in self.__unique_words:
                occurances = 0
                for w in tokenize_words(s):
                    if w.lower() == word:
                        occurances = occurances+1
                tmp.append(occurances)           
            self.__bow_list.append(tmp)        
        return np.array(self.__bow_list)   

      

    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """   
        # your code goes here
        return self.__unique_words

In [13]:
# Checking how BagOfWords class works

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]    
    
vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
#print(X)

print("--> Bag Of Words displayed as DataFrame:\n")
display(pd.DataFrame(X, columns= vectorizer.get_feature_names()))

print("--> Number of words (columns) in our Bag Of Words: ", len(vectorizer.get_feature_names()),"\n")
print("--> Number of documents(rows) in our Bag Of Words: ", len(X),"\n")

print("--> Words in a bag of words: ", vectorizer.get_feature_names())

--> Bag Of Words displayed as DataFrame:



Unnamed: 0,a,as,bag,based,below,can,counting,document,documents,gives,...,really,see,sparse,the,third,this,throughout,us,words,you
0,0,0,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,1,1,0,0,0,0
3,0,1,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,0,1,1
4,1,0,0,0,2,0,0,0,0,1,...,1,2,1,0,0,1,0,1,0,0


--> Number of words (columns) in our Bag Of Words:  31 

--> Number of documents(rows) in our Bag Of Words:  5 

--> Words in a bag of words:  ['a', 'as', 'bag', 'based', 'below', 'can', 'counting', 'document', 'documents', 'gives', 'is', 'matrix', 'most', 'multiple', 'occur', 'occurences', 'of', 'on', 'once', 'only', 'pretty', 'really', 'see', 'sparse', 'the', 'third', 'this', 'throughout', 'us', 'words', 'you']


## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [14]:
# Loading necessary libraries

from nltk.book import *
import re
import random

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [15]:
# Definition of functions

wall_street = text7.tokens
tokens = wall_street

# Cleanup function - deletes all meaningless words/characters
def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

# build_ngrams - builds ngrams
def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

# ngram_freqs - calculates the frequency of tokens in each ngram and sum if there are more than one tokens related to a ngram
def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

# next_word - choose the next word by using the most recent tokens and adds it.
def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [16]:
# clean_generated - cleans generated text

def clean_generated(generated):
    generated = generated.replace(" .", ".").replace(" ?", "?").replace(" !", "!") 
    temp = [g for g in generated]
    temp[0] = temp[0].upper()
    
    return "".join(temp)

In [17]:
# Assessing performance of functions

N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have managed to" # Changed to first N-1 words

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

print("--> Generated text before cleaning and corrections:\n\n",generated)

generated = clean_generated(generated)

print("\n--> Generated text after cleaning and corrections:\n\n",generated)

--> Generated text before cleaning and corrections:

 we have managed to maximize our direct-mail capability . In addition Buick is a relatively respected nameplate among American Express card holders says 0 an American Express spokeswoman . When the company asked members in a mailing which cars they would like to get information about for possible future purchases Buick came in fourth among U.S. cars and in the top 10 of all cars the spokeswoman says 0 . American Express has more than 24 million card holders in the U.S.

--> Generated text after cleaning and corrections:

 We have managed to maximize our direct-mail capability. In addition Buick is a relatively respected nameplate among American Express card holders says 0 an American Express spokeswoman. When the company asked members in a mailing which cars they would like to get information about for possible future purchases Buick came in fourth among U.S. cars and in the top 10 of all cars the spokeswoman says 0. American Express