# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [2]:
from typing import List
import re

def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    return re.split('\\s+', text)
def tokenize_sentence(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!)\s', text)
    return sentences
text = "Here we go again. I was supposed to add this text later. \
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try. \
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', 'I was supposed to add this text later.', "Well, it's 10.p.m. here, and I'm actually having fun making this course.", ':oI hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again.', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later.', 'Well,', "it's", '10.p.m.', 'here,', 'and', "I'm", 'actually', 'having', 'fun', 'making', 'this', 'course.', ':oI', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation,', 'I', 'really', 'did', 'try.', 'And', 'one', 'last', 'sentence,', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better.']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [12]:
import nltk
from nltk import word_tokenize
from nltk import pos_tag
import string

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

file_path = "../datasets/trump.txt"
with open(file_path, "r", encoding="utf-8") as file:
    trump = file.read()

words = word_tokenize(trump)

words = [word for word in words if word not in string.punctuation]

word_tags = pos_tag(words)

for word, tag in word_tags[:20]:
    print(f"{word}: {tag}")


Thank: NNP
you: PRP
very: RB
much: RB
Mr.: NNP
Speaker: NNP
Mr.: NNP
Vice: NNP
President: NNP
Members: NNP
of: IN
Congress: NNP
the: DT
First: NNP
Lady: NNP
of: IN
the: DT
United: NNP
States: NNPS
and: CC


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ola55\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ola55\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")

file_path = "../datasets/trump.txt"
with open(file_path, "r", encoding="utf-8") as file:
    trump = file.read()

doc = nlp(trump)

sentences = list(doc.sents)

last_10_sentences = sentences[-10:]

nouns_by_sentence = []
for sentence in last_10_sentences:
    nouns = [token.text for token in sentence if token.pos_ == "NOUN"]
    nouns_by_sentence.append(nouns)

for i, nouns in enumerate(nouns_by_sentence, 1):
    print(f'Sentence: {str(sentence).rstrip()}')
    print(f"{nouns}")


Sentence: Thank you, God bless you, and God bless the United States.
['vision', 'years', 'freedom', 'tonight', 'chapter', 'greatness']
Sentence: Thank you, God bless you, and God bless the United States.
['time', 'thinking']
Sentence: Thank you, God bless you, and God bless the United States.
['time', 'fights']
Sentence: Thank you, God bless you, and God bless the United States.
['courage', 'dreams', 'hearts', 'bravery', 'hopes', 'souls', 'confidence', 'hopes', 'dreams', 'action']
Sentence: Thank you, God bless you, and God bless the United States.
['aspirations', 'fears', 'future', 'failures', 'past', 'vision', 'doubts']
Sentence: Thank you, God bless you, and God bless the United States.
['citizens', 'renewal', 'spirit']
Sentence: Thank you, God bless you, and God bless the United States.
['Members', 'things', 'country']
Sentence: Thank you, God bless you, and God bless the United States.
['tonight', 'moment']
Sentence: Thank you, God bless you, and God bless the United States.
['you

## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before 

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [9]:
import spacy
import numpy as np

class BagOfWords:
    """Basic BoW implementation using spaCy for text processing."""

    def __init__(self):
        self.__nlp = spacy.load("en_core_web_sm")
        self.__bow_list = []

    def fit_transform(self, corpus: list) -> np.array:
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transformed

        Returns
        -------
        np.array
                Matrix representation of BoW
        """
        # Process the documents and build the vocabulary
        docs_tokens = self.__process_documents(corpus)
        self.__build_vocabulary(docs_tokens)

        # Create and return the BoW matrix
        return self.__create_bow_matrix(docs_tokens)

    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function
        """
        return self.__bow_list

    def __process_documents(self, corpus: list) -> list:
        """Tokenize and preprocess the documents.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be processed

        Returns
        -------
        List[List[str]]
                Tokenized and preprocessed documents
        """
        docs_tokens = []
        for doc in corpus:
            tokens = [token.lemma_.lower() for token in self.__nlp(doc)
                      if not token.is_punct and not token.is_stop]
            docs_tokens.append(tokens)
        return docs_tokens

    def __build_vocabulary(self, docs_tokens: list):
        """Build the vocabulary from the tokenized documents.

        Parameters
        ----------
        docs_tokens: List[List[str]]
                     Tokenized and preprocessed documents
        """
        unique_tokens = set(token for doc_tokens in docs_tokens for token in doc_tokens)
        self.__bow_list = sorted(unique_tokens)

    def __create_bow_matrix(self, docs_tokens: list) -> np.array:
        """Create the BoW matrix from tokenized documents.

        Parameters
        ----------
        docs_tokens: List[List[str]]
                     Tokenized and preprocessed documents

        Returns
        -------
        np.array
                Matrix representation of BoW
        """
        bow_matrix = np.zeros((len(docs_tokens), len(self.__bow_list)), dtype=np.int32)
        for i, doc in enumerate(docs_tokens):
            for token in doc:
                bow_matrix[i, self.__bow_list.index(token)] += 1
        return bow_matrix

# Example usage:
corpus = [
     'Bag Of Words is based on counting',
     'words occurrences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]


bow = BagOfWords()
matrix = bow.fit_transform(corpus)
feature_names = bow.get_feature_names()

print("Bag of Words matrix:")
print(matrix)
print("\nFeature names:")
print(feature_names)
print(len(feature_names))


Bag of Words matrix:
[[1 1 1 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 1 0 1 0 0 1 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 1 0 0 0 1 1 0 0]]

Feature names:
['bag', 'base', 'count', 'document', 'give', 'matrix', 'multiple', 'occur', 'occurrence', 'pretty', 'sparse', 'word', 'words']
13


## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [None]:
from random import random
from nltk.book import *

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

In [None]:
def clean_generated():
    # put your code here
    pass

N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have"

counts = ngram_freqs(ngrams)

if start_seq is None: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

sentences = 0
while sentences < sentence_count:
    generated += SEP + next_word(generated, N, counts)
    sentences += 1 if generated.endswith(('.','!', '?')) else 0

# put your code here:

print(generated)