# NLP Exercises

We have five exercises in this section. The exercises are:
1. Build your own tokenizer, where you need to implement two functions to implement a tokenizer based on regular expression.
2. Get tags from Trump speech.
3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.
4. Build your own Bag Of Words implementation using tokenizer created before.
5. Build a 5-gram model and clean up the results.

## Exercise 1. Build your own tokenizer

Build two different tokenizers:
- ``tokenize_sentence``: function tokenizing text into sentences,
- ``tokenize_word``: function tokenizing text into words.

In [2]:
import re
from typing import List

def tokenize_words(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    words = re.findall(r"\b\w+'\w+\b|\b\w+\.\w+\.\w+\b|:\w|\b\w+\b", text)
    return words

def tokenize_sentence(text: str) -> list:
    """Tokenize text into words using regex.

    Parameters
    ----------
    text: str
            Text to be tokenized

    Returns
    -------
    List[str]
            List containing words tokenized from text

    """
    sentences = re.split(r'(?<=[.!?])\s+(?![a-z])|(?<=[.!?])(?=[A-Z])', text)
    sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
    return sentences

text = "Here we go again. I was supposed to add this text later.\
Well, it's 10.p.m. here, and I'm actually having fun making this course. :o\
I hope you are getting along fine with this presentation, I really did try.\
And one last sentence, just so you can test you tokenizers better."

print("Tokenized sentences:")
print(tokenize_sentence(text))

print("Tokenized words:")
print(tokenize_words(text))

Tokenized sentences:
['Here we go again.', 'I was supposed to add this text later.', "Well, it's 10.p.m. here, and I'm actually having fun making this course.", ':oI hope you are getting along fine with this presentation, I really did try.', 'And one last sentence, just so you can test you tokenizers better.']
Tokenized words:
['Here', 'we', 'go', 'again', 'I', 'was', 'supposed', 'to', 'add', 'this', 'text', 'later', 'Well', "it's", '10.p.m', 'here', 'and', "I'm", 'actually', 'having', 'fun', 'making', 'this', 'course', ':o', 'hope', 'you', 'are', 'getting', 'along', 'fine', 'with', 'this', 'presentation', 'I', 'really', 'did', 'try', 'And', 'one', 'last', 'sentence', 'just', 'so', 'you', 'can', 'test', 'you', 'tokenizers', 'better']


## Exercise 2. Get tags from Trump speech using NLTK

You should use the ``trump.txt`` file, read it and find the tags for each word. Use NLTK for it.

In [3]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk import pos_tag

file = open("./trump.txt", "r",encoding="utf-8")
trump = file.read()
words = word_tokenize(trump)

tags = pos_tag(words)
print(list(zip(*tags))[1])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


('NNP', 'PRP', 'RB', 'RB', '.', 'NNP', 'NNP', ',', 'NNP', 'NNP', 'NNP', ',', 'NNP', 'IN', 'NNP', ',', 'DT', 'NNP', 'NNP', 'IN', 'DT', 'NNP', 'NNPS', ',', 'CC', 'NNS', 'IN', 'NNP', ':', 'NN', ',', 'IN', 'PRP', 'VBP', 'DT', 'NN', 'IN', 'PRP$', 'NN', 'IN', 'NNP', 'NNP', 'NNP', ',', 'PRP', 'VBP', 'VBN', 'IN', 'PRP$', 'NN', 'POS', 'NN', 'NNS', 'JJ', 'NNS', 'CC', 'DT', 'NN', 'WDT', 'RB', 'VBZ', 'TO', 'VB', 'VBN', '.', 'JJ', 'NNS', 'VBG', 'NNP', 'NN', 'NNS', 'CC', 'NN', 'IN', 'JJ', 'NNS', ',', 'RB', 'RB', 'IN', 'JJ', 'NN', 'POS', 'NN', 'IN', 'NNP', 'NNP', ',', 'VBP', 'PRP', 'IN', 'IN', 'PRP', 'MD', 'VB', 'DT', 'NN', 'VBN', 'IN', 'NNS', ',', 'PRP', 'VBP', 'DT', 'NN', 'WDT', 'VBZ', 'JJ', 'IN', 'VBG', 'NN', 'CC', 'NN', 'IN', 'DT', 'IN', 'PRP$', 'RB', 'RB', 'NNS', '.', 'DT', 'JJ', 'NN', 'VBZ', 'DT', 'NN', 'IN', 'NN', ',', 'NN', ',', 'CC', 'NN', 'IN', 'DT', 'JJ', 'NN', ',', 'PDT', 'DT', 'NN', 'IN', 'TO', 'DT', 'JJ', '.', 'DT', 'NN', 'VBZ', 'RB', 'IN', 'PRP$', 'NNS', ',', 'CC', 'PRP', 'MD', 'VB', '

## Exercise 3. Get the nouns in the last 10 sentences from Trump's speech and find the nouns divided by sentencens. Use SpaCy.

Please use Python list features to get the last 10 sentences and display nouns from it.

In [4]:
import spacy

file = open("./trump.txt", "r",encoding='utf-8')
trump = file.read()

nlp = spacy.load("en_core_web_sm")
doc = nlp(trump)

sentences = []
for sent in doc.sents:
  sentences.append(sent)

last_ten = sentences[-10:]

nouns = []
for sent in last_ten:
  for noun in sent.noun_chunks:
    nouns.append(noun)

print(nouns)


[we, this vision, we, our 250 years, glorious freedom, we, tonight, this new chapter, American greatness, The time, small thinking, The time, trivial fights, us, We, the courage, the dreams, that, our hearts, the bravery, the hopes, that, our souls, the confidence, those hopes, those dreams, action, America, our aspirations, our fears, the future, failures, the past, our vision, our doubts, I, all citizens, this renewal, the American spirit, I, all Members, Congress, me, things, our country, I, everyone, this moment, yourselves, your future, America, you, God, you, God, the United States]


## Exercise 4. Build your own Bag Of Words implementation using tokenizer created before

You need to implement following methods:

- ``fit_transform`` - gets a list of strings and returns matrix with it's BoW representation
- ``get_features_names`` - returns list of words corresponding to columns in BoW

In [5]:
import numpy as np
import spacy

class BagOfWords:
    """Basic BoW implementation."""

    __nlp = spacy.load("en_core_web_sm")
    __bow_list = []

    def fit_transform(self, corpus: list):
        """Transform list of strings into BoW array.

        Parameters
        ----------
        corpus: List[str]
                Corpus of texts to be transforrmed

        Returns
        -------
        np.array
                Matrix representation of BoW

        """
        words = []
        for sentence in corpus:
            words.extend(re.findall(r"\b\w+'\w+\b|\b\w+\.\w+\.\w+\b|:\w|\b\w+\b", sentence.lower()))

        self.__bow_list = sorted(set(words))
        bow_matrix = np.zeros((len(corpus), len(self.__bow_list)), dtype=int)

        for i, sentence in enumerate(corpus):
            words = re.findall(r"\b\w+'\w+\b|\b\w+\.\w+\.\w+\b|:\w|\b\w+\b", sentence.lower()) # Tokenizing and lowercasing
            for word in words:
                if word in self.__bow_list:
                    bow_matrix[i, self.__bow_list.index(word)] += 1 # Count word occurrences

        return bow_matrix


    def get_feature_names(self) -> list:
        """Return words corresponding to columns of matrix.

        Returns
        -------
        List[str]
                Words being transformed by fit function

        """
        # your code goes here
        return self.__bow_list

corpus = [
     'Bag Of Words is based on counting',
     'words occurences throughout multiple documents.',
     'This is the third document.',
     'As you can see most of the words occur only once.',
     'This gives us a pretty sparse matrix, see below. Really, see below',
]

vectorizer = BagOfWords()

X = vectorizer.fit_transform(corpus)
print(X)

vectorizer.get_feature_names()
len(vectorizer.get_feature_names())

[[0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1]
 [1 0 0 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 2 1 0 0 1 0 1 0 0]]


31

## Exercise 5. Build a 5-gram model and clean up the results.

There are three tasks to do:
1. Use 5-gram model instead of 3.
2. Change to capital letter each first letter of a sentence.
3. Remove the whitespace between the last word in a sentence and . ! or ?.

Hint: for 2. and 3. implement a function called ``clean_generated()`` that takes the generated text and fix both issues at once. It could be easier to fix the text after it's generated rather then doing some changes in the while loop.

In [16]:
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
from nltk.book import *

wall_street = text7.tokens

import re

tokens = wall_street

def cleanup():
    compiled_pattern = re.compile("^[a-zA-Z0-9.!?]")
    clean = list(filter(compiled_pattern.match,tokens))
    return clean
tokens = cleanup()

def build_ngrams():
    ngrams = []
    for i in range(len(tokens)-N+1):
        ngrams.append(tokens[i:i+N])
    return ngrams

def ngram_freqs(ngrams):
    counts = {}

    for ngram in ngrams:
        token_seq  = SEP.join(ngram[:-1])
        last_token = ngram[-1]

        if token_seq not in counts:
            counts[token_seq] = {}

        if last_token not in counts[token_seq]:
            counts[token_seq][last_token] = 0

        counts[token_seq][last_token] += 1;

    return counts

def next_word(text, N, counts):

    token_seq = SEP.join(text.split()[-(N-1):]);
    choices = counts[token_seq].items();

    total = sum(weight for choice, weight in choices)
    r = random.uniform(0, total)
    upto = 0
    for choice, weight in choices:
        upto += weight;
        if upto > r: return choice
    assert False # should not reach here

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [25]:
import random

def clean_generated(generated):
    generated = generated.replace(" .", ".")
    generated = generated.replace(" ?", "?")
    generated = generated.replace(" !", "!")

    sentences = tokenize_sentence(generated)

    clean = []
    for sentence in sentences:
        if len(sentence) > 0:
            sentence = sentence[0].upper() + sentence[1:]
            clean.append(sentence)

    return ' '.join(clean)

N=5 # fix it for other value of N

SEP=" "

sentence_count=5

ngrams = build_ngrams()
start_seq="We have found that"

counts = ngram_freqs(ngrams)

if start_seq is None or start_seq not in counts: start_seq = random.choice(list(counts.keys()))
generated = start_seq.lower();

success = False
while not success:
    try:
        sentences = 0
        while sentences < sentence_count:
            generated += SEP + next_word(generated, N, counts)
            sentences += 1 if generated.endswith(('.','!', '?')) else 0

        success = True

    except Exception as e:
        start_seq = random.choice(list(counts.keys()))

generated = clean_generated(generated)

print(generated)

Some grain elevators are offering farmers 2.15 a bushel for corn. Many farmers probably would n't sell until prices rose at least 20 cents a bushel said 0 Lyle Reed president of Chicago Central Pacific Railroad Co. of Waterloo Iowa. It is n't clear however whether support for the proposal will be broad enough to pose a serious challenge to the White House acid-rain plan. While the new proposal might appeal to the dirtiest utilities it might not win the support of utilities many in the West that already have added expensive cleanup equipment or burn cleaner-burning fuels.
