## Introduction to Lab 01

This lab is an introduction to some basic test processing

So lets first set some text...

In [None]:
sentence = "The voice that navigated was definitely that of a machine, and yet you could tell that the machine was a woman, which hurt my mind a little.\n How can machines have genders?\n The machine also had an American accent.\n How can machines have nationalities?\n This can't be a good idea, making machines talk like real people, can it?\n Giving machines humanoid identities?"

Now lets try to split the text based on spaces (default function)

In [None]:
sentence.split()

You will observe that some words are not well separated from punctuation and contain some appended to the words.
So we need to find a way to remove those characters... but, before we do that, lets see how we can create a quick feature vector first!

In [None]:
tokens = sorted(sentence.split()) # splitting based on spaces
vocab = sorted(set(tokens)) # sorting and removing duplicates by using set()
vocab # just printing the vocab so we can look at it

We can see that the order has the numbers first, followerd by capital and then lower case letters (all alphabetically sorted). We also see that some repeating words appear only once in the vocabulary list. Let's compate the size of the two lists.

In [None]:
tokens_len = len(tokens)
vocab_len = len(vocab)

print("tokens:", tokens_len)
print("vocab:", vocab_len)

Lets try and print the matrix of tokens against vocabulary. We will use the numpy lib for that.

In [None]:
import numpy as np

matrix = np.zeros((tokens_len, vocab_len), int)
for i, token in enumerate(tokens):
    matrix[i, vocab.index(token)] = 1

matrix

Is not easy to see, but some columns contain multiple rows showing 1, whereas the rest is all one 1 per column. To make it a little more readable, we could use Pandas and DataFrame! Both Pandas and NumPy are very useful libs that we will use many times.

In [None]:
import pandas as pd

pd.DataFrame(matrix, columns=vocab, index=tokens)

Now this is a lot more clear and if we wanted we could carry on making it look nicer.

Lets now carry on building the bag of words (BoW)

In [None]:
bow = {} # setting this up as a dictionary

for token in tokens:
    bow[token] = 1

sorted(bow.items()) # lets print it

Since bow is a dictionary, we see that same words will not duplicate.
Pandas also has a more efficient form of a dictionary called Series.

In [None]:
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in tokens])), columns=['sent']).T
df

In [None]:
corpus = {}
for i, sent in enumerate(sentence.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df

Now we see how we managed to build feature vectors for the sentences we originally had. Now lets do a Dot Product calculation.

In [None]:
df = df.T
print("dot product of sent0 from sent1:", df.sent0.dot(df.sent1), " and dot product of sent0 from sent1:", df.sent0.dot(df.sent2))

As we see from the results, the higher the dot product to more similar the vectors are... 

### Tokenization

We can improve our vocabulary now if we were to remove all other punctuation. Lets do that with regular expressions.

In [None]:
import re

tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

Although this seems to be great... you might still have issues with different characters that are not anticipated. So we usually use an existing NLP related tokeniser to do this job. Lets try NLTK lib.

NLTK also supports regular expressions:

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence)

but there are other more specialised tokenisers:

In [None]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

For now lets use the regular expression special word pattern w, so we can controll what we do

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"\w+")
tokens = tokenizer.tokenize(sentence)
print(tokens)



At the point you could try out different other tokenisers from other libraries and see if there are any differences.

### n-Gram Creation

We will now calculate the 2-grams

In [None]:
from nltk.util import ngrams

list(ngrams(tokens, 2))

and 3-grams

In [None]:
list(ngrams(tokens, 3))

If we want to include the n-grams as a string rather than touples, then we need to convert them

In [None]:
bigrams = [" ".join(x) for x in list(ngrams(tokens, 2))]
print(bigrams)
trigrams = [" ".join(x) for x in list(ngrams(tokens, 3))]
print()
print(trigrams)

Another important step we looked at in the lectures are the stop words. Lets try to use the nltk stopword list to remove them.

### Stop-word Removal

First lets download the list.

In [None]:
import nltk
nltk.download('stopwords')

and now check it up

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
print("number of stopwords:", len(stop_words))
print(stop_words)

Other libs have different stopwords. Lets see a much larger set from sklearn

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words

print("number of stopwords:", len(sklearn_stop_words))
print(sklearn_stop_words)

Strangely enough, although there are more stopwords in sklearn, you will find that nltk has words that are not contained in sklearn. So you might want to join the teo lists.

For normalising the text you could do something as simple as making sure all words are lower case.

In [None]:
norm_tokens = [x.lower() for x in tokens]
print(norm_tokens)

### Stemming

For stemming the words we could use nltk again

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
stem_tokens = [stemmer.stem(x) for x in norm_tokens]
print(stem_tokens)

For lemmatising nltk again also do the job

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stem_tokens = [lemmatizer.lemmatize(x) for x in norm_tokens]
print(stem_tokens)

The sentence we have has no issues with the lemma... but look into the following example

In [None]:
print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("better", 'a')) # declaring the POS as adjective

If we don't include the POS, the nltk library with wordnet does not work well. So lets try fix that

In [None]:
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    # now we need to convert from nltk to wordnet POS notations (for compatibility reasons)
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN) # return and default to noun if not found

In [None]:
stem_tokens = [lemmatizer.lemmatize(x, pos=get_wordnet_pos(x)) for x in norm_tokens]
print(stem_tokens)

If we look at the words now we are getting more counts for our bag of words

### Feature-vector Creation

In [None]:
from collections import Counter

bow = Counter(stem_tokens)
bow

Now lets check the most frequent 6 words

In [None]:
bow.most_common(6)

Now lets remove the stopwords

In [None]:
no_stop_tokens = [x for x in stem_tokens if x not in stop_words]
count = Counter(no_stop_tokens)
count

Finally... lets make our feature vector using the frequency ratio (term count / total number of terms in the doc)

In [None]:
document_vector = []
doc_length = len(no_stop_tokens)
for key, value in count.most_common():
    document_vector.append(value / doc_length)

print(document_vector)

We have explored many many options already and we will continue with more advances feature vectors in the next lab, plus some visualisations in charts. So untill then please try different experiments on your own:
* see if you change the text and have more sentences with different topics (so you can compare the feature vectors later)
* try to use different libraries for tokenising , PoS, stemming and lemmatising
* try to use other distance metrics to compare vectors, such as Euclidian distance