## Chapter 2 of Getting Started with Natural Language Processing
by Ekaterina Kochmar

## Tokenizing

Looking at some methods to split textual data into meaningful words

#### Using split to seperate whitespace

In [2]:
text = 'Michael Jordan is considered by many to be the greatest basketball player of all time'

text_words = text.split(" ")
print(text_words)

['Michael', 'Jordan', 'is', 'considered', 'by', 'many', 'to', 'be', 'the', 'greatest', 'basketball', 'player', 'of', 'all', 'time']


### Considering punctuation
Punctuation will be considered as part of the word which changes the word representation as a feature as opposed to the *same* word that has no punctuation. This will cause issues for NLP and one which tokenization should handle.

In [3]:
text = 'How should we regard Jordan\'s "greatness" in a basketball context? Is it Greatness at \
some or all facets of the game? Does the U.S.A have different criteria to U.K? First, lets consider \
Jordan\'s  greatness. '

question = "How great was Jordan? "
answer = "I'm going to say he was great! "
answer2 = "I haven't met anyone who disagrees."
text = text +" "+ question +" "+ answer +" "+ answer2

text_words = text.split(" ")
print(text_words)

['How', 'should', 'we', 'regard', "Jordan's", '"greatness"', 'in', 'a', 'basketball', 'context?', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game?', 'Does', 'the', 'U.S.A', 'have', 'different', 'criteria', 'to', 'U.K?', 'First,', 'lets', 'consider', "Jordan's", '', 'greatness.', '', 'How', 'great', 'was', 'Jordan?', '', "I'm", 'going', 'to', 'say', 'he', 'was', 'great!', '', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees.']


In sentence above **"greatness"**, **Greatness** and **greatness** all have the same meaning but will be represented differenly as words. The capitalisation of words will be considered when we normalise textual data.

Lets update the algorithm so that it considers puntuation when splitting the text into words.

In [4]:
text

'How should we regard Jordan\'s "greatness" in a basketball context? Is it Greatness at some or all facets of the game? Does the U.S.A have different criteria to U.K? First, lets consider Jordan\'s  greatness.  How great was Jordan?  I\'m going to say he was great!  I haven\'t met anyone who disagrees.'

### My attempt

In [5]:

# List of words
words = []

# Track the current word
current_word = ""

# List of delimiters
delimiters = ['"', '.', "?", "!"]

# iterate through each character in the text
for c in text:
    if len(current_word) > 0:
        previous_char = current_word[-1]
    
    if c == " ":
        words.append(current_word)    # add the current_word to words list
        current_word = ""             # initialise the current_word to empty string again
    elif c in delimiters and previous_char != " ":
        words.append(current_word)
        words.append(c)
        current_word = ""
    elif c in delimiters and previous_char == " ":
        words.append(c)
        current_word = ""
    else:
        current_word += c 
  


In [6]:
print(words)

['How', 'should', 'we', 'regard', "Jordan's", '', '"', 'greatness', '"', '', 'in', 'a', 'basketball', 'context', '?', '', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game', '?', '', 'Does', 'the', 'U', '.', 'S', '.', 'A', 'have', 'different', 'criteria', 'to', 'U', '.', 'K', '?', '', 'First,', 'lets', 'consider', "Jordan's", '', 'greatness', '.', '', '', 'How', 'great', 'was', 'Jordan', '?', '', '', "I'm", 'going', 'to', 'say', 'he', 'was', 'great', '!', '', '', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees', '.']


## From the book

In [7]:

delimiters = ['"', ".", "?", "!"]
words = []
current_word = ""
 
for char in text:
    if char == " ":
        if not current_word == "":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        if current_word == "":
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = ""
    else:
        current_word += char
        
    
print(words)

['How', 'should', 'we', 'regard', "Jordan's", '"', 'greatness', '"', 'in', 'a', 'basketball', 'context', '?', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game', '?', 'Does', 'the', 'U', '.', 'S', '.', 'A', 'have', 'different', 'criteria', 'to', 'U', '.', 'K', '?', 'First,', 'lets', 'consider', "Jordan's", 'greatness', '.', 'How', 'great', 'was', 'Jordan', '?', "I'm", 'going', 'to', 'say', 'he', 'was', 'great', '!', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees', '.']


There are still some issues with the above code if they are to operate as tokenizers:
* Can't handle abbreviations like "U.K." or "i.e."
* Word concatenation like "I'm" will not be understood as "I am"
    * Tokenizer would split the answer into [I, 'm, going, to , say, he ,was, great, !] and some rules would recognise 'm as short for "am" in this example
    * Similarly "haven't" should be recognised as "have not". Tokenizer would split answer2 into [I have, n't, met, anyone, who, disagees, .]

Good tokenizing packages ensure that text is properly split into words

## Enron case study

#### Reading in the data

In [8]:
import os
import codecs

# ISO-8859-1: alias for latin-1 codec which is the language system for western europe
# Enron emails are in English

def read_in(folder):
    files = os.listdir(folder)
    a_list = []
    for file in files:
        if not file.startswith("."):
            f = codecs.open(os.path.join(folder,file), "r", 
                encoding="ISO-8859-1", errors="ignore")
            a_list.append(f.read())
            f.close()
    return a_list

In [9]:
data_path = 'data/enron1/'

# each element in the list contains contents of one email
ham_list = read_in(os.path.join(data_path, 'ham'))
spam_list = read_in(os.path.join(data_path, 'spam'))

print(len(spam_list))
print(len(ham_list))
print("***SPAM***")
print(spam_list[1])
print("***HAM***")
print(ham_list[0])


1500
3672
***SPAM***
Subject: last notice
problem mount sure pattern . art , were do any . give cloud had , it
noun us may . their such let past , part sound . stand buy through ,
every . leave say current . wear rest , blow hair final word
before . way talk quick was . seem this though talk live wild
problem . map opposite able , sleep put world .
- -
phone : 837 - 444 - 1269
mobile : 268 - 464 - 9520
email : yorkers @ alltel . net

***HAM***
Subject: enron / hpl actuals for november 27 , 2000
teco tap 20 . 000 / enron ; 100 . 000 / hpl gas daily


#### Preprocessing the data
Combine the ham/spam datasets, include the label then shuffle the final dataset

In [10]:
import random 

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
print(all_emails[1])

# Note
# From lib/python3.8/string.py
# whitespace -- a string containing all ASCII whitespace
# Some strings for ctype-style character classification
# whitespace = ' \t\n\r\v\f'

random.seed(42)
random.shuffle(all_emails)
print(f"Dataset size = {str(len(all_emails))} emails")

('Subject: last notice\r\nproblem mount sure pattern . art , were do any . give cloud had , it\r\nnoun us may . their such let past , part sound . stand buy through ,\r\nevery . leave say current . wear rest , blow hair final word\r\nbefore . way talk quick was . seem this though talk live wild\r\nproblem . map opposite able , sleep put world .\r\n- -\r\nphone : 837 - 444 - 1269\r\nmobile : 268 - 464 - 9520\r\nemail : yorkers @ alltel . net\r\n', 'spam')
Dataset size = 5172 emails


#### Tokenizing
Emails are single string of symbols so need to split the text into words. Use NLTK's tokenizer for this

In [11]:
import nltk 

user = os.environ["USER"]
print(user)

nltk.download('punkt', download_dir='/home/'+user+'/repos/hosting-ml-as-service/venv/share/nltk_data')

jack


[nltk_data] Downloading package punkt to /home/jack/repos/hosting-ml-
[nltk_data]     as-service/venv/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:

from nltk import word_tokenize

def tokenize(input):
    word_list = []
    for word in word_tokenize(input):
        word_list.append(word)
    return word_list

input = "What's the best way to split a sentence into words?"
print(tokenize(input))

['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']


#### Extract and normalise the features
* Iterate over emails and tokenize the text that has been tranformed to be lower case so that it is normalised. 
* Each email and label are paired together in tuples.
* Tokenized list of words are converted to dictionary with words as keys and set to value of True

In [13]:
def get_features(text):
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
        features[word] = True    # switch on the flag that this word is contained in th
    return features

all_features = [(get_features(email), label) for (email, label) in all_emails]


In [14]:
all_emails[0]

("Subject: no doctor ' s visit .\r\nstop wasting money on prescription drugs . get them online for 80 % off .\r\nvlagra , clalis , zyban , prozac , xenlcal , and many many more . . .\r\nstop paying more than you have too !\r\n- todays special -\r\nviagra , retail price $ 15 . 99 , our price $ 2 . 99\r\ncialis , retail price $ 17 . 99 , our price $ 3 . 99\r\nshipped world wide\r\nno prescription required\r\nvisit us here : http : / / imkjbest . com / z /\r\n",
 'spam')

In [15]:
print(all_features[0])

({'subject': True, ':': True, 'no': True, 'doctor': True, "'": True, 's': True, 'visit': True, '.': True, 'stop': True, 'wasting': True, 'money': True, 'on': True, 'prescription': True, 'drugs': True, 'get': True, 'them': True, 'online': True, 'for': True, '80': True, '%': True, 'off': True, 'vlagra': True, ',': True, 'clalis': True, 'zyban': True, 'prozac': True, 'xenlcal': True, 'and': True, 'many': True, 'more': True, 'paying': True, 'than': True, 'you': True, 'have': True, 'too': True, '!': True, '-': True, 'todays': True, 'special': True, 'viagra': True, 'retail': True, 'price': True, '$': True, '15': True, '99': True, 'our': True, '2': True, 'cialis': True, '17': True, '3': True, 'shipped': True, 'world': True, 'wide': True, 'required': True, 'us': True, 'here': True, 'http': True, '/': True, 'imkjbest': True, 'com': True, 'z': True}, 'spam')


In [16]:
import json

# Representation of first feature
a = json.dumps(all_features[0][0], indent=4)

print(a)
print("----------")
print(f"label: {all_features[0][1]}")


{
    "subject": true,
    ":": true,
    "no": true,
    "doctor": true,
    "'": true,
    "s": true,
    "visit": true,
    ".": true,
    "stop": true,
    "wasting": true,
    "money": true,
    "on": true,
    "prescription": true,
    "drugs": true,
    "get": true,
    "them": true,
    "online": true,
    "for": true,
    "80": true,
    "%": true,
    "off": true,
    "vlagra": true,
    ",": true,
    "clalis": true,
    "zyban": true,
    "prozac": true,
    "xenlcal": true,
    "and": true,
    "many": true,
    "more": true,
    "paying": true,
    "than": true,
    "you": true,
    "have": true,
    "too": true,
    "!": true,
    "-": true,
    "todays": true,
    "special": true,
    "viagra": true,
    "retail": true,
    "price": true,
    "$": true,
    "15": true,
    "99": true,
    "our": true,
    "2": true,
    "cialis": true,
    "17": true,
    "3": true,
    "shipped": true,
    "world": true,
    "wide": true,
    "required": true,
    "us": true,
    "here

### Train the classifier

#### Naive Bayes

Use the Bayes formula to calculate

```
P(C=c|E) = P(C=c)/P(E) P(E|C=c)
```
* `P(C=c)`: Probability of the class being equal to c
* `P(E)`: Likelihood of the evidence
* `P(E|C=c)`: Likelihood of seeing evidence when class is C=c
    * If there is a data generating process behind the instance an instance of C=c, how often would that instance look like E?

#### Likelihood of the evidence

```    
P(E|C=c) = N(E when C=c) / N(C=c)
```

Bayes classifier considered Naive as it make assumption of conditional independence

**Note:**
We have transformed email content from something like

``` "Participate in our new lottery now!" ```  

to  

```['Participate': True, 'in': True,..., '!': True]``` 
    
so that this vectore represents the feature instance and 'Spam' is the class label

Conditional independence is saying that each word in our feature vector is independent from one another, so "new" is independent of "lottery", in this example.

Therefore we can re-write `P(E|C=c)`, or in this example   

```['Participate': True, 'in': True,..., '!': True | C='Spam']```

to

```['Participate': True | C='Spam', 'in': True | C='Spam',..., '!' | C='Spam': True]```

More generally, if feature vector represented as vector `E = [e1, e2,..., ek]` then:  

```P(e1 ⋀ e2 ⋀ ... ⋀ ek | C=c) == P(e1 | C=c) * P(e2 | C=c) *...* P(ek | C=c)  ```

Each term can be estimated from the training data 

```P(e1 | C=c) = N(instances with feature present in Class c) / N(instances in Class c)```

#### Prediction

Generally, if there are two classes in the training data we predict a class following this psuedocode

```
for each instance
    for each class
        calculate probabilites
        Estimate P(C=cj|Ei)
    if P(c1 | Ei) > P(c2 | Ei)
        Predict(c1)
    else
        Predict(c2)
```
**Note:** P(E) is cancelled out when comparing each class probability so no need to calculate in this case



In [21]:
from nltk import NaiveBayesClassifier, classify

def train(features, proportion):
    train_size = int(len(features) * proportion)
    train_set, test_set = features[:train_size], features[train_size:]
    print(f"Training set size = {str(len(train_set))} emails")
    print(f"Test set size = {str(len(test_set))} emails")
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


In [26]:
def evaluate(train_set, test_set, classifier):
    print(f"Train acc = {str(classify.accuracy(classifier, train_set))}")
    print(f"Test acc = {str(classify.accuracy(classifier, test_set))}")
    
    # basically the features with biggest difference between P(feature | spam) and P(feature | ham)
    # e.g  for Spam: max[P(word:True | spam) / P(word:True | ham)]
    classifier.show_most_informative_features(50)
    
evaluate(train_set, test_set, classifier)

Train acc = 0.9579405366207396
Test acc = 0.9342995169082126
Most Informative Features
               forwarded = True              ham : spam   =    203.2 : 1.0
                     hou = True              ham : spam   =    196.3 : 1.0
                    2004 = True             spam : ham    =    159.8 : 1.0
            prescription = True             spam : ham    =    129.3 : 1.0
                    pain = True             spam : ham    =    100.4 : 1.0
                     ect = True              ham : spam   =     86.7 : 1.0
                   cheap = True             spam : ham    =     84.3 : 1.0
                  farmer = True              ham : spam   =     84.3 : 1.0
                     sex = True             spam : ham    =     82.7 : 1.0
                featured = True             spam : ham    =     79.5 : 1.0
                    2001 = True              ham : spam   =     76.2 : 1.0
              nomination = True              ham : spam   =     72.6 : 1.0
             

#### Word occurrences

Check word occurrence in all available contexts

e.g. "stocks" is a strong predictor of spam. Why?  
```
stocks = True             spam : ham    =     40.3 : 1.0
```

"stocks" must be used in a harmless way --> lets check

In [28]:

from nltk.text import Text

def concordance(data_list, search_word):
    for email in data_list:
        word_list = [word for word in word_tokenize(email.lower())]
        text_list = Text(word_list)
        if search_word in word_list:
            # concordance checks for the occurrences of the specified word
            # prints out word in context
            # NLTK by default prints 36 characters either side of the search word
            text_list.concordance(search_word)
            
print("STOCKS in HAM: ")
concordance(ham_list, "stocks")
print("\n\nSTOCKS in SPAM: ")
concordance(spam_list, "stocks")

STOCKS in HAM: 
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ad my portfolio is diversified into stocks that have lost even more money than


STOCKS in SPAM: 
Displaying 1 of 1 matches:
dge - ksige are you tired of buying stocks and not having them perform ? our s
Displaying 5 of 5 matches:
hursday ! some of these littie voip stocks have been realiy moving lateiy . an
t can happen with these sma | | cap stocks when they take off . and it happens
 statements . as with many microcap stocks , today ' s company has additiona |
is report pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this report . none 
Displaying 2 

Displaying 1 of 1 matches:
scovering value in natural resource stocks elgin resources ( elr - tsx ) extra
Displaying 3 of 3 matches:
 plays . widespread gains in energy stocks are inflating the portfolios of agg
st levels of the year , with energy stocks outperforming all other market sect
utions that sma | | and micro - cap stocks are high - risk investments and tha
Displaying 4 of 4 matches:
watch this one trade . these little stocks can surprise in a big way sometimes
might occur . as with many microcap stocks , today ' s company has additional 
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this email . none o
Displaying 1 of 1 matches:
cautions that small and micro - cap stocks are high - risk investments and tha
Displaying 5 of 5 matches:
ck monday some of these littie voip stocks have been really moving lately . an
t can happen with these sma | | cap stocks when they take off . and it happ