## Chapter 2 of Getting Started with Natural Language Processing
by Ekaterina Kochmar

## Tokenizing

Looking at some methods to split textual data into meaningful words

#### Using split to seperate whitespace

In [1]:
text = 'Michael Jordan is considered by many to be the greatest basketball player of all time'

text_words = text.split(" ")
print(text_words)

['Michael', 'Jordan', 'is', 'considered', 'by', 'many', 'to', 'be', 'the', 'greatest', 'basketball', 'player', 'of', 'all', 'time']


### Considering punctuation
Punctuation will be considered as part of the word which changes the word representation as a feature as opposed to the *same* word that has no punctuation. This will cause issues for NLP and one which tokenization should handle.

In [2]:
text = 'How should we regard Jordan\'s "greatness" in a basketball context? Is it Greatness at \
some or all facets of the game? Does the U.S.A have different criteria to U.K? First, lets consider \
Jordan\'s  greatness. '

question = "How great was Jordan? "
answer = "I'm going to say he was great! "
answer2 = "I haven't met anyone who disagrees."
text = text +" "+ question +" "+ answer +" "+ answer2

text_words = text.split(" ")
print(text_words)

['How', 'should', 'we', 'regard', "Jordan's", '"greatness"', 'in', 'a', 'basketball', 'context?', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game?', 'Does', 'the', 'U.S.A', 'have', 'different', 'criteria', 'to', 'U.K?', 'First,', 'lets', 'consider', "Jordan's", '', 'greatness.', '', 'How', 'great', 'was', 'Jordan?', '', "I'm", 'going', 'to', 'say', 'he', 'was', 'great!', '', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees.']


In sentence above **"greatness"**, **Greatness** and **greatness** all have the same meaning but will be represented differenly as words. The capitalisation of words will be considered when we normalise textual data.

Lets update the algorithm so that it considers puntuation when splitting the text into words.

In [3]:
text

'How should we regard Jordan\'s "greatness" in a basketball context? Is it Greatness at some or all facets of the game? Does the U.S.A have different criteria to U.K? First, lets consider Jordan\'s  greatness.  How great was Jordan?  I\'m going to say he was great!  I haven\'t met anyone who disagrees.'

### My attempt

In [4]:

# List of words
words = []

# Track the current word
current_word = ""

# List of delimiters
delimiters = ['"', '.', "?", "!"]

# iterate through each character in the text
for c in text:
    if len(current_word) > 0:
        previous_char = current_word[-1]
    
    if c == " ":
        words.append(current_word)    # add the current_word to words list
        current_word = ""             # initialise the current_word to empty string again
    elif c in delimiters and previous_char != " ":
        words.append(current_word)
        words.append(c)
        current_word = ""
    elif c in delimiters and previous_char == " ":
        words.append(c)
        current_word = ""
    else:
        current_word += c 
  


In [5]:
print(words)

['How', 'should', 'we', 'regard', "Jordan's", '', '"', 'greatness', '"', '', 'in', 'a', 'basketball', 'context', '?', '', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game', '?', '', 'Does', 'the', 'U', '.', 'S', '.', 'A', 'have', 'different', 'criteria', 'to', 'U', '.', 'K', '?', '', 'First,', 'lets', 'consider', "Jordan's", '', 'greatness', '.', '', '', 'How', 'great', 'was', 'Jordan', '?', '', '', "I'm", 'going', 'to', 'say', 'he', 'was', 'great', '!', '', '', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees', '.']


## From the book

In [6]:

delimiters = ['"', ".", "?", "!"]
words = []
current_word = ""
 
for char in text:
    if char == " ":
        if not current_word == "":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        if current_word == "":
            words.append(char)
        else:
            words.append(current_word)
            words.append(char)
            current_word = ""
    else:
        current_word += char
        
    
print(words)

['How', 'should', 'we', 'regard', "Jordan's", '"', 'greatness', '"', 'in', 'a', 'basketball', 'context', '?', 'Is', 'it', 'Greatness', 'at', 'some', 'or', 'all', 'facets', 'of', 'the', 'game', '?', 'Does', 'the', 'U', '.', 'S', '.', 'A', 'have', 'different', 'criteria', 'to', 'U', '.', 'K', '?', 'First,', 'lets', 'consider', "Jordan's", 'greatness', '.', 'How', 'great', 'was', 'Jordan', '?', "I'm", 'going', 'to', 'say', 'he', 'was', 'great', '!', 'I', "haven't", 'met', 'anyone', 'who', 'disagrees', '.']


There are still some issues with the above code if they are to operate as tokenizers:
* Can't handle abbreviations like "U.K." or "i.e."
* Word concatenation like "I'm" will not be understood as "I am"
    * Tokenizer would split the answer into [I, 'm, going, to , say, he ,was, great, !] and some rules would recognise 'm as short for "am" in this example
    * Similarly "haven't" should be recognised as "have not". Tokenizer would split answer2 into [I have, n't, met, anyone, who, disagees, .]

Good tokenizing packages ensure that text is properly split into words

## Enron case study

#### Reading in the data

In [7]:
import os
import codecs

def read_in(folder):
    files = os.listdir(folder)
    a_list = []
    for file in files:
        if not file.startswith("."):
            f = codecs.open(os.path.join(folder,file), "r", 
                encoding="ISO-8859-1", errors="ignore")
            a_list.append(f.read())
            f.close()
    return a_list

In [8]:
data_path = 'data/enron1/'

# each element in the list contains contents of one email
ham_list = read_in(os.path.join(data_path, 'ham'))
spam_list = read_in(os.path.join(data_path, 'spam'))

print(len(spam_list))
print(len(ham_list))
print("***SPAM***")
print(spam_list[2])
print("***HAM***")
print(ham_list[0])


1500
3672
***SPAM***
Subject: office xp - $60 convenes permian
Even is story. Live together store went where grass. Pull, your,
Large. Yes then north, stick said. Number produce thing, hair
Corner, play, basic. Air may figure. View house then which,
Ready. Collect clear with. Wide arm what, industry, sound, clear
Talk. Job distant, keep fit. Sense quick would seem, take. I
Row, and. Sound, decimal color work coast friend. Ride east sand
Took. Believe him then, modern, old catch like.

***HAM***
Subject: on - call notes
Please see the attached file for on - call notes.
Bob


#### Preprocessing the data
Combine the ham/spam datasets, include the label then shuffle the final dataset

In [9]:
import random 

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
random.seed(42)
random.shuffle(all_emails)
print(f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


#### Tokenizing
Emails are single string of symbols so need to split the text into words. Use NLTK's tokenizer for this

In [10]:
import nltk 

nltk.download('punkt', download_dir='/home/js/repos/hosting-ml-as-microservice/venv/share/nltk_data')

[nltk_data] Downloading package punkt to /home/js/repos/hosting-ml-as-
[nltk_data]     microservice/venv/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:

from nltk import word_tokenize

def tokenize(input):
    word_list = []
    for word in word_tokenize(input):
        word_list.append(word)
    return word_list

input = "What's the best way to split a sentence into words?"
print(tokenize(input))

['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']


#### Extract and normalise the features
* Iterate over emails and tokenize the text that has been tranformed to be lower case so that it is normalised. 
* Each email and label are paired together in tuples.
* Tokenized list of words are converted to dictionary with words as keys and set to value of True

In [12]:
def get_features(text):
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
        features[word] = True
    return features

all_features = [(get_features(email), label) for (email, label) in all_emails]


In [13]:
all_emails[0]

('Subject: how to confidently attract, meet and seduce more hot women\r\nFact: you can go out tonight and approach\r\nAny beautiful woman confidently and without fear. You can know\r\nExactly what to say to break the ice...... And exactly\r\nWhat to do to get her into bed!\r\nRmove\r\n',
 'spam')

In [14]:
print(all_features[0])

({'subject': True, ':': True, 'how': True, 'to': True, 'confidently': True, 'attract': True, ',': True, 'meet': True, 'and': True, 'seduce': True, 'more': True, 'hot': True, 'women': True, 'fact': True, 'you': True, 'can': True, 'go': True, 'out': True, 'tonight': True, 'approach': True, 'any': True, 'beautiful': True, 'woman': True, 'without': True, 'fear': True, '.': True, 'know': True, 'exactly': True, 'what': True, 'say': True, 'break': True, 'the': True, 'ice': True, '......': True, 'do': True, 'get': True, 'her': True, 'into': True, 'bed': True, '!': True, 'rmove': True}, 'spam')


In [15]:
print(all_features[0][0])
print("----------")
print(f"label: {all_features[0][1]}")


{'subject': True, ':': True, 'how': True, 'to': True, 'confidently': True, 'attract': True, ',': True, 'meet': True, 'and': True, 'seduce': True, 'more': True, 'hot': True, 'women': True, 'fact': True, 'you': True, 'can': True, 'go': True, 'out': True, 'tonight': True, 'approach': True, 'any': True, 'beautiful': True, 'woman': True, 'without': True, 'fear': True, '.': True, 'know': True, 'exactly': True, 'what': True, 'say': True, 'break': True, 'the': True, 'ice': True, '......': True, 'do': True, 'get': True, 'her': True, 'into': True, 'bed': True, '!': True, 'rmove': True}
----------
label: spam


#### Train the classifier