<a href="https://colab.research.google.com/github/h-aldarmaki/NLPCourse/blob/main/Text_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 1: Text Normalization

In this assignment, you will download a text corpus (Sherlock Holmes), and perform some text normalization steps. You will count the number of types before and after normalization. 

##**What you need to do:**



1.   Read the instructions and comments below, and try to understand what is happening in each step. Run all blocks of code and inspect the output in each step. 
2.   Answer all questions (there are 5 questions) and submit your answers in blackboard. 




## 1. Downloading the dataset

In the following block, we download the dataset, and do some simple processing using command-line toolds like sed and grep. You can execute commands on the underlying operatin system using an exclamation mark (!). For example, the following command is used to delete empty lines from the file sherlock.txt

```! sed -i '/^$/' sherlock.txt```

The above is not python code, but a unix command (it's like executing something directly from the command-line terminal)

Run the following block and examine the output. The file will be saved as sherlock.txt

In [None]:
# first, download the text corpurs

! wget https://sherlock-holm.es/stories/plain-text/cnus.txt 
! mv cnus.txt sherlock.txt

# delete empty lines:
! sed -i '/^$/d' sherlock.txt

# I want to delete the first few lines to remove headers and table of contents
! grep -n -m 1 'CHAPTER I' sherlock.txt
! sed -i '1,80d' sherlock.txt


#display the first 50 lines in the file
! head -50 sherlock.txt

--2021-09-22 07:17:10--  https://sherlock-holm.es/stories/plain-text/cnus.txt
Resolving sherlock-holm.es (sherlock-holm.es)... 49.12.76.210, 2a01:4f8:c17:3ff5::2
Connecting to sherlock-holm.es (sherlock-holm.es)|49.12.76.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3382026 (3.2M) [text/plain]
Saving to: ‘cnus.txt’


2021-09-22 07:17:12 (3.05 MB/s) - ‘cnus.txt’ saved [3382026/3382026]

79:          CHAPTER I
     In the year 1878 I took my degree of Doctor of Medicine of the
     University of London, and proceeded to Netley to go through the
     course prescribed for surgeons in the army. Having completed my
     studies there, I was duly attached to the Fifth Northumberland
     Fusiliers as Assistant Surgeon. The regiment was stationed in India
     at the time, and before I could join it, the second Afghan war had
     broken out. On landing at Bombay, I learned that my corps had
     advanced through the passes, and was already deep in the enemy's


## 2. Open the files and count word types

In the following few blocks, we will read the file 'sherlock.txt' in python, then do some counting. Instead of manually counting word types, we will use the ```Counter``` class from the ```collections``` package. 

Inspect the code below and run each block, then inspect the output. 

In [None]:

# let's first read the whole file and store it in the variable 'text'
filename ='sherlock.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split words using whitespaces ... 
words = text.split()
print(words[:100])

In [None]:
# now let's count the number of word types! 
# we will use the Counter class from collections

from collections import Counter

counter = Counter(words)
print(f"Number of unique word types: {len(counter)}")

In [None]:
# if you want to know the frequency of a specific word:
n = counter["Sherlock"]
print(f"The word 'Sherlock' occurred {n} times")
n = counter["the"]
print(f"The word 'the' occurred {n} times")
n = counter["The"]
print(f"The word 'The' occurred {n} times")
n = counter["doesn't"]
print(f"The word 'doesn't occurred {n} times")

##3. Tokenization

We will now use ```nltk``` for word tokenization. We will use the function ```word_tokenize``` which takes a string and returns a list of tokens. Note that some NLTK tools require downloading associated resources, so you will also see calls to ```nltk.download()``` as needed. 

## **Question 1:**

* (a) Before tokenization, how many word tokens did we have in the original text? And how many unique word types did we have?

* (b) After tokenization, how many word tokens do we have in the text? And how many unique word types do we now have?

* (c) What is the most frequent word in this corpus (after tokenization)?

In [None]:
# Now let's do some normalizations ... 

# convert the text to lowercase characters

text = text.lower()

#import required function and download resources
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# tokenize text 
tokens = word_tokenize(text)

print(tokens[:100])

word_tokenize("Mr. Holmes isn't here, unfortunately.")


#Removing Punctuation

In some applications, punctuation are not useful and we would want to remove them. In the following block, I show you how to easily remove all punctuation from tokenized lists of words:

##**Question 2:**

How many unique types do we have after removing all punctuation?

In [None]:
#we can remove punctuation marks:
import string
print(string.punctuation) #this is a list of all punctuation marks

# remove all tokens that are not alphanumeric. The resulting list is stored in words
words = []
for t in tokens:
  if t not in string.punctuation:
    words.append(t)

#alternatively, you could use the following one-liner to do exactly the same thing!
#words = [word for word in tokens if word not in string.punctuation]


print(words[:100])


##4. Splitting Sentences

Without punctuation, now we cannot even tell where sentences start and end. Identifying where sentences start and end is not super easy. In a paragraph, a sentence could end with a period (.), an exclamation mark (!) or a question mark (?). But there are other occurrences of a period that do not indicate the end of sentence, for example Mr. PhD. etc. 

NLTK has a trained function called ```sent_tokenize``` which can reliably split a text into a list of sentences. We need to split sentences before tokenizing the text and removing punctuation, so we will go back a few steps and  apply this function on the original text and then re-tokenize the text sentence by sentence. 


##**Question 3:***

How many sentneces do we have in this corpus?

In [None]:
#keep sentences before processing the words ... we will use sentence tokenizer

from nltk import sent_tokenize  #this is a pre-trained sentence tokenizer model

sentences = sent_tokenize(text)

print(sentences[0])
print(sentences[1])



### Tokenizing the sentences

Now that we have a list of sentences, we can go over them one by one and apply ```word_tokenize```. In python, this can be done in one line of code as shown below:

In [None]:
sen_tokens = [word_tokenize(sentence) for sentence in sentences]
print(" ".join(sen_tokens[0]))
print(" ".join(sen_tokens[1]))
print(" ".join(sen_tokens[3]))

##Removing Punctuation from the list

Now that we have a list of lists, we will need to use a for loop to process each sentence and remove the punctuation marks. We will store the result in ```sen_words```

In [None]:
#remove punctuation
sen_words=[]
for sentence in sen_tokens:
  sen_words.append([word for word in sentence if not word in string.punctuation])

print(" ".join(sen_words[0]))
print(" ".join(sen_words[1]))
print(" ".join(sen_words[3]))

##5. Stemming

We will now use the Porter Stemmer to normalize the text even more. In practice, we only use stemming is some applications where we don't have enough data and we need to reduce the number of types. Notice that the output now is made up of stems, some of which are not valid English words. 

To count the number of stems, we cannot use ```Counter``` directly since now we have a list of lists. We will first need to flatten the list (convert it to a one-dimensional list, rather than a list of lists). To do that, we will use the ```flatten``` function from the package ```pandas``` as done in the second code block below. 

##**Question 4:**

How many unique word types do we now have after stemming?

In [None]:

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

sen_stemmed = []
for sentence in sen_words:
  sen_stemmed.append([porter.stem(word) for word in sentence])
print(" ".join(sen_stemmed[0]))



In [None]:
from pandas.core.common import flatten

counter = Counter(flatten(sen_words))

print(f"Number of unique tokens: {len(counter)}")

n = counter["land"]
print(f"The word 'land' occurred {n} times")
n = counter["landing"]
print(f"The word 'landing' occurred {n} times")
n = counter["doesn't"]
print(f"The word 'doesn't' occurred {n} times")
n = counter["n't"]
print(f"The word 'n't' occurred {n} times")


In [None]:
counter = Counter(flatten(sen_stemmed))

print(f"Number of unique tokens: {len(counter)}")
n = counter["land"]
print(f"The word 'land' occurred {n} times")
n = counter["landing"]
print(f"The word 'landing' occurred {n} times")
n = counter["n't"]
print(f"The word 'n't' occurred {n} times")

##6. Lemmatization

Stemming is a crude form of lemmatization. Stemming is done using blind rules and it results in non-words. Lemmatization is a more intelligent form of normalization that results in actual 'lemmas' or word roots. However, it requires additional resources (in this case WordNet) and it's also a little slower. 

In the following block, we use NLTK's ```WordNetLemmatizer``` to lemmatize the text. Note that we need to download wordnet first. 

NOTE: we are doing lemmatization without specifying part-of-speech (noun, vern, etc.) so the lemmatization may not be accurate. 

##**Question 5:**

* (a) How many unique types do we now have after lemmatization?
* (b) What are the 10 most frequent lemmas in the corpus?
* (c) The lemmatization in the code below is not accurate because it assumes everything is a noun. What would you need to have better lammatization? What type of additional steps would you need for that?

In [None]:
#lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lem = WordNetLemmatizer()

sen_lemmatized = []
for sentence in sen_words:
  sen_lemmatized.append([lem.lemmatize(word) for word in sentence])
print(" ".join(sen_lemmatized[0]))


