<a href="https://colab.research.google.com/github/h-aldarmaki/NLPCourse/blob/main/Text_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: Text Normalization

In this assignment, you will download a text corpus (Sherlock Holmes), and perform some text normalization steps. You will count the number of types before and after normalization. 

## **What you need to do:**



1.   Read the instructions and comments below, and try to understand what is happening in each step. Run all blocks of code and inspect the output in each step. 
2.   Answer all questions (there are 5 questions). 




## 1. Downloading the dataset

In the following block, we download the dataset, and do some simple processing using command-line tools like sed and grep. You can execute commands on the underlying operatin system using an exclamation mark (!). For example, the following command is used to delete empty lines from the file sherlock.txt

```! sed -i '/^$/' sherlock.txt```

The above is not python code, but a unix command (it's like executing something directly from the command-line terminal)

Run the following block and examine the output. The file will be saved as sherlock.txt

In [None]:
# first, download the text corpurs

! wget https://sherlock-holm.es/stories/plain-text/cnus.txt 
! mv cnus.txt sherlock.txt

# delete empty lines:
! sed -i '/^$/d' sherlock.txt

# I want to delete the first few lines to remove headers and table of contents
! grep -n -m 1 'CHAPTER I' sherlock.txt
! sed -i '1,80d' sherlock.txt


#display the first 50 lines in the file
! head -50 sherlock.txt

## 2. Open the files and count word types

In the following few blocks, we will read the file 'sherlock.txt' in python, then do some counting. Instead of manually counting word types, we will use the ```Counter``` class from the ```collections``` package. 

Inspect the code below and run each block, then inspect the output. 

## **Question 1:**

* (a) How many times does the word 'Go' occur in this text?
* (b) How many times does the word 'go' occur in this text?
* (c) How many times does the word 'doesn't' occur in this text?
* (d) How many times does the word 'does' occur in this text?
* (e) What is the most frequent word in this text?


In [None]:

# let's first read the whole file and store it in the variable 'text'
filename ='sherlock.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# split words using whitespaces ... 
words = text.split()
print(words[:100])

In [None]:
# now let's count the number of word types! 
# we will use the Counter class from collections

from collections import Counter

counter = Counter(words)
print(f"Number of unique word types: {len(counter)}")

In [None]:
# if you want to know the frequency of a specific word:
n = counter["Sherlock"]
print(f"The word 'Sherlock' occurred {n} times")

#find the most frequent words:
print(counter.most_common(3))

## 3. Splitting Sentences

Identifying where sentences start and end is not super easy. In a paragraph, a sentence could end with a period (.), an exclamation mark (!) or a question mark (?). But there are other occurrences of a period that do not indicate the end of a sentence, for example Mr. PhD. etc. 

NLTK has a trained function called ```sent_tokenize``` which can reliably split a text into a list of sentences. We need to split sentences before doing any further processing.  


## **Question 1:**

In the code block below, split ``text`` into sentences, and store the result as ``sentences``. How many sentneces do we have in this corpus?

In [None]:
from nltk import sent_tokenize  #this is a pre-trained sentence tokenizer model

#your code here:



#Check the output to make sure everything is fine
print(sentences[0])
print(sentences[1])

## 4. Tokenization

We will now use ```nltk``` for word tokenization. We will use the function ```word_tokenize``` which takes a string and returns a list of tokens. Note that some NLTK tools require downloading associated resources, so you will also see calls to ```nltk.download()``` as needed. Note that ``setences`` is a list, so we will need to loop over the sentences and apply the functions to each sentence separately.

## **Question 2:**

* (a) Before tokenization, how many word tokens did we have in the original text? And how many unique word types did we have?

* (b) After tokenization, how many word tokens do we have in the text? And how many unique word types do we now have?

* (c) What is the most frequent word in this corpus (after tokenization)?

### **Hint**
Each element in ``tok_sentences`` will be a list of tokens. You will need to convert this to a one-dimensional list of tokens before you can apply ``Counter`` correctly. To do that, you can use the ``flatten`` function 

In [None]:

#import required function and download resources
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#the following can be used to convert any list to a one-dimensional list
from pandas.core.common import flatten


tok_sentences=[]
for sen in sentences:
    #your code here, follow the steps below:
    
    #1. convert sen to lowercase   
    #2. apply word_tokenize
    #3  append the result to tok_sentences


#Check the output to make sure everything is fine
print(" ".join(tok_sentences[0]))
print(" ".join(tok_sentences[1]))
print(" ".join(tok_sentences[3]))


flat_sentence_list=flatten(tok_sentences)
#Now you can count the word types:


## 5. Removing Punctuation

In some applications, punctuation are not useful and we would want to remove them. In the following block, I show you how to easily remove all punctuation from tokenized lists of words. 

## **Question 3:**

How many unique types do we have after removing all punctuation?

In [None]:
#we can remove punctuation marks:
import string
print(string.punctuation) #this is a list of all punctuation marks

#remove punctuation
no_punc_sentences=[]
for sentence in tok_sentences:
    new_sen=[]
    for word in sentnece:
        if word not in string.punctuation:
            new_sen.append(word)
    no_punc_sentences.append(word)

print(" ".join(no_punc_sentences[0]))
print(" ".join(no_punc_sentences[1]))
print(" ".join(no_punc_sentences[3]))


## 6. Stemming

We will now use the Porter Stemmer to normalize the text even more. In practice, we only use stemming is some applications where we don't have enough data and we need to reduce the number of types. Notice that the output now is made up of stems, some of which are not valid English words. 

## **Question 4:**

Stem all the words in each sentence. How many unique word types do we now have after stemming?

In [None]:

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

sen_stemmed = []
for sentence in no_punc_sentences:
  #your code here:


#flatten & count the word types