# **Assignment 1 on Natural Language Processing**

### Date : 4th Sept, 2020

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Anusha Potnuru, Uppada Vishnu

 # NLTK Library

The [NLTK](https://www.nltk.org/) Python framework is generally used as an education and research tool. Tokenization, Stemming, Lemmatization, Punctuation, Character count, word count are some of these packages which will be discussed in this tutorial.

**Installing Nltk** <br>
Nltk can be installed using PIP or Conda package managers.For detailed installation instructions follow this [link](https://www.nltk.org/install.html).

To ensure we are all on the same page, the coding environment will be in **python3**. We suggest downloading Anaconda3 and creating a separate environment to do this assignment. 
The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. 
The steps to install NLTK is available on the link: 
```bash
sudo pip3 install nltk 
python3 
nltk.download()
```

**Note for Question and answers:**

Write your answers to the point in the text box below labelled as **Answer here**.

# Tokenizing words and Sentences using Nltk

**Tokenization** is the process by which big quantity of text is divided into smaller parts called tokens. <br>It is crucial to understand the pattern in the text in order to perform various NLP tasks.These tokens are very useful for finding such patterns.<br>

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1. word tokenize
2. sentence tokenize

In [1]:
# Importing modules
import nltk
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to C:\Users\harshal
[nltk_data]     d\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package inaugural to C:\Users\harshal
[nltk_data]     d\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


In [2]:
# Sample corpus.
from nltk.corpus import inaugural
corpus = inaugural.raw('1789-Washington.txt')
# print(corpus)

### **TASK**:

For the given corpus, 
1. Print the number of sentences and tokens. 
2. Print the average number of tokens per sentence.
3. Print the number of unique tokens
4. Print the number of tokens after stopword removal using the stopwords from nltk.


In [3]:
# TODO

corpus_sents=sent_tokenize(corpus)
corpus_toks=word_tokenize(corpus)

In [4]:
# 1.Printing the number of sentences and tokens.

num_toks=len(corpus_toks)
num_sents=len(corpus_sents)

print("number of tokens in corpus is {}".format(num_toks)
      ,"number of sentences in corpus is {}".format(num_sents),sep='\n')

number of tokens in corpus is 1537
number of sentences in corpus is 23


In [5]:
# 2. Printing the average number of tokens per sentence
# avg_tok_per_sent = total number of tokens/total number of sentences

avg_tok_per_sent = num_toks/num_sents

print("Average number of tokens per sentence is {}".format(avg_tok_per_sent))

Average number of tokens per sentence is 66.82608695652173


In [6]:
# 3. Printing the number of unique tokens

print("Number of  unique tokens in the corpus is {}".format( len(set(corpus_toks)) ))

Number of  unique tokens in the corpus is 626


In [7]:
# importing the requirements
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\harshal
[nltk_data]     d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# 4. Printing the number of tokens after stopword removal 

# list of all the eglish stopwords
eng_stopwords=stopwords.words('english')

# note same token might be counted twice as it was just mentioned tokens and not unique tokens
# for unique tokens we can just take the set difference

# going through all the tokens and adding it to the filtered list if it is not in the stopwords list

num_filtered_toks = [i for i in corpus_toks if i not in eng_stopwords]
print("Number of tokens after removing stopwords are {}".format(len(num_filtered_toks)))

Number of tokens after removing stopwords are 800


# Stemming and Lemmatization with NLTK

**What is Stemming?** <br>
Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.<br>
Hence Stemming is a way to find the root word from any variations of respective word

There are many stemmers provided by Nltk like **PorterStemmer**, **SnowballStemmer**, **LancasterStemmer**.<br>

We will try and see differences between Porterstemmer and Snowballstemmer

In [9]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer # Note that SnowballStemmer has language as parameter.

words = ["grows","leaves","fairly","cats","trouble","misunderstanding","friendships","easily", "rational", "relational"]

# TODO
# create an instance of both the stemmers and perform stemming on above words
SS_stemmer = SnowballStemmer("english")
PS_stemmer = PorterStemmer()

# for each word in the list get its stem
stemmed_words_SS=[SS_stemmer.stem(w) for w in words]
stemmed_words_PS=[PS_stemmer.stem(w) for w in words]

print("Original words\n",words,'\n')
print("Stemmed words using SnowballStemmer\n",stemmed_words_SS,'\n')
print("Stemmed words using PorterStemmer\n",stemmed_words_PS,'\n')

# TODO
# Complete the function which takes a sentence/corpus and gets its stemmed version.
def stemSentence(sentence=None):
    if sentence == None:
        return 

    # for stemming the words
    stemmer = SnowballStemmer("english")
    
    # first get the words
    words=word_tokenize(sentence)
    
    # getting the stemmed words
    stems=[stemmer.stem(t) for t in words]
    
    # joining them to form one string again
    sentence_stem = ' '.join(stems)
    
    return sentence_stem


Original words
 ['grows', 'leaves', 'fairly', 'cats', 'trouble', 'misunderstanding', 'friendships', 'easily', 'rational', 'relational'] 

Stemmed words using SnowballStemmer
 ['grow', 'leav', 'fair', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat'] 

Stemmed words using PorterStemmer
 ['grow', 'leav', 'fairli', 'cat', 'troubl', 'misunderstand', 'friendship', 'easili', 'ration', 'relat'] 



In [10]:
s="hie there wordly world"

stemSentence(s)

'hie there word world'

**What is Lemmatization?** <br>
Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma.<br>

*The NLTK Lemmatization method is based on WorldNet's built-in morph function.*

In [11]:
#imports
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') # Since Lemmatization method is based on WorldNet's built-in morph function.

words = ["grows","leaves","fairly","cats","trouble","running","friendships","easily", "was", "relational","has"]

#TODO
# Create an instance of the Lemmatizer and perform Lemmatization on above words
# You can also give Parts-of-speech(pos) to the Lemmatizer for example "v" (verb). Check the differences in the outputs.

lemmatizer = WordNetLemmatizer() 
  
lemmas = [ lemmatizer.lemmatize(w) for w in words ]
lemmas_v = [ lemmatizer.lemmatize(w,pos='v') for w in words ]
lemmas_n = [ lemmatizer.lemmatize(w,pos='n') for w in words ]
lemmas_a = [ lemmatizer.lemmatize(w,pos='a') for w in words ]
print("Lemmas using default pos\n",lemmas,'\n')
print("Lemmas using pos='n'\n",lemmas_n,'\n')
print("Lemmas using pos='v'\n",lemmas_v,'\n')
print("Lemmas using pos='a'\n",lemmas_a,'\n')


#TODO
# Complete the function which takes a sentence/corpus and gets its lemmatized version.
def lemmatizeSentence(sentence=None,pos='n'):
    if sentence == None:
        return 

    # instance of the Lemmatizer
    lemmatizer = WordNetLemmatizer() 

    # first get the words
    words=word_tokenize(sentence)

    # getting the lemmatized words using default i.e noun form
    lemma=[lemmatizer.lemmatize(t,pos=pos) for t in words]

    # joining them to form one string again
    sentence_lemma = ' '.join(lemma)

    return sentence_lemma

[nltk_data] Downloading package wordnet to C:\Users\harshal
[nltk_data]     d\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmas using default pos
 ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha'] 

Lemmas using pos='n'
 ['grows', 'leaf', 'fairly', 'cat', 'trouble', 'running', 'friendship', 'easily', 'wa', 'relational', 'ha'] 

Lemmas using pos='v'
 ['grow', 'leave', 'fairly', 'cat', 'trouble', 'run', 'friendships', 'easily', 'be', 'relational', 'have'] 

Lemmas using pos='a'
 ['grows', 'leaves', 'fairly', 'cats', 'trouble', 'running', 'friendships', 'easily', 'was', 'relational', 'has'] 



In [12]:
s="hie there wordly world grows leaves"

lemmatizeSentence(s,pos='n')

'hie there wordly world grows leaf'

**Question:** Give example of two words which have same stem but different lemma? Show the stem and lemma of both words in the code below 



**Answer here:** 
1. Natural 
2. Nature

In [26]:
#TODO
# Write code to print the stem and lemma of both your words
s1,s2='Natural','Nature'

s_stem=stemSentence(s1)
s_lemma=lemmatizeSentence(s1,pos='a')
print("stem of {} is {}\nlemma of {} is {}".format(s1,s_stem,s1,s_lemma))
print('\n')

s_stem=stemSentence(s2)
s_lemma=lemmatizeSentence(s2,pos='a')
print("stem of {} is {}\nlemma of {} is {}".format(s2,s_stem,s2,s_lemma))

stem of Natural is natur
lemma of Natural is Natural


stem of Nature is natur
lemma of Nature is Nature


**Question:** Write a comparison between stemming and lemmatization?

**Answer here:**

| Stemming|Lemmatization|
|:-----|:----|
|1. Rule based method to reduce words to their stem form e.g. removing the suffix 'ing' etc | 1. Dictionary based method to reduce words to their root form e.g. WordNetLemmatizer uses wordnet for this task and finds the lemma of the words|
|2. As it is rule based it is simple and fast method | 2. As it is dictionary based it is relatively slow method |
|3. Since all rules dont apply in general it sometimes gives an output which is not a valid word |3. Since it uses dictionary it works quite well and output is always a valid word in the language |
|4. It doesnt require part of speech tag to work | 4. It requires part of speech tag to work, if tag is not specified tag 'Noun' is assumed|
