## NLP Lab Assignment 1

***Student Details:***

- Name : Anjishnu Mukherjee
- Registration Number : B05-511017020
- Exam Roll Number : 510517086
- Email : 511017020.anjishnu@students.iiests.ac.in

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Import libraries

In [2]:
import random
import string
from typing import List, Optional

import nltk
from nltk.corpus import brown, stopwords
from nltk.tokenize import word_tokenize

In [3]:
nltk.download("brown")
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Data files

In [4]:
data_dir = "/content/drive/MyDrive/NLP_LAB/Assignment-1/"
data_file_1 = "sample-text-1.txt"
data_file_2 = "sample-text-2.txt"
bengali_stopwords = "stopwords-bn.txt"

## Solution class

- Here, I define a class which takes the language, stopwords and the input text file as parameters. 
- All the required functionality of the assignment are defined as methods of this class.

In [5]:
class Basics:
    ''' Assignment 1 Solution Class'''

    def __init__(self, path: str, language: str,
                 stop_words: Optional[str] = None) -> None:
        '''Creates an object of the class with given parameters.'''

        super().__init__()
        self.PATH = path
        self.language = language
        self.stop_words = stop_words

        try:
            with open(self.PATH, 'r') as inputFile:
                self.raw_text = inputFile.read()
        except IOError:
            self.raw_text = None
            print("Couldn't read input file.")

    def __repr__(self) -> str:
        '''Formalized string representation of this class.'''

        rep = self.__class__.__qualname__ + \
            '(' + self.PATH + ',' + self.language + ') object.'
        return rep

    def add_line_numbers(self) -> None:
        '''Add a line number to every non-empty line of input file and print it.'''

        if self.raw_text is not None:
            line_number = 1
            for line in self.raw_text.splitlines():

                # empty lines and whitespaces are ignored
                if not line.isspace() and line != "":
                    print(line_number, "\t", line)
                    line_number += 1
        else:
            raise Exception("Couldn't read input file.")

    def __tokenize(self, text: Optional[str] = None) -> List:
        '''Normalizes input text by converting to lower case and then tokenizing
        it.'''

        if not text:
            if self.raw_text is not None:
                return word_tokenize(self.raw_text.lower())
            else:
                raise Exception("Couldn't read input file.")
        else:
            if text:
                return word_tokenize(text.lower())
            else:
                raise Exception("Can't tokenize input string.")

    def __remove_punctuations(self, tokens: List) -> List:
        '''Removes all possible punctuations from the list of tokens.'''

        if tokens:
            # list of all English punctuations
            punctuations = set(string.punctuation)

            # this is the only Bengali punctuation which is not there in English
            if self.language == 'bn':
                punctuations.add('।')

            # if at least one of the characters is not a punctuation, it is a word.
            # this removes all the single character punctuations from the list of
            # tokens.
            words = []
            for token in tokens:
                for char in token:
                    if char not in punctuations:
                        words.append(token)
                        break

            return words
        else:
            raise Exception("No tokens found.")

    def __clean_tokens(self, tokens: List) -> List:
        '''Removes stopwords, contractions, conjunctions.'''

        if tokens:
            # use english stopwords by default, otherwise use specified stopwords
            if self.stop_words == None:
                stop_words = set(stopwords.words('english'))
            else:
                stop_words = self.stop_words

            # remove stopwords from collection of tokens
            clean_tokens = []
            for word in tokens:
                if word not in stop_words:
                    clean_tokens.append(word)

            # remove contractions (eg. hasn't) and conjunctions (eg. on-campus)
            # the following list comprehension removes a token if any character in
            # the token is not an alphabet
            if self.language == 'en':
                proper_english_tokens = []
                for word in clean_tokens:
                    if word.isalpha():
                        proper_english_tokens.append(word)

                return proper_english_tokens

            return clean_tokens
        else:
            raise Exception("No tokens found.")

    def vocab_size(self, text: Optional[str] = None) -> int:
        '''The size of the vocabulary is the number of unique tokens in input
        text.'''

        # use a set to represent the vocabulary, as a set only stores unique
        # elements
        if not text:
            if self.raw_text is not None:
                vocabulary = set(self.__tokenize(self.raw_text))
                return len(vocabulary)
            else:
                raise Exception("Couldn't read input file.")
        else:
            if text:
                vocabulary = set(self.__tokenize(text))
                return len(vocabulary)
            else:
                raise Exception("Can't create vocabulary.")

    def word_freq(self, word: str, section: str) -> int:
        '''Computes frequency of input word in given section of Brown corpus.'''

        # find all the words in the given section
        section_words = self.__remove_punctuations(
            list(brown.words(categories=section)))

        # find frequency of given word
        freq = 0
        for token in section_words:
            if token == word:
                freq += 1
        return freq

    def test_word_freq(self) -> None:
        '''Test the word_freq method of the class by comparing with
        nltk.FreqDist.'''

        # choose 5 random categories of brown corpus
        brown_categories = brown.categories()
        test_categories = random.sample(brown_categories, 5)

        print("-" * 26)
        print('{:>5} |{:>16}'.format("word_freq", "nltk.FreqDist"))
        print("-" * 26)

        # choose 3 random words from each of the 5 random categories chosen above
        for category in test_categories:
            section_words = self.__remove_punctuations(
                list(brown.words(categories=category)))
            test_words = random.sample(section_words, 3)
            nltk_freq_dist = nltk.FreqDist(section_words)
            for word in test_words:
                    # verify the correctness using assert statement
                assert self.word_freq(word, category) == nltk_freq_dist[word]

                # print the values as well for visual comparison
                print('{:>5}{:>15}'.format(self.word_freq(word, category),
                                           nltk_freq_dist[word]))

    def percent(self, word: str, text: Optional[str] = None) -> float:
        '''Calculates how often a word occurs in a text as a percentage.'''

        if not text:
            if self.raw_text is not None:
                # calculate total number of words
                text_words = self.__remove_punctuations(
                    self.__tokenize(self.raw_text))
                total_count = len(text_words)

                # calculate frequency of given word
                frequency = 0
                for token in text_words:
                    if token == word:
                        frequency += 1

                # return percentage for word
                return (frequency / total_count) * 100
            else:
                raise Exception("Couldn't read input file.")
        else:
            if text:
                # calculate total number of words
                text_words = self.__remove_punctuations(self.__tokenize(text))
                total_count = len(text_words)

                # calculate frequency of given word
                frequency = 0
                for token in text_words:
                    if token == word:
                        frequency += 1

                # return percentage for word
                return (frequency / total_count) * 100
            else:
                raise Exception(
                    "Can't calculate frequency of word in input text.")

    def n_most_frequent(self, text: Optional[str] = None,
                        num_words: Optional[int] = -1) -> List:
        '''Finds N most frequent words of text, except stopwords, contractions,
        conjunctions, punctuations.'''

        # remove punctuations
        if not text:
            if self.raw_text is not None:
                tokens = self.__remove_punctuations(
                    self.__tokenize(self.raw_text))
            else:
                raise Exception("Couldn't read input file.")
        else:
            if text:
                tokens = self.__remove_punctuations(self.__tokenize(text))
            else:
                raise Exception("Can't tokenize input.")

        # remove stopwords, contractions, conjunctions
        cleaned_tokens = self.__clean_tokens(tokens)

        # calculate freq distribution of the tokens
        freq_dist = nltk.FreqDist(cleaned_tokens)

        # sort the frequency distribution in decreasing order of frequency
        sorted_freq_dist = sorted(freq_dist.items(), key=lambda item: -item[1])

        # if num_words is -1 (default), then return all word frequencies
        most_freq_words = []
        if num_words == -1:
            for i in range(len(sorted_freq_dist)):
                most_freq_words.append(sorted_freq_dist[i][0])
        else:
            for i in range(min(len(sorted_freq_dist), num_words)):
                most_freq_words.append(sorted_freq_dist[i][0])

        return most_freq_words

    def n_letter_words(self, text: Optional[str] = None,
                       num_words: Optional[int] = 4) -> List:
        '''Finds all n letter words and prints them in decreasing order of
        frequency.'''

        # find reverse-sorted frequency of all words
        if not text:
            if self.raw_text is not None:
                freq_sorted_words = self.n_most_frequent()
            else:
                raise Exception("Couldn't read input file.")
        else:
            if text:
                freq_sorted_words = self.n_most_frequent(text)
            else:
                raise Exception("Can't tokenize input.")

        # choose only n letter words from above output
        n_letter_words = []
        for word in freq_sorted_words:
            if len(word) == num_words:
                n_letter_words.append(word)

        return n_letter_words

    def words_occuring_n_times(self, count: Optional[int] = 3) -> List:
        '''Finds all words that occur atleast n times in the Brown Corpus.'''

        # all categories in brown corpus
        brown_categories = brown.categories()

        # aggregate all words from each category
        all_words = []
        for category in brown_categories:
            section_words = self.__remove_punctuations(
                list(brown.words(categories=category)))
            all_words.extend(section_words)

        # words from the list above, which have frequency >= count
        valid_words = []
        nltk_freq_dist = dict(nltk.FreqDist(all_words))
        for word, freq in nltk_freq_dist.items():
            if freq >= count:
                valid_words.append(word)

        return valid_words

## English

- For stopwords, I use the default stopwords collection from nltk for english.
- The tasks mentioned in the assignment are executed consecutively in separate cells to show the output for each.

In [6]:
english_solution = Basics(data_dir+data_file_1, 'en')

In [7]:
english_solution.add_line_numbers()

1 	 ﻿Braving winter rain, thousands gathered at the Bhupen Hazarika memorial at Jalukbari here to join a protest rally against the Citizenship Amendment Act (CAA).
2 	 Organised by the artistes of the state and backed by the All Assam Students’ Union (AASU), thousands of protesters, wearing black masks, hit the city streets to express their anger against the amended citizenship law and vowed to uproot the current government if it failed to respect public sentiment.
3 	 The huge rally was organised a day after the BJP, in a show of strength, held a massive gathering of its workers in support of CAA at Khanapara ground.
4 	 Leading Sunday’s rally from Jalukbari to Dighalipukhuri, a distance of around 14km, AASU general secretary Lurinjyoti Gogoi said, “The BJP showed its strength with its party workers but the anti-CAA movement has public support. In a democracy, public is the power and this movement has witnessed spontaneous public support from day one.”
5 	 The AASU leader also warned 

In [8]:
print('Vocabulary size : ', english_solution.vocab_size())

Vocabulary size :  298


In [9]:
english_solution.test_word_freq()

--------------------------
word_freq |   nltk.FreqDist
--------------------------
  806            806
  263            263
   44             44
   36             36
   61             61
   95             95
    3              3
    6              6
   85             85
  127            127
   16             16
    2              2
  530            530
    1              1
    2              2


In [10]:
word = 'we'
print('Percentage of \"', word, '\" is : ', english_solution.percent(word), '%.')

Percentage of " we " is :  1.2326656394453006 %.


In [11]:
print('10 most frequent words : ')
print(english_solution.n_most_frequent(None, 10))

10 most frequent words : 
['movement', 'caa', 'assam', 'said', 'government', 'protest', 'state', 'also', 'people', 'aasu']


In [12]:
print('4 letter words in decreasing order of frequency from left to right: ')
print(english_solution.n_letter_words())

4 letter words in decreasing order of frequency from left to right: 
['said', 'also', 'aasu', 'join', 'till', 'rain', 'goal', 'city', 'huge', 'show', 'held', 'five', 'many', 'warn', 'stop', 'dass', 'duck', 'yuva', 'sure', 'garg', 'worn', 'deaf', 'amid', 'oust', 'next', 'take', 'icon', 'upon', 'vote', 'bank', 'avik']


In [13]:
print('Words occuring thrice in Brown corpus : ')
english_solution.words_occuring_n_times()

Words occuring thrice in Brown corpus : 


['Dan',
 'Morgan',
 'told',
 'himself',
 'he',
 'would',
 'forget',
 'Ann',
 'Turner',
 'He',
 'was',
 'well',
 'rid',
 'of',
 'her',
 'certainly',
 "didn't",
 'want',
 'a',
 'wife',
 'who',
 'as',
 'If',
 'had',
 'married',
 "he'd",
 'have',
 'been',
 'asking',
 'for',
 'trouble',
 'But',
 'all',
 'this',
 'Sometimes',
 'woke',
 'up',
 'in',
 'the',
 'middle',
 'night',
 'thinking',
 'and',
 'then',
 'could',
 'not',
 'get',
 'back',
 'to',
 'sleep',
 'His',
 'plans',
 'dreams',
 'revolved',
 'around',
 'so',
 'much',
 'long',
 'that',
 'now',
 'felt',
 'if',
 'nothing',
 'The',
 'easiest',
 'thing',
 'be',
 'sell',
 'out',
 'Al',
 'Budd',
 'leave',
 'country',
 'but',
 'there',
 'stubborn',
 'streak',
 'him',
 "wouldn't",
 'allow',
 'it',
 'best',
 'bitterness',
 'disappointment',
 'poisoned',
 'hard',
 'work',
 'found',
 'tired',
 'enough',
 'at',
 'went',
 'simply',
 'because',
 'too',
 'exhausted',
 'stay',
 'awake',
 'Each',
 'day',
 'less',
 'often',
 'each',
 'hurt',
 'little',

## Bengali

- Bengali is written using unicode.
- I use language specific stopwords for Bengali from [this](https://github.com/stopwords-iso/stopwords-bn/blob/master/stopwords-bn.txt) source.
- The tasks mentioned in the assignment are executed consecutively in separate cells to show the output for each.

In [14]:
stopwords_bn = set(open(data_dir+bengali_stopwords).read().split())

In [15]:
bengali_solution = Basics(data_dir+data_file_2, 'bn', stopwords_bn)

In [16]:
bengali_solution.add_line_numbers()

1 	 ﻿এক টাকা বাড়লেই রাজস্থানে সেঞ্চুরি হাঁকাবে পেট্রোলের দাম। ডিজেলের দামও ৯০-এর কোটা পার করেছে সেখানে। যা দামের নিরিখে দেশের মধ্যে সবচেয়ে বেশি।
2 	 দাম বৃদ্ধির কারণ হিসেবে বিশ্ব বাজারে অশোধিত তেল এবং দুই জ্বালানির দরের হিসেবকে দায়ী করেছে কেন্দ্র। তবে বিরোধীদের অভিযোগ, পেট্রোল এবং ডিজেলের উপর কেন্দ্রের শুল্ক বৃদ্ধির কারণেই ক্রেতাদের বাড়তি কড়ি গুনতে হচ্ছে। কংগ্রেস সাংসদ অধীররঞ্জন চৌধুরী তেলের দাম বৃদ্ধি নিয়ে কেন্দ্রকে কটাক্ষ করে টুইট করেন, ‘আত্মনির্ভর ভারত বৃদ্ধির আরও একটা শিখরে পৌঁছতে চলেছে। সেঞ্চুরি থেকে পাঁচ পয়েন্ট দূরে রয়েছে পেট্রোলের দাম। খুব শীঘ্রই নরেন্দ্র মোদীজি পেট্রোলের দাম ১০০ টাকা করবেন।’
3 	 রবিবার গোটা দেশে লিটারপিছু ২৯ পয়সা বেড়েছে পেট্রোলের দাম। লিটারপিছু ডিজেলের দাম বেড়েছে ৩২ পয়সা। পর পর টানা ৬ দিন বাড়ল এই জীবাশ্ম জ্বালানির দাম। এ ক’দিনে লিটারপিছু পেট্রোলের মোট দাম বেড়েছে ১.৮০ টাকা এবং ডিজেলের ১.৮৮ টাকা।
4 	 দাম বৃদ্ধির নিরিখে রাজস্থানের পরেই রয়েছে মুম্বই। সেখানে পেট্রোলের দাম ৯৫ টাকা ছাড়িয়ে গিয়েছে। ডিজেল ৮৬.০৪ টাকা। দেশের চার মেট্রো শহরের মধ্যে মুম্বইত

In [17]:
print('Vocabulary size : ', bengali_solution.vocab_size())

Vocabulary size :  234


In [18]:
word = 'টাকা'
print('Percentage of \"', word, '\" is : ', bengali_solution.percent(word), '%.')

Percentage of " টাকা " is :  1.2345679012345678 %.


In [19]:
print('10 most frequent words : ')
print(bengali_solution.n_most_frequent(None, 10))

10 most frequent words : 
['দাম', 'টাকা।', 'ডিজেলের', 'পেট্রোলের', 'দেশের', 'তেলের', 'টাকা', 'দাম।', 'লিটার', 'সবচেয়ে']


In [20]:
print('4 letter words in decreasing order of frequency from left to right: ')
print(bengali_solution.n_letter_words())

4 letter words in decreasing order of frequency from left to right: 
['টাকা', 'দাম।', 'পয়সা', 'দেশে', 'টানা', 'দামও', 'কোটা', 'দরের', 'দায়ী', 'কড়ি', 'টুইট', 'ভারত', 'একটা', 'পাঁচ', 'দূরে', 'দিনে', '১.৮০', '১.৮৮', 'দিনই', 'একটু', 'একুট', 'করে।', 'সকাল', 'অয়েল', 'চলতি']
