<a href="https://colab.research.google.com/github/julurisaichandu/3DSubjectChatbot/blob/main/notebook_01_tokenization_distrib.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook 1: Tokenization
===============

CS 6120 Natural Language Processing, Amir

Everyone should turn this notebook in individually at the end of class.

In [None]:
# Saichandu Juluri

Saving notebooks as pdfs
----------

Feel free to add cells to this notebook as you wish. Make sure to leave **code that you've written** and any **answers to questions** that you've written in your notebook. Turn in your notebook as a pdf at the end of lecture's day.


To convert your notebook to a pdf for turn in, you'll do the following:
1. Kernel -> Restart & Run All (clear your kernel's memory and run all cells)
2. File -> Download As -> .html -> open in a browser -> print to pdf

(The download as pdf option doesn't preserve formatting and output as nicely as taking the step "through" html, but will do if the above doesn't work for you.)

Task 1: working with strings in python
-------
Strings in python are __immutable__. We can change a string's case using the `str.lower()` and `str.upper()` functions.

[See the string python documentation for information on string methods available](https://docs.python.org/3/library/stdtypes.html#string-methods)

In [None]:
# this is an example of using type hints
# in a function definition in python
# Read about caveats and what this does/does not enforce here:
# https://docs.python.org/3/library/typing.html
# you aren't required to use type hints, but might find them helpful


def tokenize(s: str) -> list:
    """
    Tokenize a string based on whitespace
    Parameters:
        s - string piece of text
    returns a list of strings from the text.
    Each item is an individual linguistic unit.
    """
    tokens = s.split(" ")
    return tokens



test_string = "fill me in with whatever you want!"
tokenized = tokenize(test_string)
print(tokenized)

['fill', 'me', 'in', 'with', 'whatever', 'you', 'want!']


#### Q1. What decisions does your tokenizer make about what should/should not be a token?

In [None]:
# example of reading in files with the readline() function

# read in the text of moby dick (ensure the txt file is in the same directory as this notebook)
# if you do not already have - link to download text http://www.gutenberg.org/files/2701/2701-0.txt
# right click and 'save as' into the directory this notebook is located as 'moby_dick.txt'
moby = open('mobydick.txt', "r", encoding='utf-8')

print(moby.readline()) # first line is blank
print(moby.readline()) # second line just to see if its correct
moby.close()

# now read in the full contents
moby = open('mobydick.txt', "r", encoding='utf-8')
contents = moby.read()
moby.close()

print(len(contents)) # how long is this string?

﻿The Project Gutenberg eBook of Moby-Dick; or The Whale, by Herman Melville



1238355


In [None]:
# call your tokenize function on the contents of moby dick
toks = tokenize(contents)

#Calculate number of tokens
num_tokens = len(toks)
print(num_tokens)

#Calculate size of the vocabulary
# considering only unique tokens in vocabulary
num_vocab = len(set(toks))
print(num_vocab)


197668
44635


#### Q2. How many tokens are in *Moby Dick* when you use your `tokenize` function on its contents?
197668
#### Q3. How big is the __vocabulary__ of *Moby Dick* when you use your `tokenize` function on its contents?
44635

Task 2: write a classifier
----

A classifier is, in essence, a function that takes some data $x$ and assigns some label $y$ to it. For a binary classifier, we can model this a function that takes a data point $x$ and returns either `True` or `False`.

Later in this class we'll learn about how to build classifiers that automatically learn how to do this, but we'll start where NLP started—writing some rule-based classifiers.

In [None]:
def classify_sentence_end(text: str, target_index: int) -> bool:
    """
    Classify whether or not a location is the end of a sentence within
    a given text
    Parameters:
        text - string piece of text
        target_index - int candidate location
    returns true if the target index is the end of a sentence.
    False otherwise.
    """
    # TODO: write a simple, rule-based classifier that
    # decides whether or not a specific location is the
    # end of a sentence

    # considering '.' as end of sentence.
    if text[target_index] == '.':
        return True
    else:
        return False

In [None]:
# example text
# feel free to go through different examples
example = "Stocks were up as advancing issues outpaced declining issues on the NYSE by 1.5 to 1. Large- and small-cap stocks were both strong, while the S.&P. 500 index gained 0.46% to finish at 2,457.59. Among individual stocks, the two top percentage gainers in the S.&P. 500 were Incyte Corporation and Gilead Sciences Inc."

# this code will go through and
# build up a string based on the sentence
# decisions that your classifier comes up with
# it will put "****" between the sentences
# you do not need to modify any code here
so_far = ""
for index in range(len(example)):
    result = classify_sentence_end(example, index)
    so_far += example[index]
    if result:
        print(so_far)
        print("****")
        so_far = ""

print(so_far)



Stocks were up as advancing issues outpaced declining issues on the NYSE by 1.
****
5 to 1.
****
 Large- and small-cap stocks were both strong, while the S.
****
&P.
****
 500 index gained 0.
****
46% to finish at 2,457.
****
59.
****
 Among individual stocks, the two top percentage gainers in the S.
****
&P.
****
 500 were Incyte Corporation and Gilead Sciences Inc.
****



#### Q4. How many sentences did your classifier find?
10

#### Q5. Do you believe that your classifier made any errors?
Yes, because when I make '.' as end of sentence, the classifier is assuming the fullstop in decimal number as end of sentence. Another instance is that '.' in acronyms are also assumed as end of sentence. So these some of the mistakes that my classifier made when i chose '.' as my end of sentence.

Task 3: install `nltk`
-----

If you finish the first two tasks, work on making sure that you have `nltk` downloaded and accessible to your jupyter notebooks. While you will not be allowed to use `nltk` for most of your homework, we will use it frequently in class to demonstrate tools.

[`nltk`](https://www.nltk.org/) (natural language toolkit) is a python package that comes with many useful implementations of NLP tools and datasets.

From the command line, using pip: `pip3 install nltk`

[installing nltk](https://www.nltk.org/install.html)

In [None]:
! pip install nltk



In [None]:
import nltk
# for the tokenizers that we're going to use
# won't cause an error if you've already downloaded it
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
example = "N.K. Jemison is a science fiction author."
words = nltk.word_tokenize(example)
print(words)

['N.K', '.', 'Jemison', 'is', 'a', 'science', 'fiction', 'author', '.']


In [None]:
moby_nltk_tokens = nltk.word_tokenize(contents)
# feel free to add/edit code
print(len(set(moby_nltk_tokens)))

22245


#### Q6. How does the size of the vocabulary for Moby Dick compare when you use `nltk`'s tokenizer vs. the one that you made?
Vocab size when tokenize using spaces - 44635

Vocab size when tokenize using nltk's tokenizer - 22245

So I can see very less vocabulary size when using word tokenizer of nltk compared to the tokenizer that I made.

There is almost 50% decrease in the vocabulary size using nltk's tokenizer compared to mine.

To further explain this, In my opinion, my tokenizer is just splitting the words using spaces. However, the word tokenizer by nltk maybe splitting the words using the frequency of word pairs which reduces the vocabulary (for eg. byte pair encoding).
