# **My Tokenizer**

In this assignment, you are asked to create your own word tokenizer without the help of external tokenizers. Steps to the assignment:
1. Choose one of the corpora from nltk.corpus list given - assign it to corpus_name
1. Create your tokenizer in the code block - tokenize the selected corpus into token_list
1. Give the raw corpus text, corpus_raw, and the my_token_list to the evaluation block

Only splitting on whitespace is not enough. At least try two other improvements on the tokenization. Please write sufficient comments to show your reasoning.

## Rules
### Allowed:
 - Choosing a top-down tokenizer or bottom-up tokenizer
 - Using regular expressions library (import re)
 - Adding additional coding blocks
 - Having an additional dataset if you are creating a bottom-up tokenizer but you need to be able to run the code standalone.

### Not allowed:
 - Using tokenizer libraries such as nltk.tokenize, or any other external libraries to tokenize.
 - Changing the contents of the evaluation block at the end of the notebook.

## Assignment Report
Please write a short assignment report at the end of the notebook (max 500 words). Please include all of the following points in the report:
 - Corpus name and the selection reason
 - Design of the tokenizer and reasoning
 - Challenges you have faced while writing the tokenizer and challenges with the specific corpus
 - Limitations of your approach
 - Possible improvements to the system

## Grading
You will be graded with the following criteria:
 - running complete code (0.5),
 - tokenizer algorithm (2),
 - clear commenting (0.5),
 - evaluation score - comparison with nltk word tokenizer (at most 1 point),
 - assignment report (1).

## Submission

Submission will be made to SUCourse. Please submit your file using the following naming convention.


`studentid_studentname_tokenizer.ipynb  - ex. 26744_aysegulrana_tokenizer.ipynb`


**Deadline is October 22nd, 5pm.**

In [12]:
def my_tokenizer(corpus_raw):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus to be tokenized
    rtype: list
    return: a list of tokens extracted from the corpus_raw
    '''

    # write your tokenizer here and apply to corpus_raw. Return the resulting token_list.
    # you are NOT allowed to use external tokenizers such as word_tokenize from nltk.
    # Only splitting on whitespace is not enough. At least try two other improvements on the tokenization.

    # Lower the letters
    first_tokens = corpus_raw.lower()

    # Tokenize words. Split by apostrophes and rest of the word, for example, tokyo's is tokenized as "tokyo" and "'s". Also, tokenize ; > & ( ) " characters alone.
    first_tokens = re.findall(r"'(?:\w+)?|;|>|&|\(|\)|\"|[^\s';>&\(\)\"]+", first_tokens)

    # New list to hold the final tokens
    token_list = []

    # This loop splits any period and comma character but only if it has whitespace after it (so numbers like 1.76 2,000,132 or abbrevations like u.s. isn't splitted)
    for token in first_tokens:
        # Check if the token has a period at the end and there are no period in it (so we don't split dots at the end of abbrevations like u.s.)
        if (token.endswith('.') and ('.' not in token[:-1])):
            token_list.append(token[:-1])  # Append the part before the dot
            token_list.append('.')          # Append the dot
        # Check if token has comma at the end
        elif token.endswith(','):
            token_list.append(token[:-1])  # Append the part before the comma
            token_list.append(',')          # Append the comma
        else:
            token_list.append(token)

    return token_list

You are allowed to add code blocks above to use for your tokenizer or evaluate it.



In [13]:
#main code to run your tokenizer.

#import your libraries here
import nltk
import re

#select the corpus name from the list below
#gutenberg, webtext, reuters, product_reviews_2

corpus_name = 'reuters'

#download the corpus and import it.
nltk.download(corpus_name)
from nltk.corpus import reuters


#get the raw text output of the corpus to the corpus_raw variable.
corpus_raw = reuters.raw()

#call your tokenizer method
my_tokenized_list = my_tokenizer(corpus_raw)



[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


## Please do not touch the code below that will evaluate your tokenizer with the nltk word tokenizer. You will get zero points from evaluation if you do so.

In [14]:
def similarity_score(set_a, set_b):
    '''
    type set_a: set
    param set_a: The first set to be compared
    type set_b: set
    param set_b: The tokens extracted from the corpus_raw
    rtype: float
    return: similarity score with two sets using Jaccard similarity.
    '''

    jaccard_similarity = float(len(set_a.intersection(set_b)) / len(set_a.union(set_b)))

    return jaccard_similarity

In [15]:
from nltk import word_tokenize
nltk.download('punkt')
from nltk import punkt

def evaluation(corpus_raw, token_list):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus
    type token_list: list
    param token_list: The tokens extracted from the corpus_raw
    rtype: float
    return: comparison score with the given token list and the nltk tokenizer.
    '''

    #The comparison score only looks at the tokens but not the frequencies of the tokens.
    #we assume case folding is already applied to the token_list
    corpus_raw = corpus_raw.lower()
    nltk_tokens = word_tokenize(corpus_raw, language='english')

    score = similarity_score(set(token_list), set(nltk_tokens))

    return score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
#Evaluation

eval_score= evaluation(corpus_raw, my_tokenized_list)

print('The similarity score is {:.2f}'.format(eval_score))

The similarity score is 0.97


Please write your report below using clear headlines with markdown syntax.

# **Assignment Report**

## Corpus Name and Selection Reasoning

For this assignment, I have chosen the Reuters corpus from the NLTK library. It is a well-known corpus consisting of a large collection of news documents, covering diverse topics such as economics, international events, and various industries. This variety makes it suitable for testing my tokenizer on abbreviations, special names, complex punctuation, contractions, possessives, and diverse sentence structures. Overall, it provides a comprehensive challenge for evaluating the robustness of the tokenizer.

## Design of the Tokenizer and Reasoning

The tokenizer is designed to handle common tokenization tasks while addressing specific language challenges. Initially, I converted all words to lowercase and split the text by whitespace and apostrophes, as contractions and possessives like "it's" and "John's" are common in English. The part following the apostrophe is kept together with the apostrophe to retain its meaning.

Additionally, I ensured that punctuation marks such as ;, >, &, (, ), and quotation marks (") are tokenized independently, as these are typically standalone symbols. I avoided splitting by all punctuation marks, as that would cause issues with abbreviations like "U.S." and numbers like "2,000." To address this, I implemented a rule that only splits periods or commas if they appear at the end of a word and are followed by whitespace, without other periods or commas in the word. This preserves correct tokenization for cases like numbers and abbreviations.

## Challenges Faced

The main challenge I encountered was determining how to handle punctuation in the corpus, as the Reuters corpus included many punctuations in different context. Simply tokenizing each punctuation mark separately didn't work well, as it caused issues with abbreviations, the use of dashes between words, and numbers with decimals, which lowered the similarity score. To address this, I explored different approaches, such as selectively tokenizing only certain punctuation marks, keeping periods and commas with the preceding word, and separating the part after the apostrophe from the apostrophe itself. Ultimately, the final solution I found balanced these cases and efficiently handled punctuations.

## Limitations of the Approach

One limitation of the tokenizer is its handling of special words. Since all words are converted to lowercase, it might not be able to differentiate between proper nouns or specialized terms and regular words. Additionally, the tokenizer doesn't account for complex cases involving multiple punctuation marks, such as email addresses or URLs. Another limitation is that the tokenizer is designed with the assumption of English-language text and may not generalize well to other languages or scripts.

## Possible Improvements

To improve the tokenizer, more sophisticated regular expressions that recognize patterns such as email addresses, URLs, and dates could be implemented. Another potential improvement would be to incorporate handling for multi-word expressions and named entities (e.g., "New York City" as one token).