# **My Tokenizer**

In this assignment, you are asked to create your own word tokenizer without the help of external tokenizers. Steps to the assignment:
1. Choose one of the corpora from nltk.corpus list given - assign it to corpus_name
1. Create your tokenizer in the code block - tokenize the selected corpus into token_list
1. Give the raw corpus text, corpus_raw, and the my_token_list to the evaluation block

Only splitting on whitespace is not enough. At least try two other improvements on the tokenization. Please write sufficient comments to show your reasoning.

## Rules
### Allowed:
 - Choosing a top-down tokenizer or bottom-up tokenizer
 - Using regular expressions library (import re)
 - Adding additional coding blocks
 - Having an additional dataset if you are creating a bottom-up tokenizer but you need to be able to run the code standalone.

### Not allowed:
 - Using tokenizer libraries such as nltk.tokenize, or any other external libraries to tokenize.
 - Changing the contents of the evaluation block at the end of the notebook.

## Assignment Report
Please write a short assignment report at the end of the notebook (max 500 words). Please include all of the following points in the report:
 - Corpus name and the selection reason
 - Design of the tokenizer and reasoning
 - Challenges you have faced while writing the tokenizer and challenges with the specific corpus
 - Limitations of your approach
 - Possible improvements to the system

## Grading
You will be graded with the following criteria:
 - running complete code (0.5),
 - tokenizer algorithm (2),
 - clear commenting (0.5),
 - evaluation score - comparison with nltk word tokenizer (at most 1 point),
 - assignment report (1).

## Submission

Submission will be made to SUCourse. Please submit your file using the following naming convention.


`studentid_studentname_tokenizer.ipynb  - ex. 26744_aysegulrana_tokenizer.ipynb`


**Deadline is October 22nd, 5pm.**

In [1]:
pip install nltk



In [2]:
import nltk
import re

In [3]:
def extract_country_pair(input_string):
    # Regex patterns for different combinations
    pattern = r'([A-Z](?:\.[A-Z])+|[A-Z][a-zA-Z]*)-([A-Z](?:\.[A-Z])+|[A-Z][a-zA-Z]*)'

    # Find all matches using the pattern
    matches = re.findall(pattern, input_string)

    # Combine matches to form full country pairs
    result = [f"{match[0]}-{match[1]}" for match in matches]
    return result

In [4]:
# Example usage
input_string1 = "U.S.A-Japan"
input_string2 = "Turkey-U.K"
input_string3 = "U.S.A-U.K"
input_string4 = "Turkey-China"

# Extract and print results
print(extract_country_pair(input_string1))  # Output: ['U.S.A-Japan']
print(extract_country_pair(input_string2))  # Output: ['Turkey-U.K']
print(extract_country_pair(input_string3))  # Output: ['U.S.A-U.K']
print(extract_country_pair(input_string4))  # Output: ['Turkey-China']

['U.S.A-Japan']
['Turkey-U.K']
['U.S.A-U.K']
['Turkey-China']


In [5]:
def extract_hyphenated_words(input_string):
    # Regex pattern for hyphenated compound words
    pattern = r'\b\w+(?:-\w+)+\b'

    # Find all matches using the pattern
    matches = re.findall(pattern, input_string)
    return matches

In [6]:
# Example usage
input_string1 = "The well-being of the community is important."
input_string2 = "They worked together for a long-run success."
input_string3 = "The project is related to up-to-date technology."

# Extract and print results
print(extract_hyphenated_words(input_string1))  # Output: ['well-being']
print(extract_hyphenated_words(input_string2))  # Output: ['long-run']
print(extract_hyphenated_words(input_string3))  # Output: ['up-to-date']

['well-being']
['long-run']
['up-to-date']


In [7]:
def handle_contractions(tokens):
    # List to hold the processed tokens
    processed_tokens = []

    # Regex to match common contractions
    contraction_pattern = r"(\w+)(n't|'ll|'ve|'re|'d|'m)"

    # Iterate through each token to split contractions
    for token in tokens:
        match = re.match(contraction_pattern, token)
        if match:
            processed_tokens.extend([match.group(1), match.group(2)])
        else:
            processed_tokens.append(token)

    return processed_tokens

In [8]:
# Example usage
tokens = ["can't", "he'll", "I've", "you're", "they'd", "I'm", "running"]

# Handle contractions in the list of tokens
print(handle_contractions(tokens))

['ca', "n't", 'he', "'ll", 'I', "'ve", 'you', "'re", 'they', "'d", 'I', "'m", 'running']


In [9]:
def handle_possessives(tokens):
    # List to hold the processed tokens
    processed_tokens = []

    # Regex to match possessives ending in 's (e.g., "China's")
    possessive_pattern = r"(\w+)('s)"

    # Iterate through each token to split possessives
    for token in tokens:
        match = re.match(possessive_pattern, token)
        if match:
            processed_tokens.extend([match.group(1), match.group(2)])
        else:
            processed_tokens.append(token)

    return processed_tokens

In [10]:
# Example usage
tokens = ["China's", "children's", "Japan", "it's", "Tom's", "running"]

# Handle possessives in the list of tokens
print(handle_possessives(tokens))

['China', "'s", 'children', "'s", 'Japan', 'it', "'s", 'Tom', "'s", 'running']


In [11]:
def handle_decimals(text):
    # Regex pattern to match decimal numbers (e.g., 15.6, 0.001)
    pattern = r'\b\d+\.\d+\b'

    # Find all matches using the pattern
    tokens = re.findall(pattern, text)

    return tokens

In [12]:
# Example usage
sample_text = """The GDP was 15.6 billion dlrs last year, among the world's largest. The profit margin was 0.05%."""

print(handle_decimals(sample_text))

['15.6', '0.05']


In [13]:
def handle_eras(text):
    # Regex pattern to match eras (e.g., mid-1988, early-2000s, late-19th)
    pattern = r'\b(?:mid|early|late)-\d{4}(?:s)?\b|\b(?:mid|early|late)-\d{1,2}(?:st|nd|rd|th)?\b'

    # Find all matches using the pattern
    tokens = re.findall(pattern, text)

    return tokens

In [14]:
# Example usage
sample_text = """The GDP was 15.6 billion dlrs mid-1988. On the other hand, late-20th century has not began yet"""

print(handle_eras(sample_text))

['mid-1988', 'late-20th']


In [15]:
def handle_parentheses(text):
    # Regex pattern to specifically match parentheses (), [], {}
    pattern = r'[\(\)\[\]\{\}]'

    # Find all matches using the pattern
    tokens = re.findall(pattern, text)

    return tokens

In [16]:
# Example usage
sample_text = """erosion of
  exports (of goods subject to {}tariffs] to the"""

print(handle_parentheses(sample_text))

['(', '{', '}', ']']


In [17]:
def handle_special_symbols(text):
    # Regex pattern to specifically match $, %, #, <, >, ?, _, /, \, +, -, *, /
    pattern = r'[\$%#&@<>\!\\"\?_\/\+\-\*]'

    # Find all matches using the pattern
    tokens = re.findall(pattern, text)

    return tokens

In [18]:
# Example usage
sample_text = """&lt;Taiwan$/; "If Safe@ _Group>"""

print(handle_special_symbols(sample_text))

['&', '$', '/', '"', '@', '_', '>']


In [19]:
def handle_date_formats(text):
    # Regex pattern to match dates in various formats (e.g., 01/12/2023, 1.12.2023)
    pattern = r'\b(0?[1-9]|[12][0-9]|3[01])[/-](0?[1-9]|1[0-2])[/-](\d{4})\b|\b(0?[1-9]|[12][0-9]|3[01])\.(0?[1-9]|1[0-2])\.(\d{4})\b'

    # Find all matches using the pattern
    matches = re.finditer(pattern, text)

    tokens = []
    for match in matches:
        tokens.append(match.group())

    return tokens

In [20]:
# Example usage
sample_text = """I saw him at 20/10/2024 and 21.10.2024"""

print(handle_date_formats(sample_text))

['20/10/2024', '21.10.2024']


In [21]:
def handle_commas_dots_colons_semicolons(text):
    # Regex pattern to match commas, dots, colons, and semicolons as standalone tokens
    pattern = r'[,:;\.]'

    # Find all matches using the pattern
    tokens = re.findall(pattern, text)

    return tokens

In [22]:
# Example usage
sample_text = """&lt;Taiwan$/; "If., S:afe@ _Group>"""

print(handle_commas_dots_colons_semicolons(sample_text))

[';', ';', '.', ',', ':']


In [32]:
def my_tokenizer(text):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus to be tokenized
    rtype: list
    return: a list of tokens extracted from the corpus_raw
    '''
    # Split text by whitespace to get initial tokens
    tokens = text.split()
    token_list = []

    # Process each token using different handlers
    while tokens:
        token = tokens.pop(0)

        # Step 1: Handle decimal numbers
        decimals_tokens= handle_decimals(token)
        if decimals_tokens:
            token_list.extend(decimals_tokens)
            continue

        # Step 2: Extract country or abbreviation pairs
        abbreviation_pairs_tokens= extract_country_pair(token)
        if abbreviation_pairs_tokens:
            token_list.extend(abbreviation_pairs_tokens)
            continue

        # Step 3: Extract hyphenated words
        hyphenated_words_tokens = extract_hyphenated_words(token)
        if hyphenated_words_tokens:
            token_list.extend(hyphenated_words_tokens)
            continue

        # Step 4: Extract eras
        era_tokens = handle_eras(token)
        if era_tokens:
            token_list.extend(era_tokens)
            continue

        # Step 5: Extract dates
        date_tokens = handle_date_formats(token)
        if date_tokens:
            token_list.extend(date_tokens)
            continue

        # Step 6: Extract parentheses
        parantheses_tokens = handle_parentheses(token)
        if parantheses_tokens:
            token_list.extend(parantheses_tokens)
            continue

        # Step 9: Extract special symbols ($, %, #, <, >, ?, _, /, \, +, -, *, /, ")
        if re.match(r'[\$%#&@<>\"\?_\/\+\-\*]', token):
            token_list.extend(handle_special_symbols(token))
            continue

        # Step 10: Extract commas, dots, colons, and semicolons
        if re.match(r'[,:;\.]', token):
            token_list.extend(handle_commas_dots_colons_semicolons(token))
            continue

        # General case: Add the token to the list
        token_list.append(token)

    # Step 11: Handle possessives in the tokens
    token_list = handle_possessives(token_list)

    # Step 12: Handle contractions in the tokens
    token_list = handle_contractions(token_list)

    # Step 13: Convert all tokens to lowercase
    token_list = [token.lower() for token in token_list]

    return token_list


In [33]:
# Sample text to test the tokenizer
sample_text = """Korea's economic growth, mid-1988, late-19th century, 15.6 billion dlrs last year, among the world's largest. The profit margin was 0.05%. Symbols like <>, (), [], {}, "", &lt;MC.T> and & others should be handled correctly. The price is $100 and the rate is 5%. Use #hashtag for posts. Special symbols like _ / \ + - * are also included. The event was held on 01/12/2023 and 1.12.2023."""

# Tokenize the sample text
token_list = my_tokenizer(sample_text)
print(token_list)

['korea', "'s", 'economic', 'growth,', 'mid-1988', 'late-19th', 'century,', '15.6', 'billion', 'dlrs', 'last', 'year,', 'among', 'the', 'world', "'s", 'largest.', 'the', 'profit', 'margin', 'was', '0.05', 'symbols', 'like', '<', '>', '(', ')', '[', ']', '{', '}', '"', '"', '&', '>', 'and', '&', 'others', 'should', 'be', 'handled', 'correctly.', 'the', 'price', 'is', '$', 'and', 'the', 'rate', 'is', '5%.', 'use', '#', 'for', 'posts.', 'special', 'symbols', 'like', '_', '/', '\\', '+', '-', '*', 'are', 'also', 'included.', 'the', 'event', 'was', 'held', 'on', '01/12/2023', 'and', '1.12']


You are allowed to add code blocks above to use for your tokenizer or evaluate it.



In [34]:
#main code to run your tokenizer.
#import your libraries here
import nltk
import re

#select the corpus name from the list below
#gutenberg, webtext, reuters, product_reviews_2
corpus_name = 'reuters'

#download the corpus and import it.
nltk.download(corpus_name)
from nltk.corpus import reuters

#get the raw text output of the corpus to the corpus_raw variable.
file_ids = reuters.fileids()
corpus_raw = ' '.join([reuters.raw(file_id) for file_id in file_ids])

# Print a portion of the raw corpus to inspect
print(corpus_raw[:700])

#call your tokenizer method
my_tokenized_list = my_tokenizer(corpus_raw)

# Print the tokens in groups of 20
for i in range(0, 750, 20): #len(my_tokenized_list)
    print(my_tokenized_list[i:i+20])

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on Apri
['asian', 'exporters', 'fear', 'damage', 'from', 'u.s.-japan', 'rift', 'mounting', 'trade', 'friction', 'between', 'the', 'u.s.', 'and', 'japan', 'has', 'raised', 'fears', 'among', 'many']
['of', 'asia', "'s", 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far-reaching', 'econom

## Please do not touch the code below that will evaluate your tokenizer with the nltk word tokenizer. You will get zero points from evaluation if you do so.

In [35]:
def similarity_score(set_a, set_b):
    '''
    type set_a: set
    param set_a: The first set to be compared
    type set_b: set
    param set_b: The tokens extracted from the corpus_raw
    rtype: float
    return: similarity score with two sets using Jaccard similarity.
    '''

    jaccard_similarity = float(len(set_a.intersection(set_b)) / len(set_a.union(set_b)))

    return jaccard_similarity

In [36]:
from nltk import word_tokenize
nltk.download('punkt')
from nltk import punkt

def evaluation(corpus_raw, token_list):
    '''
    type corpus_raw: string
    param corpus_raw: The raw output of the corpus
    type token_list: list
    param token_list: The tokens extracted from the corpus_raw
    rtype: float
    return: comparison score with the given token list and the nltk tokenizer.
    '''

    #The comparison score only looks at the tokens but not the frequencies of the tokens.
    #we assume case folding is already applied to the token_list
    corpus_raw = corpus_raw.lower()
    nltk_tokens = word_tokenize(corpus_raw, language='english')

    score = similarity_score(set(token_list), set(nltk_tokens))

    return score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [37]:
#Evaluation

eval_score = evaluation(corpus_raw, my_tokenized_list)

print('The similarity score is {:.2f}'.format(eval_score))

The similarity score is 0.65


Please write your report below using clear headlines with markdown syntax.

# Assignment Report

### Corpus name and the selection reason

The chosen corpus for this assignment is the Reuters. 
It was selected as it contains diverse financial news articles that include abbreviations, punctuation, and complex structures. 
This diversity made it a suitable choice for testing the robustness and accuracy of the tokenizer. 
Additionally, since the dataset is expected to have minimal typos, it facilitates deeper analysis.

### Design of the tokenizer and reasoning

The tokenizer was designed using a modular approach, with each function handling a specific tokenization challenge. 
Functions were created to address decimals, possessives, contractions, hyphenated words, special symbols, and country pairs. 
Regular expressions were used for their flexibility in matching complex patterns. 
The modular design allowed for systematic debugging and easy extension to handle new cases.
13 different cases are considered as well as splitting from whitespaces. These all are decided after careful observation of what NLTK does to tokenize the Reuters corpora in another script.
After these observation, abbreviations(e.g. U.S.A), abbreviation pairs(e.g U.S.A-Japan), hypetenated words(e.g. far-reaching), eras (e.g. mid-1900), dates (e.g 23.10.2024 or 23/10/2024), parantheses (e.g ({[]})), possessives(e.g. Korea's), contractions(e.g wouldn't, he'll), special symbols and punctuations are identified as challenges to be tokenized seperately. All functions for these are tested with short sample strings.
Unlike the NLTK tokenizer, all tokens are lowercased, in order to increase the similarity score at the end.

### Challenges faced

Challenges included handling edge cases without affecting other parts of the text. For example, decimals like "-15.6" required careful handling to ensure correct tokenization. Special characters like parentheses, brackets, and ampersands also posed difficulties in ensuring they were split correctly without losing string context. Additionally, the sequential processing of tokens sometimes led to incorrect splits due to overlapping patterns. Punctuations were mostly problematic, as they are both using as punctuations and for abbreviations or any other usages.

### Limitations of the approach

The tokenizer's sequential nature means that earlier transformations can affect subsequent ones, leading to inconsistencies. Functions can be enhanced to be nested more. For example the hypen's function can cover "U.S.A-Japan" type of abbreviations or inner hypetenated words such as "well-being", as well. The reliance on regular expressions also limits flexibility in handling ambiguous text. The tokenizer is rule-based and lacks the ability to learn from data, limiting its adaptability to new or unforeseen text structures.

### Possible improvements

To make the tokenizer closer to the NLTK tokenizer, more careful implementation of punctuation handling could be done. For example, handling cases like "5%.", "posts.", and "<MC.T>" with more specific punctuation rules would improve the accuracy. For "5%" case handle_decimals function should be enhanced to be able to tokenize any string before and after the number with decimals. Also, instead of checking some of the functions with regexs in my_Tokenizer function, the helper functions can be called directly, as the previous ones. Additionally, comprehensive testing with diverse corpora could help identify and address additional edge cases.

---

Ipek Akkus - 30800 - ipek.akkus