# Tweets Tokenization

The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of four tasks.

The [data](https://drive.google.com/file/d/15x_wPAflvYQ2Xh38iNQGrqUIWLj5l5Nw/view?usp=share_link) contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.

Grading:
- 30 points - Tokenize tweets by hand
- 30 points - Implement 4 tokenizers
- 20 points - Stemming and Lemmatization
- 20 points - Explain sentencepiece (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Tokenize tweets by hand

As a first task you need to tokenize 15 tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:

- Each smiley is a separate token
- Each hashtag is an individual token. Each user reference is an individual token
- If a word has spaces between them then it is converted to a single token
- If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
- All punctuations are individual tokens. This includes double-quotes and single quotes also
- A URL is a single token

Example of output

    Input tweet
    @xfranman Old age has made N A T O!

    Tokenized tweet (separated by comma)
    @xfranman , Old , age , has , made , NATO , !


    1. Input tweet
    ...
    1. Tokenized tweet
    ...

    2. Input tweet
    ...
    2. Tokenized tweet
    ...

In [1]:
import re
# since we need to tokenize by hand, so I didn't use regex nor nltk

def is_url(text):

    if "http" in text:
        return True
    return False


def tokenize(text):
    punct = ["'", '"', ",", ":", ";", "?", "!", "-", "~", "/"]
    tokens = []
    tok = ""

    # TODO : seperate punct as indivdual tokens
    # TODO : emojis as single token

    for word in text.split():

        # checking if there is a word with spaces inbetween
        if len(word) == 1 and word not in punct:
            tok += word
            continue

        if tok:
            tokens.append(tok)
            tok = ""

        if is_url(word):
            tokens.append(word)
            continue
        # removing punctutations in words as a single token
        word_part=""
        for c in word:
            if c in punct:
                if word_part:
                    tokens.append(word_part)
                    word_part = ""
                tokens.append(c)
                continue
            word_part += c

        tokens.append(word_part)

    return tokens

i = 1
for tweet in open("data/assignment1/file1", 'r'):
    print(f"Input tweet #{i}:\n {tweet}")
    print(f"Tokenized tweet #{i}:\n ")
    token_str = ""
    for token in tokenize(tweet):
        token_str += token + ' , '
    print(token_str)
    if i >= 15:
        break
    i+=1



Input tweet #1:
 @anitapuspasari waduh..

Tokenized tweet #1:
 
@anitapuspasari , waduh.. , 
Input tweet #2:
 " Could journos please stop putting the word ""gate"" after everything they write... gate."

Tokenized tweet #2:
 
" ,  , Could , journos , please , stop , putting , the , word , " , " , gate , " , " ,  , after , everything , they , write... , gate. , " ,  , 
Input tweet #3:
 20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues

Tokenized tweet #3:
 
20% , More , Ridiculous , Sale , @20x200 , ends , tonight , ! ,  , - ,  , get , 20% , off , by , entering , ' , RIDONK , ' ,  , at , checkout. , More , info , : ,  , http://bit.ly/ridonktues , 
Input tweet #4:
 @Studio85 I have a pair of those shoes. They are comfy. Like being barefoot. Okay for running, but not on concrete, as I've discovered.

Tokenized tweet #4:
 
@Studio85 , I , have , a , pair , of , those , shoes. , They , are , comfy. , Like , bein

## Implement 4 tokenizers

Your task is to implement the 4 different tokenizers that take a list of tweets on a topic and output tokenization for each:

- White Space Tokenization
- Sentencepiece
- Tokenizing text using regular expressions
- NLTK TweetTokenizer

For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.

In [13]:
def white_space_tokenizer(text: str):
    tokens = text.split()
    return tokens

In [7]:
# based on : https://github.com/google/sentencepiece/blob/master/python/README.md
# based on https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb

import sentencepiece as spm

def prepare_data_file(files,dataset):
    data=""
    for i in files:
        with open (f"data/assignment1/file{i}", 'r', encoding="utf8") as file:
            data += file.read()
            data += "\n"

    with open(f"{dataset}.txt", "w", encoding="utf8") as file:
        file.write(data)


# Here I used the first 3 data files as training data and
# the last two as testing data

prepare_data_file(["1", "2", "3"], "train")
prepare_data_file(["4", "5"], "test")

# training sentencePiece trainer
# the whole vocabulary list from the training data is ~1300, seems appropriate size to use

spm.SentencePieceTrainer.train('--input=train.txt --model_prefix=m --vocab_size=1300')
sp = spm.SentencePieceProcessor()
sp.load('m.model')

def sentencepiece_wrapper(text: str):
    tokens = sp.Encode(text, out_type= str)
    return tokens

test_data =""
with open ("test.txt", 'r', encoding="utf8") as file:
            test_data += file.read()

# for line in test_data.split("\n"):
#     print(sentencepiece_wrapper(line))

In [10]:
import re
def re_tokenizer(text: str):
    tokens=[]
    for word in text.split():
        tokens += re.findall(r'(https?://[^\s]+)|(\B#\w+\b|@\w+\b)|(\b\w+\b|[^\s]+)|(\W+)', word)

    res = []
    for token  in tokens:
        for word in token:
            if word != '':
                res.append(word)

    return res

In [11]:
# based on https://www.geeksforgeeks.org/python-nltk-nltk-tweettokenizer/
from nltk.tokenize import TweetTokenizer

def nltk_tweet_tokenizer(text: str) :
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(text)
    return tokens

Run your implementations on the data. Compare the results, decide which one is better. List the advantages of the best tokenizer.

In [14]:
# TODO: Run tokenizer on the data
# TODO: Time the tokenization process

import time
def run_tokenizer(tokenizer):
    st = time.time()
    num_tweets = 0
    tokens=[]
    for tweet in test_data.split('\n'):
        tokens += tokenizer(tweet)
        num_tweets +=1
    et = time.time()
    elapsed_time = et-st

    # printing states
    example_tweet = test_data.split('\n')[0]
    print(f"this tokenizer took {elapsed_time} to tokenize {num_tweets} tweets\n Here is an example \n Input tweet: \n {example_tweet}\n Tokenized tweet: \n {tokenizer(example_tweet)}")


print("Trying TweeTokenizer\n")
run_tokenizer(nltk_tweet_tokenizer)
print("------------------------\n")

print("Trying re_tokenizer\n")
run_tokenizer(re_tokenizer)
print("------------------------\n")

print("Trying whitespace tokenizer\n")
run_tokenizer(white_space_tokenizer)
print("------------------------\n")

print("Trying sentiencePiece\n")
run_tokenizer(sentencepiece_wrapper)
print("------------------------\n")

Trying TweeTokenizer

this tokenizer took 0.03604841232299805 to tokenize 91 tweets
 Here is an example 
 Input tweet: 
 @marklobbezoo Mooie winnaar!
 Tokenized tweet: 
 ['@marklobbezoo', 'Mooie', 'winnaar', '!']
------------------------

Trying re_tokenizer

this tokenizer took 0.008000612258911133 to tokenize 91 tweets
 Here is an example 
 Input tweet: 
 @marklobbezoo Mooie winnaar!
 Tokenized tweet: 
 ['@marklobbezoo', 'Mooie', 'winnaar', '!']
------------------------

Trying whitespace tokenizer

this tokenizer took 0.0010025501251220703 to tokenize 91 tweets
 Here is an example 
 Input tweet: 
 @marklobbezoo Mooie winnaar!
 Tokenized tweet: 
 ['@marklobbezoo', 'Mooie', 'winnaar!']
------------------------

Trying sentiencePiece

this tokenizer took 0.016391277313232422 to tokenize 91 tweets
 Here is an example 
 Input tweet: 
 @marklobbezoo Mooie winnaar!
 Tokenized tweet: 
 ['▁@', 'm', 'ark', 'lo', 'bb', 'ez', 'oo', '▁Mo', 'o', 'ie', '▁wi', 'n', 'n', 'aar', '!']
----------------

After trying all the tokenizers on same tweets. I have noticed that the fastest is whitespace tokenizer and the slowest is tweettokenizer. I don't believe there is an ultimate "best" tokenizer as it relative to each task/domain. For this specific task I will pick tweettokenizer since it is modeled specifically for the problem we have at hand (tokenizing tweets). But if the domain was more ambiguous then I will go with sentencepeice since it tokenize based on the dataset it as at hand - as I have aforementioned in "Explain Sentencepeice" section.

## Stemming and Lemmatization

Your task is to write two functions: stem and lemmatize. Input is a text, so you need to tokenize it first.

In [15]:
# based on https://www.geeksforgeeks.org/snowball-stemmer-nlp/

from nltk.stem.snowball import SnowballStemmer

def stem(text: str):
    # I will just use tweettokenizer
    tokens = nltk_tweet_tokenizer(text)
    stemmer = SnowballStemmer(language='english')

    stem_words=[]
    for w in tokens:
        x = stemmer.stem(w)
        stem_words.append(x)

    return stem_words

In [16]:
# based on https://www.projectpro.io/recipes/use-spacy-lemmatizer
import spacy

# kept the tagger but disabled the parser and ner as they are not needed

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def lemmatize(text: str) :
    tokens = nlp(text)
    lemmas = [token.lemma_ for token in tokens]
    return lemmas

## Explain sentencepiece (for masters only)

For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works.


# SentencePiece Concept
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing. It treats the text as sequence of unicode charachters thus making it a


# SentencePiece components
Sentencepiece is composed of four components , normaliser, encoder, decoder and a trainer as shown bellow
![](../../AppData/Local/Temp/0_wbfbmY9bT_3fbWq7.png)

normalizer : It standerdize the words of the text to a specific format. according to its docs https://github.com/google/sentencepiece/blob/master/doc/normalization.md it follows part of NFKC normalization. It also includes the option to use custom normalization rules
encoder/decoder: They are used to map back and forth. The equation from SentencePiece paper  as known as lossless tokenization , is Decode(Encode(Normalized(text)) = Normalized(text). It handles whitespaces by substituting it with underscores "_" and decode by substituting the underscores back with whitespaces. T
trainer : it uses either Byte-Pair Encoding or Unigram model to built up a world vocabulary based on sub-word components

# What if there are multiple ways to split up the word based on the vocabulary list?

Another way to phrase the problem is "How do we best split a sentence to ensure that when words are used in the same context, they are matched to the same IDs?"

SentencePiece aims to solve this problem by using Subword regularization in which the segemenation model consider different ways to split the sentence, thus making the neural model accurate and robust.

## Resources

1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
2. [Spacy Lemmatizer](https://spacy.io/api/lemmatizer)
2. [NLTK Stem](https://www.nltk.org/howto/stem.html)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)

In [1]:
# Install libraries
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp310-cp310-win_amd64.whl (1.1 MB)
     ---------------------------------------- 1.1/1.1 MB 3.8 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97



[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python.exe -m pip install --upgrade pip
