## Tokenisation Notebook

##### Example Hate
https://huggingface.co/cardiffnlp/twitter-roberta-base-hate?text=I+like+you.+I+love+you

##### Example Emojis
https://huggingface.co/cardiffnlp/twitter-roberta-base-emoji

Christian's tokeniser code:

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
import re

line = 'A cat sat on the mat. His name was Måns.'

def tokenize(line):
    # Initialise lists
    tokens = []
    unmatchable = []

    # Compile patterns for speedup
    token_pat = re.compile(r'\w+')
    skippable_pat = re.compile(r'\s+')  # typically spaces

    # As long as there's any material left...
    while line:
        # Try finding a skippable token delimiter first.
        skippable_match = re.search(skippable_pat, line)
        if skippable_match and skippable_match.start() == 0:
            # If there is one at the beginning of the line, just skip it.
            line = line[skippable_match.end():]
        else:
            # Else try finding a real token.
            token_match = re.search(token_pat, line)
            if token_match and token_match.start() == 0:
                # If there is one at the beginning of the line, tokenise it.
                tokens.append(line[:token_match.end()])
                line = line[token_match.end():]
            else:
                # Else there is unmatchable material here.
                # It ends where a skippable or token match starts, or at the end of the line.
                unmatchable_end = len(line)
                if skippable_match:
                    unmatchable_end = skippable_match.start()
                if token_match:
                    unmatchable_end = min(unmatchable_end, token_match.start())
                # Add it to unmatchable and discard from line.
                unmatchable.append(line[:unmatchable_end])
                line = line[unmatchable_end:]

    print(tokens)
    print(unmatchable)

In [None]:
line2= "~40 million SSI numbers have been stolen & used by Illegal Aliens to get work, according to agency records!""~Obama stopped sending notice to employers notifying them when numbers don't match their identity!@realDonaldTrump! #BuildTheWall #DeportThemAll!"
tokenize(line2)

['40', 'million', 'SSI', 'numbers', 'have', 'been', 'stolen', 'used', 'by', 'Illegal', 'Aliens', 'to', 'get', 'work', 'according', 'to', 'agency', 'records', 'Obama', 'stopped', 'sending', 'notice', 'to', 'employers', 'notifying', 'them', 'when', 'numbers', 'don', 't', 'match', 'their', 'identity', 'realDonaldTrump', 'BuildTheWall', 'DeportThemAll']
['~', '&', ',', '!~', "'", '!@', '!', '#', '#', '!']


### Tokenisation Features we want to look out for:
- ".."
    - two or more dots (ignore spaces), mean ending of sentence or verbal pause
- "#topic"
    - one token block
    - if at end of sentence, create new sentence with topic list (not part of the "speech")
- "@username"
    - one token block, shows directed speech
- emojis
    - convert to words
- " | "
    - news header to the left of the line, comment on news comes afterwards
- "https:"/"http:"
    - link, ignore
- email address
    - remove



Jannik's Implementation:

In [None]:
import re

string = "test one two three four. well this is stupid. hahaha"
match = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', string)
print(match)

['test one two three four.', 'well this is stupid.', 'hahaha']


In [None]:
'''Skipping: e-mail addresses, links, spaces'''
skippable_pat = re.compile(r'\s+|https.*|www.*|http.*|[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')  

Emoji and emoticon conversion:

In [None]:
from emot.emo_unicode import UNICODE_EMOJI, UNICODE_EMOJI_ALIAS, EMOTICONS_EMO
from flashtext import KeywordProcessor

## formatting
all_emoji_emoticons = {**EMOTICONS_EMO,**UNICODE_EMOJI_ALIAS, **UNICODE_EMOJI_ALIAS}
all_emoji_emoticons = {k:v.replace(":","").replace("_"," ").strip() for k,v in all_emoji_emoticons.items()}

kp_all_emoji_emoticons = KeywordProcessor()
for k,v in all_emoji_emoticons.items():
    kp_all_emoji_emoticons.add_keyword(k, v)
kp_all_emoji_emoticons.replace_keywords('I am an 👽 hehe :-)). Lets try another one 😲. It seems 👌')


'I am an alien hehe Very happy. Lets try another one astonished. It seems ok hand'

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=bf1f20fb-a6c0-4ea2-8567-a5a41eea1d8a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>