## Breaking of long text into small pieces is tokenization. Then we represent it in number by one hot encoding. Further for reducing dimensions we use Embedding. We can use one hot encoding or embeddings as an input into our machine learning algorithms
### SUMMARY
Tokenization can be done in 3 ways

#1) word Tokenizer : 

    a) already trained (only breaks text into small case words)  (NO VECTORIZATION):- 
                        from tensorflow.keras.preprocessing.text import text_to_word_sequence
                        result = text_to_word_sequence(text) 
    b) Can train our self tokenizer as well (YES VECTORIZATION):
                        from tensorflow.keras.preprocessing.text import Tokenizer
                        #instance of class tokenizer #always include oov_token = 'unk'
                        t = Tokenizer(oov_token='<unk>')
                        #Training of tokenizer docs is the list of corpus documents [doc1,doc2,doc3..] 
                        t.fit_on_texts(docs)   
                        #will give one hot encoding of the model fitted on list of docs
                        t.texts_to_matrix(["please welcome Eminem"],mode='binary')  
                        #there are other methods as well for tokenizer class 
                        #mode could be ('freq','tfidf','count','binary' )
                        #'freq' measures the proportion of words in a particular document
                        
                        t.texts_to_sequences(['what a dirty place']) 
                        #this method will be use if we are training any sequence models like RNN
                        
#2) Sub word Tokenizer
                    from tokenizers import Tokenizer
                    from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
                    from tokenizers.trainers import BpeTrainer, WordLevelTrainer, WordPieceTrainer, UnigramTrainer

                    ## a pretokenizer to segment the text into words ie at max a word will be given as the tokenizer
                    from tokenizers.pre_tokenizers import Whitespace
                    
                    # 1st RUN "prepare_tokenizer_trainer" FUNCTION in the below code
                    #2nd run "train_tokenizer" function in the below code
                    #3rd run "tokenize" function defined in the below code
                    #4th Train the tokenizer
                    trained_tokenizer = train_tokenizer(files, alg) #where alg =  ['WLV', 'BPE', 'UNI', 'WPC']
                    #5th create tokens of any new text
                    output = tokenize(text, trained_tokenizer) #text = 'any text we require to tokenize'
                    #6th return vector reporesentation of the new text
                    print(output.ids) #this would be output for RNN model we can create one hot encoding as well from this
                    


#3) Character Tokenizer (Not generally used (so not in this python code))

In [40]:
#Detailed codes and their output are in the below text

In [None]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 

#very crude way of tokenize because it includes , . and so on
text.split() 

#sentence tokenizer
text.split(". ")

## Word Tokenization
Tokenizer could be trained or there are readymade tokenizers (which need not be trained) as well in keras.
tokenizer is required just call the class and then convert into tokens and then vectorization of the text

In [2]:
#Keras lowers the case of all the alphabets before tokenizing them. A very good tokenizer
#Readymade tokenizer
from tensorflow.keras.preprocessing.text import text_to_word_sequence
result = text_to_word_sequence(text)
print(result)

['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'earth']


In [3]:
#using tokenizer class    #very important and main method for the CREATING TOKENS
#training of word tokenizer
from tensorflow.keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['How are you doing Kushank?',
'My name is Gurjot Singh Kushank',
'Nice to meet you',
'What is your name Kushank Kushank?',
'what a nice place to meet!']
# create the tokenizer
t = Tokenizer(oov_token='<unk>')
# fit the tokenizer on the documents  ie THe model is getting trained by using t.fit_on_texts(docs)
t.fit_on_texts(docs)  

In [4]:
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs) #counts whether a word appears in a document or not

OrderedDict([('how', 1), ('are', 1), ('you', 2), ('doing', 1), ('kushank', 4), ('my', 1), ('name', 2), ('is', 2), ('gurjot', 1), ('singh', 1), ('nice', 2), ('to', 2), ('meet', 2), ('what', 2), ('your', 1), ('a', 1), ('place', 1)])
5
{'<unk>': 1, 'kushank': 2, 'you': 3, 'name': 4, 'is': 5, 'nice': 6, 'to': 7, 'meet': 8, 'what': 9, 'how': 10, 'are': 11, 'doing': 12, 'my': 13, 'gurjot': 14, 'singh': 15, 'your': 16, 'a': 17, 'place': 18}
defaultdict(<class 'int'>, {'you': 2, 'are': 1, 'how': 1, 'kushank': 3, 'doing': 1, 'gurjot': 1, 'name': 2, 'my': 1, 'is': 2, 'singh': 1, 'meet': 2, 'to': 2, 'nice': 2, 'your': 1, 'what': 2, 'place': 1, 'a': 1})


In [9]:
t.index_word

{1: '<unk>',
 2: 'kushank',
 3: 'you',
 4: 'name',
 5: 'is',
 6: 'nice',
 7: 'to',
 8: 'meet',
 9: 'what',
 10: 'how',
 11: 'are',
 12: 'doing',
 13: 'my',
 14: 'gurjot',
 15: 'singh',
 16: 'your',
 17: 'a',
 18: 'place'}

In [6]:
#one hot encoding of the corpus
print(t.texts_to_matrix(docs,mode='binary'))  #this has 19 columns (But we have 18 indexes (this is because 1st column is the column of padding) #word vectorizer) Kind of a one hot encoding
#mode could be ('freq','tfidf','count','binary' ) #'freq' measures the proportion of words in a particular document

[[0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1.]]


In [16]:
#text_to_sequences returns the index of the words in the sentence
t.texts_to_sequences(['what a dirty place'])

[[9, 17, 1, 18]]

## Sub Word Tokenizer 
 1st train the sub word tokenizer on complete corpora and then tokenise and then vectorization  of the text

In [1]:
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
from tokenizers.trainers import BpeTrainer, WordLevelTrainer, WordPieceTrainer, UnigramTrainer

## a pretokenizer to segment the text into words ie at max a word will be given as the tokenizer
from tokenizers.pre_tokenizers import Whitespace

In [12]:
unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]  # special tokens

def prepare_tokenizer_trainer(alg):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    ‘WLV’ - Word Level Algorithm
    ‘WPC’ - WordPiece Algorithm
    ‘BPE’ - Byte Pair Encoding
    ‘UNI’ - Unigram
    2 classes are called one is tokenizer + one is trainer (subword level tokenization is like fitting a model and then
    identifying tokens of the given text and then their one hot encoding)
    """
    if alg == 'BPE':
        tokenizer = Tokenizer(BPE(unk_token = unk_token))
        trainer = BpeTrainer(special_tokens = spl_tokens, vocab_size = 40000) #can changes the total vocabulary size as well
    elif alg == 'UNI':
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens, vocab_size = 40000)
    elif alg == 'WPC':
        tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
        trainer = WordPieceTrainer(special_tokens = spl_tokens, vocab_size = 40000)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
        trainer = WordLevelTrainer(special_tokens = spl_tokens, vocab_size = 40000)
    
    tokenizer.pre_tokenizer = Whitespace()
    return tokenizer, trainer

In [13]:
def train_tokenizer(file, alg='WLV'):
    """
    Takes the file and trains the tokenizer.
    """
    tokenizer, trainer = prepare_tokenizer_trainer(alg)
    tokenizer.train(file, trainer) # training the tokenzier
    tokenizer.save(f"./tokenizer-trained-{alg}.json")
    tokenizer = Tokenizer.from_file(f"./tokenizer-trained-{alg}.json")
    return tokenizer

def tokenize(input_string, tokenizer):
    """
    Tokenizes the input string using the tokenizer provided.
    """
    output = tokenizer.encode(input_string)
    return output

In [33]:
#Note we generally need a text file to train the subword tokenizer (it might be possible to train from variable (BUT this code gives errors on that))
small_file = ['Tokenizer_data.txt']
#'Tokenizer_data.txt'
tokens_dict = {}

for files in [small_file]:
    print(f"========Using vocabulary from {files}=======")
    for alg in ['WLV', 'BPE', 'UNI', 'WPC']:
        trained_tokenizer = train_tokenizer(files, alg)
        input_string = "This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!😍"
        output = tokenize(input_string, trained_tokenizer)
        tokens_dict[alg] = output.tokens
        print("----", alg, "----")
        print(output.tokens, "->", len(output.tokens))
#ids is the representation of text into numbers (ie it gives index of the tokens)
print(output.ids)
# .type_ids

---- WLV ----
['This', 'is', 'a', 'deep', 'learning', '<UNK>', '<UNK>', '.', '<UNK>', 'is', 'the', 'first', 'step', 'in', 'a', 'deep', 'learning', '<UNK>', '<UNK>', '.', 'We', 'will', 'be', 'comparing', 'the', '<UNK>', 'generated', 'by', 'each', '<UNK>', 'model', '.', '<UNK>', 'much', '<UNK>'] -> 35
---- BPE ----
['This', 'is', 'a', 'deep', 'learning', 'to', 'ken', 'ization', 't', 'ut', 'or', 'ial', '.', 'T', 'ok', 'en', 'ization', 'is', 'the', 'first', 'step', 'in', 'a', 'deep', 'learning', 'N', 'L', 'P', 'pi', 'pe', 'line', '.', 'We', 'will', 'be', 'comparing', 'the', 'to', 'k', 'ens', 'generated', 'by', 'each', 'to', 'ken', 'ization', 'model', '.', 'Ex', 'c', 'ited', 'much', '?', '!', '<UNK>'] -> 55
---- UNI ----
['Thi', 's', 'is', 'a', 'deep', 'learn', 'ing', 'to', 'ken', 'iz', 'ation', 't', 'u', 'to', 'rial', '.', 'To', 'ken', 'iz', 'ation', 'is', 'the', 'fir', 's', 't', 'step', 'in', 'a', 'deep', 'learn', 'ing', 'N', 'L', 'P', 'pi', 'pe', 'line', '.', 'We', 'will', 'be', 'compar'

In [38]:
#start from 57 minutes Text Representation in nlp
output.token_to_sequence

<function Encoding.token_to_sequence(self, token_index)>