------------------
#### Gensim, 
- tokenization is often used in the context of `topic modeling`, `document similarity analysis`, and other natural language processing tasks. 
- Gensim provides various functions and classes for tokenization, including `simple pre-processing steps` and more advanced tokenization techniques. 
-----------------

In [16]:
from gensim.utils import simple_preprocess

In [17]:
# Example text
text = "Gensim provides robust tokenization functionality for NLP tasks."

In [18]:
# Tokenize the text
tokens = simple_preprocess(text)

In [19]:
# Print tokens
print(tokens)

['gensim', 'provides', 'robust', 'tokenization', 'functionality', 'for', 'nlp', 'tasks']


### Parameters:

- **doc**: str or iterable of str
  - The input document or iterable of documents to be tokenized.
  
- **deacc**: bool, optional (default=True)
  - Whether to remove accents from tokens. Setting this to True will remove accents from tokens using `unidecode`.
  
- **min_len**: int, optional (default=2)
  - Minimum length of tokens to be included in the output. Tokens shorter than this length will be excluded from the output.
  
- **max_len**: int, optional (default=15)
  - Maximum length of tokens to be included in the output. Tokens longer than this length will be excluded from the output.
  
- **lowercase**: bool, optional (default=True)
  - Whether to convert the text to lowercase before tokenization.
  
- **no_above**: float, optional (default=1.0)
  - Threshold for filtering out tokens based on document frequency. Tokens with a document frequency higher than this threshold will be excluded from the output.
  
- **no_below**: int, optional (default=5)
  - Threshold for filtering out tokens based on document frequency. Tokens with a document frequency lower than this threshold will be excluded from the output.

- **keep_n**: int, optional (default=100000)
  - Maximum number of tokens to keep in the vocabulary. If set, the vocabulary will be trimmed to keep only the top `keep_n` tokens based on document frequency.

- **tokenizer**: function, optional (default=None)
  - Custom tokenizer function to be used for tokenization. If provided, this function will be used instead of the default tokenizer.

- **token_filters**: iterable of functions, optional (default=None)
  - Iterable of custom token filter functions to be applied after tokenization. Each function in the iterable should take a token as input and return either True or False to indicate whether the token should be included in the output.

- **max_doc_len**: int, optional (default=float('inf'))
  - Maximum length of documents to be processed. Documents longer than this length will be truncated.

### Returns:
- List of str
  - A list of tokens extracted from the input document(s).


In [20]:
# Text containing contractions
# OK
text = "He's going to the park tomorrow."

# Tokenize the text
tokens = simple_preprocess(text)

tokens

['he', 'going', 'to', 'the', 'park', 'tomorrow']

In [21]:
# Text containing contractions
# not good
text = "I can't believe it!"

# Tokenize the text
tokens = simple_preprocess(text)

tokens

['can', 'believe', 'it']

In [22]:
# Text containing contractions
# not good
text = "I'm feeling 😊😊😊😊 today!"

# Tokenize the text
tokens = simple_preprocess(text)

tokens

['feeling', 'today']

In [23]:
# Abbreviations
# not good
text = "He graduated from St. John's University."

# Tokenize the text
tokens = simple_preprocess(text)

tokens

['he', 'graduated', 'from', 'st', 'john', 'university']

In [24]:
# Hyphenated Words
# not good
text = "The quick brown fox jumped over the high-speed fence."

# Tokenize the text
tokens = simple_preprocess(text)

tokens

['the',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'high',
 'speed',
 'fence']

#### Example

In [25]:
compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter',
                "My favorite color is blue",
                "My favorite colors are blue, red, and green."]

In [26]:
for sent in compare_list:
    
    # Tokenize the text into words
    tokens = simple_preprocess(sent)

    # Print tokens
    print(tokens)

['https', 'co', 'uc']
['laugh', 'cry']
[]
['world', 'problems']
['datageneral']
['it', 'interesting']
['don', 'spell', 'my', 'name', 'right']
['all', 'nighter']
['my', 'favorite', 'color', 'is', 'blue']
['my', 'favorite', 'colors', 'are', 'blue', 'red', 'and', 'green']
