# Syntax
Below is the syntax for the gensim.utils.tokenize() function:

```python
gensim.utils.tokenize(text, lowercase=True, deacc=False, errors='strict', to_lower=False, lower=False)
```

text is the input text to be tokenized.

lowercase is an optional parameter that specifies whether to convert the text to lowercase before tokenization. The default value is True.

deacc is an optional parameter specifying whether to remove text accent marks. The default value is False.

errors is an optional parameter that specifies how to handle decoding errors in the text. The default value is 'strict'.

to_lower and lower are both optional parameters that are the same as lowercase and are used as a convenient alias.

## Corner cases:
joint → _joint 
jointed → _joint, ed 
disjointed → _di, s, jo, int, ed 
unisex → _un, ise, x 
true → _true 
untrue → _un, tr, ue 
estimate → _estimate 
overestimate → _over, est, imate

In [3]:
from gensim.utils import tokenize

text = """ Welcome to Educative Answers.
        joint
        jointed 
        disjointed 
        unisex 
        true 
        untrue 
        estimate 
        overestimate 
        """

tokens = list(tokenize(text,lowercase=True))

print(tokens)

['welcome', 'to', 'educative', 'answers', 'joint', 'jointed', 'disjointed', 'unisex', 'true', 'untrue', 'estimate', 'overestimate']


 ### dif to NLTK
 - NLTK does not remove dots

In [4]:
# Import necessary libraries
from nltk.tokenize import word_tokenize

# Tokenize documents
text_list = []
text_list.append(text)
tokens_nltk = [word_tokenize(doc.lower()) for doc in text_list]

print(tokens_nltk)

[['welcome', 'to', 'educative', 'answers', '.', 'joint', 'jointed', 'disjointed', 'unisex', 'true', 'untrue', 'estimate', 'overestimate']]


## Used function

In [8]:
import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokens_costum_fn = tokenizer.tokenize(text.lower())

print(tokens_costum_fn)

['welcome', 'to', 'educative', 'answers', 'joint', 'jointed', 'disjointed', 'unisex', 'true', 'untrue', 'estimate', 'overestimate']
