---
<strong>
    <h1 align='center'><strong>GemSim</strong></h1>
</strong>

---

## **Difference between `nltk.word_tokenize` and `gensim.utils.simple_preprocess`**?


In [13]:
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)

text = "How's it going, folks?"
tokens = word_tokenize(text)
print(tokens)

['How', "'s", 'it', 'going', ',', 'folks', '?']


In [14]:
from gensim.utils import simple_preprocess

text = "How's it going, folks?"
tokens = simple_preprocess(text)
print(tokens)

['how', 'it', 'going', 'folks']


### **When to use which?**

- if we need more fine-grained and **linguistically sophisticated** tokenization, especially when dealing with `contractions`, `punctuation`, and other `complex` cases, you should use `nltk.word_tokenize`.

- On the other hand, if you prefer a simple and efficient tokenization approach that splits text into words based on whitespace and removes non-alphanumeric characters, you can use gensim.utils.simple_preprocess. The choice between them depends on the specific requirements of your NLP task.

In [17]:
import gensim
import string
import nltk
from nltk.corpus import stopwords
from pprint import pprint

nltk.download('stopwords', quiet=True)

def preprocess_text(text):
    text = text.lower()                                                               # Lowercase the text
    text = ''.join([char for char in text if char not in string.punctuation])         # Remove punctuation
    tokens = gensim.utils.simple_preprocess(text, deacc=False, min_len=2, max_len=15) # Tokenization
    stop_words = set(stopwords.words('english'))                                      # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    return tokens


corpus = [
    "This is an EXAMPLE sentence.",
    "Another sentence with SOME punctuation!",
    "Remove STOP words from this text. And it has extra spaces   between  words!",
    "How's it going, folks? I can't believe it's already September 2023!",
    "This text contains numbers like 123 and symbols #@$%! Testing...",
    "NLP is fantastic! Isn't it?",
    "Let's test our function on this corpus."
]

# Preprocess the corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# Example output for the first document:
pprint(preprocessed_corpus)

[['example', 'sentence'],
 ['another', 'sentence', 'punctuation'],
 ['remove', 'stop', 'words', 'text', 'extra', 'spaces', 'words'],
 ['hows', 'going', 'folks', 'cant', 'believe', 'already', 'september'],
 ['text', 'contains', 'numbers', 'like', 'symbols', 'testing'],
 ['nlp', 'fantastic', 'isnt'],
 ['lets', 'test', 'function', 'corpus']]
