Text tokenization is the process of reformatting a piece of text into smaller units called “tokens.”

It transforms unstructured text into structured data that models can understand.

The goal of tokenization is to break down text into meaningful units like words, phrases, sentences, etc. which can then be inputted into machine learning models.

Tokenization enables natural language processing tasks like part-of-speech tagging (identifying verbs vs nouns, etc.), named entity recognition (categories like person, organization, location), and relationship extraction (family relationships, professional relationships, etc.).

In [2]:
import nltk

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
text = "Today is a friday and its so sunny outside!"

In [5]:
# tokenizing the sentence above

tokenized = nltk.tokenize.word_tokenize(text)

In [9]:
print(tokenized)

['Today', 'is', 'a', 'friday', 'and', 'its', 'so', 'sunny', 'outside', '!']


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
vectorizer = CountVectorizer()

In [8]:
vectorized = vectorizer.fit_transform(tokenized).toarray()

In [10]:
print(vectorized)

[[0 0 0 0 0 0 0 1]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0]]


In [11]:
# using sklearn

from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
vectorizer = TfidfVectorizer()

In [13]:
sparse_matrix = vectorizer.fit_transform(tokenized)

In [14]:
print(sparse_matrix)

  (0, 7)	1.0
  (1, 2)	1.0
  (3, 1)	1.0
  (4, 0)	1.0
  (5, 3)	1.0
  (6, 5)	1.0
  (7, 6)	1.0
  (8, 4)	1.0


CountVectorizer and TfidfVectorizer are both classes in scikit-learn's feature extraction module, but they serve different purposes in the context of text data preprocessing.

CountVectorizer:

It counts the occurrences of each word in the document.
The output is a sparse matrix where each row corresponds to a document, and each column corresponds to a unique word in the entire corpus.
The matrix elements contain the count of each word in the respective document.

TfidfVectorizer:

It computes the Term Frequency-Inverse Document Frequency (TF-IDF) values for each word in the document.
TF-IDF takes into account the frequency of a word in a document relative to its frequency across all documents in the corpus.
The output is a sparse matrix where each row corresponds to a document, and each column corresponds to a unique word in the entire corpus.
The matrix elements contain the TF-IDF values of each word in the respective document.