# **Text Processing and Feature Extraction in NLP**


This guide walks through various text processing techniques in NLP, including tokenization, vectorization, one-hot encoding, and word embeddings.

**Steps to be followed:**
1. Tokenization
2. Bag of Words (BoW) Vectorization
3. One-Hot Encoding
4. Word Embeddings (Word2Vec)

## __Step 1: Tokenization__


What is Tokenization?

Tokenization is the process of breaking text into smaller units (tokens), such as words or subwords.
This helps in text analysis, NLP preprocessing, and training ML models.

In [1]:
text_data = [
    "AI is transforming the world.",
    "Natural language processing is a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly."
]


In [2]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Tokenize the text data
tokenized_data = [word_tokenize(sentence) for sentence in text_data]
tokenized_data


[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[['AI', 'is', 'transforming', 'the', 'world', '.'],
 ['Natural', 'language', 'processing', 'is', 'a', 'subset', 'of', 'AI', '.'],
 ['Deep',
  'learning',
  'and',
  'machine',
  'learning',
  'are',
  'popular',
  'AI',
  'techniques',
  '.'],
 ['AI', 'applications', 'are', 'diverse', 'and', 'growing', 'rapidly', '.']]

## __Step 2: Bag of Words (BoW) Vectorization__

What is BoW?

Bag of Words (BoW) converts text into numerical form.
It counts word occurrences across sentences/documents.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
vectorized_data = vectorizer.fit_transform(text_data)

# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names_out()

# Convert to array for better visualization
vector_array = vectorized_data.toarray()

feature_names, vector_array


(array(['ai', 'and', 'applications', 'are', 'deep', 'diverse', 'growing',
        'is', 'language', 'learning', 'machine', 'natural', 'of',
        'popular', 'processing', 'rapidly', 'subset', 'techniques', 'the',
        'transforming', 'world'], dtype=object),
 array([[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
        [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0],
        [1, 1, 0, 1, 1, 0, 0, 0, 0, 2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
        [1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]))

## __Step 3: One-Hot Encoding__

What is One-Hot Encoding?

One-Hot Encoding represents text as binary vectors.
Each word gets a unique index and is encoded as 0s and 1s.

In [4]:
import nltk
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Ensure NLTK punkt is downloaded
nltk.download('punkt')

# Sample Text Dataset
text_data = [
    "AI is transforming the world.",
    "Natural language processing is a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly."
]

# Tokenize the text data
tokenized_data = [word_tokenize(sentence.lower()) for sentence in text_data]

# Flatten the list of tokenized words
flattened_tokens = [word for sentence in tokenized_data for word in sentence]

# Create a vocabulary of unique words
vocabulary = list(set(flattened_tokens))

# Initialize OneHotEncoder with the vocabulary
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(np.array(vocabulary).reshape(-1, 1))

# Encode each sentence
one_hot_encoded_data = []
for sentence in tokenized_data:
    encoded_sentence = encoder.transform(np.array(sentence).reshape(-1, 1))
    one_hot_encoded_data.append(np.sum(encoded_sentence, axis=0))

# Output the vocabulary and one-hot encoded data
vocabulary, np.array(one_hot_encoded_data)


[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(['are',
  'learning',
  'popular',
  'deep',
  'a',
  'techniques',
  'applications',
  'machine',
  'processing',
  '.',
  'and',
  'of',
  'is',
  'language',
  'subset',
  'natural',
  'rapidly',
  'transforming',
  'the',
  'ai',
  'growing',
  'diverse',
  'world'],
 array([[1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 1., 1., 1.],
        [1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0.,
         1., 0., 1., 0., 0., 0., 0.],
        [1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 2., 1., 0., 0., 1.,
         0., 0., 0., 1., 0., 0., 0.],
        [1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
         0., 1., 0., 0., 0., 0., 0.]]))

In [5]:
b = np.array(one_hot_encoded_data)

In [6]:
b[0]

array([1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 1., 1.])

## __Step 4: Word Embeddings (Word2Vec)__

What is Word2Vec?

Word2Vec converts words into dense vector representations.
Similar words have similar vectors based on context.

In [7]:
import nltk
from gensim.models import Word2Vec

nltk.download('punkt')

# Sample Text Dataset
text_data = [
    "AI is transforming the world.",
    "Natural language processing is a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly."
]

# Tokenize the text data
tokenized_data = [word_tokenize(sentence.lower()) for sentence in text_data]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

# Get the vocabulary
vocabulary = list(model.wv.index_to_key)

# Get word embeddings for the vocabulary
word_embeddings = {word: model.wv[word] for word in vocabulary}

# Example: Get embedding for the word 'ai'
embedding_ai = word_embeddings['ai']

vocabulary, embedding_ai


[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(['ai',
  '.',
  'is',
  'are',
  'and',
  'learning',
  'a',
  'transforming',
  'the',
  'world',
  'natural',
  'language',
  'processing',
  'rapidly',
  'subset',
  'growing',
  'deep',
  'machine',
  'popular',
  'techniques',
  'applications',
  'diverse',
  'of'],
 array([-5.3929625e-04,  2.3846110e-04,  5.1058391e-03,  9.0093408e-03,
        -9.2996191e-03, -7.1191201e-03,  6.4605977e-03,  8.9770071e-03,
        -5.0151777e-03, -3.7672531e-03,  7.3795291e-03, -1.5375089e-03,
        -4.5374273e-03,  6.5554040e-03, -4.8583425e-03, -1.8160876e-03,
         2.8794403e-03,  9.9324819e-04, -8.2843015e-03, -9.4504934e-03,
         7.3131216e-03,  5.0689015e-03,  6.7586694e-03,  7.6146907e-04,
         6.3525592e-03, -3.4040422e-03, -9.4959326e-04,  5.7699587e-03,
        -7.5216787e-03, -3.9335201e-03, -7.5099799e-03, -9.3070691e-04,
         9.5371082e-03, -7.3208129e-03, -2.3332399e-03, -1.9356081e-03,
         8.0778319e-03, -5.9298510e-03,  4.8323451e-05, -4.7512641e-03,
       