# **Text Processing and Feature Extraction in NLP**


This guide walks through various text processing techniques in NLP, including tokenization, vectorization, one-hot encoding, and word embeddings.

**Steps to be followed:**
1. Tokenization
2. Bag of Words (BoW) Vectorization
3. One-Hot Encoding
4. Word Embeddings (Word2Vec)

## __Step 1: Tokenization__


What is Tokenization?

Tokenization is the process of breaking text into smaller units (tokens), such as words or subwords.
This helps in text analysis, NLP preprocessing, and training ML models.

In [19]:
text_data = [
    "AI is ai ai transforming the world.",
    "Natural language processing is a a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly."
]


In [2]:
import nltk
from nltk.tokenize import word_tokenize

In [3]:
word_tokenize(text_data[0])

['AI', 'is', 'transforming', 'the', 'world', '.']

In [4]:
# Tokenize the text data
tokenized_data = [word_tokenize(sentence) for sentence in text_data]
tokenized_data

[['AI', 'is', 'transforming', 'the', 'world', '.'],
 ['Natural', 'language', 'processing', 'is', 'a', 'subset', 'of', 'AI', '.'],
 ['Deep',
  'learning',
  'and',
  'machine',
  'learning',
  'are',
  'popular',
  'AI',
  'techniques',
  '.'],
 ['AI', 'applications', 'are', 'diverse', 'and', 'growing', 'rapidly', '.']]


## __Step 2: Bag of Words (BoW) Vectorization__

What is BoW?

Bag of Words (BoW) converts text into numerical form.
It counts word occurrences across sentences/documents.


frequency:
features = vocabulory = 5
cat sat mat
cat hat
cat cat rat


In [40]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the text data
vectorized_data = vectorizer.fit_transform(text_data)

#dir(vectorizer)


In [41]:
CountVectorizer()

In [42]:
# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['ai', 'and', 'applications', 'are', 'deep', 'diverse', 'growing',
       'is', 'language', 'learning', 'machine', 'natural', 'of',
       'popular', 'processing', 'rapidly', 'subset', 'techniques', 'the',
       'transforming', 'world'], dtype=object)

In [43]:
(vectorizer.vocabulary_)

{'ai': 0,
 'is': 7,
 'transforming': 19,
 'the': 18,
 'world': 20,
 'natural': 11,
 'language': 8,
 'processing': 14,
 'subset': 16,
 'of': 12,
 'deep': 4,
 'learning': 9,
 'and': 1,
 'machine': 10,
 'are': 3,
 'popular': 13,
 'techniques': 17,
 'applications': 2,
 'diverse': 5,
 'growing': 6,
 'rapidly': 15}

In [47]:
import pandas as pd
df = pd.DataFrame(vectorized_data.toarray(), columns= vectorizer.get_feature_names_out())

In [45]:
df["Text"] = text_data

In [50]:
df #vocabulory

Unnamed: 0,ai,and,applications,are,deep,diverse,growing,is,language,learning,...,natural,of,popular,processing,rapidly,subset,techniques,the,transforming,world
0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,1,1
1,1,0,0,0,0,0,0,1,1,0,...,1,1,0,1,0,1,0,0,0,0
2,1,1,0,1,1,0,0,0,0,2,...,0,0,1,0,0,0,1,0,0,0
3,1,1,1,1,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [None]:

# Convert to array for better visualization
vector_array = vectorized_data.toarray()

feature_names, vector_array

## __Step 3: One-Hot Encoding__

What is One-Hot Encoding?

One-Hot Encoding represents text as binary vectors.
Each word gets a unique index and is encoded as 0s and 1s.

In [None]:
# gender  f    m   t

# f =1      1    0   0
# m =2     0    1   0
# t =3      0    0   1

# categorical--
# education
# ug=1
# pg=2
# phd

# nominal---one hot encoder
# ordinal--label encoder

In [51]:
import nltk
from sklearn.preprocessing import OneHotEncoder #nominal categorical
from nltk.tokenize import word_tokenize, sent_tokenize
import numpy as np

In [2]:
# Sample Text Dataset
text_data = [
    "AI is transforming the world.",
    "Natural language processing is a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly.",
    "anu's book"
]
len(text_data)

4

In [53]:
# word_tokenize("i like it!")
# \n\n

['i', 'like', 'it', '!']

In [52]:
# Tokenize the text data
tokenized_data = [sent_tokenize(sentence.lower()) for sentence in text_data]
tokenized_data

[['ai is transforming the world.'],
 ['natural language processing is a subset of ai.'],
 ['deep learning and machine learning are popular ai techniques.'],
 ['ai applications are diverse and growing rapidly.']]

In [None]:
# LLM- tokens
# AI==numerical

In [5]:
# Tokenize the text data
tokenized_data = [word_tokenize(sentence.lower()) for sentence in text_data]
tokenized_data

[['ai', 'is', 'transforming', 'the', 'world', '.'],
 ['natural', 'language', 'processing', 'is', 'a', 'subset', 'of', 'ai', '.'],
 ['deep',
  'learning',
  'and',
  'machine',
  'learning',
  'are',
  'popular',
  'ai',
  'techniques',
  '.'],
 ['ai', 'applications', 'are', 'diverse', 'and', 'growing', 'rapidly', '.']]

In [11]:
list_token = []
for i in tokenized_data:
    for j in i:
        list_token.append(j)

In [13]:
# list_token

In [10]:
# Flatten the list of tokenized words
flattened_tokens = [word for sentence in tokenized_data for word in sentence]
type(flattened_tokens)

list

In [16]:
len(flattened_tokens)

33

In [14]:
{1,1,1,1,2,2,2}

{1, 2}

In [21]:
# # LLM= vocab=10000

# from transformers import AutoModelForCausalLM, AutoTokenizer
# model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2-large")
# tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-large")

In [None]:
tokenizer

In [37]:
# Create a vocabulary of unique words
vocabulary = list(set(flattened_tokens))
(vocabulary)

['popular',
 'transforming',
 'world',
 'growing',
 'the',
 'ai',
 'a',
 'diverse',
 'is',
 '.',
 'are',
 'deep',
 'machine',
 'language',
 'processing',
 'subset',
 'applications',
 'and',
 'techniques',
 'learning',
 'of',
 'natural',
 'rapidly']

In [27]:
features = np.array(vocabulary).reshape(-1, 1)

In [29]:
# Initialize OneHotEncoder with the vocabulary
encoder = OneHotEncoder()
x1 = encoder.fit_transform(features)

In [33]:
# dir(encoder)

In [38]:
encoder.get_feature_names_out()

array(['x0_.', 'x0_a', 'x0_ai', 'x0_and', 'x0_applications', 'x0_are',
       'x0_deep', 'x0_diverse', 'x0_growing', 'x0_is', 'x0_language',
       'x0_learning', 'x0_machine', 'x0_natural', 'x0_of', 'x0_popular',
       'x0_processing', 'x0_rapidly', 'x0_subset', 'x0_techniques',
       'x0_the', 'x0_transforming', 'x0_world'], dtype=object)

In [36]:
import pandas as pd
pd.DataFrame(x1.toarray(), columns=encoder.get_feature_names_out())

Unnamed: 0,x0_.,x0_a,x0_ai,x0_and,x0_applications,x0_are,x0_deep,x0_diverse,x0_growing,x0_is,...,x0_natural,x0_of,x0_popular,x0_processing,x0_rapidly,x0_subset,x0_techniques,x0_the,x0_transforming,x0_world
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [39]:


# Encode each sentence
one_hot_encoded_data = []
for sentence in tokenized_data:
    encoded_sentence = encoder.transform(np.array(sentence).reshape(-1, 1))
    one_hot_encoded_data.append(np.sum(encoded_sentence, axis=0))

# Output the vocabulary and one-hot encoded data
vocabulary, np.array(one_hot_encoded_data)


(['popular',
  'transforming',
  'world',
  'growing',
  'the',
  'ai',
  'a',
  'diverse',
  'is',
  '.',
  'are',
  'deep',
  'machine',
  'language',
  'processing',
  'subset',
  'applications',
  'and',
  'techniques',
  'learning',
  'of',
  'natural',
  'rapidly'],
 array([[[1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 1., 1., 1.]],
 
        [[1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0.,
          1., 0., 1., 0., 0., 0., 0.]],
 
        [[1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 2., 1., 0., 0., 1.,
          0., 0., 0., 1., 0., 0., 0.]],
 
        [[1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 0., 0.]]]))

In [None]:
b = np.array(one_hot_encoded_data)

In [None]:
b[0]

## __Step 4: Word Embeddings (Word2Vec)__

What is Word2Vec?

Word2Vec converts words into dense vector representations.
Similar words have similar vectors based on context.

In [None]:
import nltk
from gensim.models import Word2Vec

nltk.download('punkt')

# Sample Text Dataset
text_data = [
    "AI is transforming the world.",
    "Natural language processing is a subset of AI.",
    "Deep learning and machine learning are popular AI techniques.",
    "AI applications are diverse and growing rapidly."
]

# Tokenize the text data
tokenized_data = [word_tokenize(sentence.lower()) for sentence in text_data]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

# Get the vocabulary
vocabulary = list(model.wv.index_to_key)

# Get word embeddings for the vocabulary
word_embeddings = {word: model.wv[word] for word in vocabulary}

# Example: Get embedding for the word 'ai'
embedding_ai = word_embeddings['ai']

vocabulary, embedding_ai


In [None]:
1
2
3