### Tokenization

Tokenization is the process of breaking down a text into smaller components, typically words or subwords, called tokens. It’s a crucial step in natural language processing, as it transforms text into a structured format that models can work with.

In [6]:
import nltk
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
from nltk.tokenize import word_tokenize

text = "Tokenization is essential of natural language processing"
tokens = word_tokenize(text)
print(tokens)

['Tokenization', 'is', 'essential', 'of', 'natural', 'language', 'processing']


### Stop Word Removal

Stop word removal is the process of eliminating common words (like "the," "is," "and") from text data. These words generally carry less meaningful information in natural language processing (NLP) tasks, so removing them can help reduce the noise and focus on the more important terms in the text.

Let’s say we are analyzing customer reviews for a product. The sentence "The product is very good and easy to use" includes several stop words that don’t add much value in terms of understanding customer sentiment. After removing stop words, we might get: "product very good easy use," which still conveys the main sentiment without unnecessary words.

In [7]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

text = "The product is very good and easy to use."
# Tokenize the text
tokens = word_tokenize(text)

# Get the list of English stop words
stop_words = set(stopwords.words("english"))

# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)


['product', 'good', 'easy', 'use', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Lemmatization and Stemming

Lemmatization and Stemming are both techniques used in natural language processing (NLP) to reduce words to their base forms, but they do so in different ways:

Stemming: This is a more aggressive method that removes suffixes from words in an attempt to reduce them to their "stem" form, which may not necessarily be a valid word.
Lemmatization: This process reduces words to their base or dictionary form (lemma). It considers the word’s meaning and context, so it is more precise than stemming.

Let’s consider the words "running," "ran," and "runner."

Stemming might reduce all of these to "run."
Lemmatization would reduce "running" to "run," but would leave "runner" as "runner" (since it’s a different base form).

In [9]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer


# Download necessary resources

nltk.download('wordnet')
nltk.download('omw-1.4')

text = "The cats are running faster than the runners."

# Tokenize the text
tokens = word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') if word not in ['the', 'are'] else word for word in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mindf\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Stemmed Tokens: ['the', 'cat', 'are', 'run', 'faster', 'than', 'the', 'runner', '.']
Lemmatized Tokens: ['The', 'cat', 'are', 'run', 'faster', 'than', 'the', 'runners', '.']


### Cosine Similarity

Cosine Similarity is a metric used to measure how similar two text documents (or vectors) are, based on the cosine of the angle between them. It’s often used in text analysis and natural language processing (NLP) to determine the similarity between two documents, regardless of their size. The cosine similarity ranges from 0 (completely dissimilar) to 1 (completely similar).

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download necessary resources
nltk.download('stopwords')

# Define the documents
doc1 = "I love programming in Python."
doc2 = "Python programming is fun."

# Create a list of documents
documents = [doc1, doc2]

# Remove stopwords (for better results)
stop_words = set(stopwords.words('english'))

# Tokenize and remove stopwords
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    return [word for word in tokens if word.isalnum() and word not in stop_words]

# Apply preprocessing to both documents
doc1_tokens = preprocess_text(doc1)
doc2_tokens = preprocess_text(doc2)

# Create the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents into TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform([' '.join(doc1_tokens), ' '.join(doc2_tokens)])

# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

print(f"Cosine Similarity between doc1 and doc2: {cosine_sim[0][0]}")


ModuleNotFoundError: No module named 'sklearn'

### Bag of Words

In [4]:
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.0 MB 4.2 MB/s eta 0:00:03
   ---- ----------------------------------- 1.3/11.0 MB 3.2 MB/s eta 0:00:04
   ------ --------------------------------- 1.8/11.0 MB 3.0 MB/s eta 0:00:04
   --------- ------------------------------ 2.6/11.0 MB 3.1 MB/s eta 0:00:03
   ----------- ---------------------------- 3.1/11.0 MB 3.0 MB/s eta 0:00:03
   -------------- ------------------------- 3.9/11.0 MB 3.1 MB/s e


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: C:\Users\mindf\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Steps in BoW:

Tokenization: Split the text into individual words (tokens).

Vocabulary Creation: Create a list of all unique words in the entire corpus.

Word Frequency: Count the frequency of each word in each document.

Vector Representation: Each document is represented as a vector of word frequencies.

Example:

Let’s consider two documents:

Document 1: "I love programming in Python."

Document 2: "Python programming is fun."

We’ll represent these documents using the Bag of Words model and then create a matrix where each row corresponds to a document, and each column corresponds to a word from the vocabulary.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the documents
documents = ["I love programming in Python.", "Python programming is fun."]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the documents into a Bag of Words representation
X = vectorizer.fit_transform(documents)

# Get the vocabulary (the unique words)
vocabulary = vectorizer.get_feature_names_out()

# Convert the matrix to an array to see the word counts
bow_matrix = X.toarray()

print("Vocabulary:", vocabulary)
print("Bag of Words Matrix:\n", bow_matrix)


Vocabulary: ['fun' 'in' 'is' 'love' 'programming' 'python']
Bag of Words Matrix:
 [[0 1 0 1 1 1]
 [1 0 1 0 1 1]]


Document 1: [0, 1, 0, 1, 1, 1] — This means "Document 1" has:
0 occurrences of "fun"
1 occurrence of "in"
0 occurrences of "is"
1 occurrence of "love"
1 occurrence of "programming"
1 occurrence of "python"

This is also a type of Vectorization