<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/NLP_basics/02_feature_extraction_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction and Vectorizing Methods - Part 2

In [1]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## N-Grams

N-grams are contiguous sequences of 'n' items from a given sample of text or speech. These items can be words, characters, or symbols, depending on the application. N-grams are widely used in various NLP tasks to capture the context and structure of the text.

1. Types of N-grams
- Unigram (n=1): Single words. Example: "This is a test" → ["This", "is", "a", "test"]
- Bigram (n=2): Pairs of consecutive words. Example: "This is a test" → ["This is", "is a", "a test"]
- Trigram (n=3): Triplets of consecutive words. Example: "This is a test" → ["This is a", "is a test"]

2. Applications of N-grams
- Language Modeling: Predicting the next word in a sequence based on the previous 'n' words.
- Text Classification: Using N-grams as features to classify documents.
Spelling Correction: Identifying and correcting misspelled words by analyzing N-gram patterns.
- Machine Translation: Translating text by considering N-gram sequences to maintain context.
- Text Mining: Extracting meaningful patterns and insights from text data.

In [2]:
from nltk import ngrams

# Sample text
text = "This is a test"

# Tokenize the text
tokens = text.split()

# Generate bigrams (n=2)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)

# Generate trigrams (n=3)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)

Bigrams: [('This', 'is'), ('is', 'a'), ('a', 'test')]
Trigrams: [('This', 'is', 'a'), ('is', 'a', 'test')]


N-grams, such as bigrams and trigrams, are used in various NLP tasks to capture the context and relationships between words. Here are some common applications:

**Applications of N-grams**
1. Language Modeling: N-grams help in predicting the next word in a sequence. For example, in a bigram model, the probability of a word depends on the previous word. This is useful in applications like text generation and autocomplete.

2. Text Classification: N-grams can be used as features to classify documents. For instance, in spam detection, certain bigrams or trigrams might be more common in spam emails than in legitimate ones.

3. Sentiment Analysis: N-grams capture phrases that convey sentiment. For example, bigrams like "not good" or "very happy" can provide more context than individual words.

4. Spelling Correction: N-grams help in identifying and correcting misspelled words by analyzing the context in which they appear. For example, if "hte" appears frequently before "cat", it can be corrected to "the".

5. Machine Translation: N-grams are used to maintain the context and fluency of translations. They help in ensuring that translated phrases make sense in the target language.

6. Information Retrieval: Search engines use N-grams to improve the relevance of search results. For example, bigrams and trigrams can help in understanding user queries better.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Sample data
texts = ["This is a place", "Which place do you stay?", "This is a test", "Stay in this place"]
labels = ["statement", "question", "statement", "statement"]

# Create a CountVectorizer with bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))

# Create a pipeline with the vectorizer and a classifier
model = make_pipeline(vectorizer, MultinomialNB())

# Train the model
model.fit(texts, labels)

# Predict the label of a new text
new_text = ["Where do you stay?"]
predicted_label = model.predict(new_text)
print("Predicted label:", predicted_label)

Predicted label: ['question']


In [4]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate translations
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'was', 'a', 'test']

# Calculate BLEU score
bleu_score = sentence_bleu(reference, candidate)
print("BLEU Score:", bleu_score)

BLEU Score: 1.0547686614863434e-154


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


## Word Embeddings

Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. These vectors capture semantic meanings and relationships between words, making them a powerful tool in NLP.

Key Concepts
1. Dense Vectors: Unlike traditional methods like Bag of Words (BoW) or TF-IDF, which create sparse vectors, word embeddings create dense vectors where each dimension captures some aspect of the word's meaning.

2. Contextual Similarity: Words that are similar in meaning are placed closer together in the vector space. For example, "king" and "queen" would have similar vectors.

3. Dimensionality Reduction: Word embeddings reduce the dimensionality of text data, making it more computationally efficient while preserving semantic relationships.

### Word2Vec

Word2Vec: Developed by Google, it uses neural networks to learn word associations from a large corpus of text. It has two models: Continuous Bag of Words (CBOW) and Skip-gram.

Word2Vec: Developed by Google, it uses neural networks to learn word associations from a large corpus of text. It has two models:

1. Continuous Bag of Words (CBOW): Predicts the target word from its surrounding context words. It is faster and works well with smaller datasets.
2. Skip-gram: Predicts the surrounding context words from a target word. It is better at capturing semantic relationships and handling rare words, but is computationally more expensive.

Word2Vec creates dense vector representations of words, capturing their meanings and relationships. These vectors are useful in various NLP tasks like text classification, sentiment analysis, and machine translation.

In [6]:
from gensim.models import Word2Vec

# Example usage
sentences = [["this", "is", "a", "sample"], ["we", "are", "learning", "word2vec"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
vector = model.wv['sample']
print(vector)
print(vector.shape)

[-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.00800821 -0.0076379   0.

### GloVe

GloVe (Global Vectors for Word Representation): Developed by Stanford, it combines the advantages of both global matrix factorization and local context window methods.

In [8]:
import numpy as np

def vectorize_sentence_glove(sentence, glove_vectors):

    vectors = []
    for word in sentence:
        if word in glove_vectors:
            vectors.append(glove_vectors[word])

    if vectors:
      return np.mean(vectors, axis=0)
    else:
      return None


# Example usage (assuming you have loaded GloVe vectors into 'glove_vectors'):
# Load pre-trained GloVe vectors (you'll need to download these separately)
glove_vectors = {} # Initialize an empty dictionary

# Example: Add some dummy vectors to the dictionary
glove_vectors['the'] = np.array([0.1, 0.2, 0.3])
glove_vectors['quick'] = np.array([0.4, 0.5, 0.6])
glove_vectors['brown'] = np.array([0.7, 0.8, 0.9])
glove_vectors['fox'] = np.array([1.0, 1.1, 1.2])


sentence = ["the", "quick", "brown", "fox"]
sentence_vector = vectorize_sentence_glove(sentence, glove_vectors)

if sentence_vector is not None:
    print("Sentence vector:", sentence_vector)
    print("Shape:", sentence_vector.shape)
else:
    print("No words in the sentence were found in the GloVe vocabulary.")

Sentence vector: [0.55 0.65 0.75]
Shape: (3,)


In [9]:
import numpy as np
import gensim.downloader as api

# Load the GloVe model
glove_model = api.load("glove-wiki-gigaword-100")




In [10]:
def vectorize_sentence(sentence, model):
    words = sentence.split()
    word_vectors = [model[word] for word in words if word in model]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

In GloVe (Global Vectors for Word Representation), each word is assigned a constant embedding. This means that every word has a fixed vector representation in the embedding space, which is determined based on the global word-word co-occurrence statistics from a large corpus:

In [17]:
print(glove_model['duplicate'].shape)

(100,)


In [12]:
# Example sentence
sentence = "This is an example sentence for vectorization."

# Vectorize the sentence
sentence_vector = vectorize_sentence(sentence, glove_model)

print(f"Vector representation of the sentence: {sentence_vector.shape}")

Vector representation of the sentence: (100,)


Math with word  vectors

In [30]:
word1 = glove_model["king"]
word2 = glove_model["female"]

result = word2 + word1

To retrieve the word from a GloVe vector, you need to find the word whose vector is closest to the given GloVe vector. This is typically done using cosine similarity:

In [31]:
glove_model.similar_by_vector(result, topn=5)

[('king', 0.7884284853935242),
 ('female', 0.7797942757606506),
 ('male', 0.7577809691429138),
 ('queen', 0.7418192028999329),
 ('father', 0.7039036154747009)]