<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/NLP_basics/02_feature_extraction_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction and Vectorizing Methods - Part 2

In [None]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

## N-Grams

N-grams are contiguous sequences of 'n' items from a given sample of text or speech. These items can be words, characters, or symbols, depending on the application. N-grams are widely used in various NLP tasks to capture the context and structure of the text.

1. Types of N-grams
- Unigram (n=1): Single words. Example: "This is a test" → ["This", "is", "a", "test"]
- Bigram (n=2): Pairs of consecutive words. Example: "This is a test" → ["This is", "is a", "a test"]
- Trigram (n=3): Triplets of consecutive words. Example: "This is a test" → ["This is a", "is a test"]

2. Applications of N-grams
- Language Modeling: Predicting the next word in a sequence based on the previous 'n' words.
- Text Classification: Using N-grams as features to classify documents.
Spelling Correction: Identifying and correcting misspelled words by analyzing N-gram patterns.
- Machine Translation: Translating text by considering N-gram sequences to maintain context.
- Text Mining: Extracting meaningful patterns and insights from text data.

In [69]:
from nltk import ngrams

# Sample text
text = "This is a test"

# Tokenize the text
tokens = text.split()

# Generate bigrams (n=2)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)

# Generate trigrams (n=3)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)

Bigrams: [('This', 'is'), ('is', 'a'), ('a', 'test')]
Trigrams: [('This', 'is', 'a'), ('is', 'a', 'test')]


N-grams, such as bigrams and trigrams, are used in various NLP tasks to capture the context and relationships between words. Here are some common applications:

**Applications of N-grams**
1. Language Modeling: N-grams help in predicting the next word in a sequence. For example, in a bigram model, the probability of a word depends on the previous word. This is useful in applications like text generation and autocomplete.

2. Text Classification: N-grams can be used as features to classify documents. For instance, in spam detection, certain bigrams or trigrams might be more common in spam emails than in legitimate ones.

3. Sentiment Analysis: N-grams capture phrases that convey sentiment. For example, bigrams like "not good" or "very happy" can provide more context than individual words.

4. Spelling Correction: N-grams help in identifying and correcting misspelled words by analyzing the context in which they appear. For example, if "hte" appears frequently before "cat", it can be corrected to "the".

5. Machine Translation: N-grams are used to maintain the context and fluency of translations. They help in ensuring that translated phrases make sense in the target language.

6. Information Retrieval: Search engines use N-grams to improve the relevance of search results. For example, bigrams and trigrams can help in understanding user queries better.

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Sample data
texts = ["This is a place", "Which place do you stay?", "This is a test", "Stay in this place"]
labels = ["statement", "question", "statement", "statement"]

# Create a CountVectorizer with bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))

# Create a pipeline with the vectorizer and a classifier
model = make_pipeline(vectorizer, MultinomialNB())

# Train the model
model.fit(texts, labels)

# Predict the label of a new text
new_text = ["Where do you stay?"]
predicted_label = model.predict(new_text)
print("Predicted label:", predicted_label)

Predicted label: ['question']


In [75]:
import nltk
from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate translations
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'was', 'a', 'test']

# Calculate BLEU score
bleu_score = sentence_bleu(reference, candidate)
print("BLEU Score:", bleu_score)

BLEU Score: 1.0547686614863434e-154


## Word Embeddings

Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. These vectors capture semantic meanings and relationships between words, making them a powerful tool in NLP.

Key Concepts
1. Dense Vectors: Unlike traditional methods like Bag of Words (BoW) or TF-IDF, which create sparse vectors, word embeddings create dense vectors where each dimension captures some aspect of the word's meaning.

2. Contextual Similarity: Words that are similar in meaning are placed closer together in the vector space. For example, "king" and "queen" would have similar vectors.

3. Dimensionality Reduction: Word embeddings reduce the dimensionality of text data, making it more computationally efficient while preserving semantic relationships.

### Word2Vec

Word2Vec: Developed by Google, it uses neural networks to learn word associations from a large corpus of text. It has two models: Continuous Bag of Words (CBOW) and Skip-gram.