# One Hot Encoding

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### One Hot Encoding --> using sklearn

In [None]:
# Read the data
df = pd.read_csv("/content/drive/MyDrive/NLP Techniques/Car_Details.csv")

# One-Hot Encoding for 'fuel', 'seller_type', and 'transmission'
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_features = one_hot_encoder.fit_transform(df[["fuel", "seller_type", "transmission"]])
one_hot_df = pd.DataFrame(encoded_features, columns=one_hot_encoder.get_feature_names_out(["fuel", "seller_type", "transmission"]))

# Combine the encoded features with the original DataFrame
df = df.drop(columns=["fuel", "seller_type", "transmission"])
df = pd.concat([df, one_hot_df], axis=1)

# Display the resulting DataFrame
print(df.head())

                       name  year  selling_price  km_driven         owner  \
0             Maruti 800 AC  2007          60000      70000   First Owner   
1  Maruti Wagon R LXI Minor  2007         135000      50000   First Owner   
2      Hyundai Verna 1.6 SX  2012         600000     100000   First Owner   
3    Datsun RediGO T Option  2017         250000      46000   First Owner   
4     Honda Amaze VX i-DTEC  2014         450000     141000  Second Owner   

   fuel_CNG  fuel_Diesel  fuel_Electric  fuel_LPG  fuel_Petrol  \
0       0.0          0.0            0.0       0.0          1.0   
1       0.0          0.0            0.0       0.0          1.0   
2       0.0          1.0            0.0       0.0          0.0   
3       0.0          0.0            0.0       0.0          1.0   
4       0.0          1.0            0.0       0.0          0.0   

   seller_type_Dealer  seller_type_Individual  seller_type_Trustmark Dealer  \
0                 0.0                     1.0                

### One Hot Encoding --> From Scratch

In [None]:
def one_hot_encode(df, columns):
    """
    Perform one-hot encoding on specified columns of a DataFrame.

    Args:
    df (pd.DataFrame): The DataFrame to encode.
    columns (list): A list of columns to one-hot encode.

    Returns:
    pd.DataFrame: The DataFrame with one-hot encoded columns.
    """
    for column in columns:
        # Get unique values
        unique_values = df[column].unique()

        # Create a binary column for each unique value
        for value in unique_values:
            df[f"{column}_{value}"] = (df[column] == value).astype(int)

        # Drop the original column
        df.drop(column, axis=1, inplace=True)

    return df

# Read the data
df = pd.read_csv("/content/drive/MyDrive/NLP Techniques/Car_Details.csv")

# Apply one-hot encoding
df = one_hot_encode(df, ["fuel", "seller_type", "transmission"])

# Display the resulting DataFrame
print(df.head())

                       name  year  selling_price  km_driven         owner  \
0             Maruti 800 AC  2007          60000      70000   First Owner   
1  Maruti Wagon R LXI Minor  2007         135000      50000   First Owner   
2      Hyundai Verna 1.6 SX  2012         600000     100000   First Owner   
3    Datsun RediGO T Option  2017         250000      46000   First Owner   
4     Honda Amaze VX i-DTEC  2014         450000     141000  Second Owner   

   fuel_Petrol  fuel_Diesel  fuel_CNG  fuel_LPG  fuel_Electric  \
0            1            0         0         0              0   
1            1            0         0         0              0   
2            0            1         0         0              0   
3            1            0         0         0              0   
4            0            1         0         0              0   

   seller_type_Individual  seller_type_Dealer  seller_type_Trustmark Dealer  \
0                       1                   0                

## Pros and Cons of One-Hot Encoding

### Pros of One-Hot Encoding

- **No Ordinal Relationships**: One-hot encoding does not impose any ordinal relationship between the categories. Each category is treated independently, which is beneficial when there is no inherent order in the categories.

- **Simplicity**: One-hot encoding is straightforward to implement and understand. It creates binary columns that are easy to interpret.


### Cons of One-Hot Encoding

- **Increased Dimensionality**: One-hot encoding can significantly increase the dimensionality of the dataset, especially if there are many unique categories. This can lead to the "curse of dimensionality" and make the model more complex and slower to train.

- **Sparsity**: The resulting matrix is often sparse, meaning most of the elements are zero. This can be inefficient in terms of memory usage and computation. It can also cause overfitting.

- **Multicollinearity**: One-hot encoding can introduce multicollinearity, especially if one of the categories is perfectly predictable from the others. This can affect the performance of some algorithms, like linear regression.

- **Not Suitable for High Cardinality**: For categorical variables with high cardinality (many unique values), one-hot encoding can become impractical due to the large number of resulting columns.

- **Interpretability**: While one-hot encoding is simple, the interpretation of the resulting model can become complex, especially when dealing with a large number of categories.

- **OOV**: It cannot handle out of vocabulary words.

- **Semantic Meaning**: The way words are converted to vectors, all the vectors are equidistant from each other. So, semantic meaning cannot be captured from this.

### When to Use One-Hot Encoding

- Use one-hot encoding when the categorical variable does not have an inherent order and when the number of unique categories is manageable.
- It is particularly useful for algorithms that do not handle categorical data natively, such as linear models and neural networks.


# Label Encoding

### Label Encoding --> Using Sklearn

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Read the data
df = pd.read_csv("/content/drive/MyDrive/NLP Techniques/Car_Details.csv")

# Label Encoding for 'owner'
label_encoder = LabelEncoder()
df["owner"] = label_encoder.fit_transform(df["owner"])

print(df.head())

                       name  year  selling_price  km_driven    fuel  \
0             Maruti 800 AC  2007          60000      70000  Petrol   
1  Maruti Wagon R LXI Minor  2007         135000      50000  Petrol   
2      Hyundai Verna 1.6 SX  2012         600000     100000  Diesel   
3    Datsun RediGO T Option  2017         250000      46000  Petrol   
4     Honda Amaze VX i-DTEC  2014         450000     141000  Diesel   

  seller_type transmission  owner  
0  Individual       Manual      0  
1  Individual       Manual      0  
2  Individual       Manual      0  
3  Individual       Manual      0  
4  Individual       Manual      2  


### Label Encoding --> From Scratch

In [None]:
def label_encode(df, column):
    """
    Perform label encoding on a specified column of a DataFrame.

    Args:
    df (pd.DataFrame): The DataFrame to encode.
    column (str): The column to label encode.

    Returns:
    pd.DataFrame: The DataFrame with the label encoded column.
    """
    # Get unique values and sort them
    unique_values = sorted(df[column].unique())

    # Create a mapping from value to integer
    value_to_int = {value: i for i, value in enumerate(unique_values)}

    # Map the values to integers
    df[column] = df[column].map(value_to_int)

    return df

# Read the data
df = pd.read_csv("/content/drive/MyDrive/NLP Techniques/Car_Details.csv")

# Apply label encoding
df = label_encode(df, "owner")

# Display the resulting DataFrame
print(df.head())

                       name  year  selling_price  km_driven    fuel  \
0             Maruti 800 AC  2007          60000      70000  Petrol   
1  Maruti Wagon R LXI Minor  2007         135000      50000  Petrol   
2      Hyundai Verna 1.6 SX  2012         600000     100000  Diesel   
3    Datsun RediGO T Option  2017         250000      46000  Petrol   
4     Honda Amaze VX i-DTEC  2014         450000     141000  Diesel   

  seller_type transmission  owner  
0  Individual       Manual      0  
1  Individual       Manual      0  
2  Individual       Manual      0  
3  Individual       Manual      0  
4  Individual       Manual      2  


## Pros and Cons of Label Encoding

### Pros of Label Encoding

- **Simplicity**: Label encoding is straightforward to implement and understand. It involves assigning a unique integer to each category, which is computationally efficient.

- **Reduced Dimensionality**: Unlike one-hot encoding, label encoding does not increase the dimensionality of the dataset. This can be beneficial when dealing with high-cardinality categorical variables.

- **Memory Efficient**: Label encoding is more memory-efficient compared to one-hot encoding, especially for categorical variables with many unique values.

- **Compatibility with Tree-Based Algorithms**: Label encoding works well with tree-based algorithms (like decision trees and random forests) that can handle categorical data natively without assuming any ordinal relationship.

### Cons of Label Encoding

- **Ordinal Relationships**: Label encoding imposes an ordinal relationship between categories, which may not exist. This can mislead algorithms that assume a natural ordering (like linear regression or k-nearest neighbors).

- **Algorithm Sensitivity**: Some algorithms are sensitive to the arbitrary ordering imposed by label encoding, which can lead to poor model performance if the algorithm assumes a linear relationship between the encoded values.

- **Interpretability**: The resulting model can be harder to interpret, as the integer values do not have any inherent meaning related to the original categories.

- **Not Suitable for All Algorithms**: Label encoding is not suitable for algorithms that assume linearity or require numerical input with meaningful distances between values.

### When to Use Label Encoding

- Use label encoding when the categorical variable has an inherent order or when using tree-based algorithms that are not affected by the ordinal relationships.
- It is particularly useful for reducing dimensionality and memory usage when dealing with high-cardinality categorical variables.


# Bag of Words

### Bag of Words --> Using Sklearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = [
    'The cat sat on the mat.',
    'The dog sat on the log.',
    'Dogs and cats are great pets.'
]

# Create the CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Print the feature names and the transformed vectors
print("Feature Names:", feature_names)
print("Bag of Words Representation:\n", X.toarray())

Feature Names: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'great' 'log' 'mat' 'on' 'pets'
 'sat' 'the']
Bag of Words Representation:
 [[0 0 1 0 0 0 0 0 1 1 0 1 2]
 [0 0 0 0 1 0 0 1 0 1 0 1 2]
 [1 1 0 1 0 1 1 0 0 0 1 0 0]]


### Bag of Words --> From Scratch

In [None]:
import numpy as np
from collections import Counter
import re

def tokenize(text):
    # Simple tokenizer using regex to find words
    return re.findall(r'\b\w+\b', text.lower())

def build_vocabulary(corpus):
    # Build a set of unique words from the corpus
    vocabulary = set()
    for document in corpus:
        vocabulary.update(tokenize(document))
    return sorted(vocabulary)

def encode_documents(corpus, vocabulary):
    # Encode each document as a vector of word counts
    word_index = {word: index for index, word in enumerate(vocabulary)}
    features = []
    for document in corpus:
        word_counts = Counter(tokenize(document))
        feature_vector = np.zeros(len(vocabulary))
        for word, count in word_counts.items():
            if word in word_index:
                feature_vector[word_index[word]] = count
        features.append(feature_vector)
    return np.array(features)

# Sample corpus
corpus = [
    'The cat sat on the mat.',
    'The dog sat on the log.',
    'Dogs and cats are great pets.'
]

# Build the vocabulary
vocabulary = build_vocabulary(corpus)

# Encode the documents
features = encode_documents(corpus, vocabulary)

# Print the vocabulary and feature vectors
print("Vocabulary:", vocabulary)
print("Bag of Words Representation:\n", features)

Vocabulary: ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'great', 'log', 'mat', 'on', 'pets', 'sat', 'the']
Bag of Words Representation:
 [[0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 1. 2.]
 [0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 2.]
 [1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0.]]


## Pros and Cons of Bag of Words

### Pros

- **Simplicity**: The BoW model is straightforward to implement and understand. It converts text into numerical features that can be easily used with various machine learning algorithms.

- **Efficiency**: It is computationally efficient, especially for large datasets, as it involves simple counting operations.

- **Scalability**: The model can scale well with large corpora and is suitable for tasks like document classification and information retrieval.

- **Compatibility**: It works well with traditional machine learning algorithms that require fixed-length feature vectors, such as logistic regression, SVMs, and naive Bayes classifiers.

- **Baseline Performance**: Despite its simplicity, BoW often provides a strong baseline performance for many text classification tasks.

### Cons

- **Loss of Context**: The BoW model ignores the order of words, which means it loses contextual information. This can be a significant limitation for tasks where word order is important, such as sentiment analysis or language translation.

- **Sparse Representation**: The resulting feature vectors are often sparse, especially with large vocabularies. This can lead to inefficiencies in storage and computation.

- **Ignoring Semantics**: It treats all words independently and does not capture semantic relationships between words. For example, "king" and "queen" are treated as completely different features despite their semantic similarity.

- **High Dimensionality**: The dimensionality of the feature space can be very high, equal to the size of the vocabulary. This can lead to the "curse of dimensionality" and overfitting, especially with small datasets.

- **No Handling of Synonyms and Polysemy**: The model does not handle synonyms (different words with similar meanings) or polysemy (words with multiple meanings) effectively.

- **Limited to Known Words**: It can only represent words that are present in the training corpus. Out-of-vocabulary words are not handled well.

### Conclusion

- The Bag of Words model is a useful starting point for many NLP tasks due to its simplicity and efficiency.
- These sentences will be considered similar and will be placed close to each other:

      A. This is a very good movie

      B. This is not a very good movie
- However, its limitations in capturing context and semantics often necessitate the use of more advanced techniques, such as word embeddings (e.g., Word2Vec, GloVe) or transformer-based models (e.g., BERT), for more complex and nuanced language understanding tasks.


# N-grams

## Implementation using Sklearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly."
]

# CountVectorizer with n-grams
vectorizer = CountVectorizer(ngram_range=(1, 2))  # Unigrams and bigrams
X = vectorizer.fit_transform(corpus)

# Print the feature names (n-grams)
print("Feature names (n-grams):", vectorizer.get_feature_names_out())

# Print the vectorized output
print("Vectorized output:\n", X.toarray())

# Alternatively, use TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Print the TF-IDF vectorized output
print("TF-IDF Vectorized output:\n", X_tfidf.toarray())

Feature names (n-grams): ['brown' 'brown fox' 'dog' 'dog quickly' 'fox' 'fox jumps' 'jump'
 'jump over' 'jumps' 'jumps over' 'lazy' 'lazy dog' 'never' 'never jump'
 'over' 'over the' 'quick' 'quick brown' 'quickly' 'the' 'the lazy'
 'the quick']
Vectorized output:
 [[1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 2 1 1]
 [0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 0]]
TF-IDF Vectorized output:
 [[0.26666724 0.26666724 0.18973594 0.         0.26666724 0.26666724
  0.         0.         0.26666724 0.26666724 0.18973594 0.18973594
  0.         0.         0.18973594 0.18973594 0.26666724 0.26666724
  0.         0.37947187 0.18973594 0.26666724]
 [0.         0.         0.23031454 0.32369906 0.         0.
  0.32369906 0.32369906 0.         0.         0.23031454 0.23031454
  0.32369906 0.32369906 0.23031454 0.23031454 0.         0.
  0.32369906 0.23031454 0.23031454 0.        ]]


## N-grams --> From Scratch

In [None]:
import re
from collections import Counter

def extract_ngrams(text, n, mode='word', remove_punctuation=True):
    """
    Extracts n-grams from a given text.

    Parameters:
    - text: str : The input text.
    - n: int : The number of items in each n-gram.
    - mode: str : 'word' for word n-grams, 'char' for character n-grams.
    - remove_punctuation: bool : Whether to remove punctuation from the text.

    Returns:
    - ngrams: list of tuples : The extracted n-grams.
    - ngram_freq: Counter : The frequency count of each n-gram.
    """

    if remove_punctuation:
        # Remove punctuation using regex
        text = re.sub(r'[^\w\s]', '', text)

    if mode == 'word':
        # Tokenize the text into words
        tokens = text.split()
    elif mode == 'char':
        # Tokenize the text into characters
        tokens = list(text)
    else:
        raise ValueError("Mode should be 'word' or 'char'.")

    # Generate n-grams
    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]

    # Count the frequency of each n-gram
    ngram_freq = Counter(ngrams)

    return ngrams, ngram_freq

# Example usage
text = "The quick brown fox jumps over the lazy dog. The dog is very lazy."
n = 2  # Bigrams
mode = 'word'  # Word n-grams
ngrams, ngram_freq = extract_ngrams(text, n, mode, remove_punctuation=True)

print("Extracted n-grams:", ngrams)
print("N-gram frequencies:", dict(ngram_freq))

Extracted n-grams: [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog'), ('dog', 'The'), ('The', 'dog'), ('dog', 'is'), ('is', 'very'), ('very', 'lazy')]
N-gram frequencies: {('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown', 'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1, ('the', 'lazy'): 1, ('lazy', 'dog'): 1, ('dog', 'The'): 1, ('The', 'dog'): 1, ('dog', 'is'): 1, ('is', 'very'): 1, ('very', 'lazy'): 1}


## Pros and Cons of N-grams

### Pros of Using N-grams for Word Embeddings

- **Context Capture:**

N-grams capture local word order and context, which can be beneficial for tasks that require understanding of word sequences, such as language modeling and machine translation.

- **Simplicity:**

N-grams are straightforward to implement and understand. They don't require complex neural network architectures or extensive computational resources.

- **Flexibility:**

You can easily adjust the value of \( n \) to capture different levels of context. For example, unigrams capture single words, bigrams capture pairs of words, and so on.

### Cons of Using N-grams for Word Embeddings

- **Dimensionality:**

The dimensionality of the feature space increases exponentially with \( n \), leading to high-dimensional and sparse vectors, which can be computationally expensive and challenging to manage.

- **Data Sparsity:**

As \( n \) increases, the number of possible n-grams grows rapidly, leading to data sparsity issues. This can be problematic for tasks with limited training data.

- **Lack of Semantic Meaning:**

N-grams do not capture semantic relationships between words. For example, "king" and "queen" might have similar meanings but different n-gram representations.

- **Fixed Context Window:**

N-grams have a fixed context window size, which may not be sufficient to capture long-range dependencies in text. This can be a limitation for tasks that require understanding of broader context.

- **Memory and Storage:**

Storing and processing high-dimensional n-gram vectors can be memory-intensive, especially for large text corpora.

- **Out-of-Vocabulary Words:**

N-grams can struggle with out-of-vocabulary words, as they rely on exact matches of word sequences. This can be mitigated by using character n-grams, but it's still a challenge.

### Conclusion

N-grams offer a simple and effective way to capture local context in text data, but they come with limitations in terms of dimensionality, data sparsity, and semantic understanding.


# TF-IDF

## Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is often used in information retrieval and text mining as a weighting factor in word embeddings.

### TF-IDF Components

#### Term Frequency (TF)
Measures how frequently a term occurs in a document. It is calculated as:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$




#### Inverse Document Frequency (IDF)
Measures how important a term is across the entire corpus. It is calculated as:

$$
\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
$$


#### TF-IDF
Combines TF and IDF to give a score that highlights the importance of a term in a document relative to the corpus. It is calculated as:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Dogs and cats are common pets.",
    "Cats and dogs are good companions."
]

# Create the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get the feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense format for better readability
tfidf_dense = tfidf_matrix.todense()

# Output the TF-IDF matrix and feature names
print(tfidf_dense)
print(feature_names)

[[0.         0.         0.41777218 0.         0.         0.
  0.         0.         0.         0.         0.41777218 0.32937638
  0.         0.32937638 0.65875277]
 [0.         0.         0.         0.         0.         0.
  0.41777218 0.         0.         0.41777218 0.         0.32937638
  0.         0.32937638 0.65875277]
 [0.37222485 0.37222485 0.         0.37222485 0.47212003 0.
  0.         0.37222485 0.         0.         0.         0.
  0.47212003 0.         0.        ]
 [0.37222485 0.37222485 0.         0.37222485 0.         0.47212003
  0.         0.37222485 0.47212003 0.         0.         0.
  0.         0.         0.        ]]
['and' 'are' 'cat' 'cats' 'common' 'companions' 'dog' 'dogs' 'good' 'log'
 'mat' 'on' 'pets' 'sat' 'the']


## TF-IDF --> From Scratch

In [2]:
import numpy as np
import pandas as pd
from collections import Counter
from math import log

# Sample corpus
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Dogs and cats are common pets.",
    "Cats and dogs are good companions."
]

# Step 1: Calculate Term Frequency (TF)
def compute_tf(documents):
    tf_matrix = []
    for doc in documents:
        # Count the frequency of each word in the document
        word_count = Counter(doc.split())
        total_words = len(doc.split())
        # Compute TF for each word
        tf_scores = {word: count / total_words for word, count in word_count.items()}
        tf_matrix.append(tf_scores)
    return tf_matrix

# Step 2: Calculate Inverse Document Frequency (IDF)
def compute_idf(documents):
    idf_scores = {}
    total_docs = len(documents)
    # Count the number of documents that contain each word
    for doc in documents:
        words = set(doc.split())
        for word in words:
            idf_scores[word] = idf_scores.get(word, 0) + 1
    # Compute IDF for each word
    idf_scores = {word: log(total_docs / count) for word, count in idf_scores.items()}
    return idf_scores

# Step 3: Compute TF-IDF
def compute_tfidf(tf_matrix, idf_scores):
    tfidf_matrix = []
    for tf_scores in tf_matrix:
        tfidf_scores = {word: tf * idf_scores.get(word, 0) for word, tf in tf_scores.items()}
        tfidf_matrix.append(tfidf_scores)
    return tfidf_matrix

# Compute TF, IDF, and TF-IDF
tf_matrix = compute_tf(documents)
idf_scores = compute_idf(documents)
tfidf_matrix = compute_tfidf(tf_matrix, idf_scores)

# Convert the TF-IDF matrix to a DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix).fillna(0)
tfidf_df.columns = sorted(idf_scores.keys())
tfidf_df = tfidf_df.T

tfidf_df

Unnamed: 0,0,1,2,3
Cats,0.115525,0.115525,0.0,0.0
Dogs,0.231049,0.0,0.0,0.0
The,0.115525,0.115525,0.0,0.0
and,0.115525,0.115525,0.0,0.0
are,0.115525,0.115525,0.0,0.0
cat,0.231049,0.0,0.0,0.0
cats,0.0,0.231049,0.0,0.0
common,0.0,0.231049,0.0,0.0
companions.,0.0,0.0,0.231049,0.0
dog,0.0,0.0,0.115525,0.115525


## Pros and Cons of TF-IDF

### Pros of TF-IDF

- **Simplicity**: TF-IDF is straightforward to implement and understand. It's based on simple mathematical concepts and doesn't require complex computations.

- **Effectiveness**: It's effective for many text mining and information retrieval tasks, such as keyword extraction, document similarity, and feature extraction for machine learning models.

- **Scalability**: TF-IDF can handle large datasets efficiently. It scales well with the size of the corpus and the number of documents.

- **Reduces Dimensionality**: By focusing on important words, TF-IDF helps reduce the dimensionality of the feature space, which is beneficial for machine learning algorithms.

- **Interpretability**: The resulting scores are interpretable, making it easy to understand the importance of words in a document relative to the corpus.

### Cons of TF-IDF

- **Sparse Representation**: TF-IDF matrices are often sparse, especially with large vocabularies. This can lead to inefficiencies in storage and computation.

- **Ignoring Semantics**: TF-IDF does not capture the semantic meaning of words. It treats words independently and does not consider the context or relationships between words.

- **Sensitivity to Corpus Size**: The IDF component is sensitive to the size and nature of the corpus. Adding or removing documents can change the IDF values significantly.

- **Assumes Word Independence**: TF-IDF assumes that words are independent, which is not always true in natural language. It doesn't account for phrases or multi-word expressions.

- **Ignores Word Order**: TF-IDF does not consider the order of words in a document, which can be crucial for understanding the meaning of text.

- **Not Suitable for Short Texts**: TF-IDF may not perform well with very short texts, as there is not enough context to determine the importance of words accurately.


# **Word2Vec**

Word2Vec is a popular technique used in natural language processing (NLP) to create word embeddings, which are dense vector representations of words. These embeddings capture semantic similarity between words, meaning that words with similar meanings will have similar vector representations. Word2Vec was developed by Tomas Mikolov and his team at Google in 2013.

## Key Concepts of Word2Vec

### Word Embeddings

- Word embeddings are numerical representations of words in a continuous vector space.
- They capture semantic and syntactic similarity, meaning that words with similar meanings or grammatical roles will have similar vector representations.

### Architecture

Word2Vec uses a neural network model to learn word associations from a large corpus of text. There are two main architectures: Continuous Bag of Words (CBOW) and Skip-gram.

### Training Process

- The model is trained using a large corpus of text.
- During training, the model adjusts the weights (embeddings) to maximize the probability of predicting the correct words.
- The training objective is to maximize the likelihood of the context words given the target word (Skip-gram) or vice versa (CBOW).

### Vector Space

- The resulting word vectors are positioned in a high-dimensional space (typically 100-300 dimensions).
- Words with similar meanings are closer to each other in this vector space.
- Mathematical operations can be performed on these vectors to capture semantic relationships (e.g., "King" - "Man" + "Woman" ≈ "Queen").

### Applications

- Word2Vec embeddings are used in various NLP tasks such as sentiment analysis, machine translation, and text classification.
- They provide a way to represent words in a format that can be used as input for machine learning models.

### Limitations

- Word2Vec does not capture polysemy (multiple meanings of a word) well, as it generates a single vector for each word.
- It does not consider the context beyond the window size.
- It does not handle out-of-vocabulary words (words not seen during training).

## Example

Consider the sentence: "The quick brown fox jumps over the lazy dog."

- In CBOW, given the context words ["The", "brown", "jumps"], the model predicts the target word "fox".
- In Skip-gram, given the target word "fox", the model predicts the context words ["The", "brown", "jumps"].

Word2Vec has been instrumental in advancing the field of NLP by providing a way to represent words in a continuous vector space that captures semantic meaning.


## **Continuous Bag of Words (CBOW)**

### Objective

- The CBOW model aims to predict a target word from a window of surrounding context words.
- It maximizes the probability of the target word given the context words.

### Architecture

- The input to the model is a set of context words, and the output is the target word.
- The context words are represented as a "bag of words," meaning the order of words does not matter.
- The model uses a neural network with a single hidden layer to predict the target word.

### Training

- During training, the model adjusts the weights (word vectors) to maximize the likelihood of predicting the correct target word.
- The context words are fed into the network, and the output is compared to the actual target word.
- The error is backpropagated to update the word vectors.

### Advantages

- CBOW is faster to train compared to Skip-gram because it predicts only one word at a time.
- It works well with smaller datasets.

### Disadvantages

- CBOW may not capture rare words as effectively as Skip-gram because it relies on the context words to predict the target word.


In [4]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample data
sentences = [
    ['the', 'quick', 'brown', 'fox'],
    ['jumps', 'over', 'the', 'lazy', 'dog']
]

# Hyperparameters
window_size = 2
embedding_dim = 5
epochs = 1000
learning_rate = 0.05

# Preprocess data
label_encoder = LabelEncoder()
all_words = [word for sentence in sentences for word in sentence]
label_encoder.fit(all_words)
vocab_size = len(label_encoder.classes_)

# Initialize weights
W1 = np.random.rand(vocab_size, embedding_dim)
W2 = np.random.rand(embedding_dim, vocab_size)

# Training CBOW
for epoch in range(epochs):
    for sentence in sentences:
        for target_idx in range(len(sentence)):
            context_words = []
            for j in range(max(0, target_idx - window_size), min(len(sentence), target_idx + window_size + 1)):
                if j != target_idx:
                    context_words.append(sentence[j])

            if not context_words:
                continue

            context_indices = label_encoder.transform(context_words)
            target_index = label_encoder.transform([sentence[target_idx]])[0]

            # Forward pass
            context_vectors = W1[context_indices]
            hidden_layer = np.mean(context_vectors, axis=0)
            output_layer = np.dot(W2.T, hidden_layer)

            # Softmax to get probabilities
            exp_output_layer = np.exp(output_layer - np.max(output_layer))
            probs = exp_output_layer / np.sum(exp_output_layer)

            # Backpropagation
            error = probs
            error[target_index] -= 1
            W2[:, context_indices] -= learning_rate * np.outer(hidden_layer, error[context_indices])
            W1[context_indices, :] -= learning_rate * np.outer(error[context_indices], hidden_layer)

# Output the word embeddings
word_embeddings = {word: W1[idx] for word, idx in zip(label_encoder.classes_, range(vocab_size))}
print("CBOW Word Embeddings:", word_embeddings)


CBOW Word Embeddings: {np.str_('brown'): array([-0.00904331, -0.05148498, -0.05104738, -0.0195077 , -0.01440888]), np.str_('dog'): array([ 0.11481653, -0.05281729,  0.04417411, -0.18932073,  0.02775146]), np.str_('fox'): array([-0.07818388,  0.09821605, -0.04733932,  0.16028635,  0.05183857]), np.str_('jumps'): array([ 0.12595104,  0.03393909,  0.06901787,  0.05164945, -0.06299382]), np.str_('lazy'): array([-0.14233399,  0.01974778, -0.07652036,  0.05914171,  0.05738064]), np.str_('over'): array([-0.13777552,  0.06394807, -0.06190092,  0.17926208,  0.01165647]), np.str_('quick'): array([0.00428204, 0.03786512, 0.05040596, 0.00489799, 0.00931386]), np.str_('the'): array([ 0.09059586, -0.0571445 ,  0.04824039, -0.1150323 , -0.03578787])}


## **Skip-gram**

### Objective

- The Skip-gram model aims to predict the context words given a target word.
- It maximizes the probability of the context words given the target word.

### Architecture

- The input to the model is the target word, and the output is the context words within a certain window size.
- The model uses a neural network with a single hidden layer to predict multiple context words.

### Training

- During training, the model adjusts the weights (word vectors) to maximize the likelihood of predicting the correct context words.
- The target word is fed into the network, and the output is compared to the actual context words.
- The error is backpropagated to update the word vectors.

### Advantages

- Skip-gram works better with large datasets and captures rare words more effectively.
- It can handle a larger vocabulary size compared to CBOW.

### Disadvantages

- Skip-gram is slower to train compared to CBOW because it predicts multiple words at a time.


In [9]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Sample data
sentences = [
    ['the', 'quick', 'brown', 'fox'],
    ['jumps', 'over', 'the', 'lazy', 'dog']
]

# Hyperparameters
window_size = 2
embedding_dim = 5
epochs = 1000
learning_rate = 0.05

# Preprocess data
label_encoder = LabelEncoder()
all_words = [word for sentence in sentences for word in sentence]
label_encoder.fit(all_words)
vocab_size = len(label_encoder.classes_)

# Initialize weights
W1 = np.random.rand(vocab_size, embedding_dim)
W2 = np.random.rand(embedding_dim, vocab_size)

# Training Skip-gram
for epoch in range(epochs):
    for sentence in sentences:
        for target_idx in range(len(sentence)):
            target_word = sentence[target_idx]
            target_index = label_encoder.transform([target_word])[0]

            context_words = []
            for j in range(max(0, target_idx - window_size), min(len(sentence), target_idx + window_size + 1)):
                if j != target_idx:
                    context_words.append(sentence[j])

            if not context_words:
                continue

            context_indices = label_encoder.transform(context_words)

            # Forward pass
            target_vector = W1[target_index]
            output_layer = np.dot(W2.T, target_vector)

            # Softmax to get probabilities
            exp_output_layer = np.exp(output_layer - np.max(output_layer))
            probs = exp_output_layer / np.sum(exp_output_layer)

            # Backpropagation
            error = probs.copy()
            for context_index in context_indices:
                error[context_index] -= 1
                # Update weights
                W2[:, context_index] -= learning_rate * target_vector
                W1[target_index, :] -= learning_rate * error[context_index] * W2[:, context_index]

# Output the word embeddings
word_embeddings = {word: W1[idx] for word, idx in zip(label_encoder.classes_, range(vocab_size))}
print("Skip-gram Word Embeddings:", word_embeddings)

Skip-gram Word Embeddings: {np.str_('brown'): array([ 0.09126157, -0.0405691 ,  0.3118307 ,  0.09852563, -0.00686632]), np.str_('dog'): array([ 0.12044257, -0.0464454 , -0.15187716,  0.08494878,  0.09691155]), np.str_('fox'): array([0.36346851, 0.35255664, 0.14547659, 0.33697078, 0.1187854 ]), np.str_('jumps'): array([ 0.25497921,  0.17467231, -0.25533569, -0.36104801,  0.29881799]), np.str_('lazy'): array([0.34726189, 0.13480416, 0.31155007, 0.0880946 , 0.08570625]), np.str_('over'): array([-0.06939735,  0.03165416, -0.11039496,  0.30406574,  0.46997456]), np.str_('quick'): array([ 0.23636718, -0.2558761 ,  0.10151488,  0.14332214,  0.32082459]), np.str_('the'): array([0.25066364, 0.23073517, 0.50548869, 0.16661775, 0.56623956])}


### Comparison of CBOW and Skip-gram

#### Context Window

- Both models use a context window to define the range of context words.
- The window size can be adjusted to include more or fewer context words.

#### Training Time

- CBOW is generally faster to train because it predicts only one word at a time.
- Skip-gram predicts multiple words, which makes it slower to train.

#### Word Frequency

- Skip-gram is better at capturing rare words because it uses the target word to predict the context words, which can include rare words.

#### Use Cases

- CBOW is suitable for smaller datasets and when training speed is a concern.
- Skip-gram is better for large datasets and capturing a broader range of word meanings.

Both CBOW and Skip-gram are fundamental techniques in Word2Vec for creating word embeddings that capture semantic meaning. The choice between the two depends on the specific requirements and constraints of the task at hand.


# GloVe: Global Vectors for Word Representation

GloVe, which stands for Global Vectors for Word Representation, is an unsupervised learning algorithm developed by Stanford researchers to generate word embeddings. Word embeddings are vector representations of words that capture semantic similarity, meaning that words with similar meanings will have similar vector representations.

## Key Concepts of GloVe

- **Global Context**: Unlike some other embedding methods that focus on local context windows, GloVe considers the global context of words in a corpus. It constructs a word-context co-occurrence matrix from the entire text corpus, which captures how frequently words co-occur with each other.

- **Co-occurrence Matrix**: GloVe builds a co-occurrence matrix \( X \), where each element \( X_{ij} \) represents the number of times word \( j \) appears in the context of word \( i \). The context is typically defined by a window of words around the target word.

- **Objective Function**: GloVe aims to minimize the difference between the predicted and actual co-occurrence probabilities. The objective function is designed to capture the ratio of co-occurrence probabilities, which helps in capturing meaningful linear substructures in the word vector space.

- **Dimensionality Reduction**: The co-occurrence matrix is factorized to reduce its dimensionality. This process yields two sets of word vectors: one for words as "center" words and one for words as "context" words. The final word vectors are typically the sum of these two sets.

- **Efficiency**: GloVe is computationally efficient and can be trained on large corpora. It leverages statistical information across the entire text corpus, making it effective for capturing global word-word co-occurrence information.

## Advantages of GloVe

- **Captures Global Statistics**: By considering the entire corpus, GloVe captures global statistical information, which can be beneficial for capturing the meaning of words.
- **Efficient Training**: GloVe can be trained relatively quickly on large datasets due to its use of matrix factorization techniques.
- **High-Quality Embeddings**: GloVe produces high-quality word embeddings that capture semantic relationships between words.

## Applications

- **Natural Language Processing (NLP)**: GloVe embeddings are widely used in various NLP tasks such as sentiment analysis, machine translation, and text classification.
- **Semantic Similarity**: They can be used to measure the semantic similarity between words, which is useful in information retrieval and recommendation systems.

Overall, GloVe is a powerful tool for generating word embeddings that capture the semantic relationships between words, leveraging global statistical information from the text corpus.


In [11]:
import numpy as np

class GloVe:
    def __init__(self, corpus, vector_size=50, window_size=10, learning_rate=0.05, epochs=50):
        self.corpus = corpus
        self.vector_size = vector_size
        self.window_size = window_size
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.vocab = self.build_vocab()
        self.cooccurrence_matrix = self.build_cooccurrence_matrix()
        self.W = np.random.rand(len(self.vocab), vector_size)
        self.C = np.random.rand(len(self.vocab), vector_size)
        self.bias_w = np.random.rand(len(self.vocab))
        self.bias_c = np.random.rand(len(self.vocab))

    def build_vocab(self): # This method constructs a vocabulary from the corpus.
        vocab = {}
        for sentence in self.corpus:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = len(vocab)
        return vocab

    def build_cooccurrence_matrix(self): # This method constructs a co-occurrence matrix based on the context window size.
        cooccurrence_matrix = np.zeros((len(self.vocab), len(self.vocab)))
        for sentence in self.corpus:
            for i, center_word in enumerate(sentence):
                context_words = sentence[max(0, i - self.window_size): i] + sentence[i + 1: min(len(sentence), i + self.window_size + 1)]
                center_idx = self.vocab[center_word]
                for context_word in context_words:
                    context_idx = self.vocab[context_word]
                    cooccurrence_matrix[center_idx][context_idx] += 1.0 / len(context_words)
        return cooccurrence_matrix

    def train(self): # This method updates the word vectors and biases using the GloVe objective function.
        for epoch in range(self.epochs):
            for i in range(len(self.vocab)):
                for j in range(len(self.vocab)):
                    if i != j and self.cooccurrence_matrix[i][j] > 0:
                        weight = min(1.0, (self.cooccurrence_matrix[i][j] / 100) ** 0.75)
                        cost_inner = np.dot(self.W[i], self.C[j]) + self.bias_w[i] + self.bias_c[j] - np.log(self.cooccurrence_matrix[i][j])
                        grad_w = weight * cost_inner * self.C[j]
                        grad_c = weight * cost_inner * self.W[i]
                        self.W[i] -= self.learning_rate * grad_w
                        self.C[j] -= self.learning_rate * grad_c
                        self.bias_w[i] -= self.learning_rate * weight * cost_inner
                        self.bias_c[j] -= self.learning_rate * weight * cost_inner

    def get_embeddings(self): # This method returns the final word embeddings by averaging the word vectors and context vectors.
        return (self.W + self.C) / 2

# Example usage:
corpus = [
    ["i", "like", "machine", "learning"],
    ["i", "love", "cats"],
    ["i", "enjoy", "learning", "new", "things"]
]

glove = GloVe(corpus, vector_size=5, window_size=2, learning_rate=0.05, epochs=100)
glove.train()
embeddings = glove.get_embeddings()

print("Word Embeddings:\n", embeddings)

Word Embeddings:
 [[ 0.17939874  0.41627203  0.01563113  0.23874276  0.00279134]
 [ 0.3306892  -0.129326    0.21291018  0.03439553  0.6701372 ]
 [ 0.29421412  0.33080237  0.14943978  0.64166909  0.29803278]
 [ 0.41879369  0.20771939 -0.1807786  -0.05692185  0.15818512]
 [ 0.18601556  0.42714543  0.34674826  0.32768166  0.17758825]
 [ 0.02458444  0.28506487  0.4131357   0.57836569  0.18452116]
 [ 0.40176779  0.13027695  0.60868793  0.33681591  0.75884909]
 [ 0.02119305  0.51771323  0.09235924  0.30762168  0.1090183 ]
 [ 0.14177566  0.72977264  0.71460173  0.38812816  0.62755391]]


### Pros and Cons

GloVe (Global Vectors for Word Representation) is a popular method for generating word embeddings, and it has its own set of advantages and disadvantages. Here are some of the key pros and cons:

#### Pros of GloVe:

- **Efficient Training**: GloVe is designed to be computationally efficient. It leverages statistical information from a co-occurrence matrix, which can be precomputed, making the training process faster compared to some other methods like Skip-gram with Negative Sampling (SGNS).

- **Global Context**: GloVe considers the global context of words in the entire corpus, not just local context windows. This can lead to more robust embeddings that capture broader semantic relationships.

- **Scalability**: GloVe can scale well to large corpora due to its efficient use of matrix factorization techniques.

- **Interpretability**: The co-occurrence matrix and the resulting embeddings can be more interpretable, as they directly capture the statistical relationships between words.

- **Pre-trained Embeddings**: Pre-trained GloVe embeddings are widely available and have been trained on large corpora, making them easy to use for various NLP tasks.

#### Cons of GloVe:

- **Memory Usage**: The co-occurrence matrix can be very large, especially for large vocabularies, leading to high memory usage.

- **Less Suitable for Rare Words**: GloVe may not perform as well for rare words, as it relies on co-occurrence statistics that may be sparse for infrequent words.

- **Static Embeddings**: Like other traditional embedding methods, GloVe produces static embeddings that do not change with context. This can be a limitation compared to contextual embeddings produced by models like BERT.

- **Dependence on Corpus**: The quality of GloVe embeddings is highly dependent on the quality and size of the training corpus. Small or poorly curated corpora can lead to less effective embeddings.

- **Less Flexible**: GloVe is less flexible compared to neural network-based methods that can be fine-tuned for specific tasks. GloVe embeddings are typically used as-is after training.

- **No Subword Information**: GloVe does not inherently capture subword information, which can be a disadvantage for languages with rich morphology or for handling out-of-vocabulary words.

In summary, GloVe is a powerful and efficient method for generating word embeddings, particularly suitable for tasks where global context is important. However, it has limitations related to memory usage, handling of rare words, and the static nature of the embeddings.
