<a href="https://colab.research.google.com/github/manojbejawada/FMML_MODULE/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words?

The **TF-IDF** (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the **Bag-of-Words (BoW)** approach for several important reasons:

### 1. **Account for Word Frequency Across Documents**:
   - **BoW**: The Bag-of-Words model represents a document as a vector of word counts, without taking into account how frequently a word appears in different documents. This means that common words (such as "the," "is," and "and") are treated the same as more informative words, even though they do not carry as much semantic meaning.
   - **TF-IDF**: In contrast, **TF-IDF** not only considers the **term frequency (TF)** — how often a word appears in a document — but also the **inverse document frequency (IDF)**, which penalizes words that appear frequently across many documents. This reduces the influence of common, less informative words and increases the weight of words that are rare but significant to specific documents.

   - **Result**: TF-IDF helps to highlight words that are more unique to a specific document or a set of documents, making it more effective at capturing the semantic content of the text. As a result, TF-IDF often leads to better accuracy in tasks like classification or clustering.

### 2. **Reducing the Impact of Stop Words**:
   - **BoW**: In BoW, common "stop words" (e.g., "a," "the," "in") are treated with equal importance as other words. These words are often very frequent in all documents and do not provide meaningful information for text classification or clustering tasks.
   - **TF-IDF**: Since the **IDF** component of TF-IDF down-weights the importance of terms that appear in many documents, stop words (which appear frequently across most documents) receive very low weights, reducing their influence on the model. This leads to a better representation of the text’s content.

   - **Result**: By de-emphasizing stop words, TF-IDF allows the model to focus on the more meaningful, distinctive terms in the dataset, leading to better performance.

### 3. **Capture of Discriminative Features**:
   - **BoW**: BoW treats all words equally, so it may not adequately capture the features that distinguish between different classes or categories. In other words, it may treat less relevant words as equally important as the key distinguishing terms.
   - **TF-IDF**: By adjusting the importance of terms based on their frequency across the entire corpus, TF-IDF helps capture the **discriminative features** — words that are important for distinguishing between different documents or classes. Words that are common across the entire dataset are down-weighted, while those that are specific to a particular class or document are given higher importance.

   - **Result**: This makes TF-IDF more effective for tasks like document classification or sentiment analysis, where distinguishing between documents based on key terms is important.

### 4. **Improved Weighting for Rare Terms**:
   - **BoW**: In the Bag-of-Words model, the frequency of words is counted without any consideration of how rare or common the words are across the entire dataset. This means rare, but potentially important, terms may not be given enough weight.
   - **TF-IDF**: The **IDF** component boosts the weight of rare words that appear in only a few documents, making them more impactful for classification or clustering. This helps capture more specialized or domain-specific vocabulary that could be crucial for distinguishing between different classes or topics.

   - **Result**: By giving more weight to rare, distinctive terms, TF-IDF leads to a more accurate representation of the underlying meaning in the text.

### 5. **Better Handling of Large Corpora**:
   - **BoW**: As the size of the corpus grows, the BoW approach tends to treat common words as highly significant, even if they don’t carry any distinguishing information. This can lead to a less effective representation of the text as the vocabulary grows.
   - **TF-IDF**: As the corpus expands, the **IDF** value for common words increases, reducing their influence on the model and giving more emphasis to terms that are less common but more informative.

   - **Result**: This dynamic adjustment allows TF-IDF to maintain better performance as the corpus grows, unlike BoW, which can become less effective as more frequent words accumulate.

### 6. **Sparsity and Computational Efficiency**:
   - **BoW**: The Bag-of-Words model can result in a large, sparse feature matrix where many word counts are zero. While this is also true for TF-IDF, TF-IDF's ability to down-weight common words means that the matrix will often be **less sparse**, with more meaningful terms having non-zero values, improving the overall representation.
   - **Result**: This can lead to better model performance and efficiency, especially in text classification tasks where distinguishing between key terms is essential.

### Summary of Key Advantages of TF-IDF over BoW:
1. **Reduces the influence of common words (stop words)**, making the model focus on more informative terms.
2. **Captures discriminative features** that are important for distinguishing between classes or documents.
3. **Improves handling of rare terms**, which are often important for capturing the unique content of specific documents.
4. **Better generalization** to new, unseen data, especially as the corpus grows.
5. **Focuses on the most relevant words** by adjusting for their frequency across the entire corpus.

In conclusion, **TF-IDF** generally outperforms **Bag-of-Words** because it helps prioritize the most informative and unique terms in the dataset, leading to better classification accuracy, particularly for tasks that require distinguishing between different topics or categories.

2. Can you think of techniques that are better than both BoW and TF-IDF ?


Yes, there are several techniques that can often outperform **Bag-of-Words (BoW)** and **TF-IDF**, especially when it comes to capturing semantic meaning, handling complex relationships in text, and dealing with issues like high dimensionality and sparsity. Here are some of the more advanced techniques:

### 1. **Word Embeddings (e.g., Word2Vec, GloVe, FastText)**
   - **What it is**: Word embeddings map words to dense, continuous vector representations in a high-dimensional space, where semantically similar words are represented by vectors that are close together.
   - **How it’s better**:
     - **Captures semantic meaning**: Unlike BoW and TF-IDF, which treat words as independent features, word embeddings capture **semantic relationships** (e.g., synonyms, analogies).
     - **Reduces dimensionality**: Word embeddings typically use a much smaller number of dimensions (e.g., 100 to 300) compared to BoW or TF-IDF, which can have thousands of dimensions.
     - **Contextual similarities**: Embeddings allow you to understand that "king" and "queen" are similar, even if they don't appear in the same context within the same document.
   - **Popular models**: Word2Vec (Skip-Gram and CBOW), GloVe (Global Vectors for Word Representation), and FastText (which also captures subword information).
   
### 2. **Contextualized Word Embeddings (e.g., BERT, GPT, RoBERTa)**
   - **What it is**: These models generate embeddings that take into account the **context** in which a word appears. Unlike traditional word embeddings, which assign a single vector to each word, contextualized embeddings generate a different vector for a word depending on the sentence or document it's used in.
   - **How it’s better**:
     - **Captures word meaning in context**: For example, "bank" in the context of "river bank" vs. "bank" in "financial bank" would have different representations.
     - **Handles polysemy**: By considering the surrounding words, these models can distinguish between words with multiple meanings (polysemous words).
     - **State-of-the-art performance**: Models like BERT (Bidirectional Encoder Representations from Transformers) have set new benchmarks in many NLP tasks, including question answering, sentiment analysis, and text classification.
   - **Popular models**: BERT, RoBERTa, GPT-3, T5, and other transformer-based models.

### 3. **Transformer-Based Models (e.g., BERT, GPT, T5)**
   - **What it is**: Transformer models, based on **self-attention mechanisms**, are designed to capture long-range dependencies and context in text. These models are trained on large amounts of text data and fine-tuned for specific tasks.
   - **How it’s better**:
     - **Better handling of long-range dependencies**: Unlike BoW, TF-IDF, and even Word2Vec, transformer models can understand **longer contextual relationships** between words in a sentence or paragraph.
     - **Pre-trained models**: Pre-trained transformer models (like BERT and GPT) can be fine-tuned for specific downstream tasks with relatively small labeled datasets, resulting in superior performance.
     - **Better at understanding complex language**: They excel in understanding nuances, idiomatic expressions, and complex language features.
   - **Popular models**: BERT (for bidirectional context), GPT-3 (for autoregressive generation), and T5 (for text-to-text tasks).

### 4. **Latent Dirichlet Allocation (LDA)**
   - **What it is**: LDA is a **generative probabilistic model** that assumes documents are mixtures of topics, and each word in a document is attributable to one of these topics. LDA helps uncover the **latent thematic structure** in a collection of documents.
   - **How it’s better**:
     - **Topic modeling**: LDA not only extracts important words but also discovers the **underlying topics** in the corpus. This can help understand the main themes of a document.
     - **Dimensionality reduction**: LDA reduces the feature space by focusing on a smaller set of topics, improving the interpretability and efficiency of subsequent analysis.
   - **Popular use cases**: Topic modeling, document clustering, and understanding the thematic structure of a corpus.

### 5. **Doc2Vec (Paragraph Vectors)**
   - **What it is**: Doc2Vec is an extension of Word2Vec that creates vector representations not just for individual words but for **entire documents or paragraphs**. It captures the overall semantic content of a document.
   - **How it’s better**:
     - **Captures document-level semantics**: Instead of just looking at individual word-level semantics (as in BoW, TF-IDF, or Word2Vec), Doc2Vec generates fixed-length vectors that represent entire documents, allowing for better document-level analysis.
     - **Good for document clustering, classification, and retrieval**: By representing entire documents as vectors, Doc2Vec is useful for tasks like document classification, similarity detection, and clustering.

### 6. **Universal Sentence Encoder (USE)**
   - **What it is**: USE is a pre-trained model from Google that generates fixed-length embeddings for entire sentences or paragraphs, capturing semantic meaning at the sentence level.
   - **How it’s better**:
     - **Handles sentence-level semantics**: USE is designed to capture the meaning of entire sentences or short texts, which is more useful for tasks like sentence similarity, semantic search, and sentence-level classification.
     - **Pre-trained and easy to use**: Like BERT, it is pre-trained on large datasets and can be easily fine-tuned for specific tasks.

### 7. **ELMo (Embeddings from Language Models)**
   - **What it is**: ELMo is a deep contextualized word representation model, which provides word embeddings based on the entire context of a sentence, allowing for better representation of word meanings.
   - **How it’s better**:
     - **Context-sensitive**: Unlike static embeddings like Word2Vec, ELMo produces embeddings that vary depending on the sentence, capturing nuances in word meanings based on surrounding context.
     - **State-of-the-art for some tasks**: Although BERT and GPT have largely surpassed ELMo, it was a significant step forward in embedding techniques.

### 8. **Sentence Transformers (SBERT)**
   - **What it is**: SBERT is a modification of BERT designed specifically to generate **sentence-level embeddings** that capture the meaning of an entire sentence, allowing for efficient pairwise sentence similarity computations.
   - **How it’s better**:
     - **Better for sentence similarity**: SBERT is optimized for tasks like sentence matching, semantic search, and paraphrase detection, where capturing the meaning of entire sentences or text pairs is crucial.
     - **Efficient computation**: Unlike traditional BERT, SBERT is fine-tuned to generate embeddings for sentence pairs, which improves efficiency for tasks requiring large-scale similarity or retrieval.

---

### Summary of Techniques Better Than BoW and TF-IDF:

1. **Word Embeddings (Word2Vec, GloVe, FastText)**: Capture semantic relationships and provide dense vector representations for words.
2. **Contextualized Embeddings (BERT, GPT, RoBERTa)**: Provide context-sensitive embeddings, improving handling of polysemy and word meaning in context.
3. **Transformer Models (BERT, GPT, T5)**: Powerful models for capturing long-range dependencies and understanding complex language.
4. **Latent Dirichlet Allocation (LDA)**: Helps with topic modeling and dimensionality reduction by identifying underlying topics in the corpus.
5. **Doc2Vec**: Generates fixed-length vectors for entire documents, capturing document-level semantics.
6. **Universal Sentence Encoder (USE)**: Useful for capturing sentence-level semantics and used for tasks like semantic similarity.
7. **ELMo**: Contextualized word embeddings that adapt to the surrounding sentence.
8. **SBERT (Sentence-BERT)**: Optimized for tasks like sentence similarity and semantic search.

### Conclusion:
While **BoW** and **TF-IDF** are simple and effective, they do not capture the deeper semantic meaning of text or relationships between words as well as more advanced methods like **word embeddings** and **transformer-based models**. These newer techniques, particularly **contextualized embeddings** like **BERT** and **GPT**, are currently considered state-of-the-art for most NLP tasks and tend to outperform traditional methods in terms of accuracy, handling complex language patterns, and understanding context.

3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.


Stemming and **lemmatization** are both techniques used in **Natural Language Processing (NLP)** to reduce words to their root or base forms, but they do so in different ways and have distinct advantages and disadvantages. Here’s a comparison of the two approaches based on their respective **pros and cons**:

### 1. **Stemming**

**Stemming** is the process of reducing a word to its **root form** by chopping off prefixes or suffixes. The goal is to normalize the word so that variations like "running" and "runner" become "run." The process is typically **rule-based**, using predefined rules to strip affixes from words.

#### Pros of Stemming:
1. **Simple and Fast**:
   - Stemming algorithms like **Porter Stemmer** or **Snowball Stemmer** are computationally simple and fast, making them suitable for large datasets where processing speed is a priority.
   
2. **Effective for Reducing Variations**:
   - Stemming helps in reducing word variations, especially in cases where different forms of a word are used in the same context. For example, "jumps," "jumping," and "jumped" all become "jump."

3. **Reduces Dimensionality**:
   - By reducing related words to a common root, stemming can help reduce the feature space in models, improving efficiency in tasks like text classification or clustering.

4. **No Need for Extensive Lexical Resources**:
   - Stemming is based on rules, so it doesn't require large lexical databases or word dictionaries, making it lightweight and easy to implement.

#### Cons of Stemming:
1. **Over-Stemming**:
   - Stemming often cuts off affixes too aggressively, leading to **incorrect roots**. For example, "unhappily" might be reduced to "unhappi," which is not a valid word, and can affect the quality of downstream NLP tasks.
   
2. **Loss of Meaning**:
   - Stemming doesn't consider the context of words, which may result in **loss of semantic meaning**. For instance, "better" (comparative) could be reduced to "better" (base form), which may not preserve the intended meaning.
   
3. **Non-Standardized Output**:
   - Since stemming is heuristic-based, it may yield non-standard, non-dictionary forms of words that could make it difficult to interpret or use in human-readable formats.
   
4. **Inaccurate for Complex Words**:
   - It may fail for more complex or irregular words, as it simply applies a set of rules rather than understanding the word's true meaning.

---

### 2. **Lemmatization**

**Lemmatization** is a more advanced technique that reduces words to their **lemma** (the base form of a word), using knowledge of **part-of-speech (POS)** and linguistic rules. Unlike stemming, which uses rules to chop off affixes, lemmatization involves looking up the dictionary form of a word and understanding its meaning.

#### Pros of Lemmatization:
1. **Context-Aware and Accurate**:
   - Lemmatization uses **POS tagging** to understand the context of a word (e.g., "better" as an adjective vs. "better" as a verb) and reduces it to its correct lemma. This ensures that words are lemmatized correctly and preserves meaning.
   
2. **Produces Valid Words**:
   - Lemmatization always produces valid words from a dictionary. Unlike stemming, it avoids producing non-standard or incorrect words, ensuring that the resulting base forms are meaningful (e.g., "run" instead of "runn").

3. **Preserves Word Meaning**:
   - By considering the context and part of speech, lemmatization better preserves the **semantic meaning** of words, which is important for tasks like sentiment analysis, text summarization, and question answering.

4. **Better for Text Interpretation**:
   - Lemmatization results in more natural and interpretable output compared to stemming. This makes it more suitable when the goal is to generate human-readable text or perform more accurate downstream tasks.

#### Cons of Lemmatization:
1. **Computationally Expensive**:
   - Lemmatization is more **resource-intensive** than stemming because it requires a **lexicon** and **part-of-speech tagging**, which adds extra processing time. This can be a drawback when working with large datasets or when speed is critical.
   
2. **Complexity**:
   - The lemmatization process is more **complex** and involves more intricate linguistic knowledge. This makes it harder to implement from scratch and often requires external libraries like **WordNet** (e.g., in Python’s **nltk** library).
   
3. **Relies on External Resources**:
   - Lemmatization depends on the availability of **external lexical resources** (e.g., WordNet, morpheme dictionaries), which may not be suitable for all languages or domains. In some cases, these resources might not be comprehensive or accurate enough.

4. **Slower for Large Datasets**:
   - Due to the need for dictionary lookups and POS tagging, lemmatization tends to be **slower** than stemming, especially for large datasets or real-time applications.

---

### Comparison of Stemming vs. Lemmatization

| Feature                | **Stemming**                              | **Lemmatization**                          |
|------------------------|-------------------------------------------|--------------------------------------------|
| **Complexity**          | Simple and fast                          | More complex and slower                   |
| **Accuracy**            | Less accurate, can produce incorrect roots | More accurate, context-aware               |
| **Preservation of Meaning** | Often loses meaning                    | Preserves meaning and context              |
| **Output**              | Non-standard, often invalid words         | Valid dictionary words                     |
| **Context Awareness**   | Does not consider context                 | Uses POS and context for accurate output   |
| **Performance on Large Datasets** | Faster, more efficient             | Slower due to need for POS tagging         |
| **Use Case**            | Suitable for quick, approximate text processing tasks | Suitable for tasks needing high accuracy and semantic understanding |

---

### Conclusion

- **Stemming** is faster and simpler, making it suitable for applications where processing speed is crucial and where slight inaccuracies are acceptable. However, it can produce non-standard words and may lead to loss of meaning, especially in complex language tasks.
  
- **Lemmatization** provides more accurate and semantically meaningful results because it uses part-of-speech information and dictionaries to reduce words to their base forms. It's better for tasks where accuracy and preserving the intended meaning are important, but it comes at the cost of being slower and more computationally expensive.

**Choosing between the two** depends on the specific task and resource constraints. For simpler, fast applications where speed is key, stemming may suffice. For more nuanced tasks that require preserving meaning and ensuring linguistic correctness, lemmatization is often the better choice.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
