<a href="https://colab.research.google.com/github/lavanya950/FMML-LABS-MAIN/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

The **TF-IDF (Term Frequency - Inverse Document Frequency)** approach generally results in better accuracy than the **Bag-of-Words (BoW)** model because of the following key reasons:

### 1. **Handles Word Frequency Differently**
   - **Bag-of-Words** simply counts how many times a word appears in a document. While this gives a basic idea of word presence, it does not account for the relative importance of words in the corpus. Words that occur frequently in most documents (e.g., common words like "the", "is", "and") are given the same weight as more informative words, which could lead to poor performance in text classification tasks.
   
   - **TF-IDF**, on the other hand, adjusts the frequency of words by considering not only how often a word appears in a particular document (Term Frequency, TF), but also how often the word appears across all documents in the corpus (Inverse Document Frequency, IDF). This helps **downweight common words** and **highlight more informative words** that are unique to certain documents, improving the representation of the documents.

### 2. **Reduces the Influence of Common Words**
   - **Problem with Bag-of-Words**: In BoW, the most common words across all documents (often stopwords like "is", "the", "on", etc.) can dominate the feature set. These words are not very informative for distinguishing between classes, but because they appear frequently in many documents, they can overwhelm the analysis and skew the model.
   
   - **TF-IDF Advantage**: TF-IDF reduces the weight of words that appear frequently across many documents (which are likely less informative), and **increases the weight** of words that are specific to a smaller subset of documents (which are more informative and relevant). The IDF component helps to give more importance to rare terms, leading to a better feature representation for classification tasks.

### 3. **Improves Document Representation**
   - **BoW**: Treats words independently and assigns them the same weight, ignoring the context in which words appear. This leads to a **sparse** representation of the text, with many features representing common words.
   
   - **TF-IDF**: By down-weighting frequent terms and emphasizing less common, more distinctive terms, TF-IDF provides a **more nuanced representation** of the text. It captures the relative importance of each term within the document and the corpus, making it better suited for distinguishing between different documents or classes.

### 4. **Better Differentiation Between Documents**
   - **BoW** may struggle to distinguish documents when they share a lot of common words. This is because the frequency of those common words is treated equally across all documents.
   
   - **TF-IDF** improves the ability to differentiate documents because it gives more weight to words that are unique to certain documents. For example, if two documents share a common theme but use different specialized vocabulary, TF-IDF will help distinguish them better because it emphasizes the unique words in each document, rather than just counting all words equally.

### 5. **Helps in Reducing Dimensionality**
   - **Problem with BoW**: The BoW approach can lead to very high-dimensional feature vectors, especially when there is a large vocabulary. Many of these dimensions may correspond to common words that don't provide useful information.
   
   - **TF-IDF's Effect**: Since the IDF part of TF-IDF down-weights common words, it **reduces the number of significant features**. This can result in a **denser, lower-dimensional feature space** that is more useful for machine learning algorithms, reducing the computational complexity and improving model performance.

### 6. **Improves Accuracy by Giving Contextual Weight to Words**
   - In the BoW model, each word is treated as an independent feature, without consideration for how its frequency compares across the entire corpus. This can lead to inaccurate representations for words that may appear frequently in one document but are generally not important in distinguishing different classes.
   
   - With TF-IDF, the **importance of a word is measured** by both how frequently it appears in the document and how rare it is in the corpus. This helps **emphasize important terms** that can more effectively distinguish between documents or classes, leading to better classification accuracy.

### Example:
Let’s consider a document classification task where we are classifying movie reviews as either "positive" or "negative". Common words like "movie", "film", "actor", etc., appear in most reviews regardless of sentiment. If we use **Bag-of-Words**, these common words could dominate the feature vector and reduce the impact of sentiment-bearing words like "great", "boring", "exciting", etc.

However, in **TF-IDF**, these common words will be down-weighted (because they appear in many reviews), and sentiment-heavy words will receive higher weights. This better reflects the **unique aspects** of each document and helps improve the classifier's performance.

### **Summary of Key Differences:**

| **Feature**                         | **Bag-of-Words (BoW)**                          | **TF-IDF**                                    |
|-------------------------------------|-------------------------------------------------|-----------------------------------------------|
| **Term Weighting**                  | No weighting; counts occurrences of words       | Weighs words based on frequency in document and across corpus |
| **Effect of Common Words**          | Common words can dominate the model             | Common words are down-weighted, less influence |
| **Impact on Sparse Terms**          | Sparse terms (unique words) may be ignored      | Sparse terms (unique words) are emphasized     |
| **Accuracy**                         | Can perform poorly if common words dominate     | Usually results in better accuracy by focusing on important, unique words |
| **Dimensionality**                  | High dimensionality due to inclusion of all words| Lower dimensionality due to down-weighting of frequent terms |
| **Document Representation**         | Simple word count vector                        | More sophisticated, capturing the importance of words |

### **Conclusion:**
TF-IDF generally leads to **better accuracy** than Bag-of-Words because it emphasizes important, distinctive terms while reducing the impact of frequent, less informative words. This improves the ability of machine learning models to differentiate between documents and classes, leading to more effective and efficient text classification, sentiment analysis, and other natural language processing tasks.



2. Can you think of techniques that are better than both BoW and TF-IDF ?


Yes, there are several techniques that can be **better** than both **Bag-of-Words (BoW)** and **TF-IDF** for text representation, particularly when dealing with more complex tasks or larger corpora. While **BoW** and **TF-IDF** are simple and useful methods for many text-based problems, they have limitations, such as high dimensionality, lack of semantic meaning, and the inability to capture word order and context. Here are some more advanced techniques that can address these issues:

### 1. **Word Embeddings (Word2Vec, GloVe, FastText)**
   - **Description**: Word embeddings are dense vector representations of words where each word is mapped to a high-dimensional vector that captures semantic meaning based on its context in a large corpus. These embeddings are trained in such a way that words with similar meanings appear closer in the vector space.
     - **Word2Vec**: Uses models like **Continuous Bag of Words (CBOW)** or **Skip-gram** to learn vector representations of words by predicting surrounding words or being predicted by surrounding words.
     - **GloVe**: Constructs word vectors by capturing the co-occurrence statistics of words in the corpus, focusing on the global context of the word relationships.
     - **FastText**: An extension of Word2Vec that considers **subword** information, making it more effective for morphologically rich languages and words that are rare or misspelled.

   - **Why it’s better**:
     - **Captures Semantics**: Unlike BoW and TF-IDF, which treat words as independent entities, word embeddings capture the **semantic relationships** between words (e.g., "king" is closer to "queen" than to "apple").
     - **Lower Dimensionality**: Embeddings often use much lower-dimensional spaces (e.g., 300-500 dimensions) compared to the high-dimensional sparse vectors in BoW and TF-IDF.
     - **Contextual Meaning**: Similar words (e.g., "dog" and "puppy") will have similar embeddings, and embeddings capture more nuanced meanings based on context.
   
   - **Limitations**:
     - Embeddings require large corpora for training or pre-trained models.
     - They may still have issues with **polysemy** (same word with different meanings) unless extended to more sophisticated models (e.g., BERT, GPT).

---

### 2. **Contextualized Word Embeddings (BERT, GPT, RoBERTa, T5)**
   - **Description**: **Contextualized embeddings** are a major step forward from traditional word embeddings. They are generated by models like **BERT** (Bidirectional Encoder Representations from Transformers) or **GPT** (Generative Pretrained Transformer), which are pre-trained on large corpora and fine-tuned for specific tasks.
     - These embeddings are generated dynamically based on the **context** in which a word appears, meaning that the embedding for a word can change depending on its surrounding words. For example, the word "bank" would have different embeddings when used in "river bank" versus "bank account."
   
   - **Why it’s better**:
     - **Context-Aware**: Unlike BoW and TF-IDF (which treat words as fixed entities), contextual embeddings change depending on the sentence, allowing for more **accurate representation of word meanings**.
     - **Superior Performance**: Models like BERT have shown **state-of-the-art results** across a variety of NLP tasks, including sentiment analysis, question answering, and text classification.
     - **Transfer Learning**: You can fine-tune pre-trained models on your specific task, leveraging large-scale knowledge from the pre-training phase.

   - **Limitations**:
     - Computationally expensive and requires significant resources.
     - Training from scratch requires a massive amount of data, although pre-trained models like BERT or GPT are widely available for fine-tuning.

---

### 3. **Doc2Vec (Paragraph Vectors)**
   - **Description**: Doc2Vec (an extension of Word2Vec) is designed to represent **entire documents** as vectors. While Word2Vec generates vector representations for individual words, Doc2Vec learns a vector representation for the entire document, along with word vectors. The key idea is that each document has its own **document vector** that helps represent the overall context of the document, in addition to word vectors.
   
   - **Why it’s better**:
     - **Captures Document-Level Context**: Unlike BoW or TF-IDF, which focus only on word-level features, Doc2Vec captures the overall meaning of an entire document.
     - **Efficient for Document Similarity**: It's particularly useful for tasks like document classification, document retrieval, or finding **similar documents** based on semantic meaning.
   
   - **Limitations**:
     - Similar to Word2Vec, Doc2Vec requires a large corpus of data to produce meaningful vectors.
     - Can be computationally intensive depending on the corpus size.

---

### 4. **Latent Semantic Analysis (LSA)**
   - **Description**: LSA is a technique that uses **Singular Value Decomposition (SVD)** to reduce the dimensionality of the term-document matrix (usually produced by BoW or TF-IDF). LSA attempts to uncover latent structures or relationships between terms in a corpus by projecting words and documents into a reduced vector space. The goal is to capture hidden **semantic** relationships between words by considering the context in which they appear.
   
   - **Why it’s better**:
     - **Captures Latent Semantics**: LSA goes beyond the surface-level word frequencies and captures deeper, latent semantic structures. This makes it more robust to synonyms and related terms.
     - **Reduces Noise**: By reducing dimensionality, LSA removes **noisy, irrelevant terms** from the representation.
   
   - **Limitations**:
     - It is based on the **linear assumptions** of SVD and may not be as effective for capturing more complex, non-linear relationships in language.
     - Still relies on the initial **BoW or TF-IDF matrix**, which might be sparse for very large corpora.

---

### 5. **Topic Modeling (LDA)**
   - **Description**: **Latent Dirichlet Allocation (LDA)** is a generative probabilistic model that assumes each document is a mixture of various topics, and each topic is characterized by a distribution over words. LDA helps to uncover the **underlying topics** in a corpus of documents, providing a better understanding of document content by grouping words into topics.
   
   - **Why it’s better**:
     - **Unsupervised**: LDA doesn’t require labeled data, making it useful for discovering hidden themes or topics in a large corpus of documents.
     - **Better Representations**: LDA creates topic-based representations rather than just relying on word counts, making it better suited for applications like **document clustering** or **theme identification**.
   
   - **Limitations**:
     - Requires fine-tuning the number of topics and other hyperparameters.
     - Assumes that topics are **disjoint**, which might not always hold in practice.

---

### 6. **Recurrent Neural Networks (RNNs) / Long Short-Term Memory Networks (LSTMs) / GRUs**
   - **Description**: **RNNs**, **LSTMs**, and **GRUs** are deep learning models designed for sequence data. They are particularly powerful for capturing **sequential dependencies** in text. Unlike the methods discussed earlier, these models process words in **order** and can capture contextual relationships between words over longer spans, making them effective for understanding sentences, paragraphs, or even entire documents.
   
   - **Why it’s better**:
     - **Contextual Understanding**: RNNs, LSTMs, and GRUs can learn long-term dependencies between words in a sequence, capturing more complex **semantic and syntactic patterns** in text.
     - **Superior Performance on Sequential Data**: These models perform particularly well on tasks like **language modeling**, **machine translation**, and **speech recognition**, where word order and context are crucial.
   
   - **Limitations**:
     - Computationally expensive, especially when processing long sequences.
     - RNNs suffer from vanishing/exploding gradient problems, which LSTMs and GRUs try to address, but these still require careful training.

---

### Conclusion

While **Bag-of-Words** and **TF-IDF** are solid baseline techniques for text representation, more advanced methods can offer significant improvements, especially in tasks requiring deeper understanding and context of text. These include:

- **Word Embeddings (Word2Vec, GloVe, FastText)** for better semantic understanding.
- **Contextualized Embeddings (BERT, GPT)** for state-of-the-art performance in many NLP tasks.
- **Doc2Vec** for capturing document-level meaning.
- **Latent Semantic Analysis (LSA)** for uncovering latent topics and reducing noise.
- **Topic Modeling (LDA)** for identifying underlying topics in documents.
- **RNNs/LSTMs/GRUs** for tasks requiring sequential dependencies.

Each of these methods has its own strengths and is more suitable for specific kinds of problems, but overall, they provide richer, more informative representations of text than traditional BoW or TF-IDF models.



3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.


**Stemming** and **Lemmatization** are two common techniques used in natural language processing (NLP) to reduce words to their base or root forms. Both methods help to normalize words, making them easier to analyze by reducing inflected words or words with derivational affixes to a common root form. However, they operate differently, and each has its own pros and cons depending on the context in which they are used. Let’s break down both techniques and evaluate their advantages and disadvantages.

### **1. Stemming**

**Stemming** is the process of reducing a word to its root form by removing prefixes and suffixes. This is done through simple algorithms (such as **Porter Stemmer**, **Lancaster Stemmer**, or **Snowball Stemmer**) that apply heuristic rules to remove suffixes from words, often without regard for whether the result is a valid word.

- **Example**:
  - "running" → "run"
  - "happiness" → "happi"
  - "better" → "better" (though this word might remain unchanged depending on the stemmer)

#### **Pros of Stemming**:
1. **Speed**: Stemming algorithms are generally faster than lemmatization because they apply simple, rule-based heuristics without the need for a dictionary lookup or part-of-speech tagging.
2. **Simplicity**: Stemming is conceptually simpler. You don’t need complex resources like dictionaries or word mappings; a stemmer just cuts off common affixes (like “-ing,” “-ly,” “-ed”).
3. **Less Memory Intensive**: Since no dictionary or lexicon is required, stemming is less resource-intensive than lemmatization.
4. **Good for Large Datasets**: Stemming is particularly useful when working with large text corpora where speed and simplicity are paramount.

#### **Cons of Stemming**:
1. **Inaccuracy**: The biggest drawback of stemming is that it can sometimes produce **non-words** or **incorrect stems** that don’t exist in the dictionary, such as "happi" instead of "happy" or "runn" instead of "run".
2. **Loss of Meaning**: Stemming doesn’t account for the correct meaning of a word in context. For example, it can reduce "better" to "better," but in some cases, the meaning might change with lemmatization (e.g., "good" becomes "better").
3. **Over-Stemming/Under-Stemming**: Sometimes a stemmer might over-reduce words (e.g., "better" → "bet" or "running" → "run"), or it might not reduce words enough (e.g., "studies" → "studi" instead of "study").
4. **Context Ignorance**: Since stemming algorithms don’t consider context, they might reduce words incorrectly. For example, “organization” might be stemmed to "organ," which loses the meaning.

---

### **2. Lemmatization**

**Lemmatization** is a more sophisticated technique that aims to reduce a word to its **dictionary form**, known as the **lemma**. Lemmatization involves using vocabulary and grammar rules, often with the help of a **part-of-speech (POS) tagger** to determine the word's correct lemma. For example, "better" might be reduced to "good" (as it’s an adjective), while "running" is reduced to "run" (as a verb).

- **Example**:
  - "running" → "run"
  - "happiness" → "happy"
  - "better" → "good"

#### **Pros of Lemmatization**:
1. **Accuracy**: Lemmatization is **context-sensitive**, and it ensures that words are reduced to their correct dictionary form. For example, "better" becomes "good," and "running" becomes "run." This reduces the likelihood of creating **non-words** or reducing words incorrectly.
2. **Preserves Meaning**: Since lemmatization uses a dictionary and part-of-speech information, it can better preserve the **meaning** of words. For instance, it can differentiate between "run" as a verb (to move fast) and "run" as a noun (an event).
3. **More Natural**: The results of lemmatization are usually **real words**, and the text remains grammatically sound, which is useful for tasks like text generation or when the goal is to maintain linguistic accuracy.
4. **Improves Accuracy in Downstream Tasks**: Lemmatization is generally better when working with models that rely on precise and well-formed inputs, such as **named entity recognition (NER)**, **information retrieval (IR)**, and **machine translation**.

#### **Cons of Lemmatization**:
1. **Slower**: Lemmatization is more computationally expensive than stemming, as it involves accessing a dictionary or lexicon and performing part-of-speech tagging. This can lead to slower processing times, especially when dealing with large datasets.
2. **Requires More Resources**: Lemmatization requires access to **wordnet** or similar lexicons and may need additional resources like POS taggers. It is more resource-intensive than stemming, both in terms of memory and processing power.
3. **Complexity**: Lemmatization is more complex than stemming, as it requires context and additional linguistic knowledge to determine the correct lemma. This means that lemmatization tools are more complicated to implement or use compared to stemmers.
4. **Possible Errors in POS Tagging**: If the part-of-speech tagger used for lemmatization makes a mistake (for example, mistaking a noun for a verb), the lemma might be incorrect, which could lead to errors in the final results.

---

### **Comparison of Stemming and Lemmatization**

| **Aspect**               | **Stemming**                                           | **Lemmatization**                                     |
|--------------------------|--------------------------------------------------------|-------------------------------------------------------|
| **Speed**                | Faster; simple, rule-based algorithms                 | Slower; requires more processing (POS tagging, dictionary lookup) |
| **Complexity**           | Simple, heuristic-based approach                      | More complex; requires a dictionary and POS tagging   |
| **Accuracy**             | Can produce non-words or incorrect stems              | More accurate; ensures that words are reduced to their valid lemma |
| **Preserving Meaning**   | May lose meaning due to over-simplification            | Better preserves meaning by considering the part of speech |
| **Resource Usage**       | Less resource-intensive                               | More resource-intensive (requires lexicon, POS tagging) |
| **Use Case**             | Suitable for quick, exploratory tasks where precision is less important | Suitable for tasks where precision is critical, such as information retrieval or document classification |

### **When to Use Stemming vs Lemmatization?**

- **Use Stemming when**:
  - You are working with **large-scale data** and need a quick, computationally efficient solution.
  - The accuracy of the final words is less important than speed (e.g., in document clustering or exploratory text analysis).
  - You are working with **highly noisy text** (e.g., social media data, where spelling mistakes and informal language are common), and you don't need precise word forms.
  
- **Use Lemmatization when**:
  - The **semantic meaning** of the words is crucial, and you want to preserve the correct form (e.g., for text classification, sentiment analysis, or named entity recognition).
  - **Accuracy is paramount**, especially in tasks that require high precision and the model needs to deal with varied word forms (e.g., in search engines, machine translation, or knowledge graph creation).
  - You are working with **formal text** where accurate linguistic structure matters.

### **Conclusion**

- **Stemming** is a **faster and simpler** method that is suitable for large-scale applications where computational efficiency is key and a little inaccuracy is acceptable.
- **Lemmatization**, while more **accurate and semantically sensitive**, is computationally expensive and slower, making it more appropriate for tasks where precision is crucial and where the text is more complex or formal.

Both techniques have their place in natural language processing, and the choice between them largely depends on the specific problem, the dataset, and the trade-offs you are willing to make between accuracy and computational efficiency.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
