<a href="https://colab.research.google.com/github/jayalakshmikarri04/FMML-lab/blob/main/Copy_of_Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [8]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [10]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [5]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) approach because it accounts for the importance of words within a corpus, rather than treating all words equally, as in the BoW model.

Here’s a breakdown of the key reasons why TF-IDF tends to be more effective:

1. Weighting Word Importance:

In BoW, each word in a document is treated as equally important, and the frequency of each word is counted. This can lead to issues, especially with common words (like "the", "is", etc.) that appear frequently across many documents but do not provide meaningful information for classification or analysis.

TF-IDF, however, assigns a higher weight to words that are frequent within a specific document (Term Frequency), but also reduces the weight of words that appear frequently across many documents (Inverse Document Frequency). This helps prioritize unique, informative words.



2. Handling Stopwords:

Common words (stopwords) that appear across many documents do not provide distinguishing power for classification or clustering tasks. TF-IDF naturally downweights these words, whereas BoW treats them as equally important.



3. Focus on Distinctive Terms:

TF-IDF emphasizes words that are distinctive to a particular document or a small group of documents. This allows the model to capture more meaningful features for tasks like text classification, while BoW might focus on less informative words that appear in most documents.



4. Improved Generalization:

By reducing the impact of common, non-informative words, TF-IDF helps the model generalize better and avoid overfitting, which is a common issue with BoW, especially when there are a lot of frequent words or large datasets.




Overall, TF-IDF provides a more nuanced view of word importance, which generally results in better performance for text analysis tasks compared to the simpler BoW approach.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several techniques that can outperform both Bag-of-Words (BoW) and TF-IDF, especially for tasks involving more complex language understanding. These advanced techniques often capture semantic meaning, contextual information, and relationships between words more effectively than BoW and TF-IDF. Some of these methods include:

1. Word Embeddings (Word2Vec, GloVe, FastText)

Word2Vec: This model learns dense, low-dimensional vector representations of words by capturing semantic relationships and contexts. Unlike BoW and TF-IDF, Word2Vec captures the meaning of words based on their surrounding words in a large corpus.

GloVe (Global Vectors for Word Representation): Like Word2Vec, GloVe generates word vectors, but it does so by factorizing the word co-occurrence matrix, capturing global statistical information about word relationships.

FastText: An extension of Word2Vec, FastText represents words as bags of character n-grams, which helps handle out-of-vocabulary words better, especially for morphologically rich languages.


Advantages:

These models capture semantic similarity between words (e.g., "king" and "queen" are closer in vector space).

They handle rare and out-of-vocabulary words better than BoW and TF-IDF.


2. Contextualized Word Embeddings (BERT, GPT, ELMo)

BERT (Bidirectional Encoder Representations from Transformers): BERT and similar models (like GPT) are based on transformer architecture, which learns contextualized representations of words, meaning the representation of a word changes depending on its surrounding words. This is in contrast to Word2Vec or GloVe, where words have static embeddings.

ELMo (Embeddings from Language Models): ELMo generates word embeddings based on the entire sentence context, which means that it can distinguish between different meanings of a word depending on its context (e.g., "bank" as a financial institution vs. "bank" of a river).


Advantages:

BERT and similar models outperform traditional approaches by capturing fine-grained contextual meanings, handling polysemy (words with multiple meanings), and understanding word order.

These models can also be fine-tuned for specific tasks like sentiment analysis, question answering, etc.


3. Transformer-Based Models (BERT, RoBERTa, T5)

RoBERTa: A variant of BERT that is trained with different hyperparameters and larger datasets. It often shows superior performance compared to BERT in various NLP tasks.

T5 (Text-to-Text Transfer Transformer): This model treats all NLP tasks as a text-to-text problem, allowing it to be fine-tuned for a wide range of tasks like summarization, translation, and classification.


Advantages:

These models handle complex language understanding better than BoW and TF-IDF.

Transformer-based models understand the relationships between words in a sequence, improving the overall comprehension of context and meaning.


4. Topic Modeling (LDA, NMF)

Latent Dirichlet Allocation (LDA): LDA is a generative probabilistic model that can discover abstract topics from a collection of documents. Each document is assumed to be a mixture of topics, and each topic is a distribution over words.

Non-Negative Matrix Factorization (NMF): NMF is another technique used for topic modeling that factorizes a document-term matrix into two lower-rank matrices, focusing on identifying latent topics in the data.


Advantages:

Topic modeling techniques can reveal the underlying themes or topics in a collection of documents, which can be more meaningful than individual word counts.

This can improve classification or clustering tasks by focusing on topic-based features instead of individual words.


5. Doc2Vec (Paragraph Vectors)

Doc2Vec is an extension of Word2Vec that generates vector representations not only for words but also for entire documents or paragraphs. It captures the context of words within a document and produces a fixed-length vector for a document.


Advantages:

Unlike BoW or TF-IDF, which represent documents as sparse vectors, Doc2Vec produces dense vectors, making it more efficient for various machine learning models.

It captures the semantic meaning of documents, making it useful for document classification, clustering, and recommendation systems.


6. Neural Networks for Text Classification

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) (including LSTMs and GRUs) have been applied to NLP tasks for classification, sentiment analysis, and sequence prediction. These models learn hierarchical representations of words and sentences, which allows them to capture more complex patterns than BoW or TF-IDF.


Advantages:

These models can learn contextual relationships within a sentence or document, which makes them more capable of handling complex language tasks compared to traditional techniques.


Conclusion:

While BoW and TF-IDF are useful for many basic tasks, modern approaches like word embeddings, transformer-based models (like BERT), and topic modeling are better suited for tasks requiring deeper semantic understanding, context, and relationships between words. Transformer models, in particular, have set new benchmarks for performance in NLP tasks, making them one of the most powerful techniques available today.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
To better understand Stemming and Lemmatization, here's a summary of each technique and their respective pros and cons.

Stemming:

Stemming is a technique in natural language processing (NLP) that reduces a word to its root form, which may not always be a valid word. For example:

"running" → "run"

"happily" → "happi"

"better" → "better"


Stemming algorithms (e.g., Porter Stemmer, Snowball Stemmer) typically use simple rules or heuristics to remove affixes (like prefixes or suffixes) from words.

Pros of Stemming:

1. Efficiency: Stemming is computationally faster and simpler than lemmatization because it applies basic rules without needing to consider word meaning.


2. Reduced Dimensionality: By reducing words to their root forms, stemming can help reduce the vocabulary size, making models faster and more efficient.


3. Good for Search Engines: Stemming can be useful in search engines where exact word matching isn't necessary, and retrieving any form of a word can be helpful.



Cons of Stemming:

1. Over-Stemming or Under-Stemming: Stemming algorithms sometimes reduce words incorrectly, either by removing too many affixes (over-stemming) or not enough (under-stemming). For example, "better" might incorrectly be stemmed to "bet," losing meaning.


2. Non-Standard Outputs: The resulting "stemmed" word may not always be a valid word, making it difficult to interpret in the context of the original text.


3. Context Ignorance: Stemming doesn't account for the meaning of the word or context in which it is used. For instance, "running" and "runner" might be reduced to the same stem "run," but they have different meanings.



Lemmatization:

Lemmatization is a more sophisticated technique where words are reduced to their base or dictionary form (lemma). Unlike stemming, lemmatization considers the context of the word and applies rules based on its part of speech. For example:

"running" → "run" (verb)

"better" → "good" (adjective)


Lemmatization requires a vocabulary and part-of-speech tagging, so it’s computationally more expensive than stemming.

Pros of Lemmatization:

1. Context-Aware: Lemmatization takes the meaning and part of speech into account, which ensures that the root word is valid. For example, "better" becomes "good," which makes more sense contextually.


2. Accuracy: The results of lemmatization are more accurate and meaningful. Lemmatized words are valid words in the language, which helps maintain their semantics.


3. Better for Text Analysis: Because lemmatization results in more semantically accurate representations, it’s often better for tasks like machine translation, sentiment analysis, and other sophisticated NLP applications.



Cons of Lemmatization:

1. Computationally Intensive: Lemmatization is slower than stemming because it involves more processing, such as part-of-speech tagging and the use of a lexicon.


2. Dependency on Lexicons: Lemmatization requires external resources like dictionaries or morphological analyzers, which might not always be available or comprehensive, especially for languages with rich morphology.


3. Complexity: Lemmatization algorithms are more complex, which can make them harder to implement and fine-tune for specific tasks.



Comparison of Stemming and Lemmatization:

Summary:

Stemming is fast and efficient but may produce non-standard or meaningless words, which might not always be suitable for tasks that require a precise understanding of the text.

Lemmatization, on the other hand, provides more accurate and semantically meaningful results, but it is slower and computationally more expensive. It’s more appropriate for tasks requiring understanding of the text's meaning and context.


In general, if you need a quick and computationally cheaper approach where precision is not critical (e.g., search engines), stemming may be sufficient. However, for tasks requiring more linguistic understanding and accuracy (e.g., sentiment analysis, machine translation), lemmatization is typically the better choice.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
