<a href="https://colab.research.google.com/github/honey799/FMML/blob/main/Mod3_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [7]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [6]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

TypeError: ignored

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv', error_bad_lines=False)
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

# 1A)
TF-IDF (Term Frequency-Inverse Document Frequency) and Bag-of-Words are both techniques used in natural language processing (NLP) and machine learning for text analysis, but they have different characteristics that can impact their performance in different ways. It's important to note that the effectiveness of these approaches depends on the specific task and dataset. Here are some reasons why TF-IDF might generally result in better accuracy than Bag-of-Words in certain scenarios:

##Term Importance Weighting:

In Bag-of-Words, each word is treated independently, and the order of words is ignored. This means that all words are considered equally important, regardless of their actual importance in the document or corpus.
TF-IDF, on the other hand, takes into account not only the term frequency (TF) but also the inverse document frequency (IDF). This helps in giving higher weights to terms that are frequent in a specific document but not in the entire corpus, making them more discriminative.
##Handling Common Words:

Bag-of-Words often includes common words that might not carry much semantic meaning (stop words) and assigns them high frequencies. TF-IDF addresses this issue by downweighting terms that are frequent across all documents, emphasizing terms that are more specific to individual documents.
##Dimensionality Reduction:

TF-IDF implicitly performs a certain degree of dimensionality reduction. By emphasizing important terms and downplaying common ones, TF-IDF reduces the dimensionality of the feature space compared to Bag-of-Words. This can be beneficial for some machine learning algorithms, especially when the number of features is large relative to the number of training instances.
##Better Representations of Document Content:

TF-IDF captures the importance of terms in a document more effectively, providing a richer representation of the content. This can be particularly useful when trying to understand the significance of terms in a document.
##Noise Reduction:

Bag-of-Words can be sensitive to noise, as it considers all terms with equal weight. TF-IDF's incorporation of IDF helps in reducing the impact of noisy or common terms.
##Semantic Understanding:

TF-IDF can capture some level of semantic understanding as it gives higher weights to terms that are indicative of the content of a document. This can lead to better performance in tasks that require a deeper understanding of the document semantics.
It's important to note that the choice between Bag-of-Words and TF-IDF depends on the specific characteristics of the data and the nature of the task. In some cases, Bag-of-Words might perform well, especially when the focus is on word presence rather than importance. Additionally, more advanced techniques, such as word embeddings (e.g., Word2Vec, GloVe), have gained popularity as they capture semantic relationships between words. The choice of the text representation method should align with the goals of the machine learning task.


# 2A)
Certainly! While Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) are commonly used techniques for text representation in machine learning, there are several advanced techniques that have been developed to capture more sophisticated features of textual data. Here are a few alternatives:

##Word Embeddings:

Word embeddings, such as Word2Vec, GloVe, and FastText, represent words as dense vectors in a continuous vector space. These embeddings capture semantic relationships between words and are capable of capturing context and similarity, which BoW and TF-IDF may not be able to do effectively.
##Doc2Vec (Paragraph Vectors):

An extension of Word2Vec, Doc2Vec represents entire documents as vectors. It learns to embed documents in a continuous vector space, considering the context in which words appear. This allows for more meaningful representations of documents.
##BERT (Bidirectional Encoder Representations from Transformers):

BERT is a transformer-based model that pre-trains on large corpora and captures bidirectional context for words in a sentence. It has achieved state-of-the-art performance on a wide range of natural language processing tasks. Fine-tuning BERT for specific tasks can yield excellent results.
##Attention Mechanisms:

Attention mechanisms, as seen in transformers, allow models to focus on different parts of the input sequence when making predictions. This helps in capturing long-range dependencies in text data and has been successful in various NLP tasks.
##Subword Embeddings:

Models like Byte Pair Encoding (BPE) or SentencePiece can be used to tokenize text into subword units. This is particularly useful for handling rare words or out-of-vocabulary words more effectively.
##TF-IDF with Word Embeddings:

Combining traditional TF-IDF with word embeddings can provide a richer representation. You can use TF-IDF to represent the importance of words in a document and then multiply it with the word embeddings to get a weighted representation.
##ELMo (Embeddings from Language Models):

ELMo is another type of word representation that uses a deep, context-dependent neural network. It considers the entire input sentence to generate word representations, capturing syntactic and semantic context.
##ULMFiT (Universal Language Model Fine-tuning):

ULMFiT is a transfer learning approach specifically designed for NLP tasks. It involves pre-training a language model on a large corpus and then fine-tuning it for a specific task. This has shown to be effective in scenarios with limited labeled data.
The choice of technique depends on the specific task, dataset, and computational resources available. Experimenting with different approaches and ensembling them can also be a powerful strategy to improve performance.


3A)
Stemming and lemmatization are techniques used in natural language processing (NLP) and text mining to reduce words to their base or root form. Here's an overview of each, along with their pros and cons:

##Stemming:
##Definition:

Stemming is the process of removing suffixes or prefixes from a word to obtain its root form (stem). The goal is to reduce words to a common base form, even if the result is not a valid word.
Pros:

##Computational Efficiency:

Stemming is usually faster than lemmatization as it involves simpler rule-based operations, making it computationally more efficient.
##Reduction of Inflections:

Stemming effectively reduces inflected words to a common base form, which can help in reducing the dimensionality of the feature space.
##Cons:

##Overstemming or Understemming:

 Stemming may result in overstemming (where different words are reduced to the same stem even though they have different meanings) or understemming (where words with the same meaning are not reduced to the same stem).
##Loss of Meaning:

Since stemming doesn't consider the context or meaning of words, it can lead to a loss of meaning. The resulting stems might not be valid words and may not carry the same semantic information as the original words.
##Language Dependency:

Stemming rules are language-dependent, and different languages may require different stemming algorithms.
##Lemmatization:
##Definition:

Lemmatization is the process of reducing words to their base or dictionary form (lemma) by considering the word's meaning and context.
Pros:

##Preservation of Meaning:

 Lemmatization preserves the semantic meaning of words by reducing them to their base form, which is often a valid word. This can be crucial for tasks that require a deeper understanding of the text.
##Better for Information Retrieval:

 In information retrieval tasks, lemmatization can enhance the performance by grouping together words with the same meaning, reducing noise in the data.
Cons:

##Computational Complexity:

 Lemmatization is generally more computationally expensive than stemming, as it involves looking up words in a lexicon or using more complex algorithms to determine the base form.
Resource Dependency: Lemmatization often requires access to a lexicon or database of word forms, and it may not perform well with out-of-vocabulary words or in languages with limited resources.
##Summary:

In machine learning, the choice between stemming and lemmatization depends on the specific task and the goals of the analysis. If computational efficiency and simplicity are priorities, stemming might be preferred. If preserving semantic meaning and using a more accurate representation of words are important, lemmatization is a better choice. In some cases, a combination of both techniques or experimenting with different approaches may yield the best results.






### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
