<a href="https://colab.research.google.com/github/priyadarshinivr19/Minors-Degree-Machine-Learning/blob/main/FMML_M3L3_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Understanding NLP tools**

  NLTK

In [12]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [13]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [14]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


Bag of Words

In [15]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


TF-IDF

In [16]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

#**Reviews Sentiment Analysis**

In [17]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Machine Learning Minors/Originals/MODULE 3/reviews.csv')

In [18]:
df = df.dropna()

In [19]:
df.to_csv('reviews.csv', index=False)

Task 1

In [20]:
def bow_knn_tweaks():
    """Method for experimenting with different parameters for KNN using bag-of-words."""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    # Different configurations
    params = [
        {"n_neighbors": 3, "weights": 'uniform', "p": 2, "metric": 'euclidean'},
        {"n_neighbors": 5, "weights": 'distance', "p": 1, "metric": 'manhattan'},
        {"n_neighbors": 7, "weights": 'distance', "p": 2, "metric": 'cosine'},
        {"n_neighbors": 10, "weights": 'uniform', "p": 1, "metric": 'minkowski'},
        {"n_neighbors": 15, "weights": 'distance', "p": 2, "metric": 'euclidean'}
    ]

    for param in params:
        knn = neighbors.KNeighborsClassifier(
            n_neighbors=param["n_neighbors"],
            weights=param["weights"],
            p=param["p"],
            metric=param["metric"],
            n_jobs=1
        )

        knn.fit(X_train, y_train)
        predicted = knn.predict(X_test)
        acc = metrics.accuracy_score(y_test, predicted)
        print(f'KNN with BOW accuracy (n={param["n_neighbors"]}, weight={param["weights"]}, p={param["p"]}, metric={param["metric"]}) = {acc * 100:.2f}%')

        scores = cross_val_score(knn, X_train, y_train, cv=3)
        print(f"Cross Validation Accuracy: {scores.mean():.2f}")
        print(scores)
        print('\n')


In [21]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [22]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


# **Text Classification**

In [23]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Machine Learning Minors/Originals/MODULE 3/spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [24]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [25]:
df.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
len(df)

5572

Task 2

In [29]:
def bow_knn_tweaks():
    """Experiment with different parameters for KNN using bag-of-words."""

    training_data = pd.read_csv('/content/drive/MyDrive/Machine Learning Minors/Originals/MODULE 3/spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    # Parameter configurations to test
    params = [
        {"n_neighbors": 3, "weights": 'uniform', "p": 2, "metric": 'euclidean'},
        {"n_neighbors": 5, "weights": 'distance', "p": 1, "metric": 'manhattan'},
        {"n_neighbors": 7, "weights": 'distance', "p": 2, "metric": 'cosine'},
        {"n_neighbors": 10, "weights": 'uniform', "p": 1, "metric": 'minkowski'},
        {"n_neighbors": 15, "weights": 'distance', "p": 2, "metric": 'euclidean'}
    ]

    for param in params:
        knn = neighbors.KNeighborsClassifier(
            n_neighbors=param["n_neighbors"],
            weights=param["weights"],
            p=param["p"],
            metric=param["metric"],
            n_jobs=1
        )

        knn.fit(X_train, y_train)
        predicted = knn.predict(X_test)
        acc = metrics.accuracy_score(y_test, predicted)
        print(f'KNN with BOW accuracy (n={param["n_neighbors"]}, weight={param["weights"]}, p={param["p"]}, metric={param["metric"]}) = {acc * 100:.2f}%')

        scores = cross_val_score(knn, X_train, y_train, cv=3)
        print(f"Cross Validation Accuracy: {scores.mean():.2f}")
        print(scores)
        print('\n')


In [30]:
def tfidf_knn_tweaks():
    """Experiment with different parameters for KNN using tf-idf."""

    training_data = pd.read_csv('/content/drive/MyDrive/Machine Learning Minors/Originals/MODULE 3/spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    # Parameter configurations to test
    params = [
        {"n_neighbors": 3, "weights": 'uniform', "p": 2, "metric": 'euclidean'},
        {"n_neighbors": 5, "weights": 'distance', "p": 1, "metric": 'manhattan'},
        {"n_neighbors": 7, "weights": 'distance', "p": 2, "metric": 'cosine'},
        {"n_neighbors": 10, "weights": 'uniform', "p": 1, "metric": 'minkowski'},
        {"n_neighbors": 15, "weights": 'distance', "p": 2, "metric": 'euclidean'}
    ]

    for param in params:
        knn = neighbors.KNeighborsClassifier(
            n_neighbors=param["n_neighbors"],
            weights=param["weights"],
            p=param["p"],
            metric=param["metric"],
            n_jobs=1
        )

        knn.fit(X_train, y_train)
        predicted = knn.predict(X_test)
        acc = metrics.accuracy_score(y_test, predicted)
        print(f'KNN with TFIDF accuracy (n={param["n_neighbors"]}, weight={param["weights"]}, p={param["p"]}, metric={param["metric"]}) = {acc * 100:.2f}%')

        scores = cross_val_score(knn, X_train, y_train, cv=3)
        print(f"Cross Validation Accuracy: {scores.mean():.2f}")
        print(scores)
        print('\n')


In [31]:
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [32]:
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


**Questions**

**1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?**


The TF-IDF (Term Frequency-Inverse Document Frequency) approach often results in better accuracy than Bag-of-Words (BoW) because it provides more informative weighting of terms:

1. Incorporating Term Importance: TF-IDF assigns weights to terms based on their frequency in a document and their rarity across the dataset. This down-weights common words (like "the," "and") that appear frequently in all documents but carry little informational value, emphasizing words that are more distinctive to specific documents.

2. Capturing Distinctive Terms: Since TF-IDF penalizes high-frequency but unimportant terms, it helps to differentiate between categories (e.g., "spam" and "ham") by emphasizing rare but meaningful terms.

**2. Can you think of techniques that are better than both BoW and TF-IDF ?**


Word Embeddings (e.g., Word2Vec, GloVe): Word embeddings are vector representations of words that capture semantic relationships and contextual meaning. These techniques place words in a vector space such that words with similar meanings are close together, helping the model generalize beyond exact word matches and understand word context better.

Contextualized Embeddings (e.g., BERT, GPT): These embeddings take context into account dynamically, meaning that a word’s vector representation changes depending on the surrounding words. This approach is more powerful for understanding nuanced language patterns and often performs better on complex NLP tasks.

Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA): These topic modeling techniques reduce dimensionality by capturing latent structures in the data, grouping similar words and reducing noise. They are particularly useful for summarization and understanding overarching themes in text.

**3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.**

**Stemming:**

Stemming reduces words to their base or root form by stripping suffixes, sometimes resulting in non-meaningful root words.

Pros:
Faster and computationally less intensive.
Works well in contexts where approximate matching is acceptable (e.g., search engines).

Cons:
Can lead to misinterpretation or distortion of words, especially in more complex applications where understanding the word's proper form matters (e.g., stemming “better” to “bett”).
Results in non-standardized or non-dictionary words, which can confuse downstream models.

**Lemmatization:**

Lemmatization reduces words to their canonical or dictionary form, considering the context and part of speech.

Pros:
Produces grammatically correct forms of words, leading to better standardization and interpretation.
Often improves performance in NLP tasks where context matters, as it maintains the semantic meaning of words.

Cons:
More computationally expensive as it requires context and part-of-speech tagging.
Requires additional language resources (like WordNet), which may not be available for all languages.