# NLP-Various Implementations | Text Classification using TF-IDF Vectorization

**Overview:** In this part of the project, I implemented a machine learning model that classifies text using Naive Bayes and Support Vector Machines (SVM) algorithms. The TfidfVectorizer function transforms text into feature vectors, and the train_classifier() function trains the classifier. The predict() function outputs predicted labels, whereas the run_model() function returns accuracy, dimensionality, time cost, and misclassification data for later use. The analyze_results() function outputs the most common misclassification pair and a random text with corresponding true and predicted labels for each model. The code also provides a way to compare the performance of Naive Bayes and SVM using various parameters.

## 1. Import all the necessary modules

**Briefly:** `csv` library provides functionality for working with Comma Separated Value (CSV) files, `time` provides functions for working with time-related tasks, `random` provides tools for generating random numbers, `defaultdict` provides a way to create a dictionary with default values for nonexistent keys, `PrettyTable` provides a way to display data in a table format, `TfidfVectorizer` is a function from `sklearn.feature_extraction.text` that transforms text into feature vectors, `MultinomialNB` and `LinearSVC` are machine learning algorithms from `sklearn.naive_bayes` and `sklearn.svm respectively`, used for text classification.

In [1]:
import csv
import time
import random
from prettytable import PrettyTable
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## 2. Load the training and testing datasets

The function load_dataset() takes in the path of a CSV file containing text data and returns a dictionary with the features and labels. The features are a concatenation of the second and third columns of each row in the CSV file, while the labels correspond to the first column. The function is used to load the training and test data from two CSV files in the specified file paths and store them in variables named train_data and test_data.

In [2]:
def load_dataset(path):
    with open(path, "r", encoding="utf-8") as f:
        data = list(csv.reader(f))
    return {"features": [row[1] + ' ' + row[2] for row in data[1:]], "labels": [row[0] for row in data[1:]]}

train_data = load_dataset("C:/Users/natalia/pyproj/nlp-proj/assignment-2b/train.csv")
test_data = load_dataset("C:/Users/natalia/pyproj/nlp-proj/assignment-2b/test.csv") 

## 3. Find and compare similar words using word2vec and GloVe embeddings

**Find similar words:** The function get_similar_words takes pre-trained word embedding models along with some input words, and performs similarity search to return the top-N similar words to the input words:

* **Step.1:** the function takes several inputs such as: n that determines the number of similar words to be retrieved, models a dictionary of pre-trained word embedding models that will be used to find the similar words, data a list of target words for which the similar words will be retrieved, pos a list of positive context words used in the word embedding models, neg a list of negative context words used in the word embedding models and analogy a boolean value that determines whether the word similarity task is an analogy task or not (simple similarity task).
* **Step.2:** the function initializes an empty list sims to store the retrieved similar words. It then iterates over the pre-trained models and for each model, it retrieves the most similar words for each target word and stores them in a table format. It adds the retrieved similar words to sims and prints the table for each model.
* **Step.3:** finally, it returns sims which contains the list of similar words for each target word.

In [3]:
def run_model(train_data, test_data, n, approach, classifier):
    start = time.time()
    X_train, X_test, vocab = extract_features(train_data["features"], test_data["features"], n, approach) # (i, j) entries represent the presence and frequency of the j-th feature (word) in the i-th document - the values in each entry represent the corresponding term frequency-inverse document frequency (tf-idf) score
    y_train, y_test = train_data["labels"], test_data["labels"]
    clf = train_classifier(X_train, y_train, classifier)
    end = time.time()
    misclass_data = detect_misclassification(test_data, clf.predict(X_test))
    return clf.score(X_test, y_test), len(vocab), end - start, misclass_data

def extract_features(train_text, test_text, n, approach):
    vectorizer = TfidfVectorizer(ngram_range=(n,n), lowercase=True, analyzer=approach) # where (n,n): the lower and upper bounds of the range of n-grams to be extracted
    return vectorizer.fit_transform(train_text), vectorizer.transform(test_text), vectorizer.vocabulary_

def train_classifier(X_train, y_train, classifier):
    return classifier.fit(X_train, y_train)

def predict(classifier, X_test):
    return classifier.predict(X_test)

def detect_misclassification(test_data, y_pred):
    misclass_data = defaultdict(list)
    for i in range(len(test_data["labels"])):
        true_label = test_data["labels"][i]
        predicted_label = y_pred[i]
        if true_label != predicted_label:
            text = test_data["features"][i]
            misclass_data[true_label].append((text, predicted_label))
    return misclass_data

accuracy, dimensionality, time_cost, misclass_data_mnv1w = run_model(train_data, test_data, 1, "word", MultinomialNB())

**Find common words:** The function get_common_words takes the number of similar words to be retrieved, the target words, and a list of similar words for each model as inputs, and returns a table that shows the common words in both models for each target word.

* **Step.1:** the function takes three inputs: n that specifies the number of similar words to be retrieved, words a list of target words, and sims a list containing the top-N similar words for each target word for both models.
* **Step.2:** it initializes an empty list coms to store the common words in both models for each target word. It then iterates over the retrieved similar words for each target word and finds the intersection of the top-N similar words for both models. It adds the common words to coms and creates a table using PrettyTable. The table shows the common words for each target word and highlights the target word in bold.
* **Step.3:** it finally prints the table showing the common words in both models for each target word.

The function get_common_words() is called with three input arguments: value 10, which represents the number of common words to retrieve for each pair of target words, a list of target words including 'car', 'jaguar', 'Jaguar', and 'facebook', and the variable sims, which contains a list of similar words for each target word obtained from the pre-trained word embeddings models. The function then compares the lists of similar words for each pair of target words and returns a list of common words that appear in the similar word lists for each pair. This list of common words is then sorted and returned as the output of the function.

> The `if/else` statement checks if there are any "N/A" values in the similarity results for the current group of similar words for both models. If both models return "N/A" for a particular group of similar words, it means that there are no similar words found for that particular target word in both models. In this case, an empty list is added to the list of common words (coms) for that target word. Otherwise, if there are similar words found for the target word in both models, the code creates a list of the common words between the two models by taking the intersection of the similar words retrieved from each model, and adds it to coms. The resulting coms list contains the common words between the two models for each target word.

In [4]:
def visualize(model, accuracy, dimensionality, time_cost):
    print(f"\033[1m{model}:\033[0m")
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["Accuracy", "Dimensionality", "Time Cost"]])
    pt.add_row([accuracy, dimensionality, time_cost])
    print(pt)

visualize("Multinomial Naïve Bayes using tf-idf word uni-grams", accuracy, dimensionality, time_cost)

[1mMultinomial Naïve Bayes using tf-idf word uni-grams:[0m
+--------------------+----------------+-------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m     |
+--------------------+----------------+-------------------+
| 0.9022368421052631 |     64999      | 5.747888565063477 |
+--------------------+----------------+-------------------+


### 3.2. Top-10 similar words for user-defined targets: {country, crying, Rachmaninoff, espresso}

The get_similar_words() function is called with several input arguments: value 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'country', 'crying', 'Rachmaninoff', and 'espresso' to search for similar words within the models, empty lists for the positive and negative words that define the context and an argument 'False' which indicates that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for each target word in the input list.

In [5]:
accuracy, dimensionality, time_cost, misclass_data_mnv3c = run_model(train_data, test_data, 3, "char", MultinomialNB())
visualize("Multinomial Naïve Bayes using tf-idf character tri-grams", accuracy, dimensionality, time_cost)

[1mMultinomial Naïve Bayes using tf-idf character tri-grams:[0m
+--------------------+----------------+--------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m      |
+--------------------+----------------+--------------------+
| 0.8686842105263158 |     31074      | 24.929243564605713 |
+--------------------+----------------+--------------------+


The function get_common_words() is called with three input arguments: value 10, which represents the number of common words to retrieve for each pair of target words, a list of target words including 'country', 'crying', 'Rachmaninoff', and 'espresso', and the variable sims, which contains a list of similar words for each target word obtained from the pre-trained word embeddings models. The function then compares the lists of similar words for each pair of target words and returns a list of common words that appear in the similar word lists for each pair. This list of common words is then sorted and returned as the output of the function.

In [6]:
accuracy, dimensionality, time_cost, misclass_data_svm1w = run_model(train_data, test_data, 1, "word", LinearSVC(C=1))
visualize("Support Vector Machines using tf-idf word uni-grams", accuracy, dimensionality, time_cost)

[1mSupport Vector Machines using tf-idf word uni-grams:[0m
+--------------------+----------------+--------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m      |
+--------------------+----------------+--------------------+
| 0.9196052631578947 |     64999      | 12.181857585906982 |
+--------------------+----------------+--------------------+


## 4. Find and filter similar words by context using word2vec and GloVe embeddings

The function get_similar_words() is called with several input arguments: value 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'student' to search for similar words within the models, empty lists for the positive and negative words that define the context, and an argument 'False' which indicates that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for the target word 'student'.

In [7]:
accuracy, dimensionality, time_cost, misclass_data_svm3c = run_model(train_data, test_data, 3, "char", LinearSVC(C=1))
visualize("Support Vector Machines using tf-idf character tri-grams", accuracy, dimensionality, time_cost)

[1mSupport Vector Machines using tf-idf character tri-grams:[0m
+--------------------+----------------+-------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m     |
+--------------------+----------------+-------------------+
| 0.9121052631578948 |     31074      | 39.59399223327637 |
+--------------------+----------------+-------------------+


The function get_similar_words() is called with several input arguments: a value of 10, which represents the number of similar words to retrieve, a dictionary with two pre-trained word embeddings models, 'Word2vec' and 'GloVe' which are used to find similar words, a list of target words including 'student' to search for similar words within the models, an empty list for the positive words that define the positive context and the word 'university' in the negative words list, which indicates that similar words to 'student' associated with 'university' should be excluded from the output. The argument 'False' is used to indicate that it should not generate missing words in analogies. The output of the function is stored in the variable sims, which contains a list of similar words for the target word 'student'.

In [8]:
def analyze_results(mnv1w, mnv3c, svm1w, svm3c):
    
    common_misclass_data = defaultdict(list)
    for true_label in mnv1w.keys():
        for text, label in mnv1w[true_label]:
            labels = [label] + [next((p for t, p in model[true_label] if t == text), '') for model in [mnv3c, svm1w, svm3c]]
            common_misclass_data[true_label].append((text, labels)) if all(labels) else None
    
    misclass_counts = {true_label: len(misclass_tuple) for true_label, misclass_tuple in common_misclass_data.items()}
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["True Label", "Misclassification Times"]])
    pt.add_rows([(true_label, count) for true_label, count in misclass_counts.items()])
    print(pt)

    misclass_freqs = defaultdict(int)
    for true_label, values in common_misclass_data.items():
        for text, pred_labels in values:
            for pl in pred_labels:
                misclass_freqs[(true_label, pl)] += 1
    max_tuple, max_count = max(misclass_freqs.items(), key=lambda x: x[1])
    print("\033[1m" + "Most common Misclassification Pair:" + "\033[0m")
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["True Label", "Predicted Label", "Frequency"]])
    pt.add_row([max_tuple[0], max_tuple[1], max_count])
    print(pt)

    rand_true_label = random.choice(list(common_misclass_data.keys()))
    rand_misclass_tuple = random.choice(common_misclass_data[rand_true_label])
    print("\033[1m" + "Text: " + "\033[0m" + rand_misclass_tuple[0] + "\033[1m" + "\nTrue Label: " + "\033[0m" + rand_true_label)
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["Model", "Prediction"]])
    pt.add_rows([(model, rand_misclass_tuple[1][idx]) for idx, model in enumerate(["mnv1w", "mnv3c", "svm1w", "svm3c"])])
    print(pt)

analyze_results(misclass_data_mnv1w, misclass_data_mnv3c, misclass_data_svm1w, misclass_data_svm3c)

+------------+-------------------------+
| [1mTrue Label[0m | [1mMisclassification Times[0m |
+------------+-------------------------+
|     4      |            85           |
|     3      |           135           |
|     1      |           112           |
|     2      |            9            |
+------------+-------------------------+
[1mMost common Misclassification Pair:[0m
+------------+-----------------+-----------+
| [1mTrue Label[0m | [1mPredicted Label[0m | [1mFrequency[0m |
+------------+-----------------+-----------+
|     3      |        4        |    381    |
+------------+-----------------+-----------+
[1mText: [0mFeds kick off digital TV consumer campaign WASHINGTON - It #39;s one of the biggest technical changes in television since color TV: the digital transition. And because many Americans remain in the dark about it, federal regulators began an education campaign Monday to enlighten them.[1m
True Label: [0m3
+-------+------------+
| [1mModel[0m | 