# NLP-Various Implementations | Text Classification using TF-IDF Vectorization

**Overview:** In this part of the project, I implemented a machine learning model that classifies text using Naive Bayes and Support Vector Machines (SVM) algorithms. The TfidfVectorizer function transforms text into feature vectors, and the train_classifier() function trains the classifier. The predict() function outputs predicted labels, whereas the run_model() function returns accuracy, dimensionality, time cost, and misclassification data for later use. The analyze_results() function outputs the most common misclassification pair and a random text with corresponding true and predicted labels for each model. The code also provides a way to compare the performance of Naive Bayes and SVM using various parameters.

## 1. Import all the necessary modules

**Briefly:** `csv` library provides functionality for working with Comma Separated Value (CSV) files, `time` provides functions for working with time-related tasks, `random` provides tools for generating random numbers, `defaultdict` provides a way to create a dictionary with default values for nonexistent keys, `PrettyTable` provides a way to display data in a table format, `TfidfVectorizer` is a function from `sklearn.feature_extraction.text` that transforms text into feature vectors, `MultinomialNB` and `LinearSVC` are machine learning algorithms from `sklearn.naive_bayes` and `sklearn.svm respectively`, used for text classification.

In [1]:
import csv
import time
import random
from prettytable import PrettyTable
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## 2. Load the training and testing datasets

The function load_dataset() takes in the path of a CSV file containing text data and returns a dictionary with the features and labels. The features are a concatenation of the second and third columns of each row in the CSV file, while the labels correspond to the first column. The function is used to load the training and test data from two CSV files in the specified file paths and store them in variables named train_data and test_data.

In [2]:
def load_dataset(path):
    with open(path, "r", encoding="utf-8") as f:
        data = list(csv.reader(f))
    return {"features": [row[1] + ' ' + row[2] for row in data[1:]], "labels": [row[0] for row in data[1:]]}

train_data = load_dataset("C:/Users/natalia/pyproj/nlp-proj/assignment-2b/train.csv")
test_data = load_dataset("C:/Users/natalia/pyproj/nlp-proj/assignment-2b/test.csv") 

## 3. Build and run the text classification model

The run_model function takes in four arguments and extracts features from the training and test data using TfidfVectorizer. It then trains a classifier using the extracted features and returns the accuracy score, dimensionality, time taken for training and testing, and misclassification data.

* **Step.1:** the run_model function is called with four arguments: train_data, test_data, n, approach, and classifier. In the meantime, the timer for calculating the time taken to run the model starts counting.
* **Step.2:** the extract_features function is called within run_model, to extract features from the train_data and test_data using TfidfVectorizer. The extracted features are assigned to variables X_train, X_test, and vocab. The labels from train_data and test_data are assigned to y_train and y_test, respectively.
* **Step.3:** the train_classifier function is also called within run_model, with X_train, y_train, and classifier as arguments, to train a classifier using the provided training data. The timer is then stopped to get the time taken for training the classifier.
* **Step.4:** the detect_misclassification function is called within run_model as well, with test_data and the predicted labels from the classifier's predict function as arguments to calculate misclassification data. The predict function takes classifier and X_test as arguments and returns predicted labels for the test data.
* **Step.5:** finally, the run_model function returns four values: accuracy score, dimensionality, time taken for training and testing, and misclassification data. Accuracy represents the accuracy score of the classifier on the test data, dimensionality is the number of unique features in the extracted features, time_cost is the time taken to train the classifier and misclass_data_mnv1w is a dictionary containing misclassification data.

### 3.1. Multinomial Naïve Bayes using tf-idf word uni-grams

The run_model() function is called with five arguments: train_data, test_data, 1 for the n-gram range, "word" for the tokenization approach and an instance of the MultinomialNB class as the classification algorithm. The function extracts features using TF-IDF with word uni-grams, trains a Multinomial Naïve Bayes classifier, and calculates the accuracy score, dimensionality, time cost, and misclassification data.

In [3]:
def run_model(train_data, test_data, n, approach, classifier):
    start = time.time()
    X_train, X_test, vocab = extract_features(train_data["features"], test_data["features"], n, approach) # (i, j) entries represent the presence and frequency of the j-th feature (word) in the i-th document - the values in each entry represent the corresponding term frequency-inverse document frequency (tf-idf) score
    y_train, y_test = train_data["labels"], test_data["labels"]
    clf = train_classifier(X_train, y_train, classifier)
    end = time.time()
    misclass_data = detect_misclassification(test_data, predict(classifier, X_test))
    return clf.score(X_test, y_test), len(vocab), end - start, misclass_data

def extract_features(train_text, test_text, n, approach):
    vectorizer = TfidfVectorizer(ngram_range=(n,n), lowercase=True, analyzer=approach) # where (n,n): the lower and upper bounds of the range of n-grams to be extracted
    return vectorizer.fit_transform(train_text), vectorizer.transform(test_text), vectorizer.vocabulary_

def train_classifier(X_train, y_train, classifier):
    return classifier.fit(X_train, y_train)

def predict(classifier, X_test):
    return classifier.predict(X_test)

def detect_misclassification(test_data, y_pred):
    misclass_data = defaultdict(list)
    for i in range(len(test_data["labels"])):
        true_label = test_data["labels"][i]
        predicted_label = y_pred[i]
        if true_label != predicted_label:
            text = test_data["features"][i]
            misclass_data[true_label].append((text, predicted_label))
    return misclass_data

accuracy, dimensionality, time_cost, misclass_data_mnv1w = run_model(train_data, test_data, 1, "word", MultinomialNB())

The visualize function takes a model name, accuracy, dimensionality, and time cost as inputs, and prints a pretty table showing the model's performance. In this case, it is used to display the results of this model: Multinomial Naïve Bayes using tf-idf word uni-grams.

In [4]:
def visualize(model, accuracy, dimensionality, time_cost):
    print(f"\033[1m{model}:\033[0m")
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["Accuracy", "Dimensionality", "Time Cost"]])
    pt.add_row([accuracy, dimensionality, time_cost])
    print(pt)

visualize("Multinomial Naïve Bayes using tf-idf word uni-grams", accuracy, dimensionality, time_cost)

[1mMultinomial Naïve Bayes using tf-idf word uni-grams:[0m
+--------------------+----------------+-------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m     |
+--------------------+----------------+-------------------+
| 0.9022368421052631 |     64999      | 6.860435485839844 |
+--------------------+----------------+-------------------+


### 3.2. Multinomial Naïve Bayes using tf-idf character tri-grams

The run_model() function is called with five arguments: train_data, test_data, 3 for the n-gram range, "char" for the tokenization approach and an instance of the MultinomialNB class as the classification algorithm. The function extracts features using TF-IDF with character tri-grams, trains a Multinomial Naïve Bayes classifier, and calculates the accuracy score, dimensionality, time cost, and misclassification data. It then visualizes the accuracy, dimensionality, and time cost of the model using the "visualize" function

In [5]:
accuracy, dimensionality, time_cost, misclass_data_mnv3c = run_model(train_data, test_data, 3, "char", MultinomialNB())
visualize("Multinomial Naïve Bayes using tf-idf character tri-grams", accuracy, dimensionality, time_cost)

[1mMultinomial Naïve Bayes using tf-idf character tri-grams:[0m
+--------------------+----------------+--------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m      |
+--------------------+----------------+--------------------+
| 0.8686842105263158 |     31074      | 24.951099157333374 |
+--------------------+----------------+--------------------+


### 3.3. Support Vector Machines using tf-idf word uni-grams

The run_model() function is called with five arguments: train_data, test_data, 1 for the n-gram range, "word" for the tokenization approach and an instance of the LinearSVC class as the classification algorithm with C parameter set to 1. The function extracts features using TF-IDF with word uni-grams, trains a Support Vector Machines classifier, and calculates the accuracy score, dimensionality, time cost, and misclassification data. It then visualizes the accuracy, dimensionality, and time cost of the model using the "visualize" function.

In [6]:
accuracy, dimensionality, time_cost, misclass_data_svm1w = run_model(train_data, test_data, 1, "word", LinearSVC(C=1))
visualize("Support Vector Machines using tf-idf word uni-grams", accuracy, dimensionality, time_cost)

[1mSupport Vector Machines using tf-idf word uni-grams:[0m
+--------------------+----------------+-------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m     |
+--------------------+----------------+-------------------+
| 0.9196052631578947 |     64999      | 11.84224820137024 |
+--------------------+----------------+-------------------+


### 3.4. Support Vector Machines using tf-idf character tri-grams

The run_model() function is called with five arguments: train_data, test_data, 3 for the n-gram range, "char" for the tokenization approach, and an instance of the LinearSVC class as the classification algorithm with C parameter set to 1. The function extracts features using TF-IDF with character tri-grams, trains a Support Vector Machines classifier, and calculates the accuracy score, dimensionality, time cost, and misclassification data. It then visualizes the accuracy, dimensionality, and time cost of the model using the "visualize" function.

In [7]:
accuracy, dimensionality, time_cost, misclass_data_svm3c = run_model(train_data, test_data, 3, "char", LinearSVC(C=1))
visualize("Support Vector Machines using tf-idf character tri-grams", accuracy, dimensionality, time_cost)

[1mSupport Vector Machines using tf-idf character tri-grams:[0m
+--------------------+----------------+-------------------+
|      [1mAccuracy[0m      | [1mDimensionality[0m |     [1mTime Cost[0m     |
+--------------------+----------------+-------------------+
| 0.9121052631578948 |     31074      | 38.08852005004883 |
+--------------------+----------------+-------------------+


## 4. Analyze the results of the models

The analyze_results function analyzes the misclassification data of multiple models by creating a common_misclass_data dictionary to store the texts that were misclassified by all the models and calls three helper functions to count the number of misclassified texts for each true label, determine the most common misclassification pair, and display a random misclassified text and its predictions made by each model.

* **Step.1:** the analyze_results function starts by creating common_misclass_data dictionary to store the texts that were misclassified by all the models. For each text that was misclassified, it looks at the true label and stores the text along with the predicted labels from all the models in a list. It then calls three helper functions, count_times(), get_top_pair(), and get_random_text(), to further analyze the data in terms of misclassification.
* **Step.2:** the count_times() function is called within analyze_results. It takes as input the common_misclass_data dictionary and the list of classes that contains all the possible labels in the dataset. It first counts the number of misclassified texts for each true label by computing the length of the corresponding dictionary value (which is a list of tuples where each tuple contains a misclassified text and its predicted labels from all models). Then, it creates a pretty table to display the number of common misclassified texts for each true label across all models.
* **Step.3:** the get_top_pair() function is also called within analyze_results. It takes as input the common_misclass_data dictionary and the list of classes that contains all the possible labels in the dataset. It then iterates through each true label and the corresponding misclassified texts and counts the number of times each pair of true label and predicted label occurs in the list of misclassified texts. It then determines the pair with the highest count and prints it as the most common misclassification pair. Finally, the function creates a PrettyTable object to display the frequencies of all the misclassification pairs, sorted in descending order by frequency.
* **Step.4:** the get_random_text() function is called within analyze_results as well. It takes as input the list of all the models, the common_misclass_data dictionary, and the list of classes that contains all the possible labels in the dataset. It randomly selects a misclassified text and displays it along with its true label. It then shows the predictions made by each model in a table. The purpose of this function is to allow the user to see an example of a misclassified text and the various predictions made by the models for that text.

> the to_category() function takes a numerical label as input, along with a list of classes, and returns the corresponding string label for that numerical value.

The analyze_results function is called with four arguments: a list of category labels (["World", "Sports", "Business", "Sci/Tech"]), a list of model names (["mnv1w", "mnv3c", "svm1w", "svm3c"]) and a list of misclassification data for each model.

In [8]:
def count_times(classes, common_misclass_data):
    misclass_counts = {true_label: len(misclass_tuples) for true_label, misclass_tuples in common_misclass_data.items()}
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["True Label", "Misclassified Texts"]])
    [pt.add_row([to_category(true_label, classes), count]) for true_label, count in misclass_counts.items()]
    print("\033[1mCommon Misclassified Texts per Class:\033[0m")
    print(pt)

def get_top_pair(classes, common_misclass_data):
    misclass_freqs = defaultdict(int)
    for true_label, values in common_misclass_data.items():
        for text, pred_labels in values:
            for pl in pred_labels:
                misclass_freqs[(true_label, pl)] += 1
    max_tuple, max_count = max(misclass_freqs.items(), key=lambda x: x[1])
    sorted_tuples = sorted(misclass_freqs.items(), key=lambda x: x[1], reverse=True)
    print(f"\n\033[1mMost common Misclassification Pair:\033[0m ({to_category(max_tuple[0], classes)}, {to_category(max_tuple[1], classes)})")
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["True Label", "Predicted Label", "Frequency"]])
    [pt.add_row([to_category(tup[0], classes), to_category(tup[1], classes), count]) for tup, count in sorted_tuples]
    print(pt)

def get_random_text(classes, models, common_misclass_data):
    rand_true_label = random.choice(list(common_misclass_data.keys()))
    rand_misclass_tuple = random.choice(common_misclass_data[rand_true_label])
    print("\n\033[1m" + "Random Text: " + "\033[0m" + rand_misclass_tuple[0] + "\033[1m" + "\nTrue Label: " + "\033[0m" + to_category(rand_true_label, classes))
    pt = PrettyTable(field_names=[f"\033[1m{field}\033[0m" for field in ["Model", "Prediction"]])
    [pt.add_row([model, to_category(rand_misclass_tuple[1][idx], classes)]) for idx, model in enumerate(models)]
    print(pt)

def to_category(label, classes):
    return classes[int(label) - 1]
    
def analyze_results(classes, models, misclassified):
    common_misclass_data = defaultdict(list)
    for true_label in misclassified[0].keys():
        for text, label in misclassified[0][true_label]:
            labels = [label] + [next((l for t, l in model[true_label] if t == text), '') for model in misclassified[1:]]
            common_misclass_data[true_label].append((text, labels)) if all(labels) else None
    count_times(classes, common_misclass_data)
    get_top_pair(classes, common_misclass_data)
    get_random_text(classes, models, common_misclass_data)

analyze_results(["World", "Sports", "Business", "Sci/Tech"], ["mnv1w", "mnv3c", "svm1w", "svm3c"], [misclass_data_mnv1w, misclass_data_mnv3c, misclass_data_svm1w, misclass_data_svm3c])

[1mCommon Misclassified Texts per Class:[0m
+------------+---------------------+
| [1mTrue Label[0m | [1mMisclassified Texts[0m |
+------------+---------------------+
|  Sci/Tech  |          85         |
|  Business  |         135         |
|   World    |         112         |
|   Sports   |          9          |
+------------+---------------------+

[1mMost common Misclassification Pair:[0m (Business, Sci/Tech)
+------------+-----------------+-----------+
| [1mTrue Label[0m | [1mPredicted Label[0m | [1mFrequency[0m |
+------------+-----------------+-----------+
|  Business  |     Sci/Tech    |    381    |
|  Sci/Tech  |     Business    |    206    |
|   World    |     Business    |    194    |
|   World    |      Sports     |    144    |
|  Business  |      World      |    123    |
|   World    |     Sci/Tech    |    110    |
|  Sci/Tech  |      World      |    103    |
|  Business  |      Sports     |     36    |
|  Sci/Tech  |      Sports     |     31    |
|   Sports  