**Class TFIDF:**

Firstly, the class was initialized with essential attributes such as *corpus, num_documents, vocab, and term_idf_cache*. These attributes were crucial for storing the input documents, keeping track of the number of documents, maintaining the vocabulary set, and caching IDF values to avoid redundant computations.

Next, **methods** were implemented to calculate the Term Frequency (TF) and Inverse Document Frequency (IDF) for each term in the corpus. The TF calculation involved counting term occurrences in each document and normalizing by the total number of words. On the other hand, IDF was computed using the logarithmic formula, considering the number of documents containing each term.

Subsequently, the **fit method** was developed to fit the TF-IDF model on the corpus. During this step, the vocabulary set was constructed by iterating over each document, updating the set with unique terms, and computing document frequencies for each term.

Following the fitting of the model, the **transform method** was created to transform a set of documents into TF-IDF representation. This process involved iterating over each document and term, computing the TF-IDF score by multiplying the TF with IDF, and constructing a sparse TF-IDF matrix using the csr_matrix from SciPy.

Finally, **text preprocessing** steps, including tokenization, removal of stopwords, and lemmatization, were incorporated into the split method. The tokenization pattern ensured tokenizing the text into words while excluding single-character words. Regular expressions and NLTK functionalities were utilized for efficient text processing, ensuring that the input text was properly preprocessed before TF-IDF computation.


**cast_list_as_strings function**:

Return a list of strings. Function defined by professor.

**perform_grid_search** function:

Performs the grid search and returns the best score and best parameters for Logistic Regression.

**plot_roc_curve_and_threshold** function:

This function plot_roc_curve_and_threshold takes in TF-IDF data, true labels, and a classifier. It can optionally calculate and plot the ROC curve, and if requested, find and print the optimal threshold.

It calculates the ROC AUC score if calculate_roc is set to True.
If calculate_optimal_threshold is True and no optimal threshold is provided, it determines the threshold that maximizes the sum of true positive rate (TPR) and true negative rate (TNR) from the ROC curve.
It plots the ROC curve with thresholds as gradient color if requested.
If an optimal threshold is found or provided, it applies the custom threshold to obtain binary predictions.
It calculates and prints the classification report, including metrics like accuracy, precision, recall, and F1-score.
Finally, it returns a dictionary containing accuracy, precision, recall, F1-score, and a single value for the optimal threshold (if applicable).

In [None]:
import math
import numpy as np
from scipy.sparse import csr_matrix
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def cast_list_as_strings(mylist):
    """
    return a list of strings
    """
    mylist_of_strings = []
    for x in mylist:
        mylist_of_strings.append(str(x))

    return mylist_of_strings

class TFIDF:
    def __init__(self):
        self.corpus = None
        self.num_documents = None
        self.vocab = None
        self.term_idf_cache = {}
        self.tfidf_scores = None
        self.stopwords = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()

    def calculate_tf(self, term, document):
        word_count = document.count(term)
        total_words = len(self.split(document))
        return word_count / total_words if total_words > 0 else 0

    def calculate_idf(self, term):
        if term in self.term_idf_cache:
            return self.term_idf_cache[term]

        num_documents_with_term = self.doc_freq.get(term, 0)
        idf = math.log(self.num_documents / (1 + num_documents_with_term))
        self.term_idf_cache[term] = idf
        return idf

    def fit(self, corpus):
        self.corpus = corpus
        self.num_documents = len(corpus)
        self.vocab = set()  # Initialize vocabulary set
        self.doc_freq = {}  # Dictionary to store document frequency for each term \ number of documents containing term t

        # Construct vocabulary set and compute document frequency \ number of documents containing term t
        for document in self.corpus:
            terms = self.split(document)
            self.vocab.update(terms)
            for term in set(terms):  # Use set to count each document only once
                if term not in self.stopwords: # Exclude stopwords from vocabulary
                    self.doc_freq[term] = self.doc_freq.get(term, 0) + 1

        # Precompute term indices
        self.term_indices = {term: idx for idx, term in enumerate(self.vocab)}

    def transform(self, questions):
        tfidf_data = []
        tfidf_row = []
        tfidf_col = []

        # Iterate over documents and terms to compute TF-IDF scores
        for i, document in enumerate(questions):
            for term in set(self.split(document)):
                if term in self.vocab:  # Check if term exists in vocabulary, if not we will just work with tokens from seen during fit
                    if term not in self.term_idf_cache:
                        self.term_idf_cache[term] = self.calculate_idf(term)

                    tfidf_score = self.calculate_tf(term, document) * self.term_idf_cache[term]
                    tfidf_data.append(tfidf_score)
                    tfidf_row.append(i)
                    tfidf_col.append(self.term_indices[term])

        self.tfidf_scores = csr_matrix((tfidf_data, (tfidf_row, tfidf_col)), shape=(len(questions), len(self.vocab)))

        return self.tfidf_scores

    def split(self, text, token_pattern="\\b\\w\\w+\\b"):
        # Split text into terms using the provided token pattern
        terms = re.findall(token_pattern, text.lower())
        lemmatized_terms = [self.lemmatizer.lemmatize(term) for term in terms]
        return [term for term in lemmatized_terms if term not in self.stopwords]  # Exclude stopwords after lemmatization

def perform_grid_search(X_train, y_train, X_val, y_val, param_grid):
    # Initialize Logistic Regression model
    log_reg = sklearn.linear_model.LogisticRegression(random_state=123, max_iter=100000)

    # Perform grid search
    best_score = 0
    best_params = None
    for C in param_grid['C']:
        for solver in param_grid['solver']:
            for penalty in param_grid['penalty']:
                log_reg.set_params(C=C, solver=solver, penalty=penalty)
                log_reg.fit(X_train, y_train)
                score = log_reg.score(X_val, y_val)

                # Check if this is the best score
                if score > best_score:
                    best_score = score
                    best_params = {'C': C, 'solver': solver, 'penalty': penalty}

    return best_score, best_params

def plot_roc_curve_and_threshold(tfidf_data, y_true, clf, optimal_threshold=None, calculate_roc=True, calculate_optimal_threshold=False):

    results = {}
    # Extract probabilities for the positive class
    y_score = clf.predict_proba(tfidf_data)[:, 1]

    roc_auc_score = None
    accuracy = None
    precision = None
    recall = None
    f1_score = None

    # Calculate ROC AUC score if requested
    if calculate_roc:
        roc_auc_score = sklearn.metrics.roc_auc_score(y_true, y_score)
        print("ROC AUC score:", roc_auc_score)

        # Generate ROC curve if requested
        fpr, tpr, thresholds = sklearn.metrics.roc_curve(y_true, y_score)
        thresholds_clipped = np.clip(thresholds, 0, 1)

        # Plot ROC curve with thresholds as gradient color
        plt.figure()
        plt.scatter(fpr, tpr, c=thresholds_clipped, cmap='viridis', s=5)
        plt.colorbar(label='Threshold')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random classifier')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC Curve with Thresholds')
        plt.legend(loc="lower right")
        plt.show()

    # Print the optimal threshold if provided
    if calculate_optimal_threshold:
        if optimal_threshold is None:
            optimal_idx = np.argmax(tpr - fpr)
            optimal_threshold = thresholds_clipped[optimal_idx]
            print("Optimal Threshold:", optimal_threshold)

    # Apply custom threshold to obtain binary predictions and calculate classification report
    if optimal_threshold is not None:
        y_pred_custom = (y_score >= optimal_threshold).astype(int)

        # Calculate classification report
        print(sklearn.metrics.classification_report(y_true, y_pred_custom))
        report = sklearn.metrics.classification_report(y_true, y_pred_custom, output_dict=True)

        # Extract relevant metrics from the classification report
        accuracy = report['accuracy']
        precision = report['macro avg']['precision']
        recall = report['macro avg']['recall']
        f1_score = report['macro avg']['f1-score']

        results = {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-score': f1_score}

    # Return the metrics and optimal threshold
    return optimal_threshold, results