# Question Classification
author : Kobe Thonissen

e-mail : kobe.thonissen@gmail.com

course : LINFO2263 - Computational Linguistics

disclaimer : the function "remove_html_tags" and a part of the function "tokenize" were offered by the professor (Pierre Dupont) and/or the course assistants (Amaury Fierens and Benoît Ronval)

## Project Overview
The goal of this project is to classify a dataset of 60k questions from Stack Overflow into one out of two categories: 
- High quality questions (HQ): Questions that received a high number of upvotes and comments, indicating community engagement
- Low quality questions (LQ): Questions that received less interaction from the community

This project explores three different approaches to text classification:

- Naive Bayes Classification: A probabilistic classifier based on Bayes' theorem that assumes independence between features
- Binary Naive Bayes classification: A variation where each word is counted only once regardless of frequency
- Naive Bayes Classification with negative tokens: An enhanced version that accounts for negation in language

The Binary Naive Bayes Classifier resulted in the highest accuracy rate (82.649). The Naive Bayes Classifier performed slightly worse (79.547), and the inclusion on negation tokens did not improve the accuracy (79.883)



## Preprocessing

This section handles data loading and text preprocessing steps needed before classification:
1. Loading the dataset from CSV files
2. Cleaning HTML tags from questions
3. Tokenizing text into individual words
4. Creating a vocabulary to replace infrequent words (n<5>) by the special token <UNK>
5. Converting tokenized text into bag-of-words representations (a list containing the individual words of the text)

In [4]:
# imports and functions

import pandas as pd
import nltk
import math
from nltk.tokenize import WordPunctTokenizer
from nltk.probability import FreqDist, ConditionalFreqDist, ConditionalProbDist
from nltk import MLEProbDist
from nltk.lm.vocabulary import Vocabulary
from nltk.lm import MLE, Laplace
from nltk.metrics.scores import accuracy


def remove_html_tags(text):
    """
    Removes HTML tags from text (any sequence of characters between < and >).
    
    Args:
        text (str): Input text containing HTML tags
        
    Returns:
        str: Clean text with HTML tags removed
    """
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)


def tokenize(path):
    """
    Processes the dataset by removing HTML tags and tokenizing the text.
    
    Args:
        path (str): Path to a CSV file containing columns "Text" (with HTML-formatted text) and "Y" (quality label)
        
    Returns:
        DataFrame: Processed pandas dataframe with:
        - HTML tags removed from "Text" column
        - New "Tokens" column containing lists of word tokens
    """
    df_corpus = pd.read_csv(path)
    df_corpus['Text'] = df_corpus['Text'].apply(lambda x: remove_html_tags(x))
    df_corpus['Tokens'] = df_corpus['Text'].apply(lambda x: WordPunctTokenizer().tokenize(x)).astype('object')
    return(df_corpus)

def to_bow(df, vocab):
    '''
    Converts tokenized text to bag-of-words representation, filtering out out-of-vocabulary tokens.
    
    Args:
        df (DataFrame): DataFrame containing a "Tokens" column with lists of tokens
        vocab (Vocabulary): NLTK vocabulary object defining valid tokens
        
    Returns:
        DataFrame: DataFrame with new "BOW" column containing lists of one-tuples for each token
    '''
    df["BOW"] = ""
    for i in range(len(df)):
        text = df.at[i, "Tokens"]
        text_without_oov = []
        for token in text:
            if vocab.lookup(token) != "<UNK>":
                text_without_oov.append( (token,) )
        df.at[i, "BOW"] = list(text_without_oov)
    return df

def create_vocab(df, cutoff):
    """
    Creates a vocabulary from tokens in the dataset, filtering out rare words.
    
    Args:
        df (DataFrame): DataFrame containing a "Tokens" column
        cutoff (int): Minimum frequency threshold - tokens appearing less than this will be excluded
        
    Returns:
        Vocabulary: NLTK vocabulary object containing all tokens meeting the frequency threshold
    """
    all_tokens = []
    for text in df["Tokens"]:
        for token in text:
            all_tokens.append(token)
    vocab = Vocabulary(all_tokens, unk_cutoff=cutoff)
    return(vocab)

def logscore(text, model, category_prob):
    """
    Computes log-probability of a text belonging to a category using Naive Bayes principles.
    
    The function calculates: log P(c) + sum(log P(w_i|c)) where:
    - P(c) is the prior probability of category c
    - P(w_i|c) is the probability of word w_i given category c
    
    Args:
        text (list): List of word tokens (as tuples)
        model: Language model that provides word probabilities
        category_prob (float): Prior probability of the category
        
    Returns:
        float: Log probability score of the text belonging to the category
    """
    sum = 0
    for word in text:
        word_prob = model.score(word[0])  # probability of word given category
        sum += math.log(word_prob, 10)
    return math.log(category_prob, 10) + sum

def argmax_category(text, model_lq, model_hq, category_prob):
    """
    Determines the most likely category (HQ or LQ) for a given text.
    
    Args:
        text (list): List of word tokens (as tuples)
        model_lq: Language model for low quality questions
        model_hq: Language model for high quality questions
        category_prob (dict): Prior probabilities for "HQ" and "LQ" categories
        
    Returns:
        str: Predicted category ("HQ" or "LQ")
    """
    label_given_text_prob_hq = logscore(text, model_hq, category_prob["HQ"])
    label_given_text_prob_lq = logscore(text, model_lq, category_prob["LQ"])
    if label_given_text_prob_hq > label_given_text_prob_lq:
        return "HQ"
    else:
        return "LQ"
    
def predict_categories(df, model_hq, model_lq, category_prob):
    """
    Predicts categories for all texts in the dataframe.
    
    Args:
        df (DataFrame): DataFrame containing "BOW" column with tokenized texts
        model_hq: Language model for high quality questions
        model_lq: Language model for low quality questions
        category_prob (dict): Prior probabilities for "HQ" and "LQ" categories
        
    Returns:
        DataFrame: Original dataframe with an additional "Predicted_Y" column containing predictions
    """
    df["Predicted_Y"] = df["BOW"].apply(lambda x: argmax_category(x, model_lq, model_hq, category_prob))
    return df

In [5]:
# tokenize training dataset
df_train = tokenize("../corpora/train.csv")
df_test = tokenize("../corpora/test.csv")


#construct vocabulary, omitting all words with less then 5 occurences
vocab_train =create_vocab(df_train, 5)

# transform texts into bags of words, omitting OOV-tokens
df_train = to_bow(df_train, vocab_train)
df_test = to_bow(df_test, vocab_train)

# get proportions of HQ and LQ texts in df_train
category_prop = df_train['Y'].value_counts(normalize=True)


## Naive Bayes
Naive Bayes allows to classify texts based on the frequency of words. 

In this implementation:
1. We train separate language models for HQ and LQ questions
2. We use Laplace smoothing to handle unseen words
3. We work with log probabilities to avoid numerical underflow


In [6]:
# train MLE model with Laplace smoothing for each category
df_train_hq = df_train[df_train["Y"]=="HQ"]
lm_hq = Laplace(order=1, vocabulary=vocab_train)
lm_hq.fit(df_train_hq["BOW"].tolist(), vocab_train)

df_train_lq = df_train[df_train["Y"]=="LQ"]
lm_lq = Laplace(order=1, vocabulary=vocab_train)
lm_lq.fit(df_train_lq["BOW"].tolist(), vocab_train)

In [7]:
# predict category
df_train_prediction = predict_categories(df_train, lm_hq, lm_lq, category_prop)
df_test_prediction = predict_categories(df_test, lm_hq, lm_lq, category_prop)

# present results
print("results training data:")
print(pd.crosstab(df_train_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("results test data:")
print(pd.crosstab(df_test_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("accuracy test data:")
accuracy_test = accuracy(df_test_prediction["Y"].tolist(), df_test_prediction["Predicted_Y"].tolist())*100
print(round(accuracy_test, 3))

results training data:
Y              HQ    LQ
Predicted_Y            
HQ           3588  1458
LQ            366  6407
results test data:
Y             HQ   LQ
Predicted_Y          
HQ           172  352
LQ           231  438
accuracy test data:
79.547


## Binary Naive Bayes
This section implements Binary Naive Bayes classification, a variant of the standard approach.

The key difference in Binary Naive Bayes is that we only consider the presence or absence 
of a word, not its frequency within a document. This means each word is counted at most once
per document, regardless of how many times it appears.

This approach:
1. Reduces the impact of word repetition in longer documents
2. May perform better for certain types of text where the mere presence of certain terms
   is more important than their frequency
3. Often has lower variance in performance across different document lengths


In [8]:
# preprocessing
def to_unique_bow(df, vocab):
    '''
    remove OOV-tokens from text
    create set: no token occurs multiple times
    transform list of words into list of tuples containing individual words
    '''
    df["BOW"] = ""
    for i in range(len(df)):
        text = df.at[i, "Tokens"]
        text_without_oov = []
        for token in text:
            if vocab.lookup(token) != "<UNK>":
                text_without_oov.append( (token,) )
        set_text = list(set(text_without_oov))
        df.at[i, "BOW"] = list(set_text)
    return df


# tokenize training dataset
df_train = tokenize("../corpora/train.csv")
df_test = tokenize("../corpora/test.csv")


#construct vocabulary, omitting all words with less then 5 occurences
vocab_train =create_vocab(df_train, 5)

# transform texts into bags of words, omitting OOV-tokens
df_train = to_unique_bow(df_train, vocab_train)
df_test = to_unique_bow(df_test, vocab_train)

# get proportions of HQ and LQ texts in df_train
category_prop = df_train['Y'].value_counts(normalize=True)

In [9]:
# train MLE model with Laplace smoothing for each category
df_train_hq = df_train[df_train["Y"]=="HQ"]
lm_hq = Laplace(order=1, vocabulary=vocab_train)
lm_hq.fit(df_train_hq["BOW"].tolist(), vocab_train)

df_train_lq = df_train[df_train["Y"]=="LQ"]
lm_lq = Laplace(order=1, vocabulary=vocab_train)
lm_lq.fit(df_train_lq["BOW"].tolist(), vocab_train)

In [10]:


# predict category
df_train_prediction = predict_categories(df_train, lm_hq, lm_lq, category_prop)
df_test_prediction = predict_categories(df_test, lm_hq, lm_lq, category_prop)

# present results
print("results training data:")
print(pd.crosstab(df_train_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("results test data:")
print(pd.crosstab(df_test_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("accuracy test data:")
accuracy_test = accuracy(df_test_prediction["Y"].tolist(), df_test_prediction["Predicted_Y"].tolist())*100
print(round(accuracy_test, 3))

results training data:
Y              HQ    LQ
Predicted_Y            
HQ           3513   997
LQ            441  6868
results test data:
Y             HQ   LQ
Predicted_Y          
HQ           141  316
LQ           262  474
accuracy test data:
82.649


## Negative tokens
This section enhances Naive Bayes classification by incorporating negation handling.

Negation can significantly alter the meaning of text, but standard bag-of-words models
don't capture this nuance. For example, "good" and "not good" have opposite meanings
but would be treated as independent features in standard models.

This implementation:
1. Identifies negative contexts (words appearing after negative words like 'not', 'no', 'never')
2. Marks words in negative contexts with a "_NOT" suffix, creating distinct features
3. Treats these marked words as separate vocabulary items in the model



In [11]:
# preprocessing

def to_bow_with_negative_tokens(df, vocab):
    '''
    remove OOV-tokens from text
    transform list of words into list of tuples containing individual words
    preprocess the questions so as to replace any token t between a negative token (['not', 'no', 'never']) and a punctuation sign (['.', ',', ':', '?', '!']) by the token t_NOT.
    '''
    df["BOW"] = ""
    for i in range(len(df)):
        text = df.at[i, "Tokens"]
        text_without_oov = []
        negative_context = False
        for token in text:
            if token in ['not', 'no', 'never']:
                negative_context=True
            if token in ['.', ',', ':', '?', '!']:
                negative_context=False
            if vocab.lookup(token) != "<UNK>":
                if negative_context:
                    token = token + "_NOT"
                text_without_oov.append( (token,) )
        df.at[i, "BOW"] = list(text_without_oov)
    return df


# tokenize training dataset
df_train = tokenize("../corpora/train.csv")
df_test = tokenize("../corpora/test.csv")


#construct vocabulary, omitting all words with less then 5 occurences
vocab_train =create_vocab(df_train, 5)

# transform texts into bags of words, omitting OOV-tokens
df_train = to_bow_with_negative_tokens(df_train, vocab_train)
df_test = to_bow_with_negative_tokens(df_test, vocab_train)

# get proportions of HQ and LQ texts in df_train
category_prop = df_train['Y'].value_counts(normalize=True)

In [12]:
# train MLE model with Laplace smoothing for each category
df_train_hq = df_train[df_train["Y"]=="HQ"]
lm_hq = Laplace(order=1, vocabulary=vocab_train)
lm_hq.fit(df_train_hq["BOW"].tolist(), vocab_train)

df_train_lq = df_train[df_train["Y"]=="LQ"]
lm_lq = Laplace(order=1, vocabulary=vocab_train)
lm_lq.fit(df_train_lq["BOW"].tolist(), vocab_train)

In [13]:
# predict category
df_train_prediction = predict_categories(df_train, lm_hq, lm_lq, category_prop)
df_test_prediction = predict_categories(df_test, lm_hq, lm_lq, category_prop)

# present results
print("results training data:")
print(pd.crosstab(df_train_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("results test data:")
print(pd.crosstab(df_test_prediction["Predicted_Y"], df_train_prediction["Y"]))
print("accuracy test data:")
accuracy_test = accuracy(df_test_prediction["Y"].tolist(), df_test_prediction["Predicted_Y"].tolist())*100
print(round(accuracy_test, 3))

results training data:
Y              HQ    LQ
Predicted_Y            
HQ           3578  1471
LQ            376  6394
results test data:
Y             HQ   LQ
Predicted_Y          
HQ           169  355
LQ           234  435
accuracy test data:
79.883
