# Lab 5: Bayesian Classification
### COSC 426: Fall 2025, Colgate University

## Part 1: Build a unigram model

#### Part 1.1

Start by understanding what is happening when you initilize a UnigramModel object

In [1]:
from UnigramModel import UnigramModel
import util
import math
import pandas as pd

sample_model = UnigramModel(tokenize = util.nltk_tokenize,
                            tokenizer_kwargs = {},
                            vocab = util.get_vocab('data/glove_vocab.txt'),
                            unk_token = '[UNK]',
                            train_paths = ['data/sample-alice.txt'],
                            smooth = 'add-0.1',
                            lower = True
                           )                    

**What are the parameters required to initialize a `UnigramModel` object and how are these parameters used in the `UnigramModel` class?**

#### Part 1.2

Verify your implementation of `get_prob` and `evaluate` with the code below. 

*Hint: Running into errors "Rabbit" in `get_probs`? Think carefully about when are where preprocessing is being applied in the pipeline, and what the expected input for `get_probs` is*

We need tokenizer, vocab(with unknown token), train data, smoothing stretegy.
They all have default values. Tokeniger is used to split text in to tokens. Vocab is used to mark whether a work is known or not. Train data is the text we use to find n-gram frequencies. `smooth` is used to indiciate the kind of smoothing strategy we use get the probability.

In [3]:
## print prob of words
expected_probs = {'rabbit': 0.00010176146615934851,
                  'it': 0.0002755005547240899,
                  '[UNK]': 0.00017622107554423767,
                  'Alice': 0.00017622107554423767,
                  'Rabbit': 0.00017622107554423767,
                 }

for word in expected_probs:
    if sample_model.get_prob(word) != expected_probs[word]:
        print(word, '\t incorrect')
        print('expected', expected_probs[word])
        print('got', sample_model.get_prob(word))
        print()
        

## Create paths and then load it
sample_model.evaluate(datafpath= 'data/sample-alice.txt',
                      predfpath = 'predictions/my_sample_preds_alice.tsv')

correct_df = pd.read_csv('predictions/sample_preds_alice.tsv', sep='\t')
my_df = pd.read_csv('predictions/my_sample_preds_alice.tsv', sep='\t')

## Does element wise comparison
print('Are dfs same?', correct_df.equals(my_df))

Are dfs same? True


## Part 2: Implement building blocks of a Naive Bayesian Classifier

Implement the building blocks for a Bayesian classifier. Here is a function that might be useful. 

### Part 2.1: Describe your approach

p(class | text) ~ p(text | class) * p(class)

How is unigram model related to likelihood?  
p(text | class) is probability of the word from the unigram model trained on the class text.

We can get p(text|class) by useing the unigram model trained on the class text, then we calculate p(class) using the 1/number of class. When we multiply the `likelihood` and the `prior`, we get the probability of the class given text.

### Part 2.2 Implement functions

In [4]:
import pandas as pd
def summarize(fname:str, aggregrate_type:str, aggregrate_col:str, groupby_cols:list, delimiter='\t'):
    """
    Args:
        fname: fpath to tsv/ csv file
        aggregate_type: mean or sum

        aggrefate_col: the column with values you want to aggregate over

        groupby_cols: the columns with the groups. 

    Returns:
        Pandas Dataframe with as many rows as unique group combinations. The values of rows in each group is either summed together or averaged depending on the aggregate_type. 

    """
    dat = pd.read_csv(fname, sep=delimiter)

    summ = dat.groupby(groupby_cols).agg({aggregrate_col: aggregrate_type}).reset_index()

    return summ

#### **Calculating the likelihood**

Start by calculating the likelihood of some text given models trained on text from different classes --- i.e., $P(text \mid model=class1)$, $P(text \mid model=class2)$, etc

*Hint: Think about why `summarize` function provided is useful* 

In [None]:
def get_likelihood(models_dict:dict, eval_fpath, class_label):
    """
    Params:
        models_dict: keys are classes and values are the models trained on the classes
        eval_fpath: the file models should be evaluated on
        class_label: the correct class label for sequences in the file 

    Returns:
        A Dataframe with the following columns: 
            sentid: id of the sentence
            model: the model being used to generate the likelihood
            likelihood: the sum of log probability across all the words in the sequence
            target_class: same as class_label
        
    """

    text = models_dict[class_label].preprocess([eval_fpath])
    rows = []
    for model, um in models_dict.items():
        for sentid in range(len(text)):
            sent = text[sentid]
            if sent:
                likelihood = 0
                for wordpos in range(len(sent)):
                    word = sent[wordpos]
                    prob = um.get_prob(word)
                    surp = -math.log2(prob)
                    likelihood += surp
                print(likelihood)
                rows.append([sentid, model, likelihood, class_label])

    # df = pd.DataFrame(rows, columns=["sentid", "model", "likelihood", "class_label"])
    df = pd.DataFrame(rows, columns=["sentid", "model", "likelihood", "target_class"])
    display(df)
    return df

Once you've implemented this function, verify that your output matches the expected output below. 

In [22]:
import numpy as np
pd.set_option('display.precision', 16)
sample_models = {
    'alice': UnigramModel(tokenize = util.nltk_tokenize,
                            tokenizer_kwargs = {},
                            vocab = util.get_vocab('data/glove_vocab.txt'),
                            unk_token = '[UNK]',
                            train_paths = ['data/sample-alice.txt'],
                            smooth = 'add-0.1',
                            lower = True
                           ),
    'sherlock': UnigramModel(tokenize = util.nltk_tokenize,
                            tokenizer_kwargs = {},
                            vocab = util.get_vocab('data/glove_vocab.txt'),
                            unk_token = '[UNK]',
                            train_paths = ['data/sample-sherlock.txt'],
                            smooth = 'add-0.1',
                            lower = True
                           )
}

my_df = get_likelihood(sample_models, 'data/sample-lookingglass.txt', 'alice').reset_index(drop=True)
correct_df = pd.read_csv('predictions/sample-likelihood.tsv', sep='\t').reset_index(drop=True)

#using this instead of equal because of floating point imprecision
print('Printing proportion of matched values across correct_df and my_df\n')
print('likelihood', np.isclose(my_df['likelihood'], correct_df['likelihood']).sum()/len(my_df)) 
for col in ['sentid', 'model', 'target_class']:
    print(col, (my_df[col] == correct_df[col]).sum()/len(my_df))


180.5044598701084
220.15188281698912
195.1884750975023
224.90033841242868
142.72540016937006
206.11602807604277
209.32081472792535
224.675331977782
232.79057254670866
165.76192744935642
83.71744581144395
190.46320534397114
236.0914456224832
207.55854889511605
236.35248407204332
151.7848078885429
224.30028729581866
223.38135905065934
231.02055505009957
246.55984983693654
178.68641037485153
81.9863002621919


Unnamed: 0,sentid,model,likelihood,target_class
0,0,alice,180.5044598701084,alice
1,1,alice,220.15188281698912,alice
2,2,alice,195.18847509750233,alice
3,3,alice,224.90033841242868,alice
4,4,alice,142.72540016937006,alice
5,6,alice,206.1160280760428,alice
6,7,alice,209.32081472792532,alice
7,8,alice,224.675331977782,alice
8,9,alice,232.79057254670863,alice
9,10,alice,165.76192744935642,alice


Printing proportion of matched values across correct_df and my_df

likelihood 0.0
sentid 1.0
model 1.0
target_class 1.0


#### Calculating the prior

In [None]:
def get_prior(data: dict) -> dict:
    """
    Args:
        data: dictionary where keys are the classes, and values are filepaths to the class specific data

    Returns:
        Dictionary with prior probability for each class, which is the number of words in the class divided by the total number of words across all classes. 

    """
        
    pass


In [None]:
dat_dict = {'sherlock': 'data/sample-sherlock.txt',
            'alice': 'data/sample-alice.txt'}

correct_prior = {'sherlock': 0.4423076923076923, 'alice': 0.5576923076923077}
get_prior(dat_dict)

{'sherlock': 0.4423076923076923, 'alice': 0.5576923076923077}

#### Compute posterior

*Hint: Think about why `summarize` function provided is useful*

In [None]:
def get_posterior(models_dict, eval_fpath, class_label, prior_dict):
    """
    Args:
        Dictionary where keys are classes and values are the models trained on the classes

        eval_fpath: the file models should be evaluated on. 

        class_label: the label of the file that models are evaluated on

        prior_dict: prior probabilities of classes

    Returns:
        A Dataframe with the following columns: sentid, model, likelihood, class. 

    If you set eval_fpath to sample_reviews_test_positive.txt, you should get a dataframe that looks like this. (Its ok if you end up having additional columns)

   sentid model     likelihood   class        prior      posterior
       0  positive  -100.975898  positive    -0.736966   -101.712864
       1  positive  -100.941133  positive    -0.736966   -101.678099
       0  negative  -101.938780  positive    -1.321928   -103.260708
       1  negative  -101.938780  positive    -1.321928   -103.260708


    """

    pass

In [None]:
my_df = get_posterior(sample_models, 
                    'data/sample-lookingglass.txt',
                    'alice',
                     get_prior(dat_dict)).reset_index(drop=True)

correct_df = pd.read_csv('predictions/sample-posterior.tsv', sep='\t').reset_index(drop=True)
print('posterior', np.isclose(my_df['posterior'], correct_df['posterior']).sum()/len(my_df)) 



              

posterior 1.0


#### Implement classify

In [None]:
def classify(posterior):
    """
    Args:
        Dataframe with posterior probabilities

    Returns: 
        Dataframe where each sentence id is associated with a prediction. 
    """

    # converts the data from long to wide
    classes = posterior['model'].unique()
    wide_df = posterior.pivot(index=['sentid', 'target_class'],
                              columns=['model'],
                              values='posterior').reset_index()    
    #Finish the rest of the function 

    pass



In [None]:
posterior = get_posterior(sample_models, 
            'data/sample-lookingglass.txt',
            'alice',
             get_prior(dat_dict)).reset_index(drop=True)
my_df = classify(posterior).reset_index(drop=True)

correct_df = pd.read_csv('predictions/sample-classify.tsv', sep='\t').reset_index(drop=True)
print('pred', (my_df['pred']==correct_df['pred']).sum()/len(my_df)) 



Index(['sentid', 'target_class', 'alice', 'sherlock'], dtype='object', name='model')
pred 1.0


#### Compute accuracy

In [None]:


def analyze(models_dict, eval_dict, prior_dict):
    """
    Args:
        models_dict: keys are classes, values are models trained on data from the class. 

        eval_dict: keys are classes, values are fpaths to evaluation data where the correct label is the class associated with the key

        prior_dict: keys are classes, values are prior probabilties of the classes. 

    Returns:
        Float which is the accuracy of the predictions across all classes

    """


    pass

In [None]:
sample_models = {
    'alice': UnigramModel(tokenize = util.nltk_tokenize,
                            tokenizer_kwargs = {},
                            vocab = util.get_vocab('data/glove_vocab.txt'),
                            unk_token = '[UNK]',
                            train_paths = ['data/sample-alice.txt'],
                            smooth = 'add-0.1',
                            lower = True
                           ),
    'sherlock': UnigramModel(tokenize = util.nltk_tokenize,
                            tokenizer_kwargs = {},
                            vocab = util.get_vocab('data/glove_vocab.txt'),
                            unk_token = '[UNK]',
                            train_paths = ['data/sample-sherlock.txt'],
                            smooth = 'add-0.1',
                            lower = True
                           )
}

## for simplicity making eval the same as train
sample_eval = {
    'alice': ['data/sample-alice.txt'],
    'sherlock': ['data/sample-sherlock.txt']
}

sample_prior = get_prior({'sherlock': ['data/sample-sherlock.txt'],
                          'alice': ['data/sample-alice.txt']})

analyze(sample_models, sample_eval, sample_prior)

target_acc = 39.473684210526315


Index(['sentid', 'target_class', 'alice', 'sherlock'], dtype='object', name='model')
Index(['sentid', 'target_class', 'alice', 'sherlock'], dtype='object', name='model')


39.473684210526315

## Part 3: Build Naive Bayesian Sentiment Classifier
Add as many code and markdown chunks as is helpful

## Part 4 (optional): Build Bigram Bayesian Sentiment Classifier
Add as many code and markdown chunks as is helpful