# CSC 620 - HW 6 - Text Classification Using Naïve Bayes
### Mark Kim

This assignment involves the implementation of a text classifier using a Naïve
Bayes method.  Laplace smoothing was applied to the data for this program.  The
resulting MLE is as follows:
$$ \hat{P}(w|c) = \frac{\operatorname{count}(w,c) + 1}{\operatorname{count}(c) + \lvert
V \rvert}. $$
Once all the conditional probabilities are calculated, class labels are
determined by finding the class with the highest probability, which can be
generalized by:
$$ C_{NB} = \underset{c\in C}{\operatorname*{argmax}}\ P(c_j)\prod_{x\in X}
P(x|c) $$

This project consists of a data set of 50 restaurant reviews pulled from Yelp!
40 reviews were used for the training set and 10 were used for the test set.

In [1]:
from nltk import word_tokenize, RegexpTokenizer, download

from numpy import array, savetxt

from csv import writer

import pandas as pd

# Literal integer sum type declaration
Sentiment = 0|1

#### Helper Functions

These functions are for dictionary initialization and data insertion

In [2]:
def insert_dict( x: str, d: dict, s: Sentiment ):
    d[x][s] = d.get(x)[s]+1

def init_dict( x: str, d: dict ):
    d[x] = {0: 0, 1: 0}

#### Class Probability Function

This function calculates the conditional probability that a text/document
belongs to a particular class.  To avoid numerical underflow due to the
diminishing nature of multiplying fractions, this calculation is formulated to
utilize log-likelihoods rather than probabilities.

In [3]:
def cl_probability( vocab: dict, cl: str, p_model: pd.DataFrame, c_model: pd.DataFrame ):
    prob = [p_model.loc[cl, 'probability']]
    for word in vocab:
        prob.append(c_model.loc[(word, cl), 'probability'] * vocab[word])
    return sum(prob)

#### Classification Function

This function receives a text string, a prior probability table, and a
conditional probability table.  It then produces a vocabulary of known words in
the text string minus any words that are not contained in our models.  Using
this vocabulary, it utilizes the `cl_probability()` function to determine the
log-likelihood of the class being either negative or positive and returns the
value associated with that class.

In [4]:
def classify( text: str, p_model: pd.DataFrame, c_model: pd.DataFrame ):
    tokenizer = RegexpTokenizer(r'\w+')
    t_arr = tokenizer.tokenize(text.lower())
    cni_df = c_model.reset_index()
    vocab = {}
    for word in t_arr:
        if word not in cni_df.loc[:, 'level_0'].values:
            continue
        if word in vocab.keys():
            vocab[word] = vocab.get(word) + 1
        else:
            vocab[word] = 1

    p_neg = cl_probability( vocab, 'negative', p_model, c_model)
    p_pos = cl_probability( vocab, 'positive', p_model, c_model)
    return 1 if p_neg < p_pos else 0

#### Create Training Vocabulary Function

This function takes in a dataframe of labeled data and creates a vocabulary from it.

In [5]:
def mk_train_vocab(df):
    wc_pos = 0
    wc_neg = 0
    df_neg = df[df['sentiment'] == 0]
    df_pos = df[df['sentiment'] == 1]
    vocab = {}
    for p in df['tok_text']:
        for q in p:
            init_dict(q, vocab)
    for p in df_neg['tok_text']:
        for q in p:
            insert_dict(q, vocab, 0)
            wc_neg += 1
    for p in df_pos['tok_text']:
        for q in p:
            insert_dict(q, vocab, 1)
            wc_pos += 1
    return vocab, wc_pos, wc_neg

#### Read Training Data

Here, we read in the training data and calculate and store the counts of words
associated with positive and negative reviews.  Next, the conditional
probabilities are calculated and some preprocessing is done to prepare the
training data for processing.

In [6]:
train_df = pd.read_csv("./train.csv", names=['text', 'sentiment'])
c_prior = train_df['sentiment'].value_counts()
p_prior = c_prior/(c_prior[0] + c_prior[1])
p_prior = p_prior.rename({0: 'negative', 1: 'positive'})
p_prior = list(zip(p_prior.index, p_prior))

tokenizer = RegexpTokenizer(r'\w+')
train_df['tok_text'] = train_df.apply(lambda row: tokenizer.tokenize(row['text'].lower()), axis=1)

Make vocabulary from training data

In [7]:
vocab, wc_pos, wc_neg = mk_train_vocab(train_df)

#### Create Arrays for Positive and Negative Reviews

This produces two arrays of data that contains a vocabulary of the training
data along with the probabilities that each word occurs within the data set.
Laplace smoothing has been applied to the resulting probabilities.

In [8]:
from math import log

pos_array = []
neg_array = []
for word in vocab:
    pos_array.append([word, 'positive', (vocab.get(word)[1]+1)/(wc_pos+len(vocab))])
    neg_array.append([word, 'negative', (vocab.get(word)[0]+1)/(wc_neg+len(vocab))])
maxp = ['', '', 0]
minp = ['', '', 1]

#### Log-likelihood

Because of the possibility of underflow from multiplying fractions, I took the
logarithm of each probability.  Once applied, the logarithm converts the next
calculations to addition (and multiplication when multiple instances of a word
is found).

In [9]:
log_p_prior = list(map(lambda x: [x[0], log(x[1])], p_prior))
logneg_array = list(map(lambda x: [x[0], x[1], log(x[2])],neg_array))
logpos_array = list(map(lambda x: [x[0], x[1], log(x[2])],pos_array))

#### Write Model to CSV

As specified, the model is written to csv.

In [10]:
from os import remove, path

filename = './model.csv'

if path.exists(filename):
    remove(filename)

with open(filename, 'a') as f:
    wr = writer(f, delimiter=',')
    f.write('PP\n')
    wr.writerows(log_p_prior)
    f.write('\nLP\n')
    wr.writerows(logneg_array)
    wr.writerows(logpos_array)

### Test Model

#### Read Trained Model

To test a model, the trained model file is read in and pre-processed.

In [11]:
p_df = pd.read_csv(filename, skiprows=0, nrows=2)
c_df = pd.read_csv(filename, skiprows=4)
p_df = p_df.rename(index={0: 'negative', 1: 'positive'}, columns={'PP': 'probability'})
c_df = c_df.rename(columns={'LP': 'probability'})

#### Read Test Data

We read in the test data and predict classifications and enter them into an array.

In [12]:
test_df = pd.read_csv("./test.csv", names=['text', 'sentiment'])

predict = []
for text in test_df['text']:
    predict.append(classify( text, p_df, c_df ))

#### Post-Processing

Enter results into the existing test dataframe and apply label changes accordingly

In [13]:
test_df['prediction'] = predict
test_df['sentiment'] = test_df.apply(lambda row: 'positive' if row['sentiment'] == 1 else 'negative', axis=1)
test_df['prediction'] = test_df.apply(lambda row: 'positive' if row['prediction'] == 1 else 'negative', axis=1)
test_df = test_df.rename(columns={'sentiment': 'actual'})

#### Output Results to CSV File

In [14]:
test_df.to_csv('./test_predictions.csv')
test_df

Unnamed: 0,text,actual,prediction
0,Server did a great job handling our large rowd...,positive,positive
1,Would come back again if I had a sushi craving...,positive,negative
2,He deserves 5 stars.,positive,positive
3,My boyfriend and I came here for the first tim...,positive,positive
4,They have great dinners.,positive,positive
5,Not my thing.,negative,negative
6,If you are reading this please don't go there.,negative,negative
7,Tonight I had the Elk Filet special...and it s...,negative,negative
8,We ordered some old classics and some new dish...,negative,positive
9,A FLY was in my apple juice.. A FLY!!!!!!!!,negative,negative
