# Movie Reviews Sentiment Analysis with Naive Bayes


## Goal

In this project, I will build a multinomial naive bayes classifier to predict whether a movie review is positive or negative.  As part of the project, I will implement *k*-fold cross validation testing.


## Discussion

### Naive bayes classifiers for text documents

A text classifier predicts to which class, $c$, an unknown document $d$ belongs. In our case, the predictions are binary: $c=0$ for negative movie review and $c=1$ for positive movie review. We can think about classification mathematically as picking the most likely class:

$$
c^*= \underset{c}{argmax} ~P(c|d)
$$

We can replace $P(c|d)$, using Bayes' theorem:

$$
P(c | d) = \frac{P(c)P(d|c)}{P(d)}
$$

to get the formula 

$$
c^*= \underset{c}{argmax} ~\frac{P(c)P(d|c)}{P(d)}
$$

Since $P(d)$ is a constant for any given document $d$, we can use the following equivalent but simpler formula:

$$
c^*= \underset{c}{argmax} ~ P(c)P(d|c)
$$

Training then consists of estimating $P(c)$ and $P(c|d)$, which will get to shortly.

### Representing documents

Text classification requires a representation for document $d$. When loading a document, we first load the text and then tokenize the words, stripping away punctuation and stop words like *the*. The list of words is a fine representation for a document except that each document has a different length, which makes training most models problematic as they assume tabular data with a fixed number of features.  

The simplest and most common approach is to: 
1. Create an overall vocabulary, $V$, created as a set of unique words across all documents in all classes. 
2. Sort the unique words in the vocab alphabetically so we standardize which word is associated with which word vector index. Then, the training features are those words.
3. Then, one way to represent a document is with a binary word vector, with a 1 in each column represents __if that word is present in the document__. Something like this:

In [1]:
import pandas as pd
df = pd.DataFrame(data=[[1,1,0,0],
                        [0,0,1,1]], columns=['cat','food','hong','kong'])
df

Unnamed: 0,cat,food,hong,kong
0,1,1,0,0
1,0,0,1,1


This tends to work well for very short strings/documents, such as article titles or tweets. For longer documents, using a binary presence or absence loses information. Instead, it's better to __count the number of times each word is present__. For example, here are 3 documents and resulting word vectors:

In [3]:
d1 = "cats food cats cats"
d2 = "hong kong hong kong"
d3 = "cats in hong kong"  # assume we strip stop words like "in"
df = pd.DataFrame(data=[[3,1,0,0],
                        [0,0,2,2],
                        [1,0,1,1]],
                  columns=['cat','food','hong','kong'])
df

Unnamed: 0,cat,food,hong,kong
0,3,1,0,0
1,0,0,2,2
2,1,0,1,1


These word vectors with fixed lengths are how most models expect data, including sklearn's implementation. (It's assuming Gaussian distributions for probability estimates where as we are assuming multinomial, but we can still shove our data in.)  Here's how to train a Naive Bayes model with sklearn using the trivial/toy `df` data and get the training set error:

In [4]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

X = df.values
y = [0, 1, 1] # assume document classes
sknb = GaussianNB()
sknb.fit(X, y)
y_pred = sknb.predict(X)
print(f"Correct = {np.sum(y==y_pred)} / {len(y)} = {100*np.sum(y==y_pred) / len(y):.1f}%")

Correct = 3 / 3 = 100.0%


Because it is convenient to keep word vectors in a 2D matrix and it is what sklearn likes, I will use the same representation in this project. 

Functions used are:
- Given the directory name, the function `load_docs()` will return a list of word lists where each word list is the raw list of tokens, typically with repeated words. 
- Then,  function `vocab()` will create the combined vocabulary as a mapping from word to  word feature index, starting with index 1.  Index 0 is reserved for unknown words.  Vocabulary $V$ should be a `defaultintdict()`, so that unknown words get mapped to value/index 0. 
- Finally, function `vectorize()` will convert that to a 2D matrix, one row per document:

```
neg = load_docs(neg_dir)
pos = load_docs(pos_dir)
V = vocab(neg,pos)
vneg = vectorize_docs(neg, V)
vpos = vectorize_docs(pos, V)
```

The `defaultintdict` class behaves exactly like defaultdict(int) except d['foo'] does **not** add 'foo' to dictionary d. (Booo for that default behavior in defaultdict!)

```
class defaultintdict(dict):
    def __init__(self): # Create dictionary of ints
        self._factory=int
        super().__init__()

    def __missing__(self, key): "Override default behavior so missing returns 0"
        return 0
```

### Estimating probabilities

To train our model, we need to estimate $P(c)$ and $P(d|c)$ for all classes and documents. Estimating $P(c)$ is easy: it's just the number of documents in class $c$ divided by the total number of documents. To estimate $P(d | c)$, Naive Bayes assumes that each word is *conditionally independent*, given the class, meaning:

$$
P(d | c) = \prod_{w \in d} P(w | c)
$$

so that gives us:

$$
c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in d} P(w | c)
$$

where $w$ is not the unique words in $d$, so the product includes $P(w|c)$ 5 times if $w$ appears 5 times in $d$.

Because we are going to use word counts, not binary word vectors, in fixed-length vectors, we need to include $P(w|c)$ explicitly multiple times for repeated $w$ in $d$:

$$
c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in V} P(w | c)^{n_w(d)}
$$

where $n_w(d)$ is the number of times $w$ occurs in $d$, $V$ is the overall vocabulary (set of unique words from all documents); $n_w(d)=0$ when $w$ isn't present in $d$.

Now we have to figure out how to estimate $P(w | c)$, the probability of seeing word $w$ given that we're looking at a document from class $c$. That's just the number of times $w$ appears in all documents from class $c$ divided by the total number of words (including repeats) in all documents from class $c$:

$$P(w | c) = \frac{wordcount(w,c)}{wordcount(c)}$$

### Making predictions

Once we have the appropriate parameter estimates, we have a model that can make predictions in an ideal setting:

$$
c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in V} P(w | c)^{n_w(d)}
$$

#### Avoiding $P(w|c)=0$

If word $w$ does not exist in class $c$ (but is in $V$), then overall product goes to 0 (and when we take the log below, the classifier would try to evaluate $log(0)$, which is undefined).  To solve the problem, we use *Laplace Smoothing*, which just means adding 1 to each word count when computing $P(w|c)$ and making sure to compensate by adding $|V|$ to the denominator (adding 1 for each vocabulary word):

$$P(w | c) = \frac{wordcount(w,c) + 1}{wordcount(c) + |V|}$$

where $|V|$ is the size of the vocabulary for all documents in all classes.  Adding this to the denominator, keeps  $P(w|c)$ a probability. This way, even if $wordcount(w,c)$ is 0, this ratio > 0.  (Note: Each doc's word vector has length $|V|$. During training, any $w$ not found in docs of $c$, will have word count 0. Summing these gets us just total number of words in $c$. However, when we add +1, then $c$ looks like it has every word in $V$.  Hence, we must divide by $|V|$ not $|V_c|$. If $w$ is not in any doc of class $c$ then $P(w|c)=1/(wordcount(c)+|V|)$, which is a very low probability.)

####  Dealing with missing words

Laplace smoothing deals with $w$ that are in the vocabulary $V$ but that are not in a class, hence, giving $wordcount(w,c)=0$ for some $c$. There's one last slightly different problem. If a future unknown document contains a word not in $V$ (i.e., not in the training data), then what should $wordcount(w,c)$ be?  Probably not 0 because that would mean we had data indicating it does not appear in class $c$ when we have *no* training data on it.

To be strictly correct and keep $P(w | c)$ a probability in the presence of unknown words, all we have to do is add 1 to the denominator in addition to the Laplace smoothing changes:

$$P(w | c) = \frac{wordcount(w,c) + 1}{wordcount(c) + |V| + 1}$$

We are lumping all unknown words into a single "wildcard" word that exists in every $V$. That has the effect of increasing the overall vocabulary size for class $c$ to include room for an unknown word (and all unknown words map to that same spot). In this way, an unknown word gets probability:

$$P(unknown|c) = \frac{0 + 1}{wordcount(c) + |V| + 1}$$

In the end, this is no big deal as all classes will get the same nudge for the unknown word so classification won't be affected.

To deal with unknown words in the project, we can reserve word index 0 to mean unknown word. All words in the training vocabulary start at index 1. So, if we normally have $|V|$ words in the training vocabulary, we will now have $|V|+1$; no word will ever have 0 word count. Each word vector will be of length $|V|+1$.  

The `vocab()` function in your project returns $|V| = |uniquewords|+1$ to handle the unknown word wildcard.  Once computed, the size of the vocabulary should never change; all word vectors are size $|V|$.

With this new adjusted estimate of $P(w|c)$, we can simplify the overall prediction problem to use $w \in V$:

$$
c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in V} P(w | c)^{n_w(d)}
$$

That means we can use dot products for prediction, which is faster than iterating in python through unique document words.

#### Floating point underflow

The first problem involves the limitations of floating-point arithmetic in computers. Because the probabilities are less than one and there could be tens of thousands multiplied together, we risk floating-point underflow. That just means that eventually the product will attenuate to zero and our classifier is useless.  The solution is simply to take the log of the right hand side because log is monotonic and won't affect the $argmax$:

$$
c^*= \underset{c}{argmax} \left \{ log(P(c)) + \sum_{w \in V} log(P(w | c)^{n_w(d)}) \right \}
$$

Or,

$$
c^* = \underset{c}{argmax} \left \{ log(P(c)) + \sum_{w \in V} n_w(d) \times log(P(w | c)) \right \}
$$

#### Speed issues

For large data sets, Python loops often are too slow and so we have to rely on vector operations, which are implemented in C. For example, the `predict(X)` method receives a 2D matrix of word vectors and must make a prediction for each one. The temptation is to write the very readable:

```
y_pred = []
for each row d in X:
    y_pred = prediction for d
return y_pred
```

But, we can use the built-in `numpy` functions such as `np.dot` (same as the `@` operator) and apply functions across vectors. For example, if I have a vector, $v$, and I'd like the log of each value, don't write a loop. Use `np.log(v)`, which will give us a vector with the results.

My `predict()` method consists primarily of a matrix-vector multiplication per class followed by `argmax`. My implementation is, oddly, twice as fast as sklearn and appears to be more accurate for 4-fold cross validation.

## Deliverables

To submit your project, ensure that your `bayes.py` file is submitted to your repository. That file must be in the root of your `bayes`-*userid* repository.  It should not have a main program; it should consist of a collection of functions. You must implement the following functions:

* `load_docs(docs_dirname)`
* `vocab(neg, pos)`
* `vectorize(V, docwords)`
* `vectorize_docs(docs, V)`
* `kfold_CV(model, X, y, k=4)` (You must implement manually; don't use sklearn's version)

and implement class `NaiveBayes621` with these methods

* `fit(self, X, y)`
* `predict(self, X)`



# 1. Load data
Gather a labeled dataset containing text samples with corresponding sentiment labels (e.g., positive, negative, neutral).

In [2]:
from typing import Sequence
import sys
import re
import string
import os
import numpy as np
import codecs
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import KFold

from bayes import ENGLISH_STOP_WORDS

In [6]:
class defaultintdict(dict):
    """
    Behaves exactly like defaultdict(int) except d['foo'] does NOT
    add 'foo' to dictionary d. (Booo for that default behavior in
    defaultdict!)
    """
    def __init__(self):
        self._factory=int
        super().__init__()

    def __missing__(self, key):
        return 0


def filelist(root) -> Sequence[str]:
    """Return a fully-qualified list of filenames under root directory; sort names alphabetically."""
    allfiles = []
    for path, subdirs, files in os.walk(root):
        for name in files:
            allfiles.append(os.path.join(path, name))
    return sorted(allfiles)


def get_text(filename:str) -> str:
    """
    Load and return the text of a text file, assuming latin-1 encoding as that
    is what the BBC corpus uses.  Use codecs.open() function not open().
    """
    f = codecs.open(filename, encoding='latin-1', mode='r')
    s = f.read()
    f.close()
    return s

def words(text:str) -> Sequence[str]:
    """
    Given a string, return a list of words normalized as follows.
    Split the string to make words first by using regex compile() function
    and string.punctuation + '0-9\\r\\t\\n]' to replace all those
    char with a space character.
    Split on space to get word list.
    Ignore words < 3 char long.
    Lowercase all words
    Remove English stop words
    """
    ctrl_chars = '\x00-\x1f'
    regex = re.compile(r'[' + ctrl_chars + string.punctuation + '0-9\r\t\n]')
    nopunct = regex.sub(" ", text)  # delete stuff but leave at least a space to avoid clumping together
    words = nopunct.split(" ")
    words = [w for w in words if len(w) > 2]  # ignore a, an, to, at, be, ...
    words = [w.lower() for w in words]
    words = [w for w in words if w not in ENGLISH_STOP_WORDS]
    return words


def load_docs(docs_dirname:str) -> Sequence[Sequence]:
    """
    Load all .txt files under docs_dirname and return a list of word lists, one per doc.
    Ignore empty and non ".txt" files.
    """
    docs = []
    filenames_list = filelist(docs_dirname)
    for filename in filenames_list:
        text = get_text(filename)
        words_list = words(text)
        docs.append(words_list)
    return docs

In [7]:
neg_dir = 'review_polarity/txt_sentoken/neg' 
pos_dir = 'review_polarity/txt_sentoken/pos'

neg = load_docs(neg_dir) 
pos = load_docs(pos_dir)

assert len(neg) == 1000
assert len(pos) == 1000
print(neg[3]) # print the list of word in a document that are labeled as 'negative'

['quest', 'camelot', 'warner', 'bros', 'feature', 'length', 'fully', 'animated', 'attempt', 'steal', 'clout', 'disney', 'cartoon', 'empire', 'mouse', 'reason', 'worried', 'recent', 'challenger', 'throne', 'fall', 'promising', 'flawed', 'century', 'fox', 'production', 'anastasia', 'disney', 'hercules', 'lively', 'cast', 'colorful', 'palate', 'beat', 'hands', 'came', 'time', 'crown', 'best', 'piece', 'animation', 'year', 'contest', 'quest', 'camelot', 'pretty', 'dead', 'arrival', 'magic', 'kingdom', 'mediocre', 'pocahontas', 'keeping', 'score', 'isn', 'nearly', 'dull', 'story', 'revolves', 'adventures', 'free', 'spirited', 'kayley', 'voiced', 'jessalyn', 'gilsig', 'early', 'teen', 'daughter', 'belated', 'knight', 'king', 'arthur', 'round', 'table', 'kayley', 'dream', 'follow', 'father', 'footsteps', 'gets', 'chance', 'evil', 'warlord', 'ruber', 'gary', 'oldman', 'round', 'table', 'member', 'gone', 'bad', 'steals', 'arthur', 'magical', 'sword', 'excalibur', 'accidentally', 'loses', 'dange

## 2. Create a vocabulary from the entire dataset

In [8]:
def vocab(neg:Sequence[Sequence], pos:Sequence[Sequence]) -> dict:
    """
    Given neg and pos lists of word lists, construct a mapping from word to word index.
    Use index 0 to mean unknown word, '__unknown__'. The real words start from index one.
    The words should be sorted so the first vocabulary word is index one.
    The length of the dictionary is |uniquewords|+1 because of "unknown word".
    |V| is the length of the vocabulary including the unknown word slot.

    Sort the unique words in the vocab alphabetically so we standardize which
    word is associated with which word vector index.

    E.g., given neg = [['hi']] and pos=[['mom']], return:

    V = {'__unknown__':0, 'hi':1, 'mom:2}

    and so |V| is 3
    """
    V_neg = set(np.concatenate(neg))
    V_pos = set(np.concatenate(pos))
    allwords = sorted(V_neg.union(V_pos))   # a list of all words except 'unknown'
    allwords.insert(0, '__unknown__') # insert 'unknown' at the beginning of the list

    # convert the list to dictionary
    V = defaultintdict()
    for ind, word in enumerate(allwords):
        V[word] = ind
    return V

In [9]:
V = vocab(neg,pos)
print(len(V))

allwords = np.array([*V.keys()])
print(allwords)

print(V['__unknown__'])

38373
['__unknown__' 'aaa' 'aaaaaaaaah' ... 'zwigoff' 'zycie' 'zzzzzzz']
0


In [10]:
rs = np.random.RandomState(42) # get same list back each time
idx = rs.randint(0,len(V),size=100)
allwords = np.array([*V.keys()])
subset = allwords[idx]
print(sorted(subset))

['acceptable', 'accompanies', 'alek', 'allows', 'amistad', 'amnesia', 'anti', 'armored', 'arty', 'atrophied', 'authentically', 'barbecue', 'bastille', 'battles', 'beatrice', 'bedtimes', 'bolt', 'bombarded', 'braun', 'breathed', 'cavern', 'characers', 'charms', 'cimino', 'comely', 'compensating', 'contentious', 'delayed', 'deliveree', 'denise', 'dependant', 'deuce', 'disintegrated', 'doom', 'embarassed', 'enterprises', 'entrepreneur', 'eurocentrism', 'examinations', 'existing', 'exposure', 'fahdlan', 'fer', 'flirts', 'franken', 'gait', 'gloat', 'goal', 'groaning', 'groundbreaking', 'homeworld', 'hovertank', 'independance', 'inputs', 'instinctively', 'invincibility', 'kermit', 'lanai', 'lava', 'lavender', 'libidinous', 'locating', 'meshes', 'metamorphoses', 'moff', 'moribund', 'mortal', 'neptune', 'observatory', 'onstage', 'orbiting', 'overemotional', 'overly', 'paradise', 'paramedic', 'parent', 'paz', 'portion', 'prays', 'pseudonym', 'psycholically', 'quinland', 'redcoats', 'robo', 'sac

## 3. Represent each text sample as a vector of word frequencies

In [11]:
def vectorize(V:dict, docwords:Sequence) -> np.ndarray:
    """
    Return a row vector (based upon V) for docwords. The first element of the
    returned vector is the count of unknown words. So |V| is |uniquewords|+1.
    """
    row_vector = np.zeros(len(V))
    for word in docwords:
            word_index = V.get(word)  
            if word_index:
                row_vector[word_index] += 1
            else:     # if the word is unknown
                row_vector[0] += 1
    return row_vector

In [12]:
def vectorize_docs(docs:Sequence, V:dict) -> np.ndarray:
    """
    Return a matrix where each row represents a documents word vector.
    Each column represents a single word feature. There are |V|+1
    columns because we leave an extra one for the unknown word in position 0.
    Invoke vector(V,docwords) to vectorize each doc for each row of matrix
    :param docs: list of word lists, one per doc
    :param V: Mapping from word to index; e.g., first word -> index 1
    :return: numpy 2D matrix with word counts per doc: ndocs x nwords
    """
    D = np.zeros((len(docs), len(V)))
    for i, doc in enumerate(docs):
        for word in doc:
            word_index = V.get(word)  
            if word_index:
                D[i, word_index] += 1
            else:     # if the word is unknown
                D[i, 0] += 1
    return D

In [13]:
vneg = vectorize_docs(neg, V)
vpos = vectorize_docs(pos, V)

In [14]:
vneg

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# 4. Construct the training data

In [15]:
# Construct the training data
X = np.vstack([vneg, vpos])
y = np.vstack([np.zeros(shape=(len(vneg), 1)),
               np.ones(shape=(len(vpos), 1))]).reshape(-1)

In [16]:
# test *vectorize* function
d1 = vectorize(V, words("mostly very funny , the story is quite appealing."))
d2 = vectorize(V, words("there is already a candidate for the worst of 1997."))

In [17]:
p = len(V)
assert len(d1)==p, f"d1 should be 1x{p} but is 1x{len(d1)}"
assert len(d2)==p, f"d2 should be 1x{p} but is 1x{len(d2)}"
d1_idx = np.nonzero(d1)
d2_idx = np.nonzero(d2)
true_d1_idx = np.array([ 1367, 13337, 26872, 32570 ])
true_d2_idx = np.array([ 4676, 37932 ])
assert (d1_idx==true_d1_idx).all(), f"{d1_idx} should be {true_d1_idx}"
assert (d2_idx==true_d2_idx).all(), f"{d2_idx} should be {true_d2_idx}"

# 5. Implement and test Naive Bayes classifier

In [18]:
class NaiveBayes621:
    """
    This object behaves like a sklearn model with fit(X,y) and predict(X) functions.
    Limited to two classes, 0 and 1 in the y target.
    """
    def fit(self, X:np.ndarray, y:np.ndarray) -> None:
        """
        Given 2D word vector matrix X, one row per document, and 1D binary vector y
        train a Naive Bayes classifier assuming a multinomial distribution for
        p(w,c), the probability of word exists in class c. p(w,c) is estimated by
        the number of times w occurs in all documents of class c divided by the
        total words in class c. p(c) is estimated by the number of documents
        in c divided by the total number of documents.

        The first column of X is a column of zeros to represent missing vocab words.
        """
        self.V_length = X.shape[1]  # |V|, the size of the vocabulary for all documents in all classes

        # P(c)
        self.P_0 = (y == 0).sum() / len(y)  # P(0)
        self.P_1 = (y == 1).sum() / len(y)  # P(1)

        # P(w|c)  = (wordcount(w, c) + 1) / (wordcount(c) + |V| + 1)
        self.P_w_0 = (X[y == 0].sum(axis=0) + 1) / (X[y == 0].sum() + self.V_length + 1)
        self.P_w_1 = (X[y == 1].sum(axis=0) + 1) / (X[y == 1].sum() + self.V_length + 1)
        
    def predict(self, X:np.ndarray) -> np.ndarray:
        """
        Given 2D word vector matrix X, one row per document, return binary vector
        indicating class 0 or 1 for each row of X.
        """
        likelihoods_0 = np.log(self.P_0) + (X*np.log(self.P_w_0)).sum(axis=1)
        likelihoods_1 = np.log(self.P_1) + (X*np.log(self.P_w_1)).sum(axis=1)
        return 1 - (likelihoods_0 > likelihoods_1)

In [19]:
model = NaiveBayes621()
model.fit(X, y)
y_pred = model.predict(X)
accuracy = np.sum(y==y_pred) / len(y)
# print(f"training accuracy {accuracy}")
print(accuracy)
print(f"Correct = {np.sum(y==y_pred)} / {len(y)} = {100*accuracy:.1f}%")

0.9725
Correct = 1945 / 2000 = 97.2%


# 6. Implement and test k-fold cross validation

In [20]:
def kfold_CV(model, X:np.ndarray, y:np.ndarray, k=4) -> np.ndarray:
    """
    Run k-fold cross validation using model and 2D word vector matrix X and binary
    y class vector. Return a 1D numpy vector of length k with the accuracies, the
    ratios of correctly-identified documents to the total number of documents. You
    can use KFold from sklearn to get the splits but must loop through the splits
    with a loop to implement the cross-fold testing.  Pass random_state=999 to KFold
    so we always get same sequence (wrong in practice) so student eval unit tests
    are consistent. Shuffle the elements before walking the folds.
    """
    accuracies = []
    kf = KFold(n_splits=k, shuffle=True, random_state=999)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = np.sum(y_test==y_pred) / len(y_test)
        accuracies.append(accuracy)
    return np.array(accuracies)

In [21]:
# Test my kfold_CV, so just use sklearn model
sklearn_accuracies = kfold_CV(MultinomialNB(), X, y, k=4)
true_sklearn_accuracies = np.array([0.798, 0.78, 0.812, 0.808])

sklearn_avg = np.mean(sklearn_accuracies)
true_avg = np.mean(true_sklearn_accuracies)

print(f"kfold {sklearn_accuracies} vs true {true_sklearn_accuracies}")
print(np.abs(sklearn_avg-true_avg))

kfold [0.798 0.78  0.812 0.808] vs true [0.798 0.78  0.812 0.808]
0.0


In [22]:
# Compare NaiveBayes621 vs sklearn MultinomialNB() using my kfold_CV
accuracies = kfold_CV(NaiveBayes621(), X, y, k=4)
sklearn_accuracies = kfold_CV(MultinomialNB(), X, y, k=4)

our_avg = np.mean(accuracies)
sklearn_avg = np.mean(sklearn_accuracies)

print(f"sklearn kfold {sklearn_accuracies} vs my {accuracies}")
print(np.abs(sklearn_avg-true_avg))

sklearn kfold [0.798 0.78  0.812 0.808] vs my [0.798 0.78  0.812 0.808]
0.0
