# Navie Bayes

$P(c | x) = \frac{P(x | c) P(c)}{P(x)}$

$P(C_k | x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n | C_k) P(C_k)}{P(x_1, x_2, ..., x_n)}$

Given Class $C_k$, $x_i$ is independant of $x_j$ when $i \neq j$

$P(C_k | x_1, x_2, ..., x_n) = \frac{P(C_k) \prod_{i=1}^nP(x_i | C_k)}{P(x_1, x_2, ..., x_n)}$

$P(x_1, x_2, ..., x_n)$ is the same for different class $k$s

$P(C_k | x_1, x_2, ..., x_n) \propto P(C_k) \prod_{i=1}^nP(x_i | C_k)$

Again, apply log probabilities

Manually split the dataset into train and test by 0.9 ratio

In [1]:
import os, glob
# read a pos/neg dir
# each document is a review
corpus_folder = './review_polarity/txt_sentoken/'
def readDir(senti_folder, pattern, top_doc_num):
    file_list = []
    path_pattern = os.path.join(corpus_folder, senti_folder, pattern + '*.txt')
    all_txt_paths = glob.glob(path_pattern)
    # !!!!! I am only taking the first top_doc_num dcouments in pos/neg folder
    for file_path in all_txt_paths[:top_doc_num]:
        # print(file_path)
        word_List = readFile(file_path)
        # print(word_List)
        file_list.append(word_List)
    return file_list

In [2]:
# read one sample document
def readFile(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        tokenzied_words = f.read().split()
        return tokenzied_words

In [3]:
train_pos_file_list = readDir('pos', 'cv[0-8]', top_doc_num = 90)
train_neg_file_list = readDir('neg', 'cv[0-8]', top_doc_num = 90)
train_pos_labels = [1 for i in range(len(train_pos_file_list))]
train_neg_labels = [0 for i in range(len(train_neg_file_list))]

test_pos_file_list = readDir('pos', 'cv9', top_doc_num = 10)
test_neg_file_list = readDir('neg', 'cv9', top_doc_num = 10)
test_pos_labels = [1 for i in range(len(test_pos_file_list))]
test_neg_labels = [0 for i in range(len(test_neg_file_list))]

train_file_list = train_pos_file_list + train_neg_file_list
test_file_list = test_pos_file_list + test_neg_file_list

train_labels = train_pos_labels + train_neg_labels
test_labels = test_pos_labels + test_neg_labels

In [4]:
# from sklearn.model_selection import train_test_split
# train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1, random_state = 0)
# print(train_features.shape, test_features.shape)

In [5]:
import string, re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [6]:
# create English stop words list (you can always define your own stopwords)
# stop_words = set(stopwords.words('english'))

In [7]:
# word present/absense
def clean(tokenized):
    punctuation_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x)]
    unique_punctuation_free = set(punctuation_free)
    return ' '.join(punctuation_free)

In [8]:
# term frequency
def clean1(tokenized):
    punctuation_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x)]
    return ' '.join(punctuation_free)

POS tag list: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [9]:
from nltk import pos_tag
def clean2(tokenized):
    # punctuation_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x)]
    # word_posTags = pos_tag(punctuation_free)
    # get the POS tags
    # pos_tags = [t[1] for t in word_posTags]
    # ??? how to get adjv_words?
    return ' '.join(adjv_words)

In [10]:
# corpus_clean is a list for tokenized documents
# each list contains the string of the tokenized words in a document
train_file_list_clean = [clean(doc) for doc in train_file_list]
test_file_list_clean = [clean(doc) for doc in test_file_list]

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [12]:
# vectorizer = CountVectorizer(stop_words='english')

In [None]:
# CountVectorizer has a function to invoke bigram
# However, think about how to use it in a correct way

## sublinear tf-idf = (1 + log(tf)) x idf

Check 11_7_tf_idf ~~ TF with Log Variant

Or https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
What parameters you should use?

In [13]:
# TF-IDF
vectorizer = TfidfVectorizer(min_df = 2, stop_words='english', sublinear_tf=True)
train_features = vectorizer.fit_transform([doc for doc in train_file_list_clean])
test_features = vectorizer.transform([doc for doc in test_file_list_clean])

In [14]:
# type(test_features)

In [15]:
# tokens = vectorizer.get_feature_names()

In [16]:
# import pandas as pd
# # create a dataframe from a word matrix
# def wm2df(wm, feat_names):
    
#     # create an index for each row
#     doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
#     df = pd.DataFrame(data=wm.toarray(), index=doc_names,
#                       columns=feat_names)
#     return(df)

In [17]:
# wm2df(train_features, tokens)

In [18]:
nb_clf = MultinomialNB()

In [19]:
nb_clf.fit(train_features, train_labels)
predictions = nb_clf.predict(test_features)

In [20]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_labels, predictions)

In [21]:
print(accuracy)

0.65


In [22]:
# show most informative features

In [23]:
class_labels = nb_clf.classes_
print(class_labels)
# coef_ just for the positive class
nb_clf.coef_.shape

[0 1]


(1, 5615)

In [24]:
import numpy as np
def show_most_informative_features(vectorizer, classifier, n=10):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()  
    topn_pos_class = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    topn_neg_class = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]    

    print("Important words in positive reviews")
    for coef, feature in topn_pos_class:
        print(class_labels[1], coef, feature) 
    print("-----------------------------------------")
    print("Important words in negative reviews")
    for coef, feature in topn_neg_class:
        print(class_labels[0], coef, feature)        

In [25]:
show_most_informative_features(vectorizer, nb_clf)


Important words in positive reviews
1 3.485474104928371 film
1 2.7786316707475223 movie
1 2.1538134510442064 good
1 2.115805364453226 like
1 1.940998036559053 story
1 1.9060070367569755 life
1 1.85315053428588 time
1 1.7799379088882974 just
1 1.7351539214143994 does
1 1.7247076555982765 films
-----------------------------------------
Important words in negative reviews
0 3.4191328989873586 film
0 3.3487884884628056 movie
0 2.453029626130787 like
0 2.2465396824527475 just
0 2.1835677841571113 story
0 2.028496773507254 bad
0 1.9553805165118996 time
0 1.889658169350273 character
0 1.8744950379789147 characters
0 1.8152638070975888 going


### Notice, the following tools are friendly to Mac, Linux and C++ users

# Glove

git clone http://github.com/stanfordnlp/glove

cd glove && make

./demo.sh

python eval/python/evaluate.py

python eval/python/distance.py

python eval/python/word_analogy.py

# fastText

git clone https://github.com/facebookresearch/fastText.git

cd fastText

mkdir build && cd build && cmake ..

make && make install

./fasttext skipgram -input data.txt -output model

cat queries.txt | ./fasttext print-word-vectors model.bin