# Navie Bayes

$$P(c | x) = \frac{P(x | c) P(c)}{P(x)}$$

$P(C_k | x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n | C_k) P(C_k)}{P(x_1, x_2, ..., x_n)}$

Given Class $C_k$, $x_i$ is independant of $x_j$ when $i \neq j$

$P(C_k | x_1, x_2, ..., x_n) = \frac{P(C_k) \prod_{i=1}^nP(x_i | C_k)}{P(x_1, x_2, ..., x_n)}$

$P(x_1, x_2, ..., x_n)$ is the same for different class $k$s

$P(C_k | x_1, x_2, ..., x_n) \propto P(C_k) \prod_{i=1}^nP(x_i | C_k)$

Again, apply log probabilities

In [1]:
import string, re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [2]:
# read one sample document
def readFile(data_path):
    with open(data_path, 'r', encoding='utf-8', errors='ignore') as f:
        return [line.strip() for line in f.readlines()]
            

In [3]:
# create a sample corpus
positive_lines = readFile('./review_polarity/txt_sentoken/pos/cv000_29590.txt')
positive_labels = [1 for i in range(len(positive_lines))]
negative_lines = readFile('./review_polarity/txt_sentoken/neg/cv000_29416.txt')
negative_labels = [0 for i in range(len(negative_lines))]

# compile sample documents into a list
doc_list = positive_lines + negative_lines
senti_labels = positive_labels + negative_labels

In [4]:
#doc_list
#senti_labels

In [5]:
# create English stop words list (you can always define your own stopwords)
stop_words = set(stopwords.words('english'))

# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

In [6]:
# Function to remove stop words from sentences & lemmatize verbs and nouns. 
def clean1(doc):
    tokenized = word_tokenize(doc.lower())
    stop_free = [x.translate(translator) for x in tokenized]
    #stop_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x) and x not in stop_words]
    lemma_verb = [lemmatizer.lemmatize(word,'v') for word in stop_free]
    lemma_noun = [lemmatizer.lemmatize(word,'n') for word in lemma_verb]
    y = [s for s in lemma_noun if len(s) > 2]
    return ' '.join(y)

POS tag list: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [7]:
from nltk import pos_tag
def clean(doc):
    tokenized = word_tokenize(doc.lower())
    punctuation_free = [x for x in tokenized if not re.fullmatch('[' + string.punctuation + ']+', x)]
    # Apply POS tagging after removing punctuations
    word_posTags = pos_tag(punctuation_free)
    stop_free = [x for x in punctuation_free if x not in stop_words]
    # Apply POS tagging after removing stop words
    # word_posTags = pos_tag(stop_free)
    pos_tags = [t[1] for t in word_posTags]    
    lemma_verb = [lemmatizer.lemmatize(word,'v') for word in stop_free]
    lemma_noun = [lemmatizer.lemmatize(word,'n') for word in lemma_verb]
    y = [s for s in lemma_noun if len(s) > 2]
    # add document len feature
    # ' '.join(y) + ' ' + len(y)
    # add pos tag features
    return ' '.join(y + pos_tags)

In [8]:
clean('This is a test.')

'test DT VBZ DT NN'

https://nlp.stanford.edu/software/
How would you add other featuees such as DependencyParser features?

In [9]:
path_to_jar = '/Users/xiaofengzhu/Downloads/stanford_models/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = '/Users/xiaofengzhu/Downloads/stanford_models/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'

In [10]:
from nltk.parse.stanford import StanfordDependencyParser

dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)

result = dependency_parser.raw_parse('I shot an elephant in my pajamas')
dep = next(result)

list(dep.triples())

[(('shot', 'VBD'), 'nsubj', ('I', 'PRP')),
 (('shot', 'VBD'), 'dobj', ('elephant', 'NN')),
 (('elephant', 'NN'), 'det', ('an', 'DT')),
 (('shot', 'VBD'), 'prep', ('in', 'IN')),
 (('in', 'IN'), 'pobj', ('pajamas', 'NNS')),
 (('pajamas', 'NNS'), 'poss', ('my', 'PRP$'))]

## Try four choices in the following list
- tokenization: white space vs. treebank-style vs. sentiment-aware style

- stemming /lemmatization

- negation / negation scope

- POS tagging

- dependency parsing

- text length

- pos/neg word ratio (i.e., the ratio of pos to neg words as defiend by a quality sentiment lexicon)

In [11]:
# corpus_clean is a list for tokenized documents (a list of list)
# each list contains the tokenized words in a document
corpus_clean = [clean(doc.strip()) for doc in doc_list]

In [12]:
#corpus_clean

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# # if you want to use Pipeline
# from sklearn.pipeline import Pipeline
# nb_clf = Pipeline([('vect', CountVectorizer()),
#                      ('clf', MultinomialNB())])

In [14]:
vectorizer = CountVectorizer()
train_features = vectorizer.fit_transform([doc for doc in corpus_clean])

In [26]:
# # sparse matrix
#train_features

<60x528 sparse matrix of type '<class 'numpy.int64'>'
	with 1328 stored elements in Compressed Sparse Row format>

In [16]:
nb_clf = MultinomialNB()

In [17]:
# in the past
# nb_clf.fit(train_features, senti_labels)
# predictions = nb_clf.predict(test_features)

In [18]:
# cross validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score, make_scorer

In [19]:
# Variables for average classification report
originalclass = []
predictedclass = []

#Make our customer score
def classification_report_with_accuracy_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return accuracy_score(y_true, y_pred) # return accuracy score

In [20]:
import numpy as np
# the number of folds you want to set
NUM_FOLDS = 3
results = cross_val_score(nb_clf, train_features, senti_labels, \
          cv=NUM_FOLDS, scoring=make_scorer(classification_report_with_accuracy_score))

# Average values in classification report for all folds in a K-fold Cross-validation  
print(classification_report(originalclass, predictedclass))

             precision    recall  f1-score   support

          0       0.72      0.94      0.81        35
          1       0.86      0.48      0.62        25

avg / total       0.78      0.75      0.73        60



# Word2vec

In [21]:
from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)
model.train(sentences, total_examples=len(sentences), epochs=10)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['word2vec'])
# save model
model.save('model.bin')

Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[-1.1493862e-03 -3.5563384e-03 -3.3391684e-03 -5.0757156e-04
  1.1734351e-03  9.1065059e-04  7.9620909e-04  1.9075215e-03
 -2.2747642e-03  3.3968594e-03  3.3861948e-03  3.0182032e-03
 -7.4280753e-05 -3.6991336e-03 -7.9530553e-04 -4.3100165e-03
 -4.5402953e-03 -4.6887724e-03 -2.1508015e-03  2.7396295e-03
  1.4228281e-03 -1.6857482e-03  7.2761270e-04 -4.9594543e-03
  6.2634121e-04  2.5783337e-04  1.3557264e-03  1.7752484e-03
 -8.8586094e-04  7.7218033e-04  4.1525191e-04 -1.9109838e-03
 -3.1143862e-03 -2.5595869e-03 -1.5352821e-03  3.9402780e-04
  1.5057779e-03  3.7732718e-03 -3.6545335e-03  2.2936778e-03
  2.3889232e-03  4.0901573e-03 -4.0119444e-03  4.4486411e-03
  4.8838388e-03 -3.3961335e-04 -1.0031519e-03  2.4855658e-03
 -9.1987365e-04  3.9645368e-03 -1.1648053e-03 -4.6440419e-03
  1.1779242e-03  1.9377724e-03  3.8601507e-0



In [22]:
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec(vocab=14, size=100, alpha=0.025)


In [23]:
new_model.wv.most_similar(positive='one', topn=2)

  if np.issubdtype(vec.dtype, np.int):


[('and', 0.22534842789173126), ('another', 0.17810091376304626)]

In [24]:
new_model.wv.similarity(w1='is', w2='sentence')

  if np.issubdtype(vec.dtype, np.int):


2.0

In [25]:
# # pip install glove-python
# # https://github.com/maciejkula/glove-python
# from glove import Corpus, Glove
# corpus = Corpus()
# corpus.fit(sentences, window=10)
# glove = Glove(no_components=100, learning_rate=0.05)
# glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
# glove.add_dictionary(corpus.dictionary)
# glove.most_similar('man')
# glove.most_similar('man', number=10)