# Baye's Rule

Posterior Porbability = $\frac{Prior probability.Likelihood}{Evidence}$
or 
P(y|X) = $\frac{P(y) . P(X|y)}{P(X)}$

The whole Naive based classifier can be summed up as:


y* = argmax P(y |X) = argmax P(y)*$\prod_{i=1}^{n}$P($x_i$|y)

For example, a query like this "Python Download" will be interpreted as:

y* = argmax P(y) * P('Python' | y) * P('Download' | y) , where y is the class label

# Two classic naive bayes Variants for text

## Multinomial Naive Bayes
Data follows a multinomial distribution. The multinomial Naïve Bayes model is one in which you assume that the data follows a multinomial distribution. So what does that mean? It means that when you have the set of features that define a particular data instance, we're assuming that these each come independent of each other and can also have multiple occurrences or multiple instances of each feature. So, counts become important in this multinomial distribution model. So you have each feature value, a some sort of a count or a weighted count. Example would be word occurrence counts or TF-IDF weighting and so on. So, suppose you have a piece of text, a document, and you are finding out what are all the words that were used in this model. That would be called a bag-of-words model. And if you just use the words, whether they were present or not, then that is a Bernoulli distribution for each feature. So, it becomes a multivariate Bernoulli when you're talking about it for all the words. But if you say that the number of times a particular word occurs is important, so for example, if the statement is to be or not to be, and you want to somehow say that the word to occur twice, the word be occur twice, the word or occur just once and so on, you want to somehow keep track of what was the frequency of each of these words. And then, if you want to give more importance to more rare words, then you would add on something called a term frequency, inverse document frequency weighting

## Bernoulli Naïve Bayes model
Here, the assumption is that the data follows a multivariate Bernoulli distribution, where each feature is a binary feature, that is, the word is present or not present, and it's only that information about just the word being present that is significant and modeled and it does not matter how many times that word was present. In fact, it also does not matter whether the word is significant or not in the sense that is the word THE, which is fairly common in everything, or is the word something like SIGNIFICANT, which is less common in all documents. So when you have just the binary features, I mean, just a binary model for every feature, then the entire data, the set of features follows what is called a multivariate Bernoulli model

# Support Vector Machines- main info

One of the most critical parameters becomes parameter C, and that defines regularization. Regularization is a term that denotes how important it is for individual data point to be labeled correctly, as compared to all the points in the general model. So how much importance should we give to individual data points? This parameter for example, if you have a larger value of c, that means the regularization is less, and that means that you are fitting the training data as much as possible. You are giving the individual points a lot of importance, and every data point becomes important. You want to get them right, even if the overall accuracy is low, or generalization error is high. Whereas smaller values of c means that are more regularization, where you are tolerant to errors on individual data points, as long as in the overall scheme you are getting simpler models let's say. So the generalization error is expected to be low.

We have linear kernel, rbf and polynomial kernels.

If you have multi_class then we have ovr (one vs rest) or ovo

class_weight : different classes can get different weights

# Using NLTK's NaiveBayesClassifier

from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)

classifier.classify(unlabeled_instance)
classifier.classify_many(inlabeled_instance)

nltk.classify.util.accuracy(classifier, test_set)

# Sentiment Analysis

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv(r'D:\\Coursera Data Science\\Course 4\\Amazon_Unlocked_Mobile.csv')

In [4]:
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [5]:
df.dropna(inplace=True)
df = df[df['Rating'] != 3]
df['Positively Rated'] = np.where(df['Rating'] >3,1,0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [6]:
df['Positively Rated'].mean()

0.7482686025879323

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 308277 entries, 0 to 413839
Data columns (total 7 columns):
Product Name        308277 non-null object
Brand Name          308277 non-null object
Price               308277 non-null float64
Rating              308277 non-null int64
Reviews             308277 non-null object
Review Votes        308277 non-null float64
Positively Rated    308277 non-null int32
dtypes: float64(2), int32(1), int64(1), object(3)
memory usage: 17.6+ MB


In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], df['Positively Rated'], random_state=0)

In [9]:
X_train[0]

"I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!"

We need to convert it into numeric representation so that we can put it into our model.

## TD-IDF

Term-frequency (TF) = $\frac{No of repitions of a word in a sentence}{Total num of words of the sentence}$

Inverse-Document-frequency (IDF) = $\log(\frac{Total number of sentences}{No of sentences containig that word})$

after calculating this, TF and IDF are multiplied and the resulting vector is used as input.

## Bag of words
In this technique, the frequency distribution is calculated for the trainig set and then each sentence is given the binary bag of words such that if the word is present in the sentence is assigned one else zero. The matrix produced is sparse as there are lot of zeros in the matrix.

In [10]:
# using bag of words
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer().fit(X_train)

In [11]:
vec.get_feature_names()[::2000]

['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']

In [12]:
len(vec.get_feature_names())

53216

In [13]:
X_train_vec = vec.transform(X_train)
X_train_vec

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [14]:
from sklearn.linear_model import LogisticRegression

md = LogisticRegression()
md.fit(X_train_vec, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
from sklearn.metrics import roc_auc_score

y_pred = md.predict(vec.transform(X_test))
roc_auc_score(y_test, y_pred)

0.92686860143504235

In [16]:
feature_names = np.array(vec.get_feature_names())
sorted_coe_idx = md.coef_[0].argsort()

print('Negative\n', feature_names[sorted_coe_idx[:10]])
print('Positive\n', feature_names[sorted_coe_idx[:-11:-1]])

Negative
 ['worst' 'false' 'worthless' 'junk' 'mony' 'garbage' 'useless' 'messing'
 'unusable' 'blacklist']
Positive
 ['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves' 'efficient'
 'perfecto' 'amazing' 'lovely']


In [17]:
# now using tfidf

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(min_df=5).fit(X_train)  #min_df removes all the tokens less than this value
len(vect.get_feature_names())

17951

In [19]:
X_train_vect = vect.transform(X_train)

md2 = LogisticRegression()
md2.fit(X_train_vect, y_train)
y_pred = md2.predict(vect.transform(X_test))
roc_auc_score(y_test, y_pred)

0.92661006667468371

In [20]:
feature_names = np.array(vect.get_feature_names())
sorted_coe_idx = md2.coef_[0].argsort()

print('Negative\n', feature_names[sorted_coe_idx[:10]])
print('Positive\n', feature_names[sorted_coe_idx[:-11:-1]])

Negative
 ['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']
Positive
 ['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [21]:
# Bag of words approach

print(md.predict(vec.transform(['not an issure, phone is working', 'an issue, phone is not working'])))

[0 0]


In [22]:
# TD-IDF approch

print(md2.predict(vect.transform(['not an issure, phone is working', 'an issue, phone is not working'])))

[0 0]


Our current model sees both of them as negative. Therefore, we introduce new parameter called n-grams.

In [23]:
vec2 = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vec2.transform(X_train)

len(vec2.get_feature_names())

198917

In [24]:
md3 = LogisticRegression()
md3.fit(X_train_vectorized, y_train)
y_pred = md3.predict(vec2.transform(X_test))
roc_auc_score(y_test, y_pred)

0.96712638794243788

In [25]:
feature_names = np.array(vec2.get_feature_names())
sorted_coe_idx = md3.coef_[0].argsort()

print('Negative\n', feature_names[sorted_coe_idx[:10]])
print('Positive\n', feature_names[sorted_coe_idx[:-11:-1]])

Negative
 ['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'looks ok' 'nope']
Positive
 ['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'no issues' 'great']


In [26]:
print(md3.predict(vec2.transform(['not an issure, phone is working', 'an issue, phone is not working'])))

[0 0]
