## Sentiment Analysis (Cheese and Butter review)

In [28]:
import numpy as np
import pandas as pd
import sklearn

df = pd.read_csv('butter_cheese_review_ie.csv') 
#text in column 1, classifier in column 2.
df.head()

Unnamed: 0,Review,Brand,Category
0,"I didn’t get the flavor I was expecting, espec...",Irish,Butter
1,Kerrygold is not a dairy in Ireland. It is jus...,Irish,Butter
2,This is an excellent butter for eating but ter...,Irish,Butter
3,I purchased an 8 oz at the local Kroger for 3....,Irish,Butter
4,And I'm picky about the dairy I use. save your...,Irish,Butter


In [29]:
df.drop(["Category"], axis=1, inplace = True)

In [30]:
df.tail()

Unnamed: 0,Review,Brand
45,simply great better,Dutch
46,Making cookies produced a great addition to th...,Dutch
47,Love it,Dutch
48,Great butter. This was the first time I had bo...,Dutch
49,"This is great butter, its smooth to the taste,...",Dutch


In [31]:
# convert label to a numerical variable
df['Brand'] = df.Brand.map({'Irish':0, 'Dutch':1})
df

Unnamed: 0,Review,Brand
0,"I didn’t get the flavor I was expecting, espec...",0
1,Kerrygold is not a dairy in Ireland. It is jus...,0
2,This is an excellent butter for eating but ter...,0
3,I purchased an 8 oz at the local Kroger for 3....,0
4,And I'm picky about the dairy I use. save your...,0
5,Drastically over priced,0
6,I don't know why this was so pricey. It's just...,0
7,"I've read that it's only 90% grass fed, but ""f...",0
8,Creamy smooth good texture nice butter,0
9,this butter taste's creamer than the butter yo...,0


In [33]:
numpy_array = df.values
X = numpy_array[:,0]
Y = numpy_array[:,1]
Y = Y.astype('int64')
print("X")
print(X)
print("Y")
print(Y)

X
['I didn’t get the flavor I was expecting, especially for the price. I wanted to love this so much.'
 "Kerrygold is not a dairy in Ireland. It is just an umbrella marketing brandname for many different dairies. In Europe this is not considered a premium brand, is very ordinary. It has a yellower color than American butters, I'll give it that. And But the taste is very, very disappointing to me. Many common brands and even the kind of butter served in cafeterias taste much better to me. I won't be purchasing it again."
 'This is an excellent butter for eating but terrible for baking. It makes my cookies fall apart and leaves a distinct salty taste. I will use it for table but never again for baking.'
 'I purchased an 8 oz at the local Kroger for 3.99 I also purchased a pound of the Kroger brand butter well what a disappointment in the Kerrygold pure Irish butter Id much rather use the Kroger brand.'
 "And I'm picky about the dairy I use. save your money this is good but there is cheap

#### Vectorisation of words

In [34]:
# create an object of CountVectorizer() class 
from sklearn.feature_extraction.text import CountVectorizer 
vec = CountVectorizer( )

fit() - converts a corpus of documents into a vector of unique words as shown below. (From Dr. Muhammad's notes)

In [35]:
vec.fit(X)
vec.vocabulary_

{'didn': 114,
 'get': 188,
 'the': 416,
 'flavor': 169,
 'was': 448,
 'expecting': 149,
 'especially': 139,
 'for': 176,
 'price': 324,
 'wanted': 447,
 'to': 426,
 'love': 256,
 'this': 422,
 'so': 383,
 'much': 278,
 'kerrygold': 231,
 'is': 223,
 'not': 286,
 'dairy': 105,
 'in': 216,
 'ireland': 221,
 'it': 225,
 'just': 228,
 'an': 23,
 'umbrella': 436,
 'marketing': 267,
 'brandname': 56,
 'many': 265,
 'different': 115,
 'dairies': 104,
 'europe': 141,
 'considered': 90,
 'premium': 322,
 'brand': 55,
 'very': 445,
 'ordinary': 300,
 'has': 199,
 'yellower': 475,
 'color': 83,
 'than': 413,
 'american': 21,
 'butters': 64,
 'll': 249,
 'give': 189,
 'that': 415,
 'and': 25,
 'but': 62,
 'taste': 406,
 'disappointing': 118,
 'me': 269,
 'common': 85,
 'brands': 57,
 'even': 143,
 'kind': 233,
 'of': 289,
 'butter': 63,
 'served': 371,
 'cafeterias': 69,
 'better': 49,
 'won': 468,
 'be': 43,
 'purchasing': 333,
 'again': 13,
 'excellent': 146,
 'eating': 134,
 'terrible': 411,
 '

```Countvectorizer()``` has converted the documents into a set of unique words alphabetically sorted and indexed. (From  Dr. Muhammad's notes)

#### Removing Stop Words

Removing them by passing a parameter stop_words='english' while instantiating Countvectorizer()

In [36]:
# removing the stop words
vec = CountVectorizer(stop_words='english' )
vec.fit(X)
vec.vocabulary_

{'didn': 87,
 'flavor': 131,
 'expecting': 114,
 'especially': 108,
 'price': 243,
 'wanted': 343,
 'love': 196,
 'kerrygold': 173,
 'dairy': 78,
 'ireland': 167,
 'just': 171,
 'umbrella': 335,
 'marketing': 205,
 'brandname': 34,
 'different': 88,
 'dairies': 77,
 'europe': 110,
 'considered': 64,
 'premium': 241,
 'brand': 33,
 'ordinary': 223,
 'yellower': 357,
 'color': 57,
 'american': 14,
 'butters': 41,
 'll': 189,
 'taste': 314,
 'disappointing': 91,
 'common': 59,
 'brands': 35,
 'kind': 175,
 'butter': 40,
 'served': 286,
 'cafeterias': 45,
 'better': 27,
 'won': 351,
 'purchasing': 252,
 'excellent': 112,
 'eating': 105,
 'terrible': 319,
 'baking': 24,
 'makes': 202,
 'cookies': 68,
 'fall': 119,
 'apart': 17,
 'leaves': 183,
 'distinct': 94,
 'salty': 276,
 'use': 338,
 'table': 312,
 'purchased': 251,
 'oz': 229,
 'local': 190,
 'kroger': 178,
 '99': 4,
 'pound': 240,
 'disappointment': 92,
 'pure': 253,
 'irish': 168,
 'id': 162,
 'picky': 237,
 'save': 278,
 'money': 2

In [37]:
# another way of representing the features
X_transformed=vec.transform(X)
X_transformed

<50x360 sparse matrix of type '<class 'numpy.int64'>'
	with 591 stored elements in Compressed Sparse Row format>

In [38]:
print(X_transformed)

  (0, 87)	1
  (0, 108)	1
  (0, 114)	1
  (0, 131)	1
  (0, 196)	1
  (0, 243)	1
  (0, 343)	1
  (1, 14)	1
  (1, 27)	1
  (1, 33)	1
  (1, 34)	1
  (1, 35)	1
  (1, 40)	1
  (1, 41)	1
  (1, 45)	1
  (1, 57)	1
  (1, 59)	1
  (1, 64)	1
  (1, 77)	1
  (1, 78)	1
  (1, 88)	1
  (1, 91)	1
  (1, 110)	1
  (1, 167)	1
  (1, 171)	1
  :	:
  (46, 107)	1
  (46, 151)	1
  (46, 203)	1
  (46, 248)	1
  (46, 267)	1
  (47, 196)	1
  (48, 31)	1
  (48, 33)	1
  (48, 40)	2
  (48, 43)	1
  (48, 48)	1
  (48, 151)	1
  (48, 326)	1
  (49, 40)	1
  (49, 57)	1
  (49, 76)	1
  (49, 151)	1
  (49, 215)	1
  (49, 262)	1
  (49, 294)	1
  (49, 314)	1
  (49, 338)	1
  (49, 347)	1
  (49, 353)	1
  (49, 356)	1


Convert sparse matrix into a more easily interpretable array.

In [39]:
# converting transformed matrix back to an array
# note the high number of zeros
X=X_transformed.toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

To make better sense of the dataset, let us examine the vocabulary and document-term matrix together in a pandas dataframe. The way to convert a matrix into a dataframe is pd.DataFrame(matrix, columns=columns). (From Dr. Muhammad's notes)

In [40]:
# converting matrix to dataframe
pd.DataFrame(X, columns=vec.get_feature_names())



Unnamed: 0,10,100,65,90,99,ability,able,acts,actually,addition,...,women,won,wont,works,world,ya,yellow,yellower,youngest,zealand
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,1,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This table shows how many times a particular word occurs in document. In other words, this is a frequency table of the words. (From Dr. Muhammad's notes)

## Testing Data

In [41]:
test_docs = pd.read_csv('Butter_Cheese_Tweets_test.csv') 
test_docs

Unnamed: 0,edit_history_tweet_ids,author_id,text,id,Tweet
0,['1608553799244992512'],251500400.0,@JackLombardi @AdamKinzinger We are all “globa...,1.60855e+18,Butter
1,['1608493257683603456'],7.32941e+17,Enjoy 🇮🇪 McCambridge soda bread mix across the...,1.60849e+18,Butter
2,['1608349591920934914'],1569071000.0,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",1.60835e+18,Butter
3,['1608294795796955136'],15258780.0,@plentyofalcoves the things I saw them do with...,1.60829e+18,Butter
4,['1608184326499274752'],1569071000.0,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",1.60818e+18,Butter
5,['1608080492280135681'],1569071000.0,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",1.60808e+18,Butter
6,['1608073307923906563'],1.47873e+18,Lots of #Brexit problems. About 2:09: An order...,1.60807e+18,Butter
7,['1608056702103863298'],45626090.0,I went from not having cows milk in few years ...,1.60806e+18,Butter
8,['1608042608009547776'],3871294000.0,@punchedmonet_ @alannah_siobhan Wasn't it an i...,1.60804e+18,Butter
9,['1607779250463191040'],14441450.0,"@DanKaszeta In Ireland ""green bacon"" refers to...",1.60778e+18,Butter


In [42]:
test_docs.drop(["id", "author_id", "edit_history_tweet_ids"], axis=1, inplace = True)
test_docs.head()

Unnamed: 0,text,Tweet
0,@JackLombardi @AdamKinzinger We are all “globa...,Butter
1,Enjoy 🇮🇪 McCambridge soda bread mix across the...,Butter
2,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",Butter
3,@plentyofalcoves the things I saw them do with...,Butter
4,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",Butter


In [43]:
# convert label to a numerical variable
test_docs['Tweet'] = test_docs.Tweet.map({'Butter':0, 'Cheese':1})
test_docs

Unnamed: 0,text,Tweet
0,@JackLombardi @AdamKinzinger We are all “globa...,0
1,Enjoy 🇮🇪 McCambridge soda bread mix across the...,0
2,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",0
3,@plentyofalcoves the things I saw them do with...,0
4,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",0
5,"https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ir...",0
6,Lots of #Brexit problems. About 2:09: An order...,0
7,I went from not having cows milk in few years ...,0
8,@punchedmonet_ @alannah_siobhan Wasn't it an i...,0
9,"@DanKaszeta In Ireland ""green bacon"" refers to...",0


In [44]:
test_numpy_array = test_docs.values
X_test = test_numpy_array[:,0]
Y_test = test_numpy_array[:,1]
Y_test = Y_test.astype('int')
print("X_test")
print(X_test)
print("Y_test")
print(Y_test)

X_test
['@JackLombardi @AdamKinzinger We are all “globalists” whether we like it or not. We live in a “global” economy where we purchase oil from the Arabs, iPhones from the Chinese, vegetables from Mexico and Butter from Ireland. We travel globally. If we burn coal in our back yard it affects the globe. Get it?'
 'Enjoy 🇮🇪 McCambridge soda bread mix across the water in USA 🇺🇸\r\nA taste of Ireland that is hard to beat! We love our soda bread fresh from the oven, with lashings of real butter and your favourite jam! 😋\r\nOrder now from @amazon\r\n\r\nhttps://t.co/vvXd1nI8r6\r\n\r\n#IrishMade https://t.co/TUxr7pAXsv'
 'https://t.co/ZPm4YMdViV Lot of 4, Noritake #Ireland Tipperary Bread &amp; Butt https://t.co/fPxsMYaAOv'
 '@plentyofalcoves the things I saw them do with butter while I was in Ireland… I’ll be intimately involved with a slice of toast at a diner in Albany, but I remember the Irish breakfasts and it turns to ash in my mouth'
 'https://t.co/ZPm4YMdViV Lot of 4, Noritake #Irel

In [45]:
X_test_transformed=vec.transform(X_test)
X_test_transformed

<60x360 sparse matrix of type '<class 'numpy.int64'>'
	with 296 stored elements in Compressed Sparse Row format>

In [46]:
X_test=X_test_transformed.toarray()
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Multinomial Naive Bayes

In [47]:
# building a multinomial NB model
from sklearn.naive_bayes import MultinomialNB

# instantiate NB class
mnb=MultinomialNB()

# fitting the model on training data
mnb.fit(X,Y)

# predicting probabilities of test data
mnb.predict_proba(X_test)


array([[0.87256745, 0.12743255],
       [0.91094314, 0.08905686],
       [0.58778899, 0.41221101],
       [0.81271594, 0.18728406],
       [0.58778899, 0.41221101],
       [0.58778899, 0.41221101],
       [0.89402384, 0.10597616],
       [0.94888454, 0.05111546],
       [0.86351967, 0.13648033],
       [0.43207949, 0.56792051],
       [0.58778899, 0.41221101],
       [0.72783733, 0.27216267],
       [0.80835724, 0.19164276],
       [0.58778899, 0.41221101],
       [0.58778899, 0.41221101],
       [0.58778899, 0.41221101],
       [0.58778899, 0.41221101],
       [0.60342948, 0.39657052],
       [0.98465237, 0.01534763],
       [0.79581731, 0.20418269],
       [0.97503454, 0.02496546],
       [0.80835724, 0.19164276],
       [0.58778899, 0.41221101],
       [0.45156354, 0.54843646],
       [0.58778899, 0.41221101],
       [0.87415018, 0.12584982],
       [0.46310834, 0.53689166],
       [0.72209741, 0.27790259],
       [0.99540969, 0.00459031],
       [0.36001046, 0.63998954],
       [0.

In [48]:
proba=mnb.predict_proba(X_test)
print("probability of test document belonging to class Butter" , proba[:,0])
print("probability of test document belonging to class Cheese" , proba[:,1])

probability of test document belonging to class Butter [0.87256745 0.91094314 0.58778899 0.81271594 0.58778899 0.58778899
 0.89402384 0.94888454 0.86351967 0.43207949 0.58778899 0.72783733
 0.80835724 0.58778899 0.58778899 0.58778899 0.58778899 0.60342948
 0.98465237 0.79581731 0.97503454 0.80835724 0.58778899 0.45156354
 0.58778899 0.87415018 0.46310834 0.72209741 0.99540969 0.36001046
 0.70596625 0.58778899 0.98115831 0.75528417 0.93925912 0.30258448
 0.82591716 0.82647465 0.58778899 0.64379691 0.86351967 0.72491281
 0.56857249 0.77214993 0.60581081 0.75528417 0.75978444 0.82236617
 0.70673194 0.98465237 0.72491281 0.74038678 0.94624798 0.86979644
 0.83361649 0.85083016 0.94208433 0.88256455 0.88934753 0.72491281]
probability of test document belonging to class Cheese [0.12743255 0.08905686 0.41221101 0.18728406 0.41221101 0.41221101
 0.10597616 0.05111546 0.13648033 0.56792051 0.41221101 0.27216267
 0.19164276 0.41221101 0.41221101 0.41221101 0.41221101 0.39657052
 0.01534763 0.2041

In [49]:
pd.DataFrame(proba, columns=['Butter','Cheese'])

Unnamed: 0,Butter,Cheese
0,0.872567,0.127433
1,0.910943,0.089057
2,0.587789,0.412211
3,0.812716,0.187284
4,0.587789,0.412211
5,0.587789,0.412211
6,0.894024,0.105976
7,0.948885,0.051115
8,0.86352,0.13648
9,0.432079,0.567921


### Bernoulli Naive Bayes

In [50]:
from sklearn.naive_bayes import BernoulliNB

# instantiating bernoulli NB class
bnb=BernoulliNB()

# fitting the model
bnb.fit(X,Y)

# predicting probability of test data
bnb.predict_proba(X_test)
proba_bnb=bnb.predict_proba(X_test)

In [51]:
pd.DataFrame(proba_bnb, columns=['Butter','Cheese'])

Unnamed: 0,Butter,Cheese
0,0.229883,0.770117
1,0.18551,0.81449
2,0.039628,0.960372
3,0.319536,0.680464
4,0.039628,0.960372
5,0.039628,0.960372
6,0.185964,0.814036
7,0.504983,0.495017
8,0.141946,0.858054
9,0.023975,0.976025
