# Sentiment analysis 

In [1]:
### Jeff Scanlon
#### jscanlo2

### Daniel Schnelbach
#### dschnelb

First parse and read in the file.  Each line is of the form:


0:	topic category label (books,	camera,	dvd,	health,	music,	or	software)	
1:	sentiment	category	label	(pos	or	neg)	
2:	document	identifier	
3	and	on:	the	document tokens

We are only interested in 1 and 3.  Note that once a line is 'split' number 3 (above), the sentence is split into individual words.  These need to be 'joined' to form a sentence.


In [None]:
def read_file(fname='all_sentiment_shuffled.txt'):
    fp = open(fname, 'r', encoding='latin-1')
    all_labels = []
    all_text = []
    for line in fp:
        cat, lbl, _, *words = line.split() # '_' means don't care what it is
        all_labels.append(lbl)
        text = ' '.join(words) 
        all_text.append(text)        
    return( all_labels, all_text )

In [None]:
# Just so we know what join does
'-'.join(['list', 'of', 'strings'])

'list-of-strings'

As with the digits exercise, we return a list of labels and items to be classified in this case the sentences commenting on the product

In [None]:
all_labels, all_text = read_file()

The below will only work on MacOS.  'wc'  (word count) is a unix program that counts the number of characters, words, and lines in a file.  With the command line argument is just counts the number of lines.  Note the '!' at the beginning of the file which tells the notebook that what follows is a shell command.  Note sure if these is a windows equivalent; windows users may have to comment this out. You can also check to see the number of entries by openign the file in an editor.

In [None]:
!wc -l 'all_sentiment_shuffled.txt'

11914 all_sentiment_shuffled.txt


*Sanity check:*  have we read and processed the whole file: number of labels and sentences should equal the number of lines in the file.

In [None]:
(len(all_labels), len(all_text))

(11914, 11914)

Check a few elements.  e.g., look at line 1

In [None]:
(all_labels[0], all_text[0])

('neg',
 "i bought this album because i loved the title song . it 's such a great song , how bad can the rest of the album be , right ? well , the rest of the songs are just filler and are n't worth the money i paid for this . it 's either shameless bubblegum or oversentimentalized depressing tripe . kenny chesney is a popular artist and as a result he is in the cookie cutter category of the nashville music scene . he 's gotta pump out the albums so the record company can keep lining their pockets while the suckers out there keep buying this garbage to perpetuate more garbage coming out of that town . i 'll get down off my soapbox now . but country music really needs to get back to it 's roots and stop this pop nonsense . what country music really is and what it is considered to be by mainstream are two different things .")

## Bag of Words model

The below URL that describes how to use `CountVectorizer`.  

http://scikit-learn.org/stable/modules/feature_extraction.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

CountVectorizer()

In [None]:
corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]

In [None]:
# how to use vectorizer?

X = vectorizer.fit_transform(corpus)

# Notice order is lost - a BAG is a kind of set.
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [None]:
X.toarray().shape

(4, 9)

In [None]:
X.toarray()[1][5] = X.toarray()[1][5] + 1
X.toarray()[1][5]

2

In [None]:
import numpy as np
x = np.array([0,0,0,0])
x

array([0, 0, 0, 0])

In [None]:
x[2] = x[2] + 1
x[2]

1

In [None]:
corpus[1]

'This is the second second document.'

As described in the documentation at the URL, pass in all the text --- note that 'all_text' is an array of sentences, not an array of arrays of words.  When reading in the file the tokens should have been 'joined' to recreate the sentence.

In [None]:
# We need to produce X_train
# how

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(all_text)

A check to see how big the training matrix is.  It should have one entry for each sentence read in (11914).  The total number of unique words in the corpus turns out to be 46925

In [None]:
X_train.shape

(11914, 46933)

### How many words (just words) are in the data set?

In [None]:
sum([len(text.split()) for text in all_text])

1794123

If one wants, one can look at some of these features just to see what is going on.  Note that the vectorizer stores a sorted version of the features.  Hence the beginning of the feature list is a bunch of numbers.  It is only towards the middle that we see actual words. This is also discussed at the above URL.

In [None]:
f = vectorizer.get_feature_names()
print(len(f))
print(f[:10])
print(f[2000:2010])
print(f[30000:30010])

46933
['00', '000', '0003', '000mb', '004144', '007', '00am', '00pm', '01', '02']
['agendas', 'agent', 'agents', 'agentz', 'ager', 'agers', 'ages', 'agey', 'aggh', 'agglomerations']
['overpraised', 'overpriced', 'overproduced', 'overproduction', 'overpronnouncing', 'overprotecting', 'overrated', 'overrating', 'overreach', 'overreached']


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics

Feature vectors in this case indicate the count of words --- not real numbers.  Hence GausianNB should NOT be used. By our discussion in class, MultinomialNB should be used as we are dealing with discrete tokens.  BernoulliNB could also be used but we get better performance with Multinomial

In [None]:
model = MultinomialNB()
#model = BernoulliNB()

In [None]:
model.fit(X_train, all_labels)

MultinomialNB()

Another sanity check, we predict on the same set we training on.  Performance should be very high

In [None]:
expected = all_labels
predicted = model.predict(X_train)

Did we get the correct number of predictions

In [None]:
len(predicted)

11914

As one would expect, performance should be high --- we are testing on the training set!  If it isn't, then there is something not right in the above steps.

In [None]:
print(metrics.accuracy_score(expected, predicted))

0.9194225281181803


Now, lets do the 10 fold cross validation

In [None]:
from sklearn.model_selection import cross_val_score
y = all_labels
scores = cross_val_score(model, X_train, y, cv=10, scoring='accuracy')

X_train and y are split into 10 folds.  `cross_val_score` automatically builds training sets with 9 of these folds and tests it against the remaining fold

In [None]:
scores

array([0.8045302 , 0.80788591, 0.83892617, 0.81291946, 0.80856423,
       0.8186398 , 0.80268682, 0.79848866, 0.81612091, 0.81024349])

Taking the mean of the score ...

In [None]:
scores.mean()

0.8119005657644864

... we get pretty good performance

## Extensions

We used a bigrams matrix instead of unigram matrix and returned the score boost below. See the scratchwork below - the functions could be used to ngrams of any type..

In [None]:
y = all_labels
scores = cross_val_score(model, x_train_2, y, cv=10, scoring='accuracy')

In [None]:
scores = cross_val_score(model, x_train_2, y, cv=10, scoring='accuracy')

In [None]:
scores

array([0.86073826, 0.85234899, 0.86661074, 0.85654362, 0.86146096,
       0.85390428, 0.85138539, 0.86397985, 0.87153652, 0.85894207])

In [None]:
scores.mean()

0.8597450678748331

## Get n-grams

In [None]:
import re

def generate_ngrams(string, n):
    
    x = re.sub(r'[^a-zA-Z0-9\s]', ' ', string)
        
    tokens = [token for token in x.split(" ") if token != ""]
    
    ngrams = zip(*[tokens[i:] for i in range(n)])
    
    return [" ".join(ngram) for ngram in ngrams]

In [None]:
#Test
generate_ngrams(all_text[1], 2)

['i was',
 'was misled',
 'misled and',
 'and thought',
 'thought i',
 'i was',
 'was buying',
 'buying the',
 'the entire',
 'entire cd',
 'cd and',
 'and it',
 'it contains',
 'contains one',
 'one song']

In [None]:
# Create ordered dict in which we make each n-gram a unique key and 
# make its value correspond to its place in order (index)

from collections import OrderedDict 
d = OrderedDict()
idx = 0
for string in all_text:
    bigrams = generate_ngrams(string, 2)
    for b in bigrams:
        if b not in d.keys():
            d[b] = idx
            idx += 1

In [None]:
# Just checking...
list(d.keys())[:10]

['i bought',
 'bought this',
 'this album',
 'album because',
 'because i',
 'i loved',
 'loved the',
 'the title',
 'title song',
 'song it']

In [None]:
list(d.values())[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
len(d.keys())

512535

In [None]:
zero_arr = np.zeros((len(d.keys()),),dtype='int8')
zero_arr.shape

(512535,)

In [None]:
# Set array collector
arr_ls = []

In [None]:
# For every string in review text, get the n-grams
# then iterate over each of those n-grams in the string and use the dict to get the index of the ZERO array
# we need to add a count to (+1)
# Append it all to a list or arrays. 
for string in all_text:
    bigrams = generate_ngrams(string, 2)
    arr = np.zeros((len(d.keys()),),dtype='int8')
    for b in bigrams:
        arr[ d[b] ] = arr[ d[b] ] + 1
    arr_ls.append(arr)

In [None]:
len(arr_ls)

11914

In [None]:
arr_ls[3].shape

(512535,)

In [None]:
# These are all super inefficient and I should have looked for scipy sparse matrix functions earlier... 
bigram_mx = np.vstack(arr_ls)

In [None]:
bigram_mx = bigram_mx.astype('int8')

In [None]:
from scipy import sparse
x_train_2 = sparse.csr_matrix(bigram_mx) 

In [None]:
x_train_2

<11914x512535 sparse matrix of type '<class 'numpy.int8'>'
	with 1503144 stored elements in Compressed Sparse Row format>

In [None]:
# get back memory...
del bigram_mx;
del arr_ls;

In [None]:
## Got it... Back to the top to run. 