__Sentiment Analysis__  also called as __Opinion Mining.__, is a populart sub-descipline of broader file of __NLP__. It is concerned with polarity of the documents.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('movie_data.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


### Introducing the bag-of-words model

1. We create a vocabulary of unique tokens -- for example, words -- from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each word occus in the particular document

Since the unique words in each document represent only a small subset of all the wrods in the bag-of-words vocabulary, the feature vectors, will mostly consist of zeros, which is why we call them __sparse.__

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(ngram_range=(1,2))

docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet',
])

bag = count.fit_transform(docs)

In [10]:
print(count.vocabulary_)

{'the': 10, 'sun': 7, 'is': 2, 'shining': 5, 'the sun': 11, 'sun is': 8, 'is shining': 3, 'weather': 13, 'sweet': 9, 'the weather': 12, 'weather is': 14, 'is sweet': 4, 'and': 0, 'shining and': 6, 'and the': 1}


In [11]:
print(bag.toarray())

[[0 0 1 1 0 1 0 1 1 0 1 1 0 0 0]
 [0 0 1 0 1 0 0 0 0 1 1 0 1 1 1]
 [1 1 2 1 1 1 1 1 1 1 2 1 1 1 1]]


These values in the feature vectors are also called the __raw term frequencies__

The sequence of items in the bag-of-words model that we just created is also called the __1-gram(unigram) 2-gram(bigram)__. More generally , the contiguous sequences of terms in NLP- words, letters or symbols are also called n-grams. 
the choice of the number _n_ in the n-gram model depends on the particular application.

### Assessing word relevancy via term frequency - inverse document frequency

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. These frequently occuring words typically don't contain useful or discriminatory information. We will learn about a useful technique called __term frequency-inverse document frequency (tf-idf)__ that can be used to downweight these frequently occruing words in the feature vectors.


<center>$tf\_idf(t,d) = tf(t,d)\space x \space idf(t,d)$</center>

Here the $tf(t,d)$ is the term frequency introduced in the previous section and $idf(t,d)$ is the inverse document frequency and can be calculated as follows:

<center>$idf(t,d) = log \frac {n_d}{1+df(d,t)}$</center>

Here, $n_d$ is the total number of documents, and $df(t,d)$ is the number of documents _d_ that contains the term _t_.

__Note__ : that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the _log_ is used to ensuere that low document frequencies are not given too much weight.


In [12]:
from sklearn.feature_extraction.text import TfidfTransformer


In [13]:
tfidf = TfidfTransformer(use_idf=True, norm='l2',smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0.    0.    0.31  0.4   0.    0.4   0.    0.4   0.4   0.    0.31  0.4
   0.    0.    0.  ]
 [ 0.    0.    0.31  0.    0.4   0.    0.    0.    0.    0.4   0.31  0.
   0.4   0.4   0.4 ]
 [ 0.29  0.29  0.35  0.22  0.22  0.22  0.29  0.22  0.22  0.22  0.35  0.22
   0.22  0.22  0.22]]


The implementation of _tf-idf_ in sklearn is slightly different from the the formulae providede above

the tf-idf computed in scikit-learn deviates slightly from the default equation as :
<center>$tf\_idf(t,d) = tf(t,d)\space x \space (idf(t,d)+1)$</center>


and equation for the inverse document frequency implemented in scikit-lear is computed as follows:
<center>$idf(t,d) = log \frac {1+n_d}{1+df(d,t)}$</center>


while it is also more typical to normalize the raw term frequencies before calculating tf-idfs, _TfidfTransformer_ class normalize tf-idfs directly. By default (_norm='l2'_) , scikit-learn's _TfidfTransformer_ applies the __l2-normalization__, which returns  a vector of length 1 by dividing an un-normalized feature vector __v__ by its L2-norm:
<center>$v_{norm}  = \frac {v}{||v||_2} = \frac {v}{\sqrt {v_1^2 + v_2^2 +.... v_n^2}} = \frac {v}{(\sum _{i=1}^{n} v_i^2)^{1/2}}$</center>

### Cleaning text data

The first important step before we build our bag-of-words model- is to clean the text data by stripping it of all unwanted characters. 



In [16]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [25]:
# Creating a preprocessor to remove all the the HTML tags

import re
def preprocessor(text):
    text = re.sub('<[^>]*>','', text) #replace all HTML tags with ''
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # finds all the emoticons like :) :-D
    text = re.sub('[\W]+', ' ', text.lower())+ ' '.join(emoticons).replace('-','')
    return text

In [26]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [27]:
preprocessor("<a> this :) is :( a test :-)!")

' this is a test :) :( :)'

lets apply our preprocessor  function to all the movie reviews in our DataFrame:

In [28]:
df['review'] = df['review'].apply(preprocessor)

### Processing documents into tokens

In [29]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In the context of tokenizatin another useful technique is word stemming, whch is the process of transforming a word into its root form : Different __stemmers__:

1. Porter Stemmer
2. Snowball Stemmer
3. Lancaster Stemmer

In [33]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Using the porterStemmer from _nltk_ package, we modified our tokenizer funciton to reduce words to theri root form.
While stemming can create non-real words, such as _thu_ (from _thus_) , as show in the previous example, a technique called  __lemmatization__ aims to obtain the canonical (grammatically correct) froms of individual words - the so-called __lemmas__

However , lemmatization is computationally more difficult then stemming  and lemmatization have little impact  on the performance of text classification.

### Stop-word Removal

In [34]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [36]:
[w for w in tokenizer_porter('a runner likes  running and run a lot')[-10:] if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Training a logistic regresssion model for document classification


In [48]:
X_train= df.loc[:25000, 'review'].values
y_train= df.loc[:25000, 'sentiment'].values

X_test= df.loc[25000:, 'review'].values
y_test= df.loc[25000:, 'sentiment'].values

In [49]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [68]:
tfidf = TfidfVectorizer(strip_accents=False, lowercase=False, preprocessor=None)


param_grid = [
               {
                   'vect__ngram_range': [(1,1)],
                   'vect__stop_words':[stop, None],
                   'vect__tokenizer':[tokenizer, tokenizer_porter],
                   'clf__penalty': ['l1', 'l2'],
                   'clf__C':[1.0,10.0,100.0]
               },
              
               {
                   'vect__ngram_range': [(1,1)],
                   'vect__stop_words':[stop, None],
                   'vect__tokenizer':[tokenizer, tokenizer_porter],
                   'vect__use_idf':[False],
                   'vect__norm':[None],
                   'clf__penalty': ['l1', 'l2'],
                   'clf__C':[1.0,10.0,100.0]
               }
             ]

lr_tfidf = Pipeline([
                          ('vect', tfidf), 
                          ('clf', LogisticRegression(random_state=1, dual=False))
                         ])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy',cv=5,verbose=1, n_jobs=-1)

In [69]:
%%time
gs_lr_tfidf.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


KeyboardInterrupt: 

### Online  algorithms and out-of-core learning

As noticed above that it is computationally expensive to construct the feature vector for the 50,000 movie dataset during grid search . In many real-wrold applications it is not uncommon to work with even larger datasets..

We will now apply a technique called __out-of-core learning__, which allows us to work with such large dataset by fitting the classfier incrementally on smaller batches of the datsets. In this, we will make use of the _partial-fit_ function of the _SDGClassifier_ in scikit-learn to stream the documents directly from out local drive, and train a logistic regression model using small mini-batches of documents

In [3]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>','', text) #replace all HTML tags with ''
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # finds all the emoticons like :) :-D
    text = re.sub('[\W]+', ' ', text.lower())+ ' '.join(emoticons).replace('-','')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

Next, we define a generator function _stream-docs_ that reads in and returns one document at a time

In [5]:
def stream_docs(path):
    with open(path , 'r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [6]:
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

Next, we define a function, _get-minibatch_ that will take a document stream fom the _stream-docs_ function and return a particular number of documents specified by the _size_ parameter

In [18]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately, we cann't use _CountVectorizer_ for out-of-core learning since it requires holding the complete vocabulary in memory. Also, _TfidVectorizer_ need to keep all the feature vectors of the trianing datatset in memory to calculate the inverse document frequencies.

However, another useful vectorizer for text processing implemnted in scikt-learn is _HashingVectorizer_. It is data independent and makes uses of the hasing trick via the 32-bit MurmurHash3 function.

In [19]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect  = HashingVectorizer(decode_error='ignore',
                         n_features= 2**21,
                         preprocessor = None,
                         tokenizer= tokenizer)

clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

We have initialized _HashingVectorizer_ with our _tokenizer_ and set the number of feature to 2^21 . 

We reinstantiated a logistic regression classifier by setting the _loss_ parameter of the _SDGClassifier_ to _'log'_ 

Note that by choosing a large number of features in the _HashingVectorizer_ reduces the chances of has collision,s but we also increas the number of coefiicients in our logistic regression model.  

In [20]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0,1])

#  fitting clf only for 45,000 movie reviews
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream=doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train,y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:45


We initialized the progress bar object with 45 iterations and, in the following _for_ loop we iteerated over 45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having complted the incremental learning process, we will use the last 5,000 documents to evaluate the perofrmance of our model:



In [21]:
X_test,y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy : %.3f' % clf.score(X_test, y_test))

Accuracy : 0.867


Accuracy of our model is 88 percent,slgihtly below the accuracy in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning in very memory effficeint and took les thatn a minute to comptete, Finally, we cna use the tlast 5,000 documents to update our model

In [24]:
clf = clf.partial_fit(X_test, y_test)

A more modern alternative to the bag-of-words model is __word2vec__, an algorithm the Google release in 2013.
 - <font style="color:red">The word 2 vec algorithm is an unsupervised learning algirthm based on neural netwrokds that attempts to autmatically learn learn the realtionship between words . The idea behind word2vec is to put words that have similar meanings into similar clusters, and iva clever vector-spacings, the model can reproduce certain wrods usingsimple vector math, for examplre , _king - man + woman = queen._
 
 The original C-implementation with useful links to the relevant papers and alternative implemtations can be count at</font>
 [word2vec](https://code.google.com/p/word2vec/.)