## Bag of Words

### Sentiment analysis



In [12]:
import pandas as pd
import numpy as np
sentiment_train = pd.read_csv('data\\labeledTrainData.tsv', header=0,delimiter="\t", quoting=3)
sentiment_train.head(10)
sentiment_train.sample(10)

Unnamed: 0,id,sentiment,review
2879,"""533_10""",1,"""I LOVE this movie. Director Michael Powell on..."
16485,"""8430_3""",0,"""OK, let me again admit that I haven't seen an..."
9180,"""3406_4""",0,"""Space Camp, which had the unfortunate luck to..."
4843,"""7683_10""",1,"""I LOVED GOOD TIMES with the rest of many of y..."
24978,"""9397_9""",1,"""Vaguely reminiscent of great 1940's westerns,..."
23985,"""921_7""",1,"""PERHAPS in an attempt to find another \""Hot P..."
14488,"""11374_2""",0,"""I couldn't relate to this film. It failed to ..."
5368,"""11884_4""",0,"""The worlds largest inside joke. The world's l..."
8823,"""1035_1""",0,"""Horrible acting, horrible cast and cheap prop..."
7343,"""11515_10""",1,"""It has been so many years since I saw this bu..."


In [24]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

# Create some random texts for testing the function preprocessor()
print(preprocessor(''))

 


In [26]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

# Testing
print(tokenizer_porter('Hi there, I am loving this, like with a lot of love'))


['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


## Training Logistic Regression

In [27]:
# split the dataset in train and test
# Your code here
from sklearn.model_selection import train_test_split
X = sentiment_train['review'].values;
y = sentiment_train['sentiment'].values;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

# A pipeline is what chains several steps together, once the initial exploration is done. 
# For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, 
# or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm,
# they are estimators. Pipeline chains all these together which can then be applied to training data
clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2',
        preprocessor=<function preproc...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [92]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test,clf.predict(X_test)))
print(confusion_matrix(y_test,clf.predict(X_test)))
print(classification_report(y_test,clf.predict(X_test)))

0.8853333333333333
[[3293  503]
 [ 357 3347]]
             precision    recall  f1-score   support

          0       0.90      0.87      0.88      3796
          1       0.87      0.90      0.89      3704

avg / total       0.89      0.89      0.89      7500



In [29]:
sentiment_test = pd.read_csv('data\\testData.tsv', header=0,delimiter="\t", quoting=3)
sample_test = sentiment_test.head(100)
sample = sample_test.loc[:,'review'].values
preds = clf.predict_proba(sample)
for i in range(len(sample)):
    print(f'{sample[i]} --> Negative, Positive  = {preds[i]}')


"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty." --> Negative, Positive  = [0.02358603 0.97641397]
"This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let's see, word to the wise

In [30]:
import pickle
import os

pickle.dump(clf, open(os.path.join('data', 'logisticRegression.pkl'), 'wb'), protocol=4)