## Week 3 - Lab - Logistic Regression

### Sentiment analysis

In [1]:
import numpy as np
import pandas as pd

sentiment = pd.read_csv('labeledTrainData.tsv', delimiter='\t',encoding='utf-8')
sentiment.head()


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### Bag of words

In [4]:
reviews = sentiment.review.head(3)
reviews

0    With all this stuff going down at the moment w...
1    \The Classic War of the Worlds\" by Timothy Hi...
2    The film starts with a manager (Nicholas Bell)...
Name: review, dtype: object

### CountVectorizer

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bag = vectorizer.fit_transform(reviews)
print(vectorizer.get_feature_names())


['000', '10', '20', '2002', '2006', 'about', 'actors', 'actual', 'addition', 'afoul', 'again', 'against', 'agent', 'agree', 'all', 'alone', 'also', 'amateur', 'ambition', 'an', 'and', 'angel', 'animal', 'animals', 'another', 'anyway', 'appear', 'appreciated', 'appropriately', 'are', 'area', 'as', 'astounding', 'at', 'attack', 'attacked', 'attention', 'australian', 'average', 'away', 'bad', 'badly', 'barrel', 'bases', 'bc', 'be', 'beasts', 'because', 'becoming', 'begins', 'beheading', 'behind', 'being', 'bell', 'belle', 'below', 'bestest', 'better', 'beyond', 'bigger', 'biography', 'bit', 'blood', 'bloody', 'book', 'boring', 'bottom', 'bound', 'br', 'brian', 'buddy', 'bunch', 'but', 'by', 'call', 'came', 'camilla', 'can', 'car', 'carnivorous', 'carradine', 'center', 'certain', 'character', 'chills', 'cinema', 'classic', 'cliff', 'closed', 'comes', 'complex', 'computer', 'consenting', 'convincing', 'cool', 'course', 'creature', 'criminal', 'criticize', 'critics', 'cruise', 'crusoe', 'cur

In [6]:
print(bag.toarray())  

[[0 0 1 ... 2 0 0]
 [0 0 0 ... 0 0 0]
 [1 1 0 ... 0 1 1]]


In [7]:
print(vectorizer.get_feature_names().index('is'))

226


### tfidf fit_transformer

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()

np.set_printoptions(precision=2)

tfidf_ed = tfidf.fit_transform(bag).toarray()
print(tfidf_ed)

[[0.   0.   0.03 ... 0.06 0.   0.  ]
 [0.   0.   0.   ... 0.   0.   0.  ]
 [0.03 0.03 0.   ... 0.   0.03 0.03]]


## Data clean up

### Removing stop words

In [9]:
from collections import Counter
vocab = Counter()
for reviews in sentiment['review']:
    for word in reviews.split():
        vocab[word] += 1
vocab.most_common(20)

[('the', 287032),
 ('a', 155096),
 ('and', 152664),
 ('of', 142972),
 ('to', 132568),
 ('is', 103228),
 ('in', 85580),
 ('I', 65973),
 ('that', 64560),
 ('this', 57196),
 ('it', 54429),
 ('/><br', 50935),
 ('was', 46698),
 ('as', 42510),
 ('with', 41721),
 ('for', 41070),
 ('but', 33790),
 ('The', 33762),
 ('on', 30766),
 ('movie', 30500)]

In [10]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeekayen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()

for w, c in vocab.items():
    if not w.lower() in stop:
        vocab_reduced[w] = c

vocab_reduced.most_common(20)

[('/><br', 50935),
 ('movie', 30500),
 ('film', 27397),
 ('one', 20688),
 ('like', 18133),
 ('would', 11922),
 ('good', 11435),
 ('really', 10815),
 ('even', 10607),
 ('see', 10155),
 ('-', 9355),
 ('get', 8777),
 ('story', 8526),
 ('much', 8507),
 ('time', 7764),
 ('make', 7485),
 ('could', 7462),
 ('also', 7422),
 ('first', 7339),
 ('people', 7335)]

### Removing special characters and trash

In [12]:
import re

def preprocessor(text):
    """ Return a cleaned version of text
    """
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

### Tokenizing

In [13]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

## Training Logistic Regression

In [14]:
from sklearn.model_selection import train_test_split

X = sentiment['review']
y = sentiment['review']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=28)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

predictions = clf.predict(X_test)
print('accuracy:',accuracy_score(y_test,predictions))
print('confusion matrix:\n',confusion_matrix(y_test,predictions))
print('classification report:\n',classification_report(y_test,predictions))


Testing

In [None]:
test = [
    "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.
"
]

preds = clf.predict_proba(test)

for i in range(len(test)):
    print(f'{test[i]} --> Positive, Negative = {preds[i]}')