# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

In [2]:
#setup
import numpy as np
import pandas as pd
from tqdm import tqdm, trange

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

from sklearn.metrics import confusion_matrix
from mglearn.tools import visualize_coefficients

import re
import scipy.optimize as sopt

In [4]:
class CustomCountVectorizer():
    
    def __init__(self, neg_words=[], enh_words=[], enh_mult=2, no_words=None):
        self.neg_words = neg_words
        self.enh_words = enh_words
        self.enh_mult = enh_mult
        self.no_words = no_words
    
    def fit(self, sentences):
        count = defaultdict(int)
        for s in sentences:
            for w in s.split():
                count[w] += 1
        
        srt = sorted(count.items(), key=lambda kv: kv[1], reverse=True)
        if self.no_words is not None:
            srt = srt[:self.no_words]
        
        self.idx_vocabulary = {w:i for i, (w, c) in enumerate(srt)}
    
    def transform(self, sentences):
        vect = np.zeros((len(sentences), len(self.idx_vocabulary)))

        for i, s in enumerate(sentences):
            neg = 1
            enh = 1
            
            for w in s.split():
                if w not in self.idx_vocabulary:
                    continue
                
                idx = self.idx_vocabulary[w]
                
                if w in self.neg_words:
                    neg *= -1
                elif w in self.enh_words:
#                     enh += self.enh_mult
                    enh *= self.enh_mult
                else:
                    vect[i][idx] += neg * enh
        
        return vect
    
    def fit_transform(self, sentences):
        self.fit(sentences)
        return self.transform(sentences)

In [5]:
class CustomLogisticRegression:
    
    def __init__(self, neg_words=[], enh_words=[], no_words=None, enh_mult=2,
                 params=None, max_iter=100, solver=sopt.fmin_l_bfgs_b):
        self.neg_words = neg_words
        self.enh_words = enh_words
        self.no_words = no_words
        self.enh_mult = enh_mult
        self.params = params
        self.max_iter = max_iter
        self.solver = solver
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def logreg_loss(self, params, sentences, Y):
        params = params.astype(np.float64)
        Y = Y.astype(np.float64)
        
#         N, E = params[:2]
#         Theta = params[2:]
        
        cv = CustomCountVectorizer(neg_words=self.neg_words, enh_words=self.enh_words,
                                  enh_mult=self.enh_mult, no_words=self.no_words)
        X = cv.fit_transform(sentences)
        X = np.hstack((np.ones((X.shape[0], 3)), X))  # bias, N, E
        X = X.astype(np.float64)
        
        Z = np.dot(X, params.T)

        sig_Z = self.sigmoid(Z)
        Y_ = Y[:,np.newaxis]
        nll = -np.sum((Y_ * np.log2(sig_Z + 1e-50) + (1-Y_) * np.log2(1 - sig_Z + 1e-50)))
        nll += np.sum(params**2) / 2
        
        grad = np.dot(X.T, (sig_Z-Y).T)
        grad = grad.reshape(params.shape) + params

        return nll / len(Y), grad / len(Y)
    
    def fit(self, sentences, Y):
        params = self.params
        cv = CustomCountVectorizer(neg_words=self.neg_words, enh_words=self.enh_words,
                                  no_words=self.no_words, enh_mult=self.enh_mult)
        X = cv.fit_transform(sentences)
        
        if params is None:
            params = np.ones(X.shape[1] + 3)
        
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        params = self.solver(lambda p: self.logreg_loss(p, sentences, Y), 
                             params,
                             maxiter=self.max_iter)[0]
        self.params = params
        
    def predict(self, sentences):
        self.N, self.E = self.params[:2]
        Theta = self.params[2:]
        cv = CustomCountVectorizer(neg_words=self.neg_words, enh_words=self.enh_words,
                                  no_words=self.no_words, enh_mult=self.enh_mult)
        X = cv.fit_transform(sentences)
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        
        preds = np.dot(Theta, X.T)
        return preds, self.sigmoid(preds)

1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/CountVectorizerent-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

In [6]:
def transform(s):
#     return re.sub('\ +\W\ +', ' ', s.lower())
    return re.sub('[^\w\s]', '', s.lower())

In [7]:
df = pd.read_csv('data/train.tsv', sep='\t')
df = df.groupby(['SentenceId'], as_index=False).agg({'PhraseId' : 'min',
                                               'Phrase' : 'first',
                                               'Sentiment' : 'first'})
df['Sentiment'] /= 4
df = df.drop(['PhraseId', 'SentenceId'], axis=1)
df['Phrase'] = df['Phrase'].apply(transform)

In [8]:
train_df = df.iloc[:7000]
test_df = df.iloc[7000:]

In [9]:
train_df.head()

Unnamed: 0,Phrase,Sentiment
0,a series of escapades demonstrating the adage ...,0.25
1,this quiet introspective and entertaining ind...,1.0
2,even fans of ismail merchant s work i suspect...,0.25
3,a positively thrilling combination of ethnogra...,0.75
4,aggressive selfglorification and a manipulativ...,0.25


2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of most frequent 20000 words.

In [10]:
word_count = defaultdict(int)
for i in trange(len(train_df), position=0, leave=True):
    for w in train_df.iloc[i]['Phrase'].split():
        word_count[w] += 1

100%|██████████| 7000/7000 [00:01<00:00, 6186.28it/s]


In [11]:
word_count = dict(sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)[:2000])

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

for full dictionary

In [12]:
train_y = train_df['Sentiment']
test_y = test_df['Sentiment']

cv = CustomCountVectorizer()
train_x = cv.fit_transform(train_df['Phrase'])
test_x = cv.transform(test_df['Phrase'])

In [13]:
%%time
clf = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf.fit(train_x, train_y*4)

CPU times: user 2min 7s, sys: 3.94 s, total: 2min 11s
Wall time: 56.5 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
preds_tr = clf.predict(train_x)

print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf.predict(test_x)).mean()}')

Confusion matrix:
[[ 838   26   11    3    0]
 [   5 1779   17    9    0]
 [   2   33 1300   24    3]
 [   2   20   13 1852   15]
 [   1    8   14   25 1000]]
Accuracy for train: 0.967
Accuracy for test: 0.39110529758011775


In [15]:
print(f'Most negative words: {np.array(list(cv.idx_vocabulary.keys()))[clf.coef_[0].argsort()[-5:]]}')
print(f'Most positive words: {np.array(list(cv.idx_vocabulary.keys()))[clf.coef_[-1].argsort()[-5:]]}')

Most negative words: ['dull' 'terrible' 'devoid' 'stupid' 'worst']
Most positive words: ['remarkable' 'captivating' 'best' 'beautifully' 'masterpiece']


for 2000 most popular words

In [16]:
cv = CustomCountVectorizer(no_words=2000)
train_x = cv.fit_transform(train_df['Phrase'])
test_x = cv.transform(test_df['Phrase'])

In [17]:
%%time
clf = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf.fit(train_x, train_y*4)

CPU times: user 23.4 s, sys: 1.31 s, total: 24.7 s
Wall time: 8.61 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [18]:
preds_tr = clf.predict(train_x)

print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf.predict(test_x)).mean()}')

Confusion matrix:
[[ 569  174   67   58   10]
 [  51 1444  146  146   23]
 [  30  219  903  182   28]
 [  22  161  120 1506   93]
 [   8   54   46  181  759]]
Accuracy for train: 0.7401428571428571
Accuracy for test: 0.3590582079790713


In [19]:
print(f'Most negative words: {np.array(list(cv.idx_vocabulary.keys()))[clf.coef_[0].argsort()[-5:]]}')
print(f'Most positive words: {np.array(list(cv.idx_vocabulary.keys()))[clf.coef_[-1].argsort()[-5:]]}')

Most negative words: ['dull' 'horrible' 'stupid' 'devoid' 'worst']
Most positive words: ['entertaining' 'beautifully' 'dazzling' 'remarkable' 'masterpiece']


4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$



In [20]:
neg_words = ['no', 'none', 'not', 'never', 'nobody']
# neg_words = ['no']

for full dictionary

In [21]:
cv2 = CustomCountVectorizer(neg_words=neg_words)
train_x2 = cv2.fit_transform(train_df['Phrase'])
test_x2 = cv2.transform(test_df['Phrase'])

In [22]:
%%time
clf2 = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf2.fit(train_x2, train_y*4)

CPU times: user 1min 59s, sys: 3.61 s, total: 2min 2s
Wall time: 51.2 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
preds_tr2 = clf2.predict(train_x2)

print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr2)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr2).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf2.predict(test_x2)).mean()}')

Confusion matrix:
[[ 831   32    6    8    1]
 [   2 1779   18   11    0]
 [   2   38 1298   21    3]
 [   5   15   17 1852   13]
 [   2   12   13   26  995]]
Accuracy for train: 0.965
Accuracy for test: 0.37933289731850883


for 2000 most popular words

In [24]:
cv2 = CustomCountVectorizer(neg_words=neg_words, no_words=2000)
train_x2 = cv2.fit_transform(train_df['Phrase'])
test_x2 = cv2.transform(test_df['Phrase'])

In [25]:
%%time
clf2 = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf2.fit(train_x2, train_y*4)

CPU times: user 22.7 s, sys: 1.25 s, total: 24 s
Wall time: 8.31 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [26]:
preds_tr2 = clf2.predict(train_x2)

print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr2)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr2).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf2.predict(test_x2)).mean()}')

Confusion matrix:
[[ 537  186   62   80   13]
 [  58 1422  113  187   30]
 [  42  253  845  190   32]
 [  31  174  128 1484   85]
 [  12   63   55  176  742]]
Accuracy for train: 0.7185714285714285
Accuracy for test: 0.355134074558535


5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{#negs prec. }w}E^{\text{#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

In [27]:
enh_words = ['very', 'extremely', 'exceedingly', 'exceptionally', 'especially', 'tremendously', 'immensely', 'vastly', 'hugely', 'extraordinarily', 'extra', 'excessively', 'overly', 'over', 'abundantly', 'inordinately', 'singularly', 'significantly', 'distinctly', 'outstandingly', 'uncommonly', 'unusually', 'decidedly', 'particularly', 'eminently', 'supremely', 'highly', 'remarkably', 'really', 'truly', 'mightily', 'thoroughly', 'all that', 'to a great extent', 'most', 'so', 'too', 'unco', 'très', 'right', 'terrifically', 'awfully', 'terribly', 'devilishly', 'madly', 'majorly', 'seriously', 'desperately', 'mega', 'ultra', 'oh-so', 'too-too', 'stinking', 'mucho', 'damn', 'damned', 'too … for words', 'devilish', 'hellish', 'frightfully', 'ever so', 'well', 'bloody', 'dead', 'dirty', 'jolly', 'fair', 'real', 'mighty', 'powerful', 'awful', 'plumb', 'darned', 'way', 'bitching', 'mad', 'lekker', 'exceeding', 'sore']

In [31]:
%%time
clf3 = CustomLogisticRegression(neg_words=neg_words, enh_words=enh_words, max_iter=200)
clf3.fit(train_df['Phrase'], train_y)

  


CPU times: user 6min 40s, sys: 3min 56s, total: 10min 36s
Wall time: 9min 25s


In [32]:
preds = clf3.predict(train_df['Phrase'])[1]

In [33]:
(np.round(4*preds)/4 == train_y).mean()

0.49328571428571427

6. **[2pb]** Propose, implement, and evaluate an extension to the above model.