# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

In [1]:
#setup
import numpy as np
import pandas as pd
from tqdm import tqdm, trange

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

from sklearn.metrics import confusion_matrix

import re
import scipy.optimize as sopt

In [2]:
class CustomCountVectorizer():
    
    def __init__(self, is_negation=False, is_enhance=False, 
                 neg_words=[], enh_words=[], enh_mult=2, no_words=None):
        self.is_negation = is_negation
        self.is_enhance = is_enhance
        self.neg_words = neg_words
        self.enh_words = enh_words
        self.enh_mult = enh_mult
        self.no_words = no_words
    
    
    def fit(self, sentences):
        count = defaultdict(int)
        for s in sentences:
            for w in s.split():
                count[w] += 1
                
        srt = sorted(count.items(), key=lambda kv: kv[1], reverse=True)
        if self.no_words is not None:
            srt = srt[:self.no_words]
        
        self.idx_vocabulary = {w:i for i, (w, c) in enumerate(srt)}
    
    def transform(self, sentences):
        vect = np.zeros((len(sentences), len(self.idx_vocabulary)))

        for i, s in enumerate(sentences):
            neg = 1
            enh = 1
            
            for w in s.split():
                if w not in self.idx_vocabulary:
                    continue
                
                idx = self.idx_vocabulary[w]
                
                if self.is_negation and w in neg_words:
                    neg *= -1
                elif self.is_enhance and w in enh_words:
                    enh *= self.enh_mult
                else:
                    vect[i][idx] += neg * enh
        return vect

In [3]:
class CustomLogisticRegression:
    
    def __init__(self, Theta=None, max_iter=1000, solver=sopt.fmin_l_bfgs_b):
        self.Theta = Theta
        self.max_iter = max_iter
        self.solver = solver
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def logreg_loss(self, Theta, X, Y):
        Theta = Theta.astype(np.float64)
        X = X.astype(np.float64)
        Y = Y.astype(np.float64)
        
        Z = np.dot(X, Theta.T)

        sig_Z = self.sigmoid(Z)
        Y_ = Y[:,np.newaxis]
        nll = -np.sum((Y_ * np.log2(sig_Z + 1e-50) + (1-Y_) * np.log2(1 - sig_Z + 1e-50)))
        nll += np.sum(Theta**2) / 2
        
        grad = np.dot(X.T, (sig_Z-Y).T)
        grad = grad.reshape(Theta.shape) + Theta

        return nll / len(Y), grad / len(Y)
    
    def fit(self, X, Y):
        Theta = self.Theta
        if Theta is None:
            Theta = np.ones(X.shape[1] + 1)
            
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        Theta = self.solver(lambda th: self.logreg_loss(th, X, Y), 
                            Theta,
                            maxiter=self.max_iter)[0]
        self.Theta = Theta
        
    def predict(self, X):
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        preds = np.dot(self.Theta, X.T)
        return preds, self.sigmoid(preds)

1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/CountVectorizerent-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

In [4]:
def transform(s):
    return re.sub('\ *\W\ *', ' ', s.lower())

In [5]:
df = pd.read_csv('data/train.tsv', sep='\t')
df = df.groupby(['SentenceId'], as_index=False).agg({'PhraseId' : 'min',
                                               'Phrase' : 'first',
                                               'Sentiment' : 'first'})
df['Sentiment'] /= 4
df = df.drop(['PhraseId', 'SentenceId'], axis=1)
df['Phrase'] = df['Phrase'].apply(transform)

In [6]:
train_df = df.iloc[:7000]
test_df = df.iloc[7000:]

In [7]:
train_df.head()

Unnamed: 0,Phrase,Sentiment
0,a series of escapades demonstrating the adage ...,0.25
1,this quiet introspective and entertaining inde...,1.0
2,even fans of ismail merchant s work i suspect ...,0.25
3,a positively thrilling combination of ethnogra...,0.75
4,aggressive self glorification and a manipulati...,0.25


2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of most frequent 20000 words.

In [8]:
word_count = defaultdict(int)
for i in trange(len(train_df), position=0, leave=True):
    for w in train_df.iloc[i]['Phrase'].split():
        word_count[w] += 1

100%|██████████| 7000/7000 [00:01<00:00, 6230.65it/s]


In [9]:
word_count = dict(sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)[:2000])

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

In [10]:
train_y = train_df['Sentiment']
test_y = test_df['Sentiment']

cv = CountVectorizer()
train_x = cv.fit_transform(train_df['Phrase'])
test_x = cv.transform(test_df['Phrase'])

In [11]:
%%time
clf = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf.fit(train_x, train_y*4)

CPU times: user 6.28 s, sys: 579 ms, total: 6.86 s
Wall time: 2.33 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
preds_tr = clf.predict(train_x)
print('Sklearn Logistic Regression')
print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf.predict(test_x)).mean()}')

Sklearn Logistic Regression
Confusion matrix:
[[ 834   28   12    3    1]
 [   6 1771   19   14    0]
 [   3   38 1296   23    2]
 [   2   19   14 1854   13]
 [   1    8   13   28  998]]
Accuracy for train: 0.9647142857142857
Accuracy for test: 0.38848920863309355


In [13]:
c_cv = CustomCountVectorizer()
c_cv.fit(train_df['Phrase'])
c_train_x = c_cv.transform(train_df['Phrase'])
c_test_x = c_cv.transform(test_df['Phrase'])

In [14]:
%%time
c_clf = CustomLogisticRegression()
c_clf.fit(c_train_x, train_y)

CPU times: user 3min 54s, sys: 2min 1s, total: 5min 55s
Wall time: 4min 30s


In [15]:
c_preds_tr = np.round(c_clf.predict(c_train_x)[1]*4)/4
c_preds_te = np.round(c_clf.predict(c_test_x)[1]*4)/4
print('Custom Logistic Regression')
print(f'Confusion matrix:\n{confusion_matrix(train_y*4, c_preds_tr*4)}')
print(f'Accuracy for train: {(train_y == c_preds_tr).mean()}')
print(f'Accuracy for test: {(test_y == c_preds_te).mean()}')

Custom Logistic Regression
Confusion matrix:
[[ 281  554   43    0    0]
 [  33 1213  556    8    0]
 [   1  183 1040  138    0]
 [   1    7  552 1308   34]
 [   0    0   52  611  385]]
Accuracy for train: 0.6038571428571429
Accuracy for test: 0.3597122302158273


4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$



In [17]:
neg_words = ['no', 'none', 'not', 'never', 'nobody']

In [18]:
cv2 = CustomCountVectorizer(is_negation=True, neg_words=neg_words)
cv2.fit(train_df['Phrase'])
train_x2 = cv2.transform(train_df['Phrase'])
test_x2 = cv2.transform(test_df['Phrase'])

In [20]:
%%time
clf2 = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf2.fit(train_x2, train_y*4)

CPU times: user 1min 57s, sys: 3.64 s, total: 2min
Wall time: 50.7 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
preds_tr2 = clf2.predict(train_x2)
print('Sklearn Logistic Regression')
print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr2)}')
print(f'Accuracy for train: {(train_y*4 == preds_tr2).mean()}')
print(f'Accuracy for test: {(test_y*4 == clf2.predict(test_x2)).mean()}')

Sklearn Logistic Regression
Confusion matrix:
[[ 830   29   10    8    1]
 [   3 1778   17   12    0]
 [   5   44 1288   21    4]
 [   5   17   18 1849   13]
 [   2   13   16   26  991]]
Accuracy for train: 0.9622857142857143
Accuracy for test: 0.36625245258338784


In [22]:
%%time
c_clf2 = CustomLogisticRegression()
c_clf2.fit(train_x2, train_y)

CPU times: user 1min 26s, sys: 45.2 s, total: 2min 11s
Wall time: 1min 40s


In [24]:
c_preds_tr2 = np.round(c_clf2.predict(train_x2)[1]*4)/4
c_preds_te2 = np.round(c_clf2.predict(test_x2)[1]*4)/4
print('Custom Logistic Regression')
print(f'Confusion matrix:\n{confusion_matrix(train_y*4, c_preds_tr2*4)}')
print(f'Accuracy for train: {(train_y == c_preds_tr2).mean()}')
print(f'Accuracy for test: {(test_y == c_preds_te2).mean()}')

Custom Logistic Regression
Confusion matrix:
[[ 333  456   84    4    1]
 [ 146 1020  587   56    1]
 [   6  283  816  250    7]
 [   1   39  532 1201  129]
 [   0    1   70  515  462]]
Accuracy for train: 0.5474285714285714
Accuracy for test: 0.35317200784826686


5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{#negs prec. }w}E^{\text{#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

In [25]:
enh_words = ['extremely', 'exceedingly', 'exceptionally', 'especially', 'tremendously', 'immensely', 'vastly', 'hugely', 'extraordinarily', 'extra', 'excessively', 'overly', 'over', 'abundantly', 'inordinately', 'singularly', 'significantly', 'distinctly', 'outstandingly', 'uncommonly', 'unusually', 'decidedly', 'particularly', 'eminently', 'supremely', 'highly', 'remarkably', 'really', 'truly', 'mightily', 'thoroughly', 'all that', 'to a great extent', 'most', 'so', 'too', 'unco', 'très', 'right', 'terrifically', 'awfully', 'terribly', 'devilishly', 'madly', 'majorly', 'seriously', 'desperately', 'mega', 'ultra', 'oh-so', 'too-too', 'stinking', 'mucho', 'damn', 'damned', 'too … for words', 'devilish', 'hellish', 'frightfully', 'ever so', 'well', 'bloody', 'dead', 'dirty', 'jolly', 'fair', 'real', 'mighty', 'powerful', 'awful', 'plumb', 'darned', 'way', 'bitching', 'mad', 'lekker', 'exceeding', 'sore']

In [28]:
cv3 = CustomCountVectorizer(is_negation=True, neg_words=neg_words, is_enhance=True,
                           enh_words=enh_words, enh_mult=2)
cv3.fit(train_df['Phrase'])
train_x3 = cv3.transform(train_df['Phrase'])
test_x3 = cv3.transform(test_df['Phrase'])

In [29]:
%%time
clf3 = CustomLogisticRegression()
clf3.fit(train_x3, train_y)

CPU times: user 2min 30s, sys: 1min 19s, total: 3min 50s
Wall time: 2min 58s


In [31]:
preds_tr3 = np.round(clf3.predict(train_x3)[1]*4)/4
preds_te3 = np.round(clf3.predict(test_x3)[1]*4)/4
print('Custom Logistic Regression')
print(f'Confusion matrix:\n{confusion_matrix(train_y*4, preds_tr3*4)}')
print(f'Accuracy for train: {(train_y == preds_tr3).mean()}')
print(f'Accuracy for test: {(test_y == preds_te3).mean()}')

Custom Logistic Regression
Confusion matrix:
[[ 274  544   59    1    0]
 [  13 1152  632   13    0]
 [   0  128 1123  111    0]
 [   0    8  591 1277   26]
 [   0    0   51  633  364]]
Accuracy for train: 0.5985714285714285
Accuracy for test: 0.3276651406147809


6. **[2pb]** Propose, implement, and evaluate an extension to the above model.