# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

In [1]:
#setup
import numpy as np
import pandas as pd
from tqdm import tqdm, trange

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

from sklearn.metrics import confusion_matrix

import re
import scipy.optimize as sopt

1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/CountVectorizerent-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

In [2]:
def transform(s):
    return re.sub('\ *\W\ *', ' ', s.lower())

In [3]:
df = pd.read_csv('data/train.tsv', sep='\t')
df = df.groupby(['SentenceId'], as_index=False).agg({'PhraseId' : 'min',
                                               'Phrase' : 'first',
                                               'Sentiment' : 'first'})
df['Sentiment'] /= 4
df = df.drop(['PhraseId', 'SentenceId'], axis=1)
df['Phrase'] = df['Phrase'].apply(transform)

In [4]:
train_df = df.iloc[:7000]
test_df = df.iloc[7000:]

In [5]:
train_df.head()

Unnamed: 0,Phrase,Sentiment
0,a series of escapades demonstrating the adage ...,0.25
1,this quiet introspective and entertaining inde...,1.0
2,even fans of ismail merchant s work i suspect ...,0.25
3,a positively thrilling combination of ethnogra...,0.75
4,aggressive self glorification and a manipulati...,0.25


2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of most frequent 20000 words.

In [6]:
word_count = defaultdict(int)
for i in trange(len(train_df), position=0, leave=True):
    for w in train_df.iloc[i]['Phrase'].split():
        word_count[w] += 1

100%|██████████| 7000/7000 [00:01<00:00, 5522.38it/s]


In [7]:
word_count = dict(sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)[:2000])

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

In [157]:
class CustomCountVectorizer():
    
    def __init__(self, idx_vocabulary=None, is_negation=False, is_enhance=False, 
                 neg_words=[], enh_words=[], enh_mult=2):
        self.is_negation = is_negation
        self.is_enhance = is_enhance
        self.idx_vocabulary = idx_vocabulary
        self.neg_words = neg_words
        self.enh_words = enh_words
        self.enh_mult = enh_mult
        
    def __find_idx_vocabulary(self, sentences):
        voc = set()
        for s in sentences:
            for w in s.split():
                if w.isalpha():
                    voc.add(w)
        self.idx_vocabulary = {w:i for i, w in enumerate(list(voc))}
    
    def transform(self, sentences):
        if self.idx_vocabulary is None:
            self.__find_idx_vocabulary(sentences)
            
        vect = np.zeros((len(sentences), len(self.idx_vocabulary)),
                       dtype=np.int32)

        for i, s in tqdm(enumerate(sentences), desc='transform', position=0, 
                         leave=True, total=len(sentences)):
            neg = 1
            enh = 1
            
            for w in s.split():
                if not w.isalpha():
                    continue
                idx = self.idx_vocabulary[w]
                vect[i][idx] += 1
#                 if self.is_negation and w in neg_words:
#                     neg *= -1
#                 elif self.is_enhance and w in enh_words:
#                     enh *= self.enh_mult
#                 elif w in self.vocabulary:
#                     idx = np.where(self.vocabulary == w)
#                     vect[i][idx] += neg * enh
        return vect

In [11]:
def compute_error(model, train_x, train_y, test_x, test_y):
    print(f'Train error: {np.count_nonzero(model.predict(train_x) != train_y) / len(train_y)}')
    print(f'Test error:  {np.count_nonzero(model.predict(test_x) != test_y) / len(test_y)}')

In [12]:
vocabulary = np.array(list(word_count.keys()))

cv = CustomCountVectorizer(vocabulary=vocabulary)

train_x = cv.transform(train_df['Phrase'])
train_y = train_df['Sentiment'] * 4

test_x = cv.transform(test_df['Phrase'])
test_y = test_df['Sentiment'] * 4

transform: 100%|██████████| 7000/7000 [00:05<00:00, 1298.95it/s]
transform: 100%|██████████| 1529/1529 [00:01<00:00, 1305.12it/s]


In [13]:
%%time
clf = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf.fit(train_x, train_y)

CPU times: user 24.8 s, sys: 1.29 s, total: 26.1 s
Wall time: 8.94 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
compute_error(clf, train_x, train_y, test_x, test_y)

Train error: 0.25971428571428573
Test error:  0.631131458469588


In [83]:
class MyCountVectorizer:
    def __init__(self, min_df=-1, max_df=1e18, negative=False, enhance=False):
        self.min_df = min_df
        self.max_df = max_df
        self.negative = negative
        self.enhence_coeff = 5
        self.enhance = enhance
        
    def fit(self, df):
        words_cnt = defaultdict(int)
        col = df.columns[0]
        
        for i in range(len(df)):
            text = df.iloc[i][col]
            for word in text.split():
                words_cnt[word] += 1
                
        all_words = []
        for word, cnt in words_cnt.items():
            if self.min_df <= cnt <= self.max_df:
                all_words.append(word)
                
        self.all_words_ids = {w:i for i,w in enumerate(all_words)}
        self.width = len(all_words)
        
    
    def transform(self, df):
        col = df.columns[0]
        count_matrix = np.zeros([len(df), self.width], \
                                dtype=np.int32)
        
        for i in range(len(df)):
            text = df.iloc[i][col].split()

            cnt_neg = 0
            for word in text:
                if word in self.all_words_ids:
                    pos = self.all_words_ids[word]
                    count_matrix[i][pos] += 1
                    
        return count_matrix

In [84]:
cvD = MyCountVectorizer()
cvD.fit(train_df)
trD = cvD.transform(train_df)

In [85]:
lrD = CustomLogisticRegression()
lrD.fit(trD, train_y/4)
(np.round(lrD.predict(trD)[1]*4) == train_y).mean()

0.6038571428571429

In [161]:
len(cvD.all_words_ids), len(cvG.idx_vocabulary)

(13831, 13695)

In [158]:
cvG = CustomCountVectorizer()
trG = cvG.transform(train_df['Phrase'])

lrG = CustomLogisticRegression()
lrG.fit(trG, train_y/4)
(np.round(lrG.predict(trG)[1]*4) == train_y).mean()

transform: 100%|██████████| 7000/7000 [00:00<00:00, 12448.25it/s]


0.46085714285714285

KeyError: ','

In [71]:
# cv = MyCountVectorizer()
# cv.fit(train_df)

# X_train = cv.transform(train_df) 
# X_test = cv.transform(test_df)

cv = CustomCountVectorizer()
X_train = cv.transform(train_df['Phrase'])

0.6038571428571429

In [99]:
np.round(tmp_pr*4)/4

array([0.  , 1.  , 0.5 , ..., 0.75, 0.75, 0.75])

In [13]:
most_negs = vocabulary[clf.coef_[0].argsort()[-5:]]
most_poss = vocabulary[clf.coef_[-1].argsort()[-5:]]
print(f'Most negative words: {most_negs}')
print(f'Most positive words: {most_poss}')

Most negative words: ['incoherent' 'dull' 'stupid' 'devoid' 'worst']
Most positive words: ['entertaining' 'best' 'beautifully' 'remarkable' 'masterpiece']


4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$



In [64]:
neg_words = ['no', 'none', 'not', 'never', 'nobody']

In [15]:
cv2 = CustomCountVectorizer(vocabulary=vocabulary, is_negation=True, neg_words=neg_words)
train_x2 = cv2.transform(train_df['Phrase'])
test_x2 = cv2.transform(test_df['Phrase'])

transform: 100%|██████████| 7000/7000 [00:05<00:00, 1227.55it/s]
transform: 100%|██████████| 1529/1529 [00:01<00:00, 1185.62it/s]


In [16]:
%%time
clf2 = LogisticRegression(multi_class='multinomial', max_iter=1000, solver='lbfgs')
clf2.fit(train_x2, train_y)

CPU times: user 23.8 s, sys: 1.29 s, total: 25 s
Wall time: 8.58 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
compute_error(clf2, train_x2, train_y, test_x2, test_y)

Train error: 0.2805714285714286
Test error:  0.6402877697841727


5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{#negs prec. }w}E^{\text{#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

In [45]:
class CustomLogisticRegression:
    
    def __init__(self, Theta=None, max_iter=1000, solver=sopt.fmin_l_bfgs_b):
        self.Theta = Theta
        self.max_iter = max_iter
        self.solver = solver
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def logreg_loss(self, Theta, X, Y):
        Theta = Theta.astype(np.float64)
        X = X.astype(np.float64)
        Y = Y.astype(np.float64)
        
        Z = np.dot(X, Theta.T)

        sig_Z = self.sigmoid(Z)
        Y_ = Y[:,np.newaxis]
        nll = -np.sum((Y_ * np.log2(sig_Z + 1e-50) + (1-Y_) * np.log2(1 - sig_Z + 1e-50)))
        nll += np.sum(Theta**2) / 2
        
        grad = np.dot(X.T, (sig_Z-Y).T)
        grad = grad.reshape(Theta.shape) + Theta

        return nll / len(Y), grad / len(Y)
    
    def fit(self, X, Y):
        Theta = self.Theta
        if Theta is None:
            Theta = np.ones(X.shape[1] + 1)
            
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        Theta = self.solver(lambda th: self.logreg_loss(th, X, Y), 
                            Theta,
                            maxiter=self.max_iter)[0]
        self.Theta = Theta
        
    def predict(self, X):
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        preds = np.dot(self.Theta, X.T)
        return preds, self.sigmoid(preds)

In [51]:
train_y = np.array(train_y) / np.max(train_y)

enh_mult = 5
cv3 = CustomCountVectorizer(vocabulary=voc, is_negation=True, is_enhance=True, 
                           neg_words=neg_words, enh_words=enh_words, enh_mult=enh_mult)
train_x3 = cv3.transform(train_df['Phrase'])
test_x3 = cv3.transform(test_df['Phrase'])

transform: 100%|██████████| 7000/7000 [00:05<00:00, 1336.82it/s]
transform: 100%|██████████| 1529/1529 [00:01<00:00, 1343.99it/s]


In [55]:
%%time
clf3 = CustomLogisticRegression()
clf3.fit(train_x3, train_y)

  if __name__ == '__main__':


CPU times: user 33.5 s, sys: 18 s, total: 51.4 s
Wall time: 46 s


In [56]:
tr_preds = clf3.predict(train_x3)
te_preds = clf3.predict(test_x3)

In [57]:
tr_err = (np.round(tr_preds[1]*4)/4 != train_y).mean()
te_err = (np.round(te_preds[1]*4)/4 != test_y).mean()
print(f'Train err: {tr_err}\nTest err: {te_err}')

Train err: 0.6297142857142857
Test err: 0.9692609548724657


6. **[2pb]** Propose, implement, and evaluate an extension to the above model.