# Assignment 1

## Important notes
**Submission deadline:**
* **Thursday, 12.03.2020**

**Points: 13 + 2bp**

This assignment is meant to test your skills in course pre-requisites:  Scientific Python programming and  Machine Learning. If it is hard, I strongly advise you to drop the course.

Please use GitHub’s [pull requests](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests) and issues to send corrections!

You can solve the assignment in any system you like, but we encourage you to try out [Google Colab](https://colab.research.google.com/).

## Assignment text
1. **[1p]** Download data competition from a Kaggle competition on sentiment prediction from [[https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data)].  Keep only full sentences, i.e. for each `SenteceId` keep only the entry with the lowest `PhraseId`.  Use first 7000 sentences as a `train set` and the remaining 1529 sentences as the `test set`. 

2. **[1p]** Prepare the data for logistic regression:
	Map the sentiment scores $0,1,2,3,4$ to a probability of the sentence being by setting $p(\textrm{positive}) = \textrm{sentiment}/4$.
	Build a dictionary of most frequent 20000 words.

3. **[3p]** Treat each document as a bag of words. e.g. if the vocabulary is 
	```
	0: the
	1: good
	2: movie
	3: is
	4: not
	5: a
	6: funny
	```
	Then the encodings can be:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,1,0,0,1,0,0] 
	the movie is not a funny movie: [1,0,2,1,1,1,1]
	```
    Train a logistic regression model to predict the sentiment. Compute the correlation between the predicted probabilities and the sentiment. Record the most positive and negative words.
    Please note that in this model each word gets its sentiment parameter $S_w$ and the score for a sentence is 
    $$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}S_w$$

4. **[3p]** Now prepare an encoding in which negation flips the sign of the following words. For instance for our vocabulary the encodings become:
	```
	good:                           [0,1,0,0,0,0,0]
	not good:                       [0,-1,0,0,1,0,0]
	not not good:                   [0,1,0,0,0,0,0]
	the movie is not a funny movie: [1,0,0,1,1,-1,-1]
	```
	For best results, you will probably need to construct a list of negative words.
	
	Again train a logistic regression classifier and compare the results to the Bag of Words approach.
	
	Please note that this model still maintains a single parameter for each word, but now the sentence score is
	$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}-1^{\text{count of negations preceeding }w}S_w$$

5. **[5p]** Now also consider emphasizing words such as `very`. They can boost (multiply by a constant >1) the following words.
	Implement learning the modifying multiplier for negation and for emphasis. One way to do this is to introduce a model which has:
	- two modifiers, $N$ for negation and $E$ for emphasis
	- a sentiment score $S_w$ for each word 
And score each sentence as:
$$\text{score}(\text{sentence}) = \sum_{w\text{ in sentence}}N^{\text{\#negs prec. }w}E^{\text{\#emphs prec. }w}S_w$$

You will need to implement a custom logistic regression model to support it.

6. **[2pb]** Propose, implement, and evaluate an extension to the above model.


In [0]:
import csv
import numpy as np
import scipy.optimize as sopt

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
class new_sentence():
  def __init__(self, row):
    self.id    = row[1]
    self.words = row[2].split()
    self.prob  = float(row[3]) / 4.0

with open('drive/My Drive/Uczelnia/Neurony/train.tsv') as tsvfile:
  reader = csv.reader(tsvfile, delimiter='\t')
  whole_set = []
  words     = []
  for row in reader:
    if row[0] != 'PhraseId':
      sen = new_sentence(row)
      if (len(whole_set) == 0 or sen.id != whole_set[-1].id) and len(sen.words) > 0:
        whole_set.append(sen)
        for word in sen.words:
          words.append(word)

train_set = whole_set[0:7000]
test_set  = whole_set[7000:]
print("Size of training set:", len(train_set))
print("Size of test set:", len(test_set))

Size of training set: 7000
Size of test set: 1529


In [0]:
unique_words = list(set(words))
# count_words = []
# for word in unique_words:
#   num = words.count(word)
#   count_words.append([num, word])

# count_words.sort(reverse = True)
# vocab = count_words[:2000]
# voc = []
# for p in vocab:
#   voc.append(p[1])

voc = unique_words
rev_voc = {}
for i, word in enumerate(voc):
  rev_voc[word] = i
print("Size of vocabulary:", len(voc))

Size of vocabulary: 18132


In [0]:
# Task 3
def logis(encoding, sentiment):
  theta = np.exp(np.int(-1.0) * np.dot(encoding, sentiment))
  return np.int(1.0) / (np.int(1.0) + theta)

def log_loss(encoding, sentiment, true_sentiment):
  h = logis(encoding, sentiment)
  err = np.double(0.0)
  if h != np.double(0.0):
    err -= true_sentiment * np.log(h)
  if h != np.double(1.0):
    err -= (np.double(1.0) - true_sentiment) * np.log(np.double(1.0) - h)
  grad = (h - true_sentiment) * encoding
  return err, grad

def upd_sent(sentiment):
  g_loss = g_grad = np.double(0.0)
  for sentence in train_set:
    enc = np.zeros(len(voc) + 1)
    enc[-1] = np.double(1.0)
    for word in sentence.words:
      if word in voc:
        enc[rev_voc[word]] += np.double(1.0)
    loss, grad = log_loss(enc, sentiment, sentence.prob)
    g_loss += loss
    g_grad += grad
  print(g_loss)
  return g_loss, g_grad

word_sent = sopt.fmin_l_bfgs_b(lambda x: upd_sent(x), np.zeros(len(voc) + 1),
                               maxiter = 30)[0]

4852.030263920194
6682.810661488786
4832.42270415843
4818.071644032589
4773.902116862635
4683.3468953965
4594.711881327117
4497.299781346748
4407.827214460151
4326.51255404781
4236.234257282805
4175.5215826398235
4113.619751579541
4030.1301766571028
4708.862187668917
4020.3971759665806
3971.5194076295334
3948.8750465177463
3922.9282429251693
3897.9841121548516
3865.359790532102
3818.2200511894334
3850.9357327254857
3794.366479807621
3750.617594244316
3721.894481038437
3697.9030127471533
3682.3198604092104
3672.5446576580234
3663.7287940007786
3655.9120827521733
3645.05298447347
3623.8626094171454
3658.706109169818
3608.1224928068036
3587.6271909239363


In [0]:
error = 0.0
mse   = 0.0
for sentence in test_set:
  enc = np.zeros(len(voc) + 1)
  for word in sentence.words:
    if word in voc:
      enc[rev_voc[word]] += np.double(1.0)
  error += log_loss(enc, word_sent, sentence.prob)[0]
  mse   += (sentence.prob - logis(enc, word_sent)) ** 2

print("Final error:", error)
print("Mean square error:", mse)
print("Most negative word:", 
      voc[np.where(word_sent == np.amin(word_sent))[0][0]])
print("Most positive word:", 
      voc[np.where(word_sent == np.amax(word_sent))[0][0]])

Final error: 985.1239688319794
Mean square error: 112.15649172292257
Most negative word: worst
Most positive word: remarkable


In [0]:
# Task 4
neg_words = ['not', 'no', 'neither', 'never', 'no one', 'nobody',
             'none', 'nor', 'nothing', 'nowhere'] # from Cambridge Dictionary

def upd_sent_neg(sentiment):
  g_loss = g_grad = np.double(0.0)
  for sentence in train_set:
    enc = np.zeros(len(voc) + 1)
    word_mean = np.double(1.0)
    enc[-1] = np.double(1.0)
    for word in sentence.words:
      if word in voc:
        enc[rev_voc[word]] += word_mean
      if word in neg_words:
        word_mean *= np.double(-1.0)
    loss, grad = log_loss(enc, sentiment, sentence.prob)
    g_loss += loss
    g_grad += grad
  print(g_loss)
  return g_loss, g_grad

word_sent_neg = sopt.fmin_l_bfgs_b(lambda x: upd_sent_neg(x),
                                   np.zeros(len(voc) + 1), maxiter = 30)[0]

4852.030263920194
6623.704555001974
4830.81464143412
4817.439739341191
4775.581803605916
4688.073818127904
4612.006533919176
4526.571866383059
4442.84377882224
4347.227986459409
4292.927824632196
4235.983020532392
4195.428572665868
4155.692802962758
4085.7699836527618
4065.7853795727115
4007.3837719229646
3994.892680414913
3984.0242641688374
3946.120165980295
3882.7765868056968
3868.5339147146547
3822.7383772814164
3806.287551870615
3779.223938750483
3749.693442828891
3724.8472006308734
3710.8595998510013
3703.8588494037267
3693.0693849272743
3678.841559515922
3657.347072309784
3622.508653495511


In [0]:
err_with_neg = 0.0
mse_with_neg = 0.0
for sentence in test_set:
  enc = np.zeros(len(voc) + 1)
  word_mean = np.double(1.0)
  enc[-1] = np.double(1.0)
  for word in sentence.words:
    if word in voc:
      enc[rev_voc[word]] += word_mean
    if word in neg_words:
      word_mean *= np.double(-1.0)
  err_with_neg += log_loss(enc, word_sent_neg, sentence.prob)[0]
  mse_with_neg += (sentence.prob - logis(enc, word_sent_neg)) ** 2

print("Final error:", err_with_neg)
print("Mean square error:", mse_with_neg)

Final error: 1003.0639712672682
Mean square error: 120.91617474109681


In [0]:
# Task 5
emph_words = ['very', 'really', 'much', 'more', 'extremely']
length = len(voc) + 3     # bias, negation multiplier N, emphasis multiplier E
word_sent_def = np.zeros(length)
word_sent_def[-2] = np.double(-1.0) 
word_sent_def[-1] = np.double(1.5)

def log_loss_param(encoding, sentiment, true_sentiment, deN, deE):
  h = logis(encoding, sentiment)
  err = np.double(0.0)
  if h != np.double(0.0):
    err -= true_sentiment * np.log(h)
  if h != np.double(1.0):
    err -= (np.double(1.0) - true_sentiment) * np.log(np.double(1.0) - h)
  grad = (h - true_sentiment) * encoding
  grad[-2] = np.sum((h - true_sentiment) * np.dot(deN, sentiment))
  grad[-1] = np.sum((h - true_sentiment) * np.dot(deE, sentiment))
  return err, grad

def upd_sent_param(sentiment):
  g_loss = g_grad = np.double(0.0)
  for sentence in train_set:
    enc = np.zeros(length)
    enc[-3] = np.double(1.0)
    N = sentiment[-2]
    E = sentiment[-1]
    num_neg = num_emp = np.double(0.0)
    dN = dE = np.zeros(length)

    for word in sentence.words:
      if word in voc:
        enc[rev_voc[word]] += np.prod(np.power([N, E], [num_neg, num_emp]))
        dN[rev_voc[word]]  += num_neg * np.prod(np.power([N, E], [num_neg - 1, num_emp]))
        dE[rev_voc[word]]  += num_emp * np.prod(np.power([N, E], [num_neg, num_emp - 1]))
      if word in neg_words:
        num_neg += 1
      if word in emph_words:
        num_emp += 1
    loss, grad = log_loss_param(enc, sentiment, sentence.prob, dN, dE)
    g_loss += loss
    g_grad += grad
  print(g_loss)
  return g_loss, g_grad


word_sent_param = sopt.fmin_l_bfgs_b(lambda x: upd_sent_param(x),
                                   word_sent_def, maxiter = 30)[0]

print("Negation parameter:", word_sent_param[-2])
print("Emphasis parameter:", word_sent_param[-1])

4852.030263920194
6742.334748120995
4832.372235535328
4819.602303515584
4779.344591520989
4689.970614881983
4612.211899534437
4527.447412993914
4445.942617123351
4340.733697080622
4306.609613316562
4244.193660429922
4209.953877424586
4172.610532168653
4107.047550604742
4127.038495548291
4067.0915793820163
4031.6219256147447
4012.696044121923
3994.273420364525
3932.6607531045506
3931.811546993564
3903.075998070227
3877.6116028777583
3841.84696604829
3808.2659625238302
3785.3324351671254
3753.0045971694094
3746.510588674248
3739.995547404545
3733.7673484996126
3725.266157999171
3708.449599381723
3678.317795405719
3717.6641419835482
3662.1089207997607
Negation parameter: -0.4198439752434978
Emphasis parameter: 2.0801560247564934


In [0]:
err_param = 0.0
mse_param = 0.0
for sentence in test_set:
  enc = np.zeros(length)
  enc[-3] = np.double(1.0)
  N = word_sent_param[-2]
  E = word_sent_param[-1]
  num_neg = num_emp = np.double(0.0)

  for word in sentence.words:
    if word in voc:
      enc[rev_voc[word]] += np.prod(np.power([N, E], [num_neg, num_emp]))
    if word in neg_words:
      num_neg += 1
    if word in emph_words:
      num_emp += 1
  err_param += log_loss_param(enc, word_sent_param, sentence.prob, 0, 0)[0]
  mse_param += (sentence.prob - logis(enc, word_sent_param)) ** 2

print("Final error:", err_param)
print("Mean square error:", mse_param)

Final error: 997.2152547064817
Mean square error: 117.03937884708144
