# Annotation task actor classification

## Group members:
- Mariann Burk
- Maxine Hofstetter


##  task

* given
    * train, development and test data
    * use train and dev to train and tune your model
    * use test data once at the end of tuning

* specifiy a basic feedforward net with Pytorch
    * 3 or more layers
    * output layer a single node for binary classification (0 or 1)
    * use BCELoss 
    * use ReLU in the hidden layers
    * determine performance with sklearn metrics

* tune your model
    * try if layer normalisation improves the result
    * try if dropout improves it 

* use the tuned model to produce a file as the basis for annotation
    * the annotation is carried out on the ***devset*** not on the testset
    
* determine IAA with Cohen's Kappa
* Merge your annotations into a single gold standard by discussing cases where you disagreed to with your partner

* Create a zip file and upload to olat (exclude word embeddings)

## howto
* load fasttext word embeddings, needed for indexing (code given)
* load the data (code given)
* apply sklearn MLP to have a baseline (code given)
* your task: apply your own MLP, try to tune it
* your task: save the data for annotation
* your task: annotate
* your task: determine the real accuracy on the dev set by replacing the old dev_y with a true_dev_y created from your annotation

# Maxine: I desgined the neural net based on these two articles:
https://www.deeplearningwizard.com/deep_learning/practical_pytorch/pytorch_feedforward_neuralnetwork/

https://medium.com/biaslyai/pytorch-introduction-to-neural-network-feedforward-neural-network-model-e7231cff47cb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# load fasttext

In [None]:
## Pfadnamen setzen

# embedding="/home/klenner/Lehre/ml20/cc.de.300.vec" # Manfred

# Maxine: downloaded the embeddings from https://fasttext.cc/docs/en/crawl-vectors.html. > models > german > text. It took approx. 2 hours to upload
embedding="/content/drive/MyDrive/ml_colabs/series_5/cc.de.300.vec" 

# mariann: trying more generic path
# embedding ="/cc.de.300.vec.zip"


# wo die Daten liegen (train und test)
# name="named/named_23_klenner/" # Manfred
name="/content/drive/MyDrive/ml_colabs/series_5/" # Maxine

# wohin die Resultate des Modells geschrieben werden (reine Textdata)
# diese Datei sollte exportiert, annotiert und wieder importiert werden
filename = 'devX_data_annotation.txt'

In [None]:
import numpy as np

def load_emb_from_file(filepath):

    word_to_index = {}
    embeddings = []
    with open(filepath, "r") as fp:
        for index, line in enumerate(fp):
            line = line.split(" ") # each line: word num1 num2 ...
            word_to_index[line[0]] = index # word = line[0] 
            embedding_i = np.array([float(val) for val in line[1:]])
            embeddings.append(embedding_i)
    return word_to_index, embeddings

In [None]:
widx,emb=load_emb_from_file(embedding)

# load data

* train, dev and test sets
* dataX: nouns per line
* datay: 0 (non-actors), 1 (actors)

In [None]:
# transform data to fasttext  embeddings
def w2e(data):
    out=[]
    for w in data:
        out.append(emb[widx[w]])
    return out

In [None]:
import pickle

pickle_in = open(name+"devy_data.pickle","rb")
devy = pickle.load(pickle_in)
pickle_in = open(name+"devX_data.pickle","rb")
devX = pickle.load(pickle_in)

pickle_in = open(name+"trainX_data.pickle","rb")
trainX = pickle.load(pickle_in)
pickle_in = open(name+"trainy_data.pickle","rb")
trainy = pickle.load(pickle_in)

pickle_in = open(name+"testy_data.pickle","rb") # Maxine: I added 'name+' for test data to make colab find the files
testy = pickle.load(pickle_in)
pickle_in = open(name+"testX_data.pickle","rb")
testX = pickle.load(pickle_in)

# transform data to fasttext embeddings
X_train=w2e(trainX)
X_dev=w2e(devX)
X_test=w2e(testX)

y_dev=devy
y_train=trainy
y_test=testy

In [None]:
# Maxine: datasets are listst now
len(X_train), len(y_train)

(14953, 14953)

# determine baseline

In [None]:
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

import torch
import torch.optim as optim
from torch.nn.parameter import Parameter
import torch.nn as nn

In [None]:
# this is a sklearn baseline

clf = MLPClassifier(random_state=1, hidden_layer_sizes=(300,100), max_iter=300).fit(X_train, y_train)

y_pred_dev = clf.predict(X_dev)
accuracy_score(y_dev, y_pred_dev), precision_recall_fscore_support(y_dev, y_pred_dev)

(0.9180064308681672,
 (array([0.94663573, 0.85340314]),
  array([0.93577982, 0.87634409]),
  array([0.94117647, 0.86472149]),
  array([436, 186])))

In [None]:
# TODO implement a pytorch NN implementation for the task here

    # requirements: 
    # 3 or more layers 
    # output layer a single node for binary classification (0 or 1)
    # use BCELoss
    # use ReLU in the hidden layers 
    # determine performance with sklearn metrics

class FeedforwardNN(torch.nn.Module):
  def __init__(self, input_size, hidden_size, output_size):
    super(FeedforwardNN, self).__init__()
   
    self.fc1 = torch.nn.Linear(input_size, hidden_size)

    self.relu_1 = torch.nn.ReLU()

    self.fc2 = torch.nn.Linear(hidden_size, hidden_size)

    self.relu_2 = torch.nn.ReLU()

    self.fc3 = torch.nn.Linear(hidden_size, hidden_size)

    self.relu_3 = torch.nn.ReLU()

    self.fc4 = torch.nn.Linear(hidden_size, output_size)

  def forward(self, x):
    sigmoid = nn.Sigmoid()

    out = self.fc1(x)
        
    out = sigmoid(out)
 
    out = self.fc2(out)
        
    out = sigmoid(out)

    out = self.fc3(out)
        
    out = sigmoid(out)

    out = self.fc4(out)
    
    return out

In [None]:
# instantiate the model

#input_size = len(X_train)
input_size = 300 # Maxine: i thought it should be the length of the input data but after getting an error, i tried with 300 and it works for some reason
hidden_size = 20
output_size = 1 # Maxine: see instructions: output layer a single node for binary classification (0 or 1)
model = FeedforwardNN(input_size, hidden_size, output_size)
model

FeedforwardNN(
  (fc1): Linear(in_features=300, out_features=20, bias=True)
  (relu_1): ReLU()
  (fc2): Linear(in_features=20, out_features=20, bias=True)
  (relu_2): ReLU()
  (fc3): Linear(in_features=20, out_features=20, bias=True)
  (relu_3): ReLU()
  (fc4): Linear(in_features=20, out_features=1, bias=True)
)

In [None]:
# hyperparameters

#criterion = nn.BCEWithLogitsLoss() # Maxine: we are asked to use BCE Loss but i had to add Logits to squeeze values between 0 and 1
criterion = nn.BCELoss() 

learning_rate = 0.001

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [None]:
# Maxine: I think we have to convert the data to tensors

'''
# approach with tensors > x and y are going to be separate datasets
X_train = torch.FloatTensor(X_train)
X_dev = torch.FloatTensor(X_dev)
X_test = torch.FloatTensor(X_test)

y_train = torch.FloatTensor(y_train)
y_dev = torch.FloatTensor(y_dev)
y_test = torch.FloatTensor(y_test)

'''
# approach with DataLoader > x and y are going to form one set
# prepare data: turn into tensors
from torch.utils.data import DataLoader
X_train_tensor = torch.FloatTensor(np.array(X_train))
X_test_tensor = torch.FloatTensor(np.array(X_test))
X_dev_tensor = torch.FloatTensor(np.array(X_dev))
y_train_tensor = torch.Tensor(np.array(y_train))
y_test_tensor = torch.Tensor(np.array(y_test))
y_dev_tensor = torch.Tensor(np.array(y_dev))

# prepare data: turn into tensor datasets
train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)
dev_dataset = torch.utils.data.TensorDataset(X_dev_tensor, y_dev_tensor)

# turn into Dataloader
train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dl = DataLoader(test_dataset, batch_size=64, shuffle=True)
dev_dl = DataLoader(dev_dataset, batch_size=64, shuffle=True)


In [None]:
# Maxine: imports we need to get acc (like in series_3)
from numpy import vstack
from sklearn.metrics import accuracy_score

In [None]:
# Maxine: TODO: add accuracy measure 
'''
# approach without iterating over x, y in enumerate(dl)
epochs = 100
for epoch in range(epochs):
  optimizer.zero_grad()

  y_pred = model(X_train)

  # Maxine: I did this because it was of type Long before but it kept asking for float
  y_pred = y_pred.to(device).float()
  y_train= y_train.to(device).float()

  loss = criterion(y_pred.squeeze(), y_train)

  # accuracy >> fails because of tensor shape: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
  y_pred = y_pred.detach().numpy()
  y_pred = y_pred.round()
  actual = y_train.numpy()
  actual  = actual.reshape((len(actual), 1))

  y_pred, actual = vstack(y_pred), vstack(actual)
  acc = accuracy_score(actual, y_pred) 

  print('Epoch {}: train loss: {} accuracy {}'.format(epoch, loss.item(), acc))

  loss.backward()
  optimizer.step()
'''
# mariann: added the iteration count for understanding purposes
# train the model
def train(epochs):
    for i in range(epochs):
        for batch, target in train_dl:
          optimizer.zero_grad()
          
          y_pred=model(batch)
          
          loss=criterion(y_pred,target.unsqueeze(1))  #mariann: i had to remove the squeexe() from predict in order to equalize the sizes of target and predict
        
          loss.backward()
          
          optimizer.step()
  
        #print(loss.item())
        for batch, target in dev_dl:
            with torch.no_grad():
                y_pred=model(batch)  
                #y_pred=y_pred.detach().numpy()  # remove tensor specific stuff, move to numpy array
           
                loss_validation = criterion(y_pred, target.unsqueeze(1))
            #print('Manfred')
        #print(loss_validation)
    return y_pred

y_pred = train(50)

# mariann: Calculate Accuracy         
correct = 0
total = 0
# Iterate through test dataset
for batch, target in test_dl:
  y_predict=model(batch)

# Get predictions from the maximum value
  _, predicted = torch.max(y_predict.data, 1)
# Total number of labels
  total += target.size(0)

# Total correct predictions
  correct += (predicted == target).sum()

accuracy = 100 * correct / total
#print('Accuracy: {}'.format(accuracy))

In [None]:
# set threshold to convert prediction either to 1 or 

# write to file for annotation

In [None]:
# one noun per line incl. the predicted class, e.g. "Tomate 1"
# nouns and labels are in the same file to ease annotation
y_pred = y_pred.detach().numpy()

file=open(name+filename,"w")
for i,w in enumerate(devX):
        file.write(w)
        file.write(" ")
        file.write(str(y_pred))  # y_pred are the predictions of your tuned model
        file.write("\n")
file.close()

In [None]:
print(file)

<_io.TextIOWrapper name='/content/drive/MyDrive/ml_colabs/series_5/devX_data_annotation.txt' mode='w' encoding='UTF-8'>


# read annotated data 

* output data are lines like "Tomate 1" (1 is the predicted class)
* 1 is actor, 0 is non-actor, so this is wrong
* use "x" in front of a wrong prediction as annotation, e.g. "xTomate 1"
* annotate only incorrect predictions

In [None]:
# reads your annotated data back into true_dev_y
# filename is the name of the file of your annotated data
# creates true_dev_y: your annotated data as a gold standard

file=open(name+filename,"r")
h,a,=0,0

true_dev_y=[]
for line in file:
    w,l=line.split(" ")
    l=l.strip("\n")
    a+=1
    if w[0]=='x' and l=='1':   # prediction was wrong, actually the label is 0
        my_dev_y.append(0)
    elif w[0]=='x' and l=='0':
        true_dev_y.append(1)
    else:
        true_dev_y.append(int(l))
        h+=1   
    

ValueError: ignored

# how good is your best model really?

* apply your best model to your new gold standard (the corrected dev file)

In [None]:
# compare it first to the baseline

clf = MLPClassifier(random_state=1, hidden_layer_sizes=(300,100), max_iter=300).fit(X_train, y_train)

y_pred=clf.predict(X_dev)
accuracy_score(true_dev_y, y_pred_dev),precision_recall_fscore_support(true_dev_y, y_pred_dev)

In [None]:
# TODO: Calculate the inter-annotator agreement using cohens kappa between the 2 members of the group 

In [None]:
# TODO: Merge your annotations into a single gold standard by discussing cases where you disagreed to with your partner