# Phase 2: Generation of Difficult Cases

The goal of this phase is to generate difficult instances for the task of sentiment analysis. The requirements are slightly different for both task types (classification versus sequence labeling), pick the task that you build your baseline model for in phase 1.

You should in both situations participate in assignment 3. In other words, you will either do assignment 1 and 3 or assignment 2 and 3.


#### How to Generate the Samples
There are three main methods to generate the samples:
* You can use the Checklist paper code: https://github.com/marcotcr/checklist
* You can write code yourself to generate the samples. You can make use of any method you prefer, including a POS-tagger, word embeddings and contextualized embeddings
* You can generate samples manually

For each of these strategies you should think of a variety of types of difficult cases (so that not the whole set contains of the same types of samples), like the categories in Table 1 in "the Checklist paper".

Note that you have to shortly present your approach in week14 (before the project proposal, you will get 2 minutes for phase 2 and 5 for the project proposal)

#### For Inspiration:
* [Beyond Accuracy: Behavioral Testing of NLP Models with CheckList](https://www.aclweb.org/anthology/2020.acl-main.442.pdf)
* [Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task](https://www.aclweb.org/anthology/W17-5401.pdf)
* [Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World
Knowledge to Fool Sentiment Analysis Systems](https://www.aclweb.org/anthology/W17-5405.pdf)


## 1. Classification

The formal requirements are:

* 100-1000 utterances should be handed in on **LearnIt before 30-03 11:59AM**
* Must be in the same format as the training data : one (json) dict per line, and per instance needs at least: "reviewText", "sentiment", and "category" key.
* The "category" key indicates which type of alternation/difficulty you included.
* The gold labels must be correct!

Assuming you write a function that generates examples, writing the final file can be done like:

In [None]:
import json

def swap(sentiment):
    if sentiment == 'positive':
        return 'negative'
    elif sentiment == 'negative':
        return 'positive'

def dataGenerator(inputSents):
    outputSents = []
    for instance in inputSents:
        if 'great' in instance[0]:
            outputSents.append({'reviewText': instance[0].replace('great', 'not great'), 'sentiment': swap(instance[1]), 'category': 'negation'})
    return outputSents

inputSents = [['this is a great album', 'positive']]

#outFile = open('group13.json', 'w')
#for instance in dataGenerator(inputSents):
   # # goldLabel is a string, either 'positive' or 'negative', text contains the review, and category 
  #  # indicates the type of alternation you did.
 #   outFile.write(json.dumps(instance) + '\n')
#outFile.close()

You should check whether your final file is in the correct format with the following code:

In [None]:
import json
inputPath = 'group13.json'

for lineIdx, line in enumerate(open(inputPath)):
    try:
        data = json.loads(line)
    except ValueError as e:
        print('error, instance ' + str(lineIdx+1) + ' is not in valid json format')
        continue
    if 'reviewText' not in data:
        print("error, instance " + str(lineIdx+1) + ' does not contain key "reviewText"')
        continue
    if 'sentiment' not in data:
        print("error, instance " + str(lineIdx+1) + ' does not contain key "sentiment"')
        continue
    if data['sentiment'] not in ['positive', 'negative']:
        print("error, instance " + str(lineIdx+1) + ': sentiment is not positive/negative')
        continue
        
if lineIdx+1 < 100:
    print('Too little instances(' + str(lineIdx) + '), please generate more')
if lineIdx+1 > 1000:
    print('Too many instances(' + str(lineIdx) + '), please generate more')

## 3. Prediction
06-04 11:59AM is the deadline for handing in the predictions of the baseline on the difficult cases of all the groups. The datafile will be made available as soon as possible after your hand-ins (we aim for 02-04), and all you have to do is re-run your baseline from phase 1. Note that some of the meta-information might not be available, so if your baseline relies on those you have to either retrain without these features, or predict without these features.

The codalab link will appear here, and will be posted on slack when available

In [None]:
#create same structure as data 
import json
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
PATH = {'train':'../data/music/music_reviews_train.json.gz',
        'dev': '../data/music/music_reviews_dev.json.gz',
        'test': '../data/music/music_reviews_test_masked.json.gz',
        'difficult': '../data/difficult_cases/phase2_testData-masked.txt'}

In [None]:
def load_data(path):
    '''
    Function to load the data
    -----
    Takes in the argument: 
        'path' - takes the form PATH['(train, dev or test)']
    '''
    dic = {}
    for i, line in enumerate(gzip.open(path)):
        review_data = json.loads(line)
        dic[i] = {}
        for key,value in review_data.items():
            dic[i][key] = value
    return dic

In [None]:

train_data = load_data(PATH['train'])
dev_data = load_data(PATH['dev'])
difficult_data = {}
instances = 0
with open(PATH['difficult']) as f:   
    for line in f:
        difficult_data[instances] = json.loads(line)
        instances+=1

In [None]:
def sent_encode(sent):
    '''
    Helper function to encode sentiment
    ------
    Takes in string description
        'sent' - either positive or negative
    Returns binary encoding
        1 = positive sentiment
        0 = negative sentiment
    '''
    if sent == 'positive':
        return 1
    if sent == 'negative':
        return 0 
    return 'unknown sentiment'

In [None]:
def clean(data):
    '''
    Function to clean the data
    -----
    Takes in data set from load_data()
        'data' - nested dictionary  
    Returns two lists
        cleaned - X list
        ys - y list
    '''
    cleaned = [] 
    ys = []
    for idx in data:
        review = data[idx].get('reviewText', None) # some data does not have a review text
        summary = data[idx].get('summary', None) # some data does not have a summary 
        
         #combine summary and review
        if review == None and summary == None:
            continue
        elif review == None:
            text = summary
        elif summary == None:
            text = review
        else:
            text = summary + ' ' + review
        text = text.lower() 
        sequence = word_tokenize(text)  # splits gotta into got ta
        cleaned.append(sequence)

        # encode sentiment
        ys.append(sent_encode(data[idx]['sentiment']))

    return cleaned, ys

In [None]:
def clean_difficult(data):
    cleaned = [] 
    ys = []
    for idx in data:
        review = data[idx].get('reviewText', None) # some data does not have a review text
        summary = data[idx].get('summary', None) # some data does not have a summary 
        review = review.lower()
        sequence = word_tokenize(review)  # splits gotta into got ta
        cleaned.append(sequence)

        # encode sentiment
        ys.append(sent_encode(data[idx]['sentiment']))
    return cleaned, ys

In [None]:
train_clean, y_train = clean(train_data)
dev_clean, y_dev = clean(dev_data)
train_clean += dev_clean 
cleaned_difficult, _ = clean_difficult(difficult_data)

In [None]:
def get_vocab_corpus(dataset):
    '''
    Function computing vocabluary and corpus for a dataset
    -----
    Takes a cleaned dataset - list 
        dataset - X list 
    Returns
        vocab - set of unique tokens in dataset
        corpus - list of strings; sentences in dataset 
    '''
    vocab = set()
    corpus = []
    for text in dataset:
        sentence = ''
        for token in text:
            vocab.add(token)
            if token in ['.','!','?',',']:
                sentence += token 
            else:
                sentence += ' ' + token 
        corpus.append(sentence.lstrip()) 
    return vocab, corpus

In [None]:
train_vocabulary, train_corpus = get_vocab_corpus(train_clean)
difficult_vocabulary, difficult_corpus = get_vocab_corpus(cleaned_difficult)



In [None]:
def get_bow(vocab, corp):
    '''
    Function returning sparse matrix of Term Frequency — Inverse Document Frequencies
    -----
    Takes vocab and corpus, working with two lists
        vocab - set of unique words
        corpus - list of strings
    Returns bag of words
        bow - 2d matrix; input to model
    '''
    vocab = list(vocab) 
    vectorizer = TfidfVectorizer(vocabulary= vocab)
    bow = vectorizer.fit_transform(corp) 
    return bow 

In [None]:
train_bow = get_bow(train_vocabulary,train_corpus)
difficult_bow = get_bow(train_vocabulary, difficult_corpus)

In [None]:
## load model 
with open('../models/logreg_music.pkl','rb') as f:
    log_reg_clf = pickle.load(f)
y_pred = log_reg_clf.predict(difficult_bow)

In [None]:
def pred_test(test, ys):
    '''
    Function to insert predicitons into test data
    '''
    index = 0
    for key in test:
        test[key]['sentiment'] = reverse_encode(ys[index])
        index += 1
    return test

In [None]:
def reverse_encode(sent):
    if sent == 1:
        return 'positive'
    if sent == 0:
        return 'negative'

In [None]:
finished_test_data = pred_test(difficult_data,y_pred)
finished_test_data

In [None]:
#with open('../data/difficult_cases/predictions.json', 'w') as f:
 #   json.dump(finished_test_data, f)