# Project phase 1: Baseline

The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:
* 1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
* 2. A simple machine learning classifier (logistic regression with words as features)
* 3. The ``state-of-the-art'' approach on which you want to improve (your starting point)

For this phase you need to make a number 2 or 3 baseline. 

If you plan to have a research question like: can we improve sentiment detection systems by doing X, the answer to the question is the most relevant if you have a competetive baseline (3). In this case we would suggest to use a BiLSTM or even a transformer based model, so that you can re-use the baseline for the final research question (phase 3).

You should pick one of the following tasks to create your baseline for.

## Task 1: Sentiment classification
* The data can be found in the `classification` folder.
* The goal is to predict the label in the `sentiment` field.
* **You have to upload the predictions of `music_reviews_test_masked.json.gz` to CodaLab. (The link will be posted here on monday). Note that the format should match the json files in the repository.**
* **Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.**

The data can be read like:

In [None]:
import numpy as np 
import torch 
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
import gzip
import json
for line in gzip.open('classification/music_reviews_dev.json.gz'):
    review_data = json.loads(line)
    for key in review_data:
        print('"' + key +'": ' + str(review_data[key]))
    break
paths = {'train':'classification/music_reviews_train.json.gz',
        'test':'classification/music_reviews_test_masked.json.gz'}

In [None]:
train_vocab = {}
train = gzip.open(paths['train'])
counter1 = 0
counter2 = 0
counter3 = 0
train_no_reviewText = []
labels = {}
train_sentences = {}
for line in train:
    counter1 +=1
    #print(line)
    if 'reviewText' in json.loads(line).keys():
        train_sentences[counter3] = json.loads(line)['reviewText']
        counter3 += 1
        for word in json.loads(line)['reviewText'].split():
            if word not in train_vocab.keys():
                train_vocab[word] = counter2
                counter2 += 1
    else:
        train_no_reviewText.append(counter1)
print(counter3,counter2)

In [None]:
# Gram matrix 
m1 = torch.zeros(counter3, counter2)


In [None]:
# Need idx2word
idx2word = dict([(value, key) for key, value in train_vocab.items()])

# Begin correcting gram matrix

for sen in train_sentences: 
    for word in train_sentences[sen].split(): 
        m1[sen, train_vocab[word]] = 1

In [None]:
train_vocab = {}
train = gzip.open(paths['train'])
train_labels = {}
counter = 0
for line in train:
    a = json.loads(line)
    if 'reviewText' in a.keys():
        if a['sentiment'] == 'positive':
            train_labels[counter] = 1
        elif a['sentiment'] == 'negative': 
            train_labels[counter] = 0
        counter +=1

        
#len(labels)
#print(type(labels))



counter2

In [None]:
batch_size = 5000
num_batches = int(len(m1)/batch_size)
train_feats_batches = m1[:batch_size*num_batches].view(num_batches,batch_size, counter2)
for feats_batch in train_feats_batches:
    print(feats_batch.shape)

bingus = list(train_labels.values())
bingus = torch.FloatTensor(bingus)

num_batches = int(len(bingus)/batch_size)
train_label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
counter = 1
for feats_batch in train_label_batches:
    counter+=1
    print(feats_batch.shape)
counter

In [None]:
# Encode test labels
test_vocab = {}
test = gzip.open(paths['test'])
counter1 = 0
counter2 = 0
counter3 = 0
test_no_reviewText = []
test_labels = {}
test_sentences = {}
for line in test:
    counter1 +=1
    #print(line)
    if 'reviewText' in json.loads(line).keys():
        test_sentences[counter3] = json.loads(line)['reviewText']
        counter3 += 1
        for word in json.loads(line)['reviewText'].split():
            if word not in train_vocab.keys():
                test_vocab[word] = counter2
                counter2 += 1
    else:
        test_no_reviewText.append(counter1)
print('Vocab size: ', counter2)
        
# Construct gram matrix
m2 = torch.zeros(counter3, 226347)
print('m2 constructed!')


# Need idx2word
idx2word = dict([(value, key) for key, value in train_vocab.items()])
print('idx2word done!')


# Begin correcting gram matrix
for sen in test_sentences: 
    for word in test_sentences[sen].split(): 
        if word in train_vocab.keys():
            m2[sen, train_vocab[word]] = 1
print('Gram matrix done')

#Note labels
test = gzip.open(paths['test'])
test_labels = {}
counter = 0
for line in test:
    a = json.loads(line)
    if 'reviewText' in a.keys():
        if a['sentiment'] == 'positive':
            test_labels[counter] = 1
        elif a['sentiment'] == 'negative': 
            test_labels[counter] = 0
        counter +=1
print('Labels noted!')
        
#Divide into batches

batch_size = 2499
num_batches = int(len(m2)/batch_size)
test_feats_batches = m2[:batch_size*num_batches].view(num_batches,batch_size, 226347)
print('Feature Matrix shapes: ')
for feats_batch in test_feats_batches:
    print(feats_batch.shape)
bingus = list(test_labels.values())
bingus = torch.FloatTensor(bingus)
num_batches = int(len(bingus)/batch_size)
test_label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
print('label matrix shapes: ')
for feats_batch in test_label_batches:
    print(feats_batch.shape)

In [None]:
def load_vocab(filepath):
    test_vocab = {}
    test = gzip.open(filepath)
    counter1 = 0
    counter2 = 0
    counter3 = 0
    test_no_reviewText = []
    test_labels = {}
    test_sentences = {}
    for line in test:
        counter1 +=1
        #print(line)
        if 'reviewText' in json.loads(line).keys():
            test_sentences[counter3] = json.loads(line)['reviewText']
            counter3 += 1
            for word in json.loads(line)['reviewText'].split():
                if word not in train_vocab.keys():
                    test_vocab[word] = counter2
                    counter2 += 1
        else:
            test_no_reviewText.append(counter1)
    final_dict = {'line_count' : counter1,
                 'review_count' : counter3,
                 'vocab_size' : counter2,
                 'no_text_reviews' : test_no_reviewText,
                 'labels' : test_labels,
                 'vocabulary' : test_vocab,
                 'sentences' : test_sentences}
    return final_dict

def construct_gram(vocab, num_sen, vocab_len, sentences): 
    # Construct gram matrix
    m2 = torch.zeros(num_sen, vocab_len)
    print('m2 constructed!')
    for sen in sentences: 
        for word in sentences[sen].split(): 
            m2[sen, vocab[word]] = 1
    print('Gram matrix done')
    return m2



In [None]:
#model = LogisticRegression()
#for feat, label in zip(train_feats_batches,train_label_batches):
#    model.fit(feat,label)


In [None]:
preds = {}
counter = 0
for batch in test_feats_batches:
    preds[counter] = model.predict(batch)
    counter+=1

In [None]:
new_arr = []
for i in preds:
    for j in preds[i]:
        new_arr.append(j)
len(new_arr)
new_arr = np.array(new_arr)
new_arr.shape

In [None]:
for pred, true in zip(preds, test_label_batches):
    print(classification_report(preds[pred], true))

In [None]:
test = gzip.open(paths['test'])
counter = 0
new_data = []
for i in test:
    bingus = json.loads(i)
    if new_arr[counter] == 0:
        bingus['sentiment'] = 'negative'
    elif new_arr[counter] == 1:
        bingus['sentiment'] = 'positive'
    new_data.append(bingus)

In [None]:
for i in new_data:
    print(i)
    break

In [None]:
#with open("final.json", 'a') as f:
#    for i in new_data:
#        json.dump(i,f)
#        f.write('\n')

# Part 2: Break it down

In this part of the project, we are tasked with breaking our own model down, to try and improve it. <br> 

### Suggested methods: 
- Change language
- More negation
- Reviews of other products

### Things we should also consider: 
- Better tokenization
- Model tuning
- Acutually using development data

In [None]:
# better tokenization: 
'''
1. Look up better regex expression
2. Remove stopwords
3. implement padding(might have to wait on that one lmao)
4. Use a way more sophisticated model. (might wanna wait on that one too)
'''













In [2]:
#Reopen file + imports: 
import gzip
import json
import torch 
from nltk.tokenize import TweetTokenizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import classification_report

# This is how they load the data ------------------------------------------
#for line in gzip.open('../classification/music_reviews_dev.json.gz'):
#    review_data = json.loads(line)
#    for key in review_data:
#        print('"' + key +'": ' + str(review_data[key]))
#    break
#--------------------------------------------------------------------------


paths = {'train':'../classification/music_reviews_train.json.gz',
        'test':'../classification/music_reviews_test_masked.json.gz',
        'dev' : '../classification/music_reviews_dev.json.gz'}

In [3]:
# Vocabs built using TweetTokenizer
# 


def build_vocab(filepath):
    train_vocab = {}
    train = gzip.open(filepath)
    counter1 = 0
    counter2 = 0
    counter3 = 0
    counter = 0
    no_reviewText = []
    labels = {}
    sentences = {}
    tokenizer = TweetTokenizer()
    for line in train:
        counter1 +=1
        #print(line)
        if 'reviewText' in json.loads(line).keys():
            a = json.loads(line)
            sentences[counter3] = a['reviewText']
            counter3 += 1
            if a['sentiment'] == 'positive':
                labels[counter] = 1
            elif a['sentiment'] == 'negative': 
                labels[counter] = 0
            counter +=1
            for word in tokenizer.tokenize(json.loads(line)['reviewText']):
                if word not in train_vocab.keys():
                    train_vocab[word] = counter2
                    counter2 += 1
        else:
            no_reviewText.append(counter1)
    final_dict = {'line_count' : counter1,
                 'review_count' : counter3,
                 'vocab_size' : counter2,
                 'no_text_reviews' : no_reviewText,
                 'labels' : labels,
                 'vocabulary' : train_vocab,
                 'sentences' : sentences}
    return final_dict

train_set =  build_vocab(paths['train'])
dev_set = build_vocab(paths['dev'])



In [4]:
# Might wanna run tests again just to see if the new tokenization improved performance in any way. 
tokenizer = TweetTokenizer()
# unigrams : 
def create_unigram(vocab, sentences, tokenzier):
    # Create matrix
    m1 = torch.zeros(len(sentences), len(vocab))
    # Correct indices
    for sen in range(len(sentences)): 
        for word in tokenizer.tokenize(sentences[sen]): 
            if word in vocab.keys():
                m1[sen, vocab[word]] = 1
    return m1

train_unigram = create_unigram(train_set['vocabulary'], train_set['sentences'], tokenizer)

def create_batches(matrix, batch_size,labels): 
    num_batches = int(len(matrix)/batch_size)
    feats_batches = matrix[:batch_size*num_batches].view(num_batches,batch_size, matrix.shape[1])
    bingus = torch.FloatTensor(list(labels.values()))
    num_batches = int(len(bingus)/batch_size)
    label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
    return feats_batches, label_batches
train_feat_batches = create_batches(train_unigram, 2499, train_set['labels'])

RuntimeError: [enforce fail at C:\Users\builder\tkoch\workspace\pytorch\pytorch_1647970138273\work\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 44822182944 bytes.

In [5]:
labels_batches = list(train_feat_batches[1])
feat_batches = list(train_feat_batches[0])

NameError: name 'train_feat_batches' is not defined

In [6]:
# Make model
model = LogisticRegression()

# Train model

for feat, label in zip(feat_batches, labels_batches):
    model.fit(feat,label)

NameError: name 'feat_batches' is not defined

In [7]:
dev_unigram = create_unigram(train_set['vocabulary'], dev_set['sentences'], tokenizer)

In [8]:
dev_batches = create_batches(dev_unigram,2499,dev_set['labels'])
preds = {}
counter = 0
for batch in dev_batches[0]:
    preds[counter] = model.predict(batch)
    counter+=1
new_arr = []
for i in preds:
    for j in preds[i]:
        new_arr.append(j)
len(new_arr)
new_arr = np.array(new_arr)
for pred, true in zip(preds, dev_batches[1]):
    print(classification_report(preds[pred], true))

NameError: name 'create_batches' is not defined

In [None]:
hard_preds = {}
counter = 0
for batch, pred_dict, true_dict in zip(dev_batches[0], preds.values(), dev_batches[1]):
    for sentence, pred, true in zip(batch, pred_dict, true_dict):
        if pred != true:
            hard_preds[counter] = int(pred)
        counter += 1

In [None]:
hard_preds

In [None]:
# grabbing the first 200 misclassification sentences
hard_sens = []
for i in hard_preds:
    hard_sentence = train_set['sentences'][i]
    if len(hard_sens) >= 200:
        break
    else:
        hard_sens.append(i)

In [None]:
test = gzip.open(paths['dev'])
new_data = []
counter = 0
for i in test:
    bingus = json.loads(i)
    if counter in hard_sens:
        new_data.append(bingus)
    counter+=1

In [None]:
with open("hard_cases.json", 'a') as f:
    for i in new_data:
        json.dump(i,f)
        f.write('\n')

### Checklist section:
Gonna follow how they do in the official documentation.
Pretty dumb donkey way of doing this. But hope it works.


In [9]:
# Load the same stuff that they do in the doc. 
import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
import numpy as np
import spacy
from checklist.test_suite import TestSuite
from checklist.perturb import Perturb
from spacy.lang.en.examples import sentences 
import random 
# Load editor. 

editor = checklist.editor.Editor()
editor.tg
nlp = spacy.load('en_core_web_sm')
sentences = dev_set['sentences'].values()
parsed_data = list(nlp.pipe(sentences))
suite = TestSuite()

In [10]:
parsed_data[0]

My dentist recommended this as a relaxation technique for dental visits. They give me an ipod with headphones, play this on it and it relieves some of the stress of dental treatment, which I dislike intensely.
It worked so well that I bought my own copy to try at home. I fall asleep after a couple of minutes and stay asleep. Instead of tossing and turning, I hardly move at all. Highly recommend.

We need at least 100 samples. What we can do is define some categories of difficult sentences for our logistic regression:

 - Negation: A negation such as "I don't like the artist" is difficult for our classifier to classify correctly, as it will see the work "like" which typically has positive connotations, but in this example the word "don't" infers that we means that like is used to infer negative sentiment of the artist, which our baseline logistic regression won't be able to detect.

 - Irony/Sarcasm. Since our baseline model basically just looks at individual words and learns if they are typically positive or negative, if a reviewer is being ironic/sarcastic and using words which indicate the opposite sentiment of what they truly mean, our logistic regression will fail. Fx with the sentence "This album rocks, I love bleeding from my ears!", it uses very positive words but since they are being ironic/sarcastic they use the positive words to convey a negative sentiment.

 - pos/neg. If a reviewer uses a mix of both positive and negative reviews, our logistic regression might become a bit confused and not know with any strong confidence which sentiment the review is. Fx "I liked the old album but this new one sucks" is hard because there is both the positive work "liked" and the negative word "sucks", and our logistic regression is too simple to understand that in this context it is the reviewed album that is bad.

With these 3 categories, since we need at least 100 samples we can try to create some templates for them and generate say 40 samples each to get 120 total samples.

In [68]:
####### Creating lists of stuff for use in generating sentences #######

# Constructing collection of music_nouns for use in all categories to reference whatever is being reviewed
music_noun = ['project', 'artist', 'album', 'genre', 'compilation', 'ep',
              'singer', 'band', 'guitar', 'drummer', 'guitarist', 'pianist',
              'group']

# Negations
negations = ['don\'t', 'can\'t' , 'not', 'won\'t', 'nothing']
past_negations = ['didn\'t', 'wouldn\'t', 'couldn\'t', 'shouldn\'t']

# Pos/Neg Adjectives
pos_adj = ['adorable', 'amazing', 'awesome', 'beautiful', 'brilliant', 'captivating',
           'creative', 'elegant', 'energetic' , 'excellent', 'exceptional', 'exciting',
           'extraordinary', 'fabulous', 'fantastic', 'fun', 'good', 'great', 'happy',
           'imaginative', 'incredible', 'nice', 'perfect', 'sweet', 'wonderful']
neg_adj = ['abrasive', 'annoying', 'average', 'awful', 'bad', 'boring', 'careless',
           'creepy', 'difficult', 'dreadful', 'frustrating', 'hard', 'horrible',
           'lame', 'lousy', 'nasty', 'poor', 'ridiculous', 'rough', 'sad', 'terrible',
           'ugly', 'unhappy', 'unpleasant', 'weird']

# Positive Verbs in Present and Past Tenses
pos_verb_present = ['admire', 'appreciate', 'enjoy', 'like', 'love', 'recommend', 'value', 'welcome']
pos_verb_past = ['admired', 'appreciated', 'enjoyed', 'liked', 'loved', 'recommended', 'valued', 'welcomed']

# Negative Verbs in Present and Past Tenses
neg_verb_present = ['abhor', 'despise', 'dislike', 'dread', 'hate', 'loathe', 'regret', 'resent']
neg_verb_past = ['abhorred', 'despised', 'disliked', 'dreaded', 'hated', 'loathed', 'regretted', 'resented']

# Neutral Verbs in Present and Past Tenses
neutral_verb_present = ['find', 'see']
neutral_verb_past = ['found', 'saw']



####### adding lexicons #######
editor.add_lexicon('music_noun', music_noun, overwrite=True)

editor.add_lexicon('negation', negations, overwrite=True)
editor.add_lexicon('past_negation', past_negations, overwrite=True)

editor.add_lexicon('pos_adj', pos_adj, overwrite=True)
editor.add_lexicon('neg_adj', neg_adj, overwrite=True)

editor.add_lexicon('pos_verb_present', pos_verb_present, overwrite=True)
editor.add_lexicon('pos_verb_past', pos_verb_past, overwrite=True)

editor.add_lexicon('neg_verb_present', neg_verb_present, overwrite=True)
editor.add_lexicon('neg_verb_past', neg_verb_past, overwrite=True)

editor.add_lexicon('neutral_verb_present', neutral_verb_present, overwrite=True)
editor.add_lexicon('neutral_verb_past', neutral_verb_past, overwrite=True)

# mixed lexicons? not too sure how these work - Aidan
editor.add_lexicon('pos_verb', pos_verb_present + pos_verb_past, overwrite=True)
editor.add_lexicon('neg_verb', neg_verb_present + neg_verb_past, overwrite=True)
editor.add_lexicon('neutral_verb', neutral_verb_present + neutral_verb_past, overwrite=True)

In [86]:
# to get 40 negations can create 2 templates for negations and randomly sample 20 sentences for each template
template_negation_neg = list(editor.template('I {negation} {pos_verb_present} the {music_noun}')['data']) # template 1
template_negation_pos = list(editor.template('I {negation} {neg_verb_present} the {music_noun}')['data']) # template 2

In [87]:
negation_neg = random.sample(template_negation_neg, 20)
negation_neg

["I don't value the compilation",
 "I don't enjoy the album",
 'I nothing admire the guitar',
 'I nothing admire the singer',
 'I not like the genre',
 "I can't enjoy the guitar",
 'I nothing welcome the band',
 "I can't admire the guitar",
 'I not value the group',
 'I nothing love the pianist',
 "I don't admire the pianist",
 "I won't value the genre",
 "I don't love the group",
 "I don't recommend the project",
 'I not welcome the group',
 "I don't welcome the pianist",
 "I don't enjoy the drummer",
 'I nothing recommend the album',
 'I not welcome the pianist',
 "I won't appreciate the singer"]

In [88]:
negation_pos = random.sample(template_negation_pos, 20)
negation_pos

["I don't dislike the guitarist",
 'I nothing abhor the drummer',
 "I don't despise the guitar",
 "I can't dislike the ep",
 'I not dread the guitar',
 'I not dislike the drummer',
 'I nothing dread the ep',
 'I nothing regret the group',
 "I won't dread the drummer",
 'I not resent the album',
 "I don't dislike the album",
 "I won't resent the guitarist",
 "I can't regret the ep",
 "I don't dislike the pianist",
 'I nothing loathe the guitar',
 "I won't dislike the project",
 'I not hate the project',
 'I nothing dislike the singer',
 "I can't abhor the singer",
 'I nothing abhor the artist']

In [89]:
# Irony/Sarcasm Templates
template_ironsarc_pos = list(editor.template("Sooooo totally didn't {pos_verb_present} this {music_noun}...")['data'])
template_ironsarc_neg = list(editor.template("Sooooo totally didn't {neg_verb_present} this {music_noun}...")['data'])

In [90]:
# negative sentiment
ironsarc_neg = random.sample(template_ironsarc_neg, 20)
ironsarc_neg

["Sooooo totally didn't regret this drummer...",
 "Sooooo totally didn't despise this ep...",
 "Sooooo totally didn't resent this genre...",
 "Sooooo totally didn't dread this album...",
 "Sooooo totally didn't dread this singer...",
 "Sooooo totally didn't resent this band...",
 "Sooooo totally didn't resent this pianist...",
 "Sooooo totally didn't abhor this genre...",
 "Sooooo totally didn't hate this drummer...",
 "Sooooo totally didn't loathe this guitarist...",
 "Sooooo totally didn't dread this artist...",
 "Sooooo totally didn't loathe this ep...",
 "Sooooo totally didn't resent this artist...",
 "Sooooo totally didn't regret this pianist...",
 "Sooooo totally didn't resent this singer...",
 "Sooooo totally didn't loathe this singer...",
 "Sooooo totally didn't dislike this pianist...",
 "Sooooo totally didn't loathe this pianist...",
 "Sooooo totally didn't hate this guitarist...",
 "Sooooo totally didn't loathe this band..."]

In [91]:
# positive sentiment
ironsarc_pos = random.sample(template_ironsarc_pos, 20)
ironsarc_pos

["Sooooo totally didn't like this genre...",
 "Sooooo totally didn't enjoy this guitarist...",
 "Sooooo totally didn't value this artist...",
 "Sooooo totally didn't like this guitarist...",
 "Sooooo totally didn't enjoy this project...",
 "Sooooo totally didn't like this compilation...",
 "Sooooo totally didn't welcome this drummer...",
 "Sooooo totally didn't appreciate this singer...",
 "Sooooo totally didn't love this group...",
 "Sooooo totally didn't appreciate this artist...",
 "Sooooo totally didn't recommend this band...",
 "Sooooo totally didn't like this album...",
 "Sooooo totally didn't appreciate this guitarist...",
 "Sooooo totally didn't value this project...",
 "Sooooo totally didn't appreciate this drummer...",
 "Sooooo totally didn't value this band...",
 "Sooooo totally didn't appreciate this guitar...",
 "Sooooo totally didn't value this singer...",
 "Sooooo totally didn't recommend this guitar...",
 "Sooooo totally didn't recommend this artist..."]

In [92]:
# pos/neg templates
template_posneg_pos = list(editor.template("I {neg_verb_past} the old {music_noun}, but this new one I {pos_verb_present}")['data'])
template_posneg_neg = list(editor.template("I {pos_verb_past} the old {music_noun}, but this new one I {neg_verb_present}")['data'])

In [93]:
posneg_pos = random.sample(template_posneg_pos, 20)
posneg_pos

['I abhorred the old band, but this new one I recommend',
 'I dreaded the old ep, but this new one I love',
 'I dreaded the old genre, but this new one I value',
 'I loathed the old drummer, but this new one I enjoy',
 'I hated the old guitarist, but this new one I recommend',
 'I abhorred the old group, but this new one I value',
 'I loathed the old singer, but this new one I enjoy',
 'I abhorred the old ep, but this new one I welcome',
 'I hated the old project, but this new one I welcome',
 'I despised the old guitar, but this new one I enjoy',
 'I loathed the old project, but this new one I admire',
 'I dreaded the old guitar, but this new one I enjoy',
 'I hated the old album, but this new one I recommend',
 'I despised the old artist, but this new one I enjoy',
 'I hated the old drummer, but this new one I admire',
 'I dreaded the old group, but this new one I appreciate',
 'I loathed the old artist, but this new one I admire',
 'I disliked the old drummer, but this new one I adm

In [94]:
posneg_neg = random.sample(template_posneg_neg, 20)
posneg_neg

['I loved the old project, but this new one I resent',
 'I recommended the old drummer, but this new one I despise',
 'I appreciated the old album, but this new one I loathe',
 'I recommended the old project, but this new one I resent',
 'I enjoyed the old singer, but this new one I despise',
 'I liked the old singer, but this new one I despise',
 'I enjoyed the old pianist, but this new one I hate',
 'I valued the old guitar, but this new one I loathe',
 'I admired the old guitarist, but this new one I hate',
 'I liked the old genre, but this new one I dislike',
 'I recommended the old album, but this new one I regret',
 'I loved the old artist, but this new one I hate',
 'I liked the old ep, but this new one I hate',
 'I admired the old genre, but this new one I regret',
 'I admired the old album, but this new one I despise',
 'I admired the old drummer, but this new one I loathe',
 'I welcomed the old compilation, but this new one I despise',
 'I enjoyed the old band, but this new o

### Saving generated sentences to json file format

In [110]:
# getting them all in a list for iteration:
generated = [negation_neg, negation_pos, ironsarc_neg, ironsarc_pos, posneg_neg, posneg_pos]

# creating json file so that if rerun this code always overwrites the existing file
with open("hard_cases.json", 'w') as f:
    f.write('')

# looping through and adding them all to json file
for category in generated:
    # getting to know categories and sentiments:
    if category == negation_neg:
        category_name = 'Negation'
        sentiment = 'Negative'
    elif category == negation_pos:
        category_name = 'Negation'
        sentiment = 'Positive'
    elif category == ironsarc_neg:
        category_name = 'Irony/Sarcasm'
        sentiment = 'Negative'
    elif category == ironsarc_pos:
        category_name = 'Irony/Sarcasm'
        sentiment = 'Positive'
    elif category == posneg_neg:
        category_name = 'Pos/Neg'
        sentiment = 'Negative'
    elif category == posneg_pos:
        category_name = 'Pos/Neg'
        sentiment = 'Positive'
    
    # adding to json file
    with open("hard_cases.json", 'a') as f:
        for reviewText in category:
            # writing all the stuff
            json.dump({'reviewText': reviewText, 'sentiment': sentiment, 'category': category_name}, f)
            f.write('\n')

In [12]:
sen_1 = list(editor.template('I {negation} {pos_verb_present} the {music_noun}')['data'])
sen_2 = list(editor.template('I {past_negation} {neg_verb_present} the really {neg_adj} {music_noun}')['data'])
irony_1 = list(editor.template('No the {neg_adj} {music_noun}, was totally {pos_adj} yeah...')['data'])
negation_3 = list(editor.template('I actually {past_negation} {neg_verb_present} the {music_noun}')['data'])
negation_4 = list(editor.template('Contrary to what i thought, the {music_noun}, wasn\'t {neg_adj}')['data'])

In [13]:
samples_1 = random.sample(sen_1, 15)
samples_2 = random.sample(sen_2, 15)
samples_3 = random.sample(irony_1, 15)
samples_4 = random.sample(negation_3, 15)
samples_5 = random.sample(negation_4, 15)

In [16]:
samples_1

["I don't appreciate the ep",
 'I not welcome the drummer',
 "I don't like the genre",
 "I can't appreciate the artist",
 'I not recommend the drummer',
 "I can't welcome the artist",
 "I don't enjoy the singer",
 "I won't love the artist",
 "I don't welcome the ep",
 'I nothing like the guitarist',
 'I nothing appreciate the band',
 "I don't recommend the genre",
 "I can't value the genre",
 "I won't appreciate the project",
 'I not enjoy the album']

In [17]:
samples_2

["I shouldn't regret the really boring genre",
 "I couldn't abhor the really rough album",
 "I shouldn't regret the really boring album",
 "I didn't abhor the really poor guitar",
 "I shouldn't hate the really sad guitarist",
 "I couldn't regret the really hard ep",
 "I didn't abhor the really nasty singer",
 "I couldn't regret the really bad singer",
 "I wouldn't abhor the really terrible drummer",
 "I didn't regret the really terrible album",
 "I couldn't dislike the really rough singer",
 "I wouldn't dread the really nasty singer",
 "I shouldn't regret the really hard ep",
 "I wouldn't abhor the really sad ep",
 "I didn't hate the really sad ep"]

In [18]:
samples_3

['No the sad album, was totally brilliant yeah...',
 'No the ugly drummer, was totally wonderful yeah...',
 'No the ugly compilation, was totally extraordinary yeah...',
 'No the terrible drummer, was totally happy yeah...',
 'No the average album, was totally awesome yeah...',
 'No the nasty singer, was totally wonderful yeah...',
 'No the lame artist, was totally good yeah...',
 'No the bad ep, was totally fantastic yeah...',
 'No the unhappy project, was totally extraordinary yeah...',
 'No the unpleasant ep, was totally incredible yeah...',
 'No the weird project, was totally extraordinary yeah...',
 'No the difficult ep, was totally good yeah...',
 'No the rough ep, was totally exciting yeah...',
 'No the hard artist, was totally great yeah...',
 'No the awful artist, was totally adorable yeah...']

In [14]:
samples_4

["I actually didn't abhor the compilation",
 "I actually couldn't hate the guitar",
 "I actually didn't hate the singer",
 "I actually didn't hate the project",
 "I actually wouldn't hate the album",
 "I actually couldn't hate the project",
 "I actually couldn't despise the guitarist",
 "I actually shouldn't abhor the album",
 "I actually couldn't abhor the genre",
 "I actually wouldn't regret the artist",
 "I actually wouldn't dread the artist",
 "I actually wouldn't hate the guitar",
 "I actually shouldn't regret the album",
 "I actually couldn't despise the compilation",
 "I actually shouldn't regret the band"]

In [15]:
samples_5

["Contrary to what i thought, the artist, wasn't hard",
 "Contrary to what i thought, the drummer, wasn't rough",
 "Contrary to what i thought, the ep, wasn't terrible",
 "Contrary to what i thought, the band, wasn't nasty",
 "Contrary to what i thought, the project, wasn't frustrating",
 "Contrary to what i thought, the guitarist, wasn't terrible",
 "Contrary to what i thought, the drummer, wasn't sad",
 "Contrary to what i thought, the band, wasn't ridiculous",
 "Contrary to what i thought, the album, wasn't poor",
 "Contrary to what i thought, the artist, wasn't unpleasant",
 "Contrary to what i thought, the genre, wasn't bad",
 "Contrary to what i thought, the compilation, wasn't annoying",
 "Contrary to what i thought, the compilation, wasn't ridiculous",
 "Contrary to what i thought, the compilation, wasn't average",
 "Contrary to what i thought, the ep, wasn't weird"]