# Project phase 1: Baseline

The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:
* 1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
* 2. A simple machine learning classifier (logistic regression with words as features)
* 3. The ``state-of-the-art'' approach on which you want to improve (your starting point)

For this phase you need to make a number 2 or 3 baseline. 

If you plan to have a research question like: can we improve sentiment detection systems by doing X, the answer to the question is the most relevant if you have a competetive baseline (3). In this case we would suggest to use a BiLSTM or even a transformer based model, so that you can re-use the baseline for the final research question (phase 3).

You should pick one of the following tasks to create your baseline for.

## Task 1: Sentiment classification
* The data can be found in the `classification` folder.
* The goal is to predict the label in the `sentiment` field.
* **You have to upload the predictions of `music_reviews_test_masked.json.gz` to CodaLab. (The link will be posted here on monday). Note that the format should match the json files in the repository.**
* **Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.**

The data can be read like:

In [1]:
import numpy as np 
import torch 
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
import gzip
import json
for line in gzip.open('classification/music_reviews_dev.json.gz'):
    review_data = json.loads(line)
    for key in review_data:
        print('"' + key +'": ' + str(review_data[key]))
    break
paths = {'train':'classification/music_reviews_train.json.gz',
        'test':'classification/music_reviews_test_masked.json.gz'}

"vote": 3
"verified": True
"reviewTime": 12 19, 2012
"reviewerID": A1KKWETTT5BZ6N
"asin": B00474S1J2
"reviewText": My dentist recommended this as a relaxation technique for dental visits. They give me an ipod with headphones, play this on it and it relieves some of the stress of dental treatment, which I dislike intensely.
It worked so well that I bought my own copy to try at home. I fall asleep after a couple of minutes and stay asleep. Instead of tossing and turning, I hardly move at all. Highly recommend.
"summary": Out like a light!
"unixReviewTime": 1355875200
"sentiment": positive
"id": 0


In [23]:
train_vocab = {}
train = gzip.open(paths['train'])
counter1 = 0
counter2 = 0
counter3 = 0
train_no_reviewText = []
labels = {}
train_sentences = {}
for line in train:
    counter1 +=1
    #print(line)
    if 'reviewText' in json.loads(line).keys():
        train_sentences[counter3] = json.loads(line)['reviewText']
        counter3 += 1
        for word in json.loads(line)['reviewText'].split():
            if word not in train_vocab.keys():
                train_vocab[word] = counter2
                counter2 += 1
    else:
        train_no_reviewText.append(counter1)
print(counter3,counter2)

99946 226347


In [4]:
# Gram matrix 
m1 = torch.zeros(counter3, counter2)


In [5]:
# Need idx2word
idx2word = dict([(value, key) for key, value in train_vocab.items()])

# Begin correcting gram matrix

for sen in train_sentences: 
    for word in train_sentences[sen].split(): 
        m1[sen, train_vocab[word]] = 1

In [6]:
train_vocab = {}
train = gzip.open(paths['train'])
train_labels = {}
counter = 0
for line in train:
    a = json.loads(line)
    if 'reviewText' in a.keys():
        if a['sentiment'] == 'positive':
            train_labels[counter] = 1
        elif a['sentiment'] == 'negative': 
            train_labels[counter] = 0
        counter +=1

        
#len(labels)
#print(type(labels))



counter2

226347

In [7]:
batch_size = 5000
num_batches = int(len(m1)/batch_size)
train_feats_batches = m1[:batch_size*num_batches].view(num_batches,batch_size, counter2)
for feats_batch in train_feats_batches:
    print(feats_batch.shape)

bingus = list(train_labels.values())
bingus = torch.FloatTensor(bingus)

num_batches = int(len(bingus)/batch_size)
train_label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
counter = 1
for feats_batch in train_label_batches:
    counter+=1
    print(feats_batch.shape)
counter

torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 226347])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])
torch.Size([5000, 1])


20

In [8]:
# Encode test labels
test_vocab = {}
test = gzip.open(paths['test'])
counter1 = 0
counter2 = 0
counter3 = 0
test_no_reviewText = []
test_labels = {}
test_sentences = {}
for line in test:
    counter1 +=1
    #print(line)
    if 'reviewText' in json.loads(line).keys():
        test_sentences[counter3] = json.loads(line)['reviewText']
        counter3 += 1
        for word in json.loads(line)['reviewText'].split():
            if word not in train_vocab.keys():
                test_vocab[word] = counter2
                counter2 += 1
    else:
        test_no_reviewText.append(counter1)
print('Vocab size: ', counter2)
        
# Construct gram matrix
m2 = torch.zeros(counter3, 226347)
print('m2 constructed!')


# Need idx2word
idx2word = dict([(value, key) for key, value in train_vocab.items()])
print('idx2word done!')


# Begin correcting gram matrix
for sen in test_sentences: 
    for word in test_sentences[sen].split(): 
        if word in train_vocab.keys():
            m2[sen, train_vocab[word]] = 1
print('Gram matrix done')

#Note labels
test = gzip.open(paths['test'])
test_labels = {}
counter = 0
for line in test:
    a = json.loads(line)
    if 'reviewText' in a.keys():
        if a['sentiment'] == 'positive':
            test_labels[counter] = 1
        elif a['sentiment'] == 'negative': 
            test_labels[counter] = 0
        counter +=1
print('Labels noted!')
        
#Divide into batches

batch_size = 2499
num_batches = int(len(m2)/batch_size)
test_feats_batches = m2[:batch_size*num_batches].view(num_batches,batch_size, 226347)
print('Feature Matrix shapes: ')
for feats_batch in test_feats_batches:
    print(feats_batch.shape)
bingus = list(test_labels.values())
bingus = torch.FloatTensor(bingus)
num_batches = int(len(bingus)/batch_size)
test_label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
print('label matrix shapes: ')
for feats_batch in test_label_batches:
    print(feats_batch.shape)

Vocab size:  397685
m2 constructed!
idx2word done!
Gram matrix done
Labels noted!
Feature Matrix shapes: 
torch.Size([2499, 226347])
torch.Size([2499, 226347])
torch.Size([2499, 226347])
label matrix shapes: 


In [9]:
def load_vocab(filepath):
    test_vocab = {}
    test = gzip.open(filepath)
    counter1 = 0
    counter2 = 0
    counter3 = 0
    test_no_reviewText = []
    test_labels = {}
    test_sentences = {}
    for line in test:
        counter1 +=1
        #print(line)
        if 'reviewText' in json.loads(line).keys():
            test_sentences[counter3] = json.loads(line)['reviewText']
            counter3 += 1
            for word in json.loads(line)['reviewText'].split():
                if word not in train_vocab.keys():
                    test_vocab[word] = counter2
                    counter2 += 1
        else:
            test_no_reviewText.append(counter1)
    final_dict = {'line_count' : counter1,
                 'review_count' : counter3,
                 'vocab_size' : counter2,
                 'no_text_reviews' : test_no_reviewText,
                 'labels' : test_labels,
                 'vocabulary' : test_vocab,
                 'sentences' : test_sentences}
    return final_dict

def construct_gram(vocab, num_sen, vocab_len, sentences): 
    # Construct gram matrix
    m2 = torch.zeros(num_sen, vocab_len)
    print('m2 constructed!')
    for sen in sentences: 
        for word in sentences[sen].split(): 
            m2[sen, vocab[word]] = 1
    print('Gram matrix done')
    return m2



In [12]:
#model = LogisticRegression()
#for feat, label in zip(train_feats_batches,train_label_batches):
#    model.fit(feat,label)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [13]:
preds = {}
counter = 0
for batch in test_feats_batches:
    preds[counter] = model.predict(batch)
    counter+=1

In [14]:
new_arr = []
for i in preds:
    for j in preds[i]:
        new_arr.append(j)
len(new_arr)
new_arr = np.array(new_arr)
new_arr.shape

(7497,)

In [83]:
for pred, true in zip(preds, test_label_batches):
    print(classification_report(preds[pred], true))

              precision    recall  f1-score   support

         0.0       0.73      0.85      0.78       901
         1.0       0.91      0.82      0.86      1598

    accuracy                           0.83      2499
   macro avg       0.82      0.84      0.82      2499
weighted avg       0.84      0.83      0.83      2499

              precision    recall  f1-score   support

         0.0       0.72      0.83      0.77       943
         1.0       0.89      0.81      0.85      1556

    accuracy                           0.82      2499
   macro avg       0.81      0.82      0.81      2499
weighted avg       0.83      0.82      0.82      2499

              precision    recall  f1-score   support

         0.0       0.75      0.86      0.80       991
         1.0       0.90      0.81      0.85      1508

    accuracy                           0.83      2499
   macro avg       0.82      0.83      0.83      2499
weighted avg       0.84      0.83      0.83      2499

              preci

In [15]:
test = gzip.open(paths['test'])
counter = 0
new_data = []
for i in test:
    bingus = json.loads(i)
    if new_arr[counter] == 0:
        bingus['sentiment'] = 'negative'
    elif new_arr[counter] == 1:
        bingus['sentiment'] = 'positive'
    new_data.append(bingus)

In [17]:
for i in new_data:
    print(i)
    break

{'verified': True, 'reviewTime': '10 24, 2017', 'reviewerID': 'A2HAJB8L9NVYTZ', 'asin': 'B007Y1AMHE', 'reviewText': 'ok', 'summary': 'ok', 'unixReviewTime': 1508803200, 'sentiment': 'positive', 'id': 0}


In [21]:
#with open("final.json", 'a') as f:
#    for i in new_data:
#        json.dump(i,f)
#        f.write('\n')

# Part 2: Break it down

In this part of the project, we are tasked with breaking our own model down, to try and improve it. <br> 

### Suggested methods: 
- Change language
- More negation
- Reviews of other products

### Things we should also consider: 
- Better tokenization
- Model tuning
- Acutually using development data

In [1]:
# better tokenization: 
'''
1. Look up better regex expression
2. Remove stopwords
3. implement padding(might have to wait on that one lmao)
4. Use a way more sophisticated model. (might wanna wait on that one too)
'''













'\n1. Look up better regex expression\n2. Remove stopwords\n3. implement padding(might have to wait on that one lmao)\n4. Use a way more sophisticated model. (might wanna wait on that one too)\n'

In [2]:
#Reopen file + imports: 
import gzip
import json
import torch 
from nltk.tokenize import TweetTokenizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import classification_report

# This is how they load the data ------------------------------------------
#for line in gzip.open('../classification/music_reviews_dev.json.gz'):
#    review_data = json.loads(line)
#    for key in review_data:
#        print('"' + key +'": ' + str(review_data[key]))
#    break
#--------------------------------------------------------------------------


paths = {'train':'../classification/music_reviews_train.json.gz',
        'test':'../classification/music_reviews_test_masked.json.gz',
        'dev' : '../classification/music_reviews_dev.json.gz'}

In [3]:
# Vocabs built using TweetTokenizer
# 


def build_vocab(filepath):
    train_vocab = {}
    train = gzip.open(filepath)
    counter1 = 0
    counter2 = 0
    counter3 = 0
    counter = 0
    no_reviewText = []
    labels = {}
    sentences = {}
    tokenizer = TweetTokenizer()
    for line in train:
        counter1 +=1
        #print(line)
        if 'reviewText' in json.loads(line).keys():
            a = json.loads(line)
            sentences[counter3] = a['reviewText']
            counter3 += 1
            if a['sentiment'] == 'positive':
                labels[counter] = 1
            elif a['sentiment'] == 'negative': 
                labels[counter] = 0
            counter +=1
            for word in tokenizer.tokenize(json.loads(line)['reviewText']):
                if word not in train_vocab.keys():
                    train_vocab[word] = counter2
                    counter2 += 1
        else:
            no_reviewText.append(counter1)
    final_dict = {'line_count' : counter1,
                 'review_count' : counter3,
                 'vocab_size' : counter2,
                 'no_text_reviews' : no_reviewText,
                 'labels' : labels,
                 'vocabulary' : train_vocab,
                 'sentences' : sentences}
    return final_dict

train_set =  build_vocab(paths['train'])
dev_set = build_vocab(paths['dev'])



In [4]:
# Might wanna run tests again just to see if the new tokenization improved performance in any way. 
tokenizer = TweetTokenizer()
# unigrams : 
def create_unigram(vocab, sentences, tokenzier):
    # Create matrix
    m1 = torch.zeros(len(sentences), len(vocab))
    # Correct indices
    for sen in range(len(sentences)): 
        for word in tokenizer.tokenize(sentences[sen]): 
            if word in vocab.keys():
                m1[sen, vocab[word]] = 1
    return m1

train_unigram = create_unigram(train_set['vocabulary'], train_set['sentences'], tokenizer)

def create_batches(matrix, batch_size,labels): 
    num_batches = int(len(matrix)/batch_size)
    feats_batches = matrix[:batch_size*num_batches].view(num_batches,batch_size, matrix.shape[1])
    bingus = torch.FloatTensor(list(labels.values()))
    num_batches = int(len(bingus)/batch_size)
    label_batches = bingus[:batch_size*num_batches].view(num_batches,batch_size,1)
    return feats_batches, label_batches
train_feat_batches = create_batches(train_unigram, 2499, train_set['labels'])

In [5]:
labels_batches = list(train_feat_batches[1])
feat_batches = list(train_feat_batches[0])

In [6]:
# Make model
model = LogisticRegression()

# Train model

for feat, label in zip(feat_batches, labels_batches):
    model.fit(feat,label)

  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)
  return f(**kwargs)


In [7]:
dev_unigram = create_unigram(train_set['vocabulary'], dev_set['sentences'], tokenizer)

In [8]:
dev_batches = create_batches(dev_unigram,2499,dev_set['labels'])
preds = {}
counter = 0
for batch in dev_batches[0]:
    preds[counter] = model.predict(batch)
    counter+=1
new_arr = []
for i in preds:
    for j in preds[i]:
        new_arr.append(j)
len(new_arr)
new_arr = np.array(new_arr)
for pred, true in zip(preds, dev_batches[1]):
    print(classification_report(preds[pred], true))

              precision    recall  f1-score   support

         0.0       0.74      0.87      0.80       900
         1.0       0.92      0.83      0.87      1599

    accuracy                           0.84      2499
   macro avg       0.83      0.85      0.84      2499
weighted avg       0.85      0.84      0.85      2499

              precision    recall  f1-score   support

         0.0       0.71      0.84      0.77       922
         1.0       0.90      0.80      0.85      1577

    accuracy                           0.82      2499
   macro avg       0.81      0.82      0.81      2499
weighted avg       0.83      0.82      0.82      2499

              precision    recall  f1-score   support

         0.0       0.76      0.87      0.81       987
         1.0       0.90      0.82      0.86      1512

    accuracy                           0.84      2499
   macro avg       0.83      0.84      0.83      2499
weighted avg       0.84      0.84      0.84      2499

              preci

In [9]:
hard_preds = {}
counter = 0
for batch, pred_dict, true_dict in zip(dev_batches[0], preds.values(), dev_batches[1]):
    for sentence, pred, true in zip(batch, pred_dict, true_dict):
        if pred != true:
            hard_preds[counter] = int(pred)
        counter += 1

In [10]:
hard_preds

{0: 0,
 2: 1,
 7: 1,
 8: 1,
 9: 0,
 10: 1,
 25: 0,
 32: 1,
 37: 0,
 48: 0,
 49: 1,
 58: 1,
 62: 0,
 64: 0,
 71: 0,
 78: 1,
 79: 1,
 87: 1,
 88: 1,
 91: 1,
 93: 1,
 95: 1,
 108: 1,
 109: 1,
 123: 1,
 131: 0,
 132: 1,
 133: 0,
 136: 1,
 143: 1,
 147: 1,
 150: 1,
 153: 0,
 162: 0,
 176: 1,
 187: 0,
 189: 0,
 196: 0,
 206: 0,
 216: 1,
 222: 1,
 224: 0,
 226: 0,
 230: 1,
 239: 1,
 241: 1,
 243: 0,
 249: 1,
 261: 1,
 276: 1,
 285: 1,
 289: 0,
 293: 1,
 297: 1,
 308: 0,
 312: 1,
 316: 1,
 317: 1,
 320: 1,
 324: 0,
 339: 1,
 352: 0,
 358: 1,
 360: 1,
 363: 1,
 385: 0,
 391: 1,
 394: 1,
 401: 0,
 408: 1,
 419: 1,
 424: 1,
 445: 1,
 447: 1,
 454: 1,
 456: 1,
 459: 1,
 460: 0,
 465: 1,
 486: 1,
 499: 1,
 501: 1,
 516: 0,
 517: 0,
 532: 1,
 546: 1,
 551: 1,
 559: 0,
 565: 0,
 572: 1,
 595: 0,
 598: 1,
 604: 1,
 607: 1,
 611: 1,
 612: 1,
 616: 1,
 625: 0,
 629: 1,
 632: 1,
 647: 0,
 651: 1,
 654: 0,
 657: 1,
 659: 1,
 661: 1,
 666: 1,
 683: 0,
 686: 1,
 689: 0,
 693: 1,
 697: 0,
 699: 1,
 706: 0,
 

In [11]:
# grabbing the first 200 misclassification sentences
hard_sens = []
for i in hard_preds:
    hard_sentence = train_set['sentences'][i]
    if len(hard_sens) >= 200:
        break
    else:
        hard_sens.append(i)

In [12]:
test = gzip.open(paths['dev'])
new_data = []
counter = 0
for i in test:
    bingus = json.loads(i)
    if counter in hard_sens:
        new_data.append(bingus)
    counter+=1

In [13]:
with open("hard_cases.json", 'a') as f:
    for i in new_data:
        json.dump(i,f)
        f.write('\n')

SyntaxError: invalid character in identifier (<ipython-input-14-7d73b51de4c3>, line 24)

### Checklist section:
Gonna follow how they do in the official documentation.
Pretty dumb donkey way of doing this. But hope it works.


In [41]:
# Load the same stuff that they do in the doc. 
import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
import numpy as np
import spacy
from checklist.test_suite import TestSuite
from checklist.perturb import Perturb
from spacy.lang.en.examples import sentences 
import random 
# Load editor. 

editor = checklist.editor.Editor()
editor.tg
nlp = spacy.load('en_core_web_sm')
sentences = dev_set['sentences'].values()
parsed_data = list(nlp.pipe(sentences))
suite = TestSuite()

In [16]:
parsed_data[0]

My dentist recommended this as a relaxation technique for dental visits. They give me an ipod with headphones, play this on it and it relieves some of the stress of dental treatment, which I dislike intensely.
It worked so well that I bought my own copy to try at home. I fall asleep after a couple of minutes and stay asleep. Instead of tossing and turning, I hardly move at all. Highly recommend.

In [59]:
# Constructing collection of music_nouns
music_noun = ['project', 'artist', 'album', 'genre', 'compilation', 'ep', 'singer', 'band', 'guitar', 'drummer', 'guitarist']
editor.add_lexicon('music_noun', music_noun, overwrite=True)

# Negations

negations = ['don\'t', 'can\'t' , 'not', 'won\'t', 'nothing']
past_negations = ['didn\'t', 'wouldn\'t', 'couldn\'t', 'shouldn\'t']
editor.add_lexicon('negation', negations, overwrite=True)
editor.add_lexicon('past_negation', past_negations, overwrite=True)

# Post / neg adj 
pos_adj = ['good', 'great', 'excellent', 'amazing', 'extraordinary', 'beautiful', 'fantastic', 'nice', 'incredible', 'exceptional', 'awesome', 'perfect', 'fun', 'happy', 'adorable', 'brilliant', 'exciting', 'sweet', 'wonderful']
neg_adj = ['awful', 'bad', 'horrible', 'weird', 'rough', 'lousy', 'unhappy', 'average', 'difficult', 'poor', 'sad', 'frustrating', 'hard', 'lame', 'nasty', 'annoying', 'boring', 'creepy', 'dreadful', 'ridiculous', 'terrible', 'ugly', 'unpleasant']
editor.add_lexicon('pos_adj', pos_adj, overwrite=True)
editor.add_lexicon('neg_adj', neg_adj, overwrite=True)

# Bunch of other shit 

pos_verb_present = ['like', 'enjoy', 'appreciate', 'love',  'recommend', 'admire', 'value', 'welcome']
neg_verb_present = ['hate', 'dislike', 'regret',  'abhor', 'dread', 'despise' ]
neutral_verb_present = ['see', 'find']
pos_verb_past = ['liked', 'enjoyed', 'appreciated', 'loved', 'admired', 'valued', 'welcomed']
neg_verb_past = ['hated', 'disliked', 'regretted',  'abhorred', 'dreaded', 'despised']
neutral_verb_past = ['saw', 'found']
editor.add_lexicon('pos_verb_present', pos_verb_present, overwrite=True)
editor.add_lexicon('neg_verb_present', neg_verb_present, overwrite=True)
editor.add_lexicon('neutral_verb_present', neutral_verb_present, overwrite=True)
editor.add_lexicon('pos_verb_past', pos_verb_past, overwrite=True)
editor.add_lexicon('neg_verb_past', neg_verb_past, overwrite=True)
editor.add_lexicon('neutral_verb_past', neutral_verb_past, overwrite=True)
editor.add_lexicon('pos_verb', pos_verb_present+ pos_verb_past, overwrite=True)
editor.add_lexicon('neg_verb', neg_verb_present + neg_verb_past, overwrite=True)
editor.add_lexicon('neutral_verb', neutral_verb_present + neutral_verb_past, overwrite=True)


In [99]:
sen_1 = list(editor.template('I {negation} {pos_verb_present} the {music_noun}')['data'])
sen_2 = list(editor.template('I {past_negation} {neg_verb_present} the really {neg_adj} {music_noun}')['data'])
irony_1 = list(editor.template('No the {neg_adj} {music_noun}, was totally {pos_adj} yeah...')['data'])
negation_3 = list(editor.template('I actually {past_negation} {neg_verb_present} the {music_noun}')['data'])
negation_4 = list(editor.template('Contrary to what i thought, the {music_noun}, wasn\'t {neg_adj}')['data'])



In [100]:
samples_1 = random.sample(sen_1, 15)
samples_2 = random.sample(sen_2, 15)
samples_3 = random.sample(irony_1, 15)
samples_4 = random.sample(negation_3, 15)
samples_5 = random.sample(negation_4, 15)

In [102]:
samples_4

["I actually didn't hate the artist",
 "I actually wouldn't despise the album",
 "I actually wouldn't dread the guitarist",
 "I actually shouldn't hate the ep",
 "I actually didn't dread the genre",
 "I actually wouldn't dread the compilation",
 "I actually wouldn't abhor the artist",
 "I actually couldn't hate the guitar",
 "I actually didn't despise the genre",
 "I actually wouldn't abhor the genre",
 "I actually shouldn't dislike the ep",
 "I actually couldn't dread the guitarist",
 "I actually shouldn't dislike the guitarist",
 "I actually couldn't abhor the compilation",
 "I actually wouldn't despise the ep"]