## Phase 1: Baseline 
The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:

1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
2. A simple machine learning classifier (logistic regression with words as features)
3. The 'state-of-the-art' approach on which you want to improve (your starting point)

### Task: Sentiment classification
The data can be found in the **classification folder** .  
The goal is to **predict the label** in the sentiment field.  
You have to upload the predictions of music_reviews_test_masked.json.gz to CodaLab. (The link will be posted here on monday). Note that the format should match the json files in the repository.  
Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.  

### 0 - Imports

In [None]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

### 1 - Data Preprocessing

1.1 - Load the Data


In [None]:
PATH = {'train':'../data/classification/music_reviews_train.json.gz',
        'dev': '../data/classification/music_reviews_dev.json.gz',
        'test': '../data/classification/music_reviews_test_masked.json.gz'}

In [None]:
def load_data(path):
    '''
    Function to load the data
    -----
    Takes in the argument: 
        'path' - takes the form PATH['(train, dev or test)']
    '''
    dic = {}
    for i, line in enumerate(gzip.open(path)):
        review_data = json.loads(line)
        dic[i] = {}
        for key,value in review_data.items():
            dic[i][key] = value
    return dic

In [None]:
train_data = load_data(PATH['train'])
dev_data = load_data(PATH['dev'])
test_data = load_data(PATH['test'])

1.2 - Data Cleaning

In [None]:
def sent_encode(sent):
    '''
    Helper function to encode sentiment
    ------
    Takes in string description
        'sent' - either positive or negative
    Returns binary encoding
        1 = positive sentiment
        0 = negative sentiment
    '''
    if sent == 'positive':
        return 1
    if sent == 'negative':
        return 0 
    return 'unknown sentiment'

In [None]:
def clean(data):
    '''
    Function to clean the data
    -----
    Takes in data set from load_data()
        'data' - nested dictionary  
    Returns two lists
        cleaned - X list
        ys - y list
    '''
    cleaned = [] 
    ys = []
    for idx in data:
        review = data[idx].get('reviewText', None) # some data does not have a review text
        summary = data[idx].get('summary', None) # some data does not have a summary 
        
        # combine summary and review
        if review == None and summary == None:
            continue
        elif review == None:
            text = summary
        elif summary == None:
            text = review
        else:
            text = summary + ' ' + review

        sequence = word_tokenize(text)  # splits gotta into got ta
        cleaned.append(sequence)

        # encode sentiment
        ys.append(sent_encode(data[idx]['sentiment']))

    return cleaned, ys

In [None]:
cleaned_train, y_train = clean(train_data)
cleaned_dev, y_dev = clean(dev_data)
cleaned_test, _ = clean(test_data)

1.3 Vocab & Corpus

In [None]:
def get_vocab_corpus(dataset):
    '''
    Function computing vocabluary and corpus for a dataset
    -----
    Takes a cleaned dataset - list 
        dataset - X list 
    Returns
        vocab - set of unique tokens in dataset
        corpus - list of strings; sentences in dataset 
    '''
    vocab = set()
    corpus = []
    for text in dataset:
        sentence = ''
        for token in text:
            vocab.add(token)
            if token in ['.','!','?',',']:
                sentence += token 
            else:
                sentence += ' ' + token 
        corpus.append(sentence.lstrip()) 
    return vocab, corpus

In [None]:
train_vocabulary, train_corpus = get_vocab_corpus(cleaned_train)
dev_vocabulary, dev_corpus = get_vocab_corpus(cleaned_dev)
test_vocabulary, test_corpus = get_vocab_corpus(cleaned_test) # dev and test vocab not used

1.4 Combine train and dev for cross validation

In [None]:
traindev_corpus = train_corpus + dev_corpus
y_traindev = y_train + y_dev

1.5 Bag of Words

In [None]:
def get_bow(vocab, corp):
    '''
    Function returning sparse matrix of Term Frequency — Inverse Document Frequencies
    -----
    Takes vocab and corpus, working with two lists
        vocab - set of unique words
        corpus - list of strings
    Returns bag of words
        bow - 2d matrix; input to model
    '''
    vocab = list(vocab) 
    vectorizer = TfidfVectorizer(vocabulary= vocab)
    bow = vectorizer.fit_transform(corp) 
    return bow 

In [None]:
train_bow = get_bow(train_vocabulary,traindev_corpus)
test_bow = get_bow(train_vocabulary,test_corpus)

### 2 - Run the Model

2.1 Grid search of train and dev data to find best fitting model

In [None]:
lr = LogisticRegression()
parameters = {'max_iter':[100,500,1000], 'C': [1,2,3,4]}
grid = GridSearchCV(lr, parameters)
grid.fit(train_bow, y_traindev)
grid.best_score_

In [None]:
best_model = grid.best_estimator_

2.3 Predict on Test Data

In [None]:
y_pred = best_model.predict(test_bow)
y_pred

2.4 Report test scores

In [None]:
def reverse_encode(sent):
    if sent == 1:
        return 'positive'
    if sent == 0:
        return 'negative'

In [None]:
def pred_test(test, ys):
    '''
    Function to insert predicitons into test data
    '''
    index = 0
    for key in test:
        review = test[key].get('reviewText', None) 
        summary = test[key].get('summary', None) 
        if review == None and summary == None:
            continue
        test[key]['sentiment'] = reverse_encode(ys[index])
        index += 1
    return test

In [None]:
finished_test_data = pred_test(test_data,y_pred)
finished_test_data

In [None]:
#test_json=[json.dumps(i)+'\n' for i in finished_test_data.values()]
#with open ('music_reviews_test.json', 'w') as file:
#    file.writelines(test_json)