## Phase 1: Baseline 
The goal of this phase is to create a baseline model. Note that the word baseline can mean different things. In the course we distinguished three different types of baselines:

1. The simplest possible approach (majority baseline, i.e. everything is positive or noun)
2. A simple machine learning classifier (logistic regression with words as features)
3. The 'state-of-the-art' approach on which you want to improve (your starting point)

### Task: Sentiment classification
The data can be found in the **classification folder** .  
The goal is to **predict the label** in the sentiment field.  
You have to upload the predictions of music_reviews_test_masked.json.gz to CodaLab. (The link will be posted here on monday). Note that the format should match the json files in the repository.  
Also upload a .txt file on LearnIt (one per group) with a short description of your baseline.  

### 0 - Imports

In [None]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
run = False
if run == True: 
    nltk.download('punkt')
    nltk.download('stopwords')

## 1 - Data Preprocessing

### 1.1 - Load the Data


In [None]:
PATH = {'train':'../data/classification/music_reviews_train.json.gz',
        'dev': '../data/classification/music_reviews_dev.json.gz',
        'test': '../data/classification/music_reviews_test_masked.json.gz'}

In [None]:
def load_data(path):
    '''
    Function to load the data
    -----
    Takes in the argument: 
        'path' - takes the form PATH['(train, dev or test)']
    '''
    dic = {}
    for i, line in enumerate(gzip.open(path)):
        review_data = json.loads(line)
        dic[i] = {}
        for key,value in review_data.items():
            dic[i][key] = value
    return dic

In [None]:
train_data = load_data(PATH['train'])
dev_data = load_data(PATH['dev'])
test_data = load_data(PATH['test'])

### 1.2 - Data Cleaning

In [None]:
def sent_encode(sent):
    if sent == 'positive':
        return 1
    if sent == 'negative':
        return 0 
    return 'unknown sentiment'

In [None]:
def clean(data):
    stop = stopwords.words('english')
    cleaned = {}
    ys = []
    for idx in data:
        review = data[idx].get('reviewText', None) # some data does not have a review text
        summary = data[idx].get('summary', None) # some data does not have a summary 
        
        # combine summary and review
        if review == None and summary == None:
            continue
        elif review == None:
            text = summary
        elif summary == None:
            text = review
        else:
            text = summary + ' ' + review

        # remove stop words
        seq = []
        sequence = word_tokenize(text)  # splits gotta into got ta
        for token in sequence:
            token = token.lower()
            if token not in stop:
                seq.append(token)
        cleaned[idx] = {}
        cleaned[idx]['text'] = seq

        # encode sentiment
        ys.append(sent_encode(data[idx]['sentiment']))

        # ys
    return cleaned, ys

In [None]:
cleaned_train, y_train = clean(train_data)
cleaned_dev, y_dev = clean(dev_data)
cleaned_test, _ = clean(test_data)

### 1.3 Vocab & Corpus

In [None]:
def get_vocab_corpus(train):
    vocab = set()
    corpus = []
    for idx in train:
        text = train[idx]['text']
        sentence = ''
        for token in text:
            vocab.add(token)
            if token in ['.','!','?',',']:
                sentence += token 
            else:
                sentence += ' ' + token 
        corpus.append(sentence.lstrip()) 
    return vocab, corpus

In [None]:
train_vocabulary, train_corpus = get_vocab_corpus(cleaned_train)
dev_vocabulary, dev_corpus = get_vocab_corpus(cleaned_dev)
test_vocabulary, test_corpus = get_vocab_corpus(cleaned_test) # dev and test vocab not used

### 1.4 Bag of Words

In [None]:
def get_bow(vocab, corp):
    vocab = {word:i for i,word in enumerate(vocab)}
    vectorizer = TfidfVectorizer(vocabulary= vocab)
    bow = vectorizer.fit_transform(corp) 
    return bow 

In [None]:
train_bow = get_bow(train_vocabulary,train_corpus)
dev_bow = get_bow(train_vocabulary,dev_corpus)
test_bow = get_bow(train_vocabulary,test_corpus)

## 2 - Train the Model

### 2.1 Fit Training Data

In [None]:
lr = LogisticRegression()
lr.fit(train_bow.toarray(), np.array(y_train))

In [None]:
lr.score(dev_bow,y_dev)