# Spoilers

In [1]:
__author__ = "Kristine Guo and Caroline Ho"
__version__ = "CS224u, Stanford, Spring 2018 term"

## Contents

0. [Overview](#Overview)
0. [Set-up](#Set-up)
0. [Latent Semantic Analysis](#Latent-Semantic-Analysis)
  0. [Overview of the LSA method](#Overview-of-the-LSA-method)
  0. [Motivating example for LSA](#Motivating-example-for-LSA)
  0. [Applying LSA to real VSMs](#Applying-LSA-to-real-VSMs)
  0. [Other resources for matrix factorization](#Other-resources-for-matrix-factorization)
0. [GloVe](#GloVe)
  0. [Overview of the GloVe method](#Overview-of-the-GloVe-method)
  0. [GloVe implementation notes](#GloVe-implementation-notes)
  0. [Applying GloVe to our motivating example](#Applying-GloVe-to-our-motivating-example)
  0. [Testing the GloVe implementation](#Testing-the-GloVe-implementation)
  0. [Applying GloVe to real VSMs](#Applying-GloVe-to-real-VSMs)
0. [Autoencoders](#Autoencoders)
  0. [Overview of the autoencoder method](#Overview-of-the-autoencoder-method)
  0. [Testing the autoencoder implementation](#Testing-the-autoencoder-implementation)
  0. [Applying autoencoders to real VSMs](#Applying-autoencoders-to-real-VSMs)
0. [word2vec](#word2vec)
  0. [Training data](#Training-data)
  0. [Basic skip-gram](#Basic-skip-gram)
  0. [Skip-gram with noise contrastive estimation ](#Skip-gram-with-noise-contrastive-estimation-)
  0. [word2vec resources](#word2vec-resources)
0. [Other methods](#Other-methods)
0. [Exploratory exercises](#Exploratory-exercises)

## Overview

The matrix weighting schemes reviewed in the first notebook for this unit deliver solid results. However, they are not capable of capturing higher-order associations in the data. 

With dimensionality reduction, the goal is to eliminate correlations in the input VSM and capture such higher-order notions of co-occurrence, thereby improving the overall space.

As a motivating example, consider the adjectives _gnarly_ and _wicked_ used as slang positive adjectives.  Since both are positive, we expect them to be similar in a good VSM. However, at least stereotypically, _gnarly_ is Californian and _wicked_ is Bostonian. Thus, they are unlikely to occur often in the same texts, and so the methods we've reviewed so far will not be able to model their similarity. 

Dimensionality reduction techniques are often capable of capturing such semantic similarities (and have the added advantage of shrinking the size of our data structures).

## Set-up

* Make sure your environment meets all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/). For help getting set-up, see [setup.ipynb](setup.ipynb).

* Make sure you've downloaded [the data distribution for this unit](http://web.stanford.edu/class/cs224u/data/vsmdata.zip), unpacked it, and placed it in the current directory (or wherever you point `data_home` to below).

In [116]:
from collections import Counter
import copy
from nltk.corpus import stopwords
import numpy as np
import os
import pandas as pd
import PorterStemmer
import scipy.stats
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.preprocessing import normalize
from sklearn.svm import LinearSVC
import string

In [4]:
data_home = 'tvtropes'

In [10]:
dev1 = pd.read_csv(
    os.path.join(data_home, 'dev1.balanced.csv'))

In [11]:
dev2 = pd.read_csv(
    os.path.join(data_home, 'dev2.balanced.csv'))

In [12]:
test = pd.read_csv(
    os.path.join(data_home, 'test.balanced.csv'))

In [13]:
train = pd.read_csv(
    os.path.join(data_home, 'train.balanced.csv'))

In [23]:
print(test.loc[0, 'trope'])

WorkCom


In [126]:
ps = PorterStemmer.PorterStemmer()
translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
stop_words = set(stopwords.words('english'))

def parse_sentence(sentence):
    s = sentence.translate(translator).split()
    for i in range(len(s)):
        s[i] = s[i].strip(string.punctuation).lower()
    s = list(filter(None, s))
    return [word for word in s if word not in stop_words]

## Features

Description here

In [122]:
def unigrams_phi(s):
    return Counter(s)

In [123]:
def stemmed_phi(s):
    return Counter([ps.stem(word) for word in s])

In [124]:
def bigrams_phi(s):
    t = copy.deepcopy(s)
    t.insert(0, '<S>')
    t.append('</S>')
    bigrams = [tuple([t[i], t[i + 1]]) for i in range(len(t) - 1)]
    return Counter(bigrams)

## Experiment

Description

In [82]:
def vectorize_X(X, phi, vectorizer=None):
    feat_dicts = [phi(parse_sentence(sentence)) for sentence in X]
    if vectorizer == None:
        vectorizer = DictVectorizer(sparse=False)
        return (vectorizer.fit_transform(feat_dicts), vectorizer)
    else:
        return (vectorizer.transform(feat_dicts), vectorizer)

In [99]:
def build_dataset(data, phi, vectorizer=None):
    X = [data.loc[i, 'sentence'] for i in range(len(data.index))]
    y = [data.loc[i, 'spoiler'] for i in range(len(data.index))]
    feat_matrix, vectorizer = vectorize_X(X, phi, vectorizer)
    return (normalize(feat_matrix), y, vectorizer)

In [85]:
def fit_svc(X, y):
    mod = LinearSVC()
    mod.fit(X, y)
    return mod

In [105]:
def experiment(phi, train_func, train_data, test_data):
    X, y, vectorizer = build_dataset(train_data, phi)
    print(len(X[0]))
    mod = train_func(X, y)
    X_test, y_test, vectorizer = build_dataset(test_data, phi, vectorizer)
    predictions = mod.predict(X_test)
    print('Accuracy: %0.03f' % accuracy_score(y_test, predictions))
    print(classification_report(y_test, predictions, digits=3))
    return f1_score(y_test, predictions, average = 'macro', pos_label=None)

In [128]:
experiment(unigrams_phi, fit_svc, train, dev1)
experiment(stemmed_phi, fit_svc, train, dev1)
experiment(bigrams_phi, fit_svc, train, dev1)

18968
Accuracy: 0.614
             precision    recall  f1-score   support

      False      0.578     0.633     0.604       496
       True      0.652     0.598     0.624       570

avg / total      0.618     0.614     0.615      1066

13520
Accuracy: 0.618
             precision    recall  f1-score   support

      False      0.584     0.623     0.603       496
       True      0.652     0.614     0.632       570

avg / total      0.620     0.618     0.619      1066

110710
Accuracy: 0.598
             precision    recall  f1-score   support

      False      0.568     0.567     0.567       496
       True      0.623     0.625     0.624       570

avg / total      0.598     0.598     0.598      1066



0.5955589791028989

# Sentiment

In [134]:
def sentiment_phi(s):
    t = copy.deepcopy(s)
    is_neg = 0
    for i in range(len(t)):
        if is_neg > 4: is_neg = 0
        if is_neg > 0:
            is_neg += 1
            t[i] = 'NOT_' + t[i]
        if 'n\'t' in t[i] or t[i] in ['not', 'no', 'never']: is_neg = 1
    return Counter(t)

In [135]:
experiment(sentiment_phi, fit_svc, train, dev1)

19485
Accuracy: 0.614
             precision    recall  f1-score   support

      False      0.579     0.629     0.603       496
       True      0.651     0.602     0.625       570

avg / total      0.617     0.614     0.615      1066



0.6141201960551175