# Non-Pretrained Methods!

All code for our non-pretrained methods is contained in this notebook.

The bulk of the work for each method is encapsulated in a dedicated class, with some methods having testing functions which can be used to repeat our experiments. This notebook also contains some of our preliminary results, but these are not necessarily final, please refer to our report for our final results for each method.

Prior to the definition of our methods we load the data and import the necessary libraries. All validation/test errors were calculated using the model_performance function, as provided in the original skeleton.

Here is a brief outline of our code/how to repeat experiments:

 1. Regression on TF-IDF features
    - Method class: TfidfRegressor
    - Experiment function: test_regressor
    - How to repeat experiments:
      - To set min_df, you must pass it as a hardcoded parameter to
        TfidfRegressor. This sets the minimum number of headlines a word must
        feature in to be included in the model.
      - For all other parameters, call test_regressor with the desired
        experiment parameters (examples in code)
      - Performance on Dev/Validation set and test set will be printed.
 2. Predicting funiness from Perplexity
    - WARNING: Running the perplexity experiments takes a long time, due to
      first training a language model, and then computing the perplexity of
      each item in the training dataset (see method classes for details).
    - **Perplexity of whole headline**
      - Method class: SentenceLMFunnyEstimator
      - Experiment function: test_sentence_perplexity_model
      - How to repeat experiments:
        - Code for all experiments reported are in cells following
          test_sentence_perplexity_model
        - Call test_sentence_perplexity_model with appropriate experiment
          parameters
        - MAKE SURE YOU USE THE CORRECT TYPE OF DATASETS (see comments above
          test_sentence_perplexity_model
        - Performance on Dev/Validation set will be printed.
    - **Perplexity of n-grams around edit**
      - Method class: EditContextLMFunnyEstimator
      - Test function: None :'(
      - How to repeat experiments:
        - Code for running all the experiments is in the code cells below
          EditContextLMFunnyEstimator, with comments relating each experiment
          to the entry in Table 3 of the report.
 3. Averaging across part-of-speech tags
    - Method class: POSTagFunninessPredictor
    - Experiment function: test_pos_predictor
    - How to repeat experiments:
      - Call test_pos_predictor with experiment parameters.
      - Performance on Dev/Validation set and test set will be printed.
      - Examples of running experiment are given below test_pos_predictor.


In [None]:
# You will need to download any word embeddings required for your code, e.g.:

# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip
# !wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
# !unzip glove.twitter.27B.zip
# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

! pip install torch
! pip install nltk==3.5
! pip install truecase


Collecting nltk==3.5
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 5.7MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-cp37-none-any.whl size=1434676 sha256=002bf10f92a21dfce3543fd199f4f75bc2f2f69fba675c4c71f0a46d2fdf464c
  Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.5
Collecting truecase
[?25l  Downloading https://files.pythonhosted.org/packages/87/52/0824cdadfe0b924f1f10b3a4042b2e15ae2477ed8acc032418a449a62936/truecase-0.0.12-py3-none-any.whl (28.4MB)
[K     |█████████████

In [None]:
# Imports

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from torch.utils.data import Dataset, random_split
import codecs

import re
import tqdm


import multiprocessing
from multiprocessing import Pool

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline, padded_everygrams
from nltk.lm import MLE, Laplace
from nltk.util import pad_sequence, everygrams
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.chunk.util import tree2conlltags

from truecase import get_true_case

nltk.download('popular')
nltk.download('universal_tagset')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [None]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
# Load data
!wget https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip
!unzip semeval-2020-task-7-dataset.zip

--2021-02-27 20:44:03--  https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip
Resolving cs.rochester.edu (cs.rochester.edu)... 192.5.53.208
Connecting to cs.rochester.edu (cs.rochester.edu)|192.5.53.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1621456 (1.5M) [application/zip]
Saving to: ‘semeval-2020-task-7-dataset.zip’


2021-02-27 20:44:04 (8.12 MB/s) - ‘semeval-2020-task-7-dataset.zip’ saved [1621456/1621456]

Archive:  semeval-2020-task-7-dataset.zip
   creating: semeval-2020-task-7-dataset/
  inflating: semeval-2020-task-7-dataset/.DS_Store  
   creating: semeval-2020-task-7-dataset/subtask-1/
  inflating: semeval-2020-task-7-dataset/subtask-1/train_funlines.csv  
  inflating: semeval-2020-task-7-dataset/subtask-1/.DS_Store  
  inflating: semeval-2020-task-7-dataset/subtask-1/test.csv  
  inflating: semeval-2020-task-7-dataset/subtask-1/dev.csv  
 extracting: semeval-2020-task-7-dataset/subtask-1/baseline.zip  
  inflating: semeval-2

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.5f} | RMSE: {rmse:.5f} |')

    return sse, mse

In [None]:

train_df = pd.read_csv('/content/semeval-2020-task-7-dataset/subtask-1/train.csv')
train_fl_df = pd.read_csv('/content/semeval-2020-task-7-dataset/subtask-1/train_funlines.csv')
val_df = pd.read_csv('/content/semeval-2020-task-7-dataset/subtask-1/dev.csv')
test_df = pd.read_csv('/content/semeval-2020-task-7-dataset/subtask-1/test.csv')

def extract_data(df):
    raw_data = df['original']
    edit_data = df['edit']
    original_data = pd.Series([re.sub('<|\/>', '', s) for s in raw_data])
    edited_data = pd.Series([re.sub('<.*\/>', e, s) for s, e in zip(raw_data, edit_data)])
    grade_data = df['meanGrade']
    return raw_data, edit_data, original_data, edited_data, grade_data

train_raw, train_edit, train_original, train_edited, train_grades = extract_data(train_df)
train_fl_raw, train_fl_edit, train_fl_original, train_fl_edited, train_fl_grades = extract_data(train_fl_df)
val_raw, val_edit, val_original, val_edited, val_grades = extract_data(val_df)
test_raw, test_edit, test_original, test_edited, test_grades = extract_data(test_df)


print(train_raw[0])
print(train_original[0])
print(train_edited[0])

# Combine all that data
combined_train_raw = train_raw.append(train_fl_raw, ignore_index=True)
combined_train_edit = train_edit.append(train_fl_edit, ignore_index=True)
combined_train_original = train_original.append(train_fl_original, ignore_index=True)
combined_train_edited = train_edited.append(train_fl_edited, ignore_index=True)
combined_train_grades = train_grades.append(train_fl_grades, ignore_index=True)

France is ‘ hunting down its citizens who joined <Isis/> ’ without trial in Iraq
France is ‘ hunting down its citizens who joined Isis ’ without trial in Iraq
France is ‘ hunting down its citizens who joined twins ’ without trial in Iraq


# Approach 1: TF-IDF Regression


In [None]:

class TfidfRegressor:

    def __init__(self, train_corpus, train_values, rm=Lasso(alpha=0.0001), sw='english'):

        self.tfidfVectorizer = TfidfVectorizer(stop_words=sw)

        train_corpus_counts = self.tfidfVectorizer.fit_transform(train_corpus)
        self.regression_model = rm.fit(train_corpus_counts, train_values)

    def predict(self, sample_corpus):

        sample_counts = self.tfidfVectorizer.transform(sample_corpus)
        return self.regression_model.predict(sample_counts)


# train_data is the dataset of edited headlines to train on
# train_labels is the scores for those headlines
# rm is the regression model to use (e.g. Lasso(0.0006) )
# sw is type of stopwords to remove ('english' or None)
# examples of how to run are below function definition
def test_regressor(train_data, train_labels, rm, sw):
    model = TfidfRegressor(train_data, train_labels, rm, sw)

    print("Number of TF-IDF features")
    print(len(model.tfidfVectorizer.get_feature_names()))

    print("Is this just predicting the mean score?")
    print(np.allclose(np.mean(train_labels), model.regression_model.intercept_))

    # Evaluate performance
    train_tfidf_preds = model.predict(train_data)
    print("\nTrain performance:")
    model_performance(train_tfidf_preds, train_labels, True)


    val_tfidf_preds = model.predict(val_edited)
    print("\nDev performance:")
    model_performance(val_tfidf_preds, val_grades, True)

    test_tfidf_preds = model.predict(test_edited)
    print("\nTest performance")
    model_performance(test_tfidf_preds, test_grades, True)
    
    print()
    print()


#test_regressor(combined_train_edited, combined_train_grades, LinearRegression(), None)

test_regressor(combined_train_edited, combined_train_grades, Lasso(0.0002), 'english')
test_regressor(combined_train_edited, combined_train_grades, Lasso(0.00006), 'english')


Number of TF-IDF features
16283
Is this just predicting the mean score?
False

Train performance:
| MSE: 0.33 | RMSE: 0.58 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

Test performance
| MSE: 0.34 | RMSE: 0.58 |


Number of TF-IDF features
16283
Is this just predicting the mean score?
False

Train performance:
| MSE: 0.28 | RMSE: 0.53 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

Test performance
| MSE: 0.33 | RMSE: 0.58 |




## TF-IDF Regression Results

## Edit Word Model (inc stop words):

Train performance:
| MSE: 0.14 | RMSE: 0.37 |

Dev performance:
| MSE: 0.36 | RMSE: 0.60 |

Test performance:
| MSE: 0.37 | RMSE: 0.61 |

## Editted Sentence Model:

**Base**

Train performance:
| MSE: 0.02 | RMSE: 0.14 |

Dev performance:
| MSE: 22.15 | RMSE: 4.71 |

Test performance
| MSE: 19.83 | RMSE: 4.45 |

**No stop words**

Train performance:
| MSE: 0.02 | RMSE: 0.16 |

Dev performance:
| MSE: 8.95 | RMSE: 2.99 |

Test performance
| MSE: 8.51 | RMSE: 2.92 |

**No stop words, min_df=2**

Train performance:
| MSE: 0.08 | RMSE: 0.28 |

Dev performance:
| MSE: 3.23 | RMSE: 1.80 |

**No stop words, min_df=3**

Train performance:
| MSE: 0.11 | RMSE: 0.33 |

Dev performance:
| MSE: 2.25 | RMSE: 1.50 |

**No stop words, min_df=5**

Train performance:
| MSE: 0.18 | RMSE: 0.42 |

Dev performance:
| MSE: 0.61 | RMSE: 0.78 |

**No stop words, min_df=10**

Train performance:
| MSE: 0.25 | RMSE: 0.50 |

Dev performance:
| MSE: 0.40 | RMSE: 0.63 |

**No stop words, min_df=20**

Train performance:
| MSE: 0.29 | RMSE: 0.54 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

**No stop words, Lasso Regression, alpha=0.5**

(just predicts mean)

Train performance:
| MSE: 0.34 | RMSE: 0.58 |

Dev performance:
| MSE: 0.33 | RMSE: 0.58 |

**No stop words, Lasso Regression, alpha=0.0001**

Train performance:
| MSE: 0.27 | RMSE: 0.52 |

Dev performance:
| MSE: 0.33 | RMSE: 0.57 |

Test performance
| MSE: 0.32 | RMSE: 0.56 |


**No stop words, Lasso Regression, alpha=0.00005**

Train performance:
| MSE: 0.20 | RMSE: 0.45 |

Dev performance:
| MSE: 0.34 | RMSE: 0.59 |

**No stop words, Lasso Regression, alpha=0.00001**

Train performance:
| MSE: 0.07 | RMSE: 0.27 |

Dev performance:
| MSE: 0.50 | RMSE: 0.70 |



## Editted Sentence Model, trained on normal + FunLines

**Baseline**

Train performance:
| MSE: 0.04 | RMSE: 0.21 |

Dev performance:
| MSE: 2.22 | RMSE: 1.49 |

Test performance
| MSE: 2.25 | RMSE: 1.50 |

**No stop words**

Train performance:
| MSE: 0.05 | RMSE: 0.22 |

Dev performance:
| MSE: 1.95 | RMSE: 1.40 |

Test performance
| MSE: 1.92 | RMSE: 1.38 

**No stop words, Lasso Regression, alpha=0.0001**

Train performance:
| MSE: 0.31 | RMSE: 0.56 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

Test performance
| MSE: 0.33 | RMSE: 0.58 |

**No stop words, Lasso Regression, alpha=0.00005**

Train performance:
| MSE: 0.26 | RMSE: 0.51 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

Test performance
| MSE: 0.33 | RMSE: 0.58 |

# Approach 2: Relating Perplexity to Funniness

In [None]:
## Language models/perplexity


def tokenize(s):
    return [t for t in re.split('\s+', s) if t is not '']

# Idea: train LM on normal data, then find correlation between perplexity
# of LM on edited data and funniness
class SentenceLMFunnyEstimator():

    def __init__(self, order, normal_data, funny_data, funny_labels):

        self.order = order

        train_normal_corp = [tokenize(s) for s in normal_data]

        train_d, train_v = padded_everygram_pipeline(order, train_normal_corp)

        # MLE break due to missing words?
        self.lm = Laplace(order)
        self.lm.fit(train_d, train_v)

        train_funny_corp = [tokenize(s) for s in funny_data]
        print(len(train_funny_corp))

        perps = self.get_perplexities(train_funny_corp)
        perps = np.expand_dims(np.array(perps), axis=1)

        self.regressor = LinearRegression().fit(perps, funny_labels)

        train_preds = self.regressor.predict(perps)

        print("Train Performance:")
        print()
        model_performance(train_preds, funny_labels, True)
        print()

    def predict(self, sentences):
        # Assume tokenised sentence
        perps = self.get_perplexities(sentences)

        np_perps = np.expand_dims(np.array(perps), axis=1)

        return self.regressor.predict(np_perps)

    def _get_perplexity(self, sentence):
        return self.lm.perplexity(padded_everygrams(self.order, sentence))

    def get_perplexities(self, sentences):
        with Pool(processes=multiprocessing.cpu_count()) as pool:
            return pool.map(self._get_perplexity, sentences)

In [None]:
val_toks =  [[t for t in s.split(' ') if t != ''] for s in val_edited]

# ngram - N of n-gram language model to use
# data_orig - uneditted headlines for training set to use
#             (all are named <something>_original )
# data_edited - edited headlines for training set to use
#               (all are named <something>_edited )
# data_grades - scores of data_edited headlines
#               (all are named <something>_grades)
def test_sentence_perplexity_model(ngram, data_orig, data_edited, data_grades):

    lfme = SentenceLMFunnyEstimator(ngram, data_orig, data_edited, data_grades)
    preds = lfme.predict(val_toks)
    
    print("Validation Performance")
    print()
    sse, mse = model_performance(preds, val_grades, True)
    print()
    print("Regressor intercept: ", lfme.regressor.intercept_)
    print()
    print("Regressor perplexity coefficient: ", lfme.regressor.coef_)
    print()
    print()

In [None]:

print('Base dataset results')
print('3 gram lfme')
print()
test_sentence_perplexity_model(3, train_original, train_edited, train_grades)
print('2 gram lfme')
print()
test_sentence_perplexity_model(2, train_original, train_edited, train_grades)
print('1 gram lfme')
print()
test_sentence_perplexity_model(1, train_original, train_edited, train_grades)


Base dataset results
3 gram lfme

9652
Train Performance:

| MSE: 0.33839 | RMSE: 0.58172 |

Validation Performance

| MSE: 0.35156 | RMSE: 0.59293 |

Regressor intercept:  1.094308667001804

Regressor perplexity coefficient:  [-0.00014175]


2 gram lfme

9652
Train Performance:

| MSE: 0.33787 | RMSE: 0.58127 |

Validation Performance

| MSE: 0.33970 | RMSE: 0.58284 |

Regressor intercept:  1.0641348931340255

Regressor perplexity coefficient:  [-7.66573025e-05]


1 gram lfme

9652
Train Performance:

| MSE: 0.33894 | RMSE: 0.58218 |

Validation Performance

| MSE: 0.33187 | RMSE: 0.57608 |

Regressor intercept:  0.9824355881975723

Regressor perplexity coefficient:  [-1.48554245e-05]




In [None]:

print('Combined dataset results')
print('3 gram lfme')
print()
test_sentence_perplexity_model(3, combined_train_original, combined_train_edited, combined_train_grades)
print('2 gram lfme')
print()
test_sentence_perplexity_model(2, combined_train_original, combined_train_edited, combined_train_grades)
print('1 gram lfme')
print()
test_sentence_perplexity_model(1, combined_train_original, combined_train_edited, combined_train_grades)


Combined dataset results
3 gram lfme

17900
Train Performance:

| MSE: 0.35496 | RMSE: 0.59578 |

Validation Performance

| MSE: 0.37317 | RMSE: 0.61087 |

Regressor intercept:  1.0417056013230948

Regressor perplexity coefficient:  [2.71052366e-05]


2 gram lfme

17900
Train Performance:

| MSE: 0.35518 | RMSE: 0.59597 |

Validation Performance

| MSE: 0.35436 | RMSE: 0.59528 |

Regressor intercept:  1.1093102757298205

Regressor perplexity coefficient:  [-7.37558235e-06]


1 gram lfme

17900
Train Performance:

| MSE: 0.35438 | RMSE: 0.59530 |

Validation Performance

| MSE: 0.35426 | RMSE: 0.59520 |

Regressor intercept:  1.121725776362903

Regressor perplexity coefficient:  [-7.73702134e-06]




In [None]:
# Idea: Same as above, but now just focus on fitting perplexity of context
# around edit


class EditContextLMFunnyEstimator():

    def __init__(self, order, normal_data, unprocessed_data, edit_words, funny_labels):

        self.order = order

        train_normal_corp = [tokenize(s) for s in normal_data]

        train_d, train_v = padded_everygram_pipeline(order, train_normal_corp)

        # MLE break due to missing words?
        self.lm = Laplace(order)
        self.lm.fit(train_d, train_v)

        e_context_everygram = self.extract_edit_context(unprocessed_data, edit_words)

        perps = self.get_perplexities(e_context_everygram)
        perps = np.expand_dims(np.array(perps), axis=1)

        self.regressor = LinearRegression().fit(perps, funny_labels)

        train_preds = self.regressor.predict(perps)

        print("Train Performance:")
        print()
        model_performance(train_preds, funny_labels, True)
        print()

    def extract_edit_context(self, unprocessed_data, editwords):

        edits_with_context = []

        for s, e in zip(unprocessed_data, editwords):
            first_half, second_half = re.split('<.*\/>', s)
            
            fh_toks = tokenize(first_half)
            edit_tok = tokenize(e)
            sh_toks = tokenize(second_half)

            fh_pad = list(pad_sequence(fh_toks, self.order, pad_left=True, left_pad_symbol='<s>'))
            sh_pad = list(pad_sequence(sh_toks, self.order, pad_right=True, right_pad_symbol='</s>'))

            edit_in_context = fh_pad[len(fh_pad) - self.order + 1:] +\
                                edit_tok + sh_pad[:self.order]
            edits_with_context.append(list(everygrams(edit_in_context, max_len=self.order)))
        
        return edits_with_context

    def predict(self, test_orig, test_edit):
        # Assume tokenised sentence
        test_edit_context_everygrams = self.extract_edit_context(test_orig, test_edit)

        perps = self.get_perplexities(test_edit_context_everygrams)

        np_perps = np.expand_dims(np.array(perps), axis=1)

        return self.regressor.predict(np_perps)

    def _get_perp(self, sentence):
        return self.lm.perplexity(sentence)

    def get_perplexities(self, sentences):
        with Pool(processes=multiprocessing.cpu_count()) as pool:
            return pool.map(self._get_perp, sentences)

In [None]:
# OD Trigram
context_lfme3 = EditContextLMFunnyEstimator(3, train_original, train_raw, train_edit, train_grades)

context_preds = context_lfme3.predict(val_raw, val_edit)
model_performance(context_preds, val_grades, True)

Train Performance:

| MSE: 0.33984 | RMSE: 0.58295 |

| MSE: 0.33232 | RMSE: 0.57647 |


(803.872178443305, 0.33231590675622363)

In [None]:
# OD Bigram
context_lfme2 = EditContextLMFunnyEstimator(2, train_original, train_raw, train_edit, train_grades)

context_preds = context_lfme2.predict(val_raw, val_edit)
model_performance(context_preds,  val_grades, True)

# OD Unigram
context_lfme1 = EditContextLMFunnyEstimator(1, train_original, train_raw, train_edit, train_grades)

context_preds = context_lfme1.predict(val_raw, val_edit)
model_performance(context_preds, val_grades, True)

Train Performance:

| MSE: 0.34029 | RMSE: 0.58334 |

| MSE: 0.33328 | RMSE: 0.57730 |
Train Performance:

| MSE: 0.33915 | RMSE: 0.58237 |

| MSE: 0.33339 | RMSE: 0.57740 |


(806.4742345195133, 0.3333915810332838)

In [None]:
print(context_lfme3.regressor.coef_)
print(context_lfme3.regressor.intercept_)
print(context_lfme2.regressor.coef_)
print(context_lfme2.regressor.intercept_)
print(context_lfme1.regressor.coef_)
print(context_lfme1.regressor.intercept_)

[-1.43947639e-05]
0.9798939828486125
[-5.75712532e-06]
0.9618051982754973
[9.17320358e-07]
0.9059682086772738


In [None]:
# OD + EF Trigram
print("Trigram")
comb_context_lfme3 = EditContextLMFunnyEstimator(3, combined_train_original, combined_train_raw, combined_train_edit, combined_train_grades)
context_preds = comb_context_lfme3.predict(val_raw, val_edit)
model_performance(context_preds, val_grades, True)
print(comb_context_lfme3.regressor.coef_)
print(comb_context_lfme3.regressor.intercept_)
print()

# OD + EF Bigram
print("Bigram")
comb_context_lfme2 = EditContextLMFunnyEstimator(2, combined_train_original, combined_train_raw, combined_train_edit, combined_train_grades)
context_preds = comb_context_lfme2.predict(val_raw, val_edit)
model_performance(context_preds, val_grades, True)
print(comb_context_lfme2.regressor.coef_)
print(comb_context_lfme2.regressor.intercept_)
print()

# OD + EF Unigram
print("Unigram")
comb_context_lfme1 = EditContextLMFunnyEstimator(1, combined_train_original, combined_train_raw, combined_train_edit, combined_train_grades)
context_preds = comb_context_lfme1.predict(val_raw, val_edit)
model_performance(context_preds, val_grades, True)
print(comb_context_lfme1.regressor.coef_)
print(comb_context_lfme1.regressor.intercept_)
print()

Trigram
Train Performance:

| MSE: 0.35514 | RMSE: 0.59593 |

| MSE: 0.35593 | RMSE: 0.59660 |
[-3.2094862e-06]
1.1064255723083531

Bigram
Train Performance:

| MSE: 0.35518 | RMSE: 0.59597 |

| MSE: 0.35745 | RMSE: 0.59787 |
[-1.68569998e-06]
1.1021019208947336

Unigram
Train Performance:

| MSE: 0.35296 | RMSE: 0.59410 |

| MSE: 0.35778 | RMSE: 0.59814 |
[6.49494333e-07]
1.0581312525883806



In [None]:
for _ in range(5):
  print(comb_context_lfme3.lm.generate(15, text_seed='<s>'))

['Sparks', 'Outrage', 'Across', 'India', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']
['Center', 'Clients', 'Rely', 'On', 'Medicaid', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']
['strange', 'collection', 'of', 'location', 'data', 'even', 'when', 'told', 'not', 'to', '-', '9to5Mac', '</s>', '</s>', '</s>']
['executive', 'cooperating', 'in', 'US', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']
[',', 'exit', 'polls', 'show', 'Trump', ',', 'Putin', 'to', 'hold', 'presidency', '.', '</s>', '</s>', '</s>', '</s>']


In [None]:
for _ in range(5):
  print(comb_context_lfme3.lm.generate(10, text_seed='<s>'))

['Dozens', 'dead', 'in', 'possible', 'gas', 'attack', 'in', 'Syria', '</s>', '</s>']
['hiding', 'something', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']
["'", ':', 'midwestern', 'workers', 'savaged', 'by', 'Trump', '’s', 'indefensible', 'defense']
['<s>', '<s>', 'Alexandria', 'Ocasio-Cortez', ':', 'Trump', 'and', 'the', 'art', 'of']
['fight', 'climate', 'crisis', ':', 'Death', 'toll', 'rises', 'after', 'bombs', 'target']


## **Perplexity results**
**LM on uneditted, do linear regression between perplexities of editted headlines and funniness:**

Trigram:

 - Model training score:  0.006487269059277545
 - MSE: 0.35
 - RMSE: 0.59
 - Perplexity coefficient: [-0.00014175]
 - Intercept: 1.094308667001804

Bigram:

 - Model training score:  0.00802764275019574
 - MSE: 0.34
 - RMSE: 0.58
 - Perplexity coefficient: [-7.66573025e-05]
 - Intercept: 1.0641348931340255

Unigram:

 - Model training score:  0.00489278156461237
 - MSE: 0.33
 - RMSE: 0.58
 - Perplexity coefficient: [-1.48554245e-05]
 - Intercept: 0.9824355881975723

**Same LM, now fit perplexities just from words around edit**

Trigram:

 - Model training score:  0.0022555031224428257
 - MSE: 0.33
 - RMSE: 0.58
 - Perplexity coefficient: [-1.43947639e-05]
 - Intercept: 0.9798939828486125

Bigram:

 - Model training score:  0.0009207103129988958
 - MSE: 0.33
 - RMSE: 0.58
 - Perplexity coefficient: [-5.75712532e-06]
 - Intercept: 0.9618051982754973

Unigram:

 - Model training score:  0.00426709488810717
 - MSE: 0.33
 - RMSE: 0.58
 - Perplexity coefficient: [9.17320358e-07]
 - Intercept: 0.9059682086772738

**LM on uneditted + funlines uneditted, do linear regression between perplexities of editted headlines and funniness:**

Trigram:

 - Model training score:  0.0008570347665806112
 - MSE: 0.37
 - RMSE: 0.61
 - [2.71052366e-05]
 - 1.0417056013230948

Bigram:

 - Model training score:  0.00021799908115582856
 - MSE: 0.35
 - RMSE: 0.60
 - Perplexity coefficient: [-7.37558235e-06]
 - Intercept: 1.1093102757298205

Unigram:
 
 - Model training score:  0.0024696745781895846
 - MSE: 0.35
 - RMSE: 0.60
 - Perplexity coefficient: [-7.73702134e-06]
 - Intercept: 1.121725776362903


**Same extended LM, but  LM, now fit perplexities just from words around edit**

Trigram: 
 - Model training score:  0.00035018215042048606
 - MSE: 0.36
 - RMSE: 0.60
 - Perplexity coeffecient: [-3.2094862e-06]
 - Intercept: 1.1064255723083531

Bigram: 
 - Model training score:  0.00022490457320711865
 - MSE: 0.36
 - RMSE: 0.60
 - Perplexity coeffecient: [-1.68569998e-06]
 - Intercept: 1.1021019208947336

Unigram: 
 - Model training score:  0.006474407232901269
 - MSE: 0.36
 - RMSE: 0.60
 - Perplexity coeffecient: [6.49494333e-07]
 - Intercept: 1.0581312525883806


# Approach 3: Averaging across POS/NER tag


In [None]:

class POSTagFunninessPredictor:

    def __init__(self, data_edited, data_orig, data_edit, data_grades, tagset, min_count, use_ner):

        self.avg_score = np.mean(data_grades)
        self.tagset = tagset
        self.use_ner = use_ner

        tag_counts = {}
        tag_score_sums = {}

        for i in range(len(data_edited)):
            s = data_edited[i]
            s_o = data_orig[i]
            e = data_edit[i]
            g = data_grades[i]

            t = self.get_edit_tag(s, s_o, e)

            if t in tag_counts:
                tag_counts[t] += 1
                tag_score_sums[t] += g
            else:
                tag_counts[t] = 1
                tag_score_sums[t] = g

        self.tag_counts = tag_counts
        self.avg_tag_scores = {}

        for t in tag_counts.keys():
            if tag_counts[t] >= min_count:
                self.avg_tag_scores[t] = tag_score_sums[t] / tag_counts[t]

    def get_edit_tag(self, s, s_o, e):

        # Remove punctuation that messes up truecaser
        s = re.sub(r'\s(-|\||~)\s', ' ', s)
        s_o = re.sub(r'\s(-|\|)\s', ' ', s_o)

        s = get_true_case(s)
        s_o = get_true_case(s_o)

        tagged_s = pos_tag(word_tokenize(get_true_case(s)), tagset=self.tagset)
        tagged_s_o = pos_tag(word_tokenize(get_true_case(s_o)), tagset=self.tagset)

        conll_tag_s = tree2conlltags(ne_chunk(tagged_s))
        conll_tag_s_o = tree2conlltags(ne_chunk(tagged_s_o))

        if len(conll_tag_s_o) < len(conll_tag_s):
            conll_tag_s_o += [('', 'FAKE', 'O')] * (len(conll_tag_s) - len(conll_tag_s_o))

        found = False
        edit_tag = 'UNK' # For unknown
        for (w, t, conll_t), (w_o, _, _) in zip(conll_tag_s, conll_tag_s_o):

            if w == w_o:
                continue
            if w.lower() == e.lower():
                if found:
                    print("WARNING - multiple occurences of edit word")
                    print(s)
                    print(e)
                    print()
                else:
                    if self.use_ner and conll_t != 'O':
                        edit_tag = conll_t[2:]
                    else:
                        edit_tag = t
                    found = True
        if not found:
            print("ERROR - could not find edit word")
            print(s)
            print(s_o)
            print(e)
            print(tagged_s)
            print(tagged_s_o)
            print(ne_chunk(tagged_s))
            print(ne_chunk(tagged_s_o))
            print(conll_tag_s)
            print(conll_tag_s_o)
            print()
        return edit_tag

    def predict(self, sentence, sentence_orig, edit):
        t = self.get_edit_tag(sentence, sentence_orig, edit)

        if t in self.avg_tag_scores:
            return self.avg_tag_scores[t]
        else:
            return self.avg_score


In [None]:
re.sub(r'\s(-|\||~)\s', ' ', ' test ~ bbc')

' test bbc'

In [None]:

def test_pos_predictor(all_data=False, tagset=None, min_count=0, use_ner=False):
    if all_data:
        pos_predictor = POSTagFunninessPredictor(combined_train_edited,
                                                 combined_train_original,
                                                 combined_train_edit,
                                                 combined_train_grades,
                                                 tagset=tagset,
                                                 min_count=min_count,
                                                 use_ner=use_ner)
    else:
        pos_predictor = POSTagFunninessPredictor(train_edited, train_original,
                                                 train_edit, train_grades,
                                                 tagset=tagset,
                                                 min_count=min_count,
                                                 use_ner=use_ner)

    print(pos_predictor.tag_counts)


    val_res = [pos_predictor.predict(val_edited[i], val_original[i], val_edit[i])
               for i in range(len(val_edited))]
    val_res = np.array(val_res)
    print("Validation performance:")
    print()
    model_performance(val_res, val_grades, True)
    print()

    test_res = [pos_predictor.predict(test_edited[i], test_original[i], test_edit[i])
               for i in range(len(test_edited))]
    test_res = np.array(test_res)
    print("Test performance:")
    print()
    model_performance(test_res, test_grades, True)
    print()
    print()
    print()

print("Normal train, PTB tagset")
test_pos_predictor(use_ner=True)
print("Normal train, PTB tagset, min_count=10")
test_pos_predictor(use_ner=True, min_count=10)
print("Normal train, PTB tagset, min_count=50")
test_pos_predictor(use_ner=True, min_count=50)
print("Normal train, Universal tagset")
test_pos_predictor(use_ner=True, tagset='universal')
print("Normal train, Universal tagset, min_count=10")
test_pos_predictor(use_ner=True, tagset='universal', min_count=10)


## Train on normal data + Funlines

# print("All train, PTB tagset")
# test_pos_predictor(all_data=True)
# print("All train, PTB tagset, min_count=10")
# test_pos_predictor(all_data=True, min_count=10)
# print("All train, PTB tagset, min_count=50")
# test_pos_predictor(all_data=True, min_count=50)
# print("All train, Universal tagset")
# test_pos_predictor(all_data=True, tagset='universal')
# print("All train, Universal tagset, min_count=10")
# test_pos_predictor(all_data=True, tagset='universal', min_count=10)

Normal train, PTB tagset
{'NNS': 1758, 'VBG': 358, 'NN': 3723, 'VBP': 120, 'NNP': 497, 'PERSON': 748, 'VB': 590, 'VBD': 162, 'GPE': 488, 'JJ': 405, 'VBZ': 248, 'ORGANIZATION': 346, 'PRP': 24, 'WDT': 1, 'IN': 14, 'VBN': 40, 'RB': 45, 'DT': 9, 'NNPS': 22, 'FACILITY': 15, 'WP': 2, 'RP': 7, 'GSP': 4, 'FW': 5, 'LOCATION': 2, 'CD': 5, 'JJR': 4, 'MD': 1, 'PRP$': 4, 'CC': 1, 'WRB': 1, 'PDT': 1, 'JJS': 2}
Validation performance:

| MSE: 0.33274 | RMSE: 0.57684 |

Test performance:

| MSE: 0.32875 | RMSE: 0.57337 |



Normal train, PTB tagset, min_count=10
{'NNS': 1758, 'VBG': 358, 'NN': 3723, 'VBP': 120, 'NNP': 497, 'PERSON': 748, 'VB': 590, 'VBD': 162, 'GPE': 488, 'JJ': 405, 'VBZ': 248, 'ORGANIZATION': 346, 'PRP': 24, 'WDT': 1, 'IN': 14, 'VBN': 40, 'RB': 45, 'DT': 9, 'NNPS': 22, 'FACILITY': 15, 'WP': 2, 'RP': 7, 'GSP': 4, 'FW': 5, 'LOCATION': 2, 'CD': 5, 'JJR': 4, 'MD': 1, 'PRP$': 4, 'CC': 1, 'WRB': 1, 'PDT': 1, 'JJS': 2}
Validation performance:

| MSE: 0.33243 | RMSE: 0.57656 |

Test performa

POS Results:


**Normal train, PTB tagset**

{'NNS': 1797, 'VBG': 358, 'NN': 3805, 'VBP': 120, 'NNP': 1932, 'VB': 590, 'VBD': 162, 'JJ': 434, 'VBZ': 248, 'PRP': 24, 'WDT': 1, 'IN': 14, 'VBN': 40, 'RB': 45, 'NNPS': 39, 'DT': 9, 'WP': 2, 'RP': 7, 'FW': 5, 'CD': 5, 'JJR': 4, 'MD': 1, 'PRP\$': 4, 'CC': 1, 'JJS': 3, 'WRB': 1, 'PDT': 1}


Validation performance:

| MSE: 0.33193 | RMSE: 0.57613 |

Test performance:

| MSE: 0.32910 | RMSE: 0.57367 |




**Normal train, PTB tagset, min_count=10**



Validation performance:

| MSE: 0.33223 | RMSE: 0.57639 |

Test performance:

| MSE: 0.32847 | RMSE: 0.57312 |



**Normal train, PTB tagset, min_count=50**

Validation performance:

| MSE: 0.33265 | RMSE: 0.57676 |

Test performance:

| MSE: 0.32863 | RMSE: 0.57327 |


**Normal train, Universal tagset**

{'NOUN': 7573, 'VERB': 1519, 'ADJ': 441, 'PRON': 30, 'DET': 11, 'ADP': 14, 'ADV': 46, 'PRT': 7, 'X': 5, 'NUM': 5, 'CONJ': 1}

Validation performance:

| MSE: 0.33452 | RMSE: 0.57838 |

Test performance:

| MSE: 0.33100 | RMSE: 0.57533 |


**Normal train, Universal tagset, min_count=10**

Validation performance:

| MSE: 0.33489 | RMSE: 0.57870 |

Test performance:

| MSE: 0.33080 | RMSE: 0.57515 |


**All train, PTB tagset**

{'NNS': 3301, 'VBG': 752, 'NN': 6412, 'VBP': 203, 'NNP': 4285, 'VB': 1034, 'VBD': 335, 'JJ': 745, 'VBZ': 430, 'PRP': 54, 'WDT': 1, 'IN': 21, 'VBN': 68, 'RB': 70, 'NNPS': 97, 'DT': 27, 'WP': 2, 'RP': 7, 'FW': 7, 'CD': 13, 'JJR': 10, 'MD': 1, 'PRP\$': 8, 'CC': 3, 'JJS': 8, 'WRB': 2, 'PDT': 1, 'POS': 1, 'UNK': 1, '\$': 1}

Validation performance:

| MSE: 0.35459 | RMSE: 0.59547 |

Test performance:

| MSE: 0.35155 | RMSE: 0.59291 |



**All train, PTB tagset, min_count=10**

Validation performance:

| MSE: 0.35497 | RMSE: 0.59579 |

Test performance:

| MSE: 0.35133 | RMSE: 0.59273 |


**All train, PTB tagset, min_count=50**

Validation performance:

| MSE: 0.35540 | RMSE: 0.59615 |

Test performance:

| MSE: 0.35111 | RMSE: 0.59255 |


**All train, Universal tagset**

{'NOUN': 14095, 'VERB': 2823, 'ADJ': 763, 'PRON': 64, 'DET': 29, 'ADP': 21, 'ADV': 72, 'PRT': 8, 'X': 7, 'NUM': 13, 'CONJ': 3, 'UNK': 1, '.': 1}

Validation performance:

| MSE: 0.35973 | RMSE: 0.59977 |

Test performance:

| MSE: 0.35359 | RMSE: 0.59463 |


**All train, Universal tagset, min_count=10**

Validation performance:

| MSE: 0.35984 | RMSE: 0.59987 |

Test performance:

| MSE: 0.35349 | RMSE: 0.59455 |

**Normal train, PTB tagset + NER**

{'NNS': 1758, 'VBG': 358, 'NN': 3723, 'VBP': 120, 'NNP': 497, 'PERSON': 748, 'VB': 590, 'VBD': 162, 'GPE': 488, 'JJ': 405, 'VBZ': 248, 'ORGANIZATION': 346, 'PRP': 24, 'WDT': 1, 'IN': 14, 'VBN': 40, 'RB': 45, 'DT': 9, 'NNPS': 22, 'FACILITY': 15, 'WP': 2, 'RP': 7, 'GSP': 4, 'FW': 5, 'LOCATION': 2, 'CD': 5, 'JJR': 4, 'MD': 1, 'PRP\$': 4, 'CC': 1, 'WRB': 1, 'PDT': 1, 'JJS': 2}

Validation performance:

| MSE: 0.33274 | RMSE: 0.57684 |

Test performance:

| MSE: 0.32875 | RMSE: 0.57337 |



**Normal train, PTB tagset + NER, min_count=10**

Validation performance:

| MSE: 0.33243 | RMSE: 0.57656 |

Test performance:

| MSE: 0.32830 | RMSE: 0.57297 |



**Normal train, PTB tagset + NER, min_count=50**

Validation performance:

| MSE: 0.33288 | RMSE: 0.57696 |

Test performance:

| MSE: 0.32845 | RMSE: 0.57311 |



**Normal train, Universal tagset + NER**

{'NOUN': 7494, 'VERB': 1519, 'ADJ': 437, 'GPE': 30, 'ORGANIZATION': 48, 'PRON': 30, 'DET': 11, 'ADP': 14, 'ADV': 46, 'PRT': 7, 'X': 5, 'PERSON': 5, 'NUM': 5, 'CONJ': 1}

Validation performance:

| MSE: 0.33440 | RMSE: 0.57827 |

Test performance:

| MSE: 0.33141 | RMSE: 0.57569 |



**Normal train, Universal tagset + NER, min_count=10**

Validation performance:

| MSE: 0.33487 | RMSE: 0.57868 |

Test performance:

| MSE: 0.33079 | RMSE: 0.57515 |



# Baseline (predict mean of train)

In [None]:
# Baseline for the task

train_mean = np.mean(train_grades)

train_pred_baseline = torch.zeros(len(train_grades)) + train_mean
val_pred_baseline = torch.zeros(len(val_grades)) + train_mean
test_pred_baseline = torch.zeros(len(test_grades)) + train_mean

print("Mean: ", train_mean)

print("\nBaseline train performance:")
sse, mse = model_performance(train_pred_baseline, train_grades, True)

print("\nBaseline validation performance:")
sse, mse = model_performance(val_pred_baseline, val_grades, True)

print("\nBaseline validation performance:")
sse, mse = model_performance(test_pred_baseline, test_grades, True)

Mean:  0.9355712114932938

Baseline train performance:
| MSE: 0.34060 | RMSE: 0.58361 |

Baseline validation performance:
| MSE: 0.33455 | RMSE: 0.57840 |

Baseline validation performance:
| MSE: 0.33029 | RMSE: 0.57471 |
