This notebook attempts to fit traditional NLP techniques - SVD, Trigrams etc. - to the economic data newspaper set to draw signal for the relevance dataset. To build robust trigram models, smoothening and discounting are used on the maximum likelihood statistics drawn to balance probability estimates for binary classification. The SVD attempts simple clustering mechanisms to find underlying similarities between article snippets.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import nltk
import re
import os
import gc
import pickle
import json
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
trainX = pickle.load(open('./data/relevance_trainX.pkl', "rb"))
trainY = pickle.load(open('./data/relevance_trainY.pkl', "rb"))
testX = pickle.load(open('./data/relevance_testX.pkl', "rb"))
testY = pickle.load(open('./data/relevance_testY.pkl', "rb"))

<h2> Supervised Learning - Trigram Model

Create two trigram models $ M_{+} $ and $ M_{-} $ such that every sentence $ S_{i} $, the probability of relevance is given by a softmax activation, namely $\frac{P_{M+}(S_{i})}{\sum P(S_{i})} $

In [3]:
#Token cleaning: Stemming, <START> and <STOP>, <NUM> tag
def num_tag(num):
    digits = np.array(['1','2','3','4','5','6','7','8','9','0'])
    return sum(np.vectorize(lambda s: s in num)(digits))>0
def cleaner_vocab(s):
    if s[0]=="\'" and s[-1]=="\'": s = s[1:-1]
    if "." in s and len(s)!=1: s = s.replace(".","")
    return s
def trigram_text(s):
    s = "<start> <start> "+s+" <end>"
    s = np.vectorize(lambda s: "<num>" if num_tag(s) else s)(s.split())
    return np.vectorize(lambda s: cleaner_vocab(s))(s)
trainX = trainX.apply(lambda txt: trigram_text(txt))

In [4]:
postrainX = trainX[trainY==1].copy().reset_index(drop=True)
negtrainX = trainX[trainY==0].copy().reset_index(drop=True)

In [5]:
#Define a Trigram Model
class LangModel():
    def __init__(self, vocab):
        self.unigram, self.bigram, self.trigram = {}, {}, {}
        self.unigram.update(dict.fromkeys(vocab, 0))
        self.bigram.update(dict.fromkeys(vocab, self.unigram.copy()))
        self.trigram.update(dict.fromkeys(vocab, self.bigram.copy()))
    
    def add_experience(self, word1, word2, word3):
        self.unigram[word3]+=1
        self.bigram[word2][word3]+=1
        self.trigram[word1][word2][word3]+=1

In [6]:
#Creating unigram, bigram and trigram probability dictionaries
vocab = trainX.apply(lambda s: s).values
vocab = [val for x in vocab for val in x]
posModel, negModel = LangModel(vocab), LangModel(vocab)

In [7]:
#Define word counts
for sent in postrainX:
    for i in range(2, len(sent)):
        posModel.add_experience(sent[i-2], sent[i-1], sent[i])
for sent in negtrainX:
    for i in range(2, len(sent)):
        negModel.add_experience(sent[i-2], sent[i-1], sent[i])

<p> Application of Linear Interpolation on Maximum Likelihood Estimates: <p>
$ P(w|u,v) = \lambda_{1} * Q_{ML}(w|u,v) + \lambda_{2} * Q_{ML}(w|v) + (1 - \lambda_{1} - \lambda_{2}) * Q_{ML}(w) $
<br\><br\>
where the  $ Q_{ML} $ refers to traditional maximum likelihood estimates (unigrams, bigrams, and trigrams), and
<br\>
$ \lambda_{1} = \dfrac {c(u,v)}{c(u,v)+\gamma} $, 
$ \lambda_{2} = (1 - \lambda_{1}) * \dfrac {c(v)}{c(v)+\gamma} $
<br\>
where $ \gamma $ is optimized by maximizing perplexity or minimizing log loss on the training set.

In [None]:
#choosing singular smoothing parameter through perplexity maximization


In [None]:
#applying linear interpolation

In [None]:
#analyzing train set results

In [None]:
#calculating test set probabilities

<h2> Semi-Supervised Learning - SVD Algorithm