This notebook contains a linear regression model, which is used to replicate the feature weighting process in
Aker et al. 2016 "A Graph-based Approach to Topic Clustering for Online Comments to News"

For the linear regression model positive instances are comment pairs from the same cluster and negative instances are comments from distinct clusters as identified in the gold standard.
The model falls short to the one it seeks to replicate as the target value is binary, 1 for positive and 0 for negative instances, while positive instances had a "quote" score in the range of [0.5,1] as target value in the original paper.

In [2]:
import numpy as np
import pandas as pd
import sklearn
from scipy.optimize import nnls
from scipy.optimize import lsq_linear

from similarity_measures import *

In [3]:
df = pd.read_csv('../dataset/3/annotations.tsv', sep='\t').reset_index(drop=True)

In [4]:
targets = np.zeros((len(df)*len(df), 8), dtype=float)

In [5]:
sparse_tfidf = tf_idf_vectorize(df["text"].tolist())
sparse_tf = tf_vectorize(df["text"].tolist())
sim_matrix_tfidf = get_cosine_similarity_pairwise(sparse_tfidf)
sim_matrix_tf = get_cosine_similarity_pairwise(sparse_tf)

df["nes"] = df.text.map(lambda row: set(named_entities(row)))

In [6]:
def fill_cells(targets, df, i, j, target):
    row = len(df)*i+j
    targets[row,0] = sim_matrix_tf[i][j]
    targets[row,1] = sim_matrix_tfidf[i][j]
    targets[row,2] = cosine_modified(df.iloc[i].text, df.iloc[j].text, is_set=False)
    targets[row,3] = dice(df.iloc[i].text, df.iloc[j].text, is_set=False)
    targets[row,4] = dice(df.iloc[i].text, df.iloc[j].text, is_set=False)
    targets[row,5] = same_thread(df.iloc[i], df.iloc[j])
    targets[row,6] = ne_overlap(df.iloc[i].nes, df.iloc[j].nes, chunked=True)
    targets[row,7] = target
    
    return targets

In [7]:
# fill positive and negative instances
for i in df.index:
    for j in df.index:
        if df.iloc[i].cluster == df.iloc[j].cluster:
            targets = fill_cells(targets, df, i, j, 1.0)
        else:
            targets = fill_cells(targets, df, i, j, 0.0)

In [8]:
print("No. of positive instances: {}".format(len(targets[targets[:,7] == 1.0])))
print("No. of negative instances: {}".format(len(targets[targets[:,7] == 0.0])))

No. of positive instances: 2524
No. of negative instances: 7476


In [9]:
X = targets[:,0:7]
y = targets[:,7]
"""
#results in some feature weights being negative

reg = sklearn.linear_model.LinearRegression().fit(X, y)
"""

'\n#results in some feature weights being negative\n\nreg = sklearn.linear_model.LinearRegression().fit(X, y)\n'

In [10]:
# use Non-negative least squares optimization
x, rnorm = nnls(X, y)

In [12]:
# results in a coefficient being zero
x

array([0.21632516, 0.16902832, 0.15730916, 0.        , 0.14558136,
       0.35676363, 0.10640532])

In [13]:
# Thus the coefficients are calculated within a bound
sol = lsq_linear(X, y, bounds=(0.1,np.infty))

In [14]:
# weights found through least squares in [0.1, inf]
sol["x"]

array([0.22332566, 0.12433157, 0.15268604, 0.1       , 0.1       ,
       0.35188559, 0.10040456])

The same shall now be done for the basic feature set of only Cosine Similarity and the Thread-relationship

In [15]:
# select feature subset
X_basic = X[:, [1,5]]

In [16]:
# again the Thread-relationship would be zero through regression
x, rnorm = nnls(X_basic, y)
x

array([1.65807057, 0.        ])

In [17]:
sol = lsq_linear(X_basic, y, bounds=(0.1, np.infty))

In [18]:
# weights found
sol["x"]

array([1.60643947, 0.1       ])