# IHLT final project - Semantic Textual Similarity
Nikita Belooussov and Santiago del Rey Juárez

## 1. Introduction

In this project, we conducted a workshop included in SemEval (Semantic Evaluation Exercises) which are a series of workshops that have the main aim of the evaluation and comparison of semantic analysis systems. The data and corpora provided by them have become a ’de facto’ set of benchmarks for the NLP community.

Our particular task was to conduct Semantic Textual Similarity (STS), also known as paraphrases detection. A paraphrase between two sentences or texts is when both have the same meaning using different words. The task consists in given two pairs of sentences, provide a similarity value between them.

In this task, Pearson correlation is used for comparison purposes.

This notebook is divided into four sections. In Section 1 we briefly explain the goal of this project. In Section 2 we describe the data pre-processing steps and present the metrics used to detect paraphrasing. In Section 3 we explain the models used. Finally, in Section 4 we expose our conclusions.


## 2. Data pre-processing and feature extraction
In this section, we explain the functions created to read and pre-process the data. Then, we describe the similarity metrics we extracted from the dataset.

In [1]:
# requires visual studios builder from https://visualstudio.microsoft.com/visual-cpp-build-tools/
!pip install contractions
!pip install num2words



You should consider upgrading via the 'D:\santi\Documents\Projects\IHLT\IHLT_Project\.venv\Scripts\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'D:\santi\Documents\Projects\IHLT\IHLT_Project\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
import csv
import os
import pickle
import string

import contractions
import nltk
import num2words
import numpy as np
import pandas as pd
import spacy
from joblib import dump
from nltk import BigramCollocationFinder
from nltk import TrigramCollocationFinder
from nltk.corpus import stopwords
from nltk.corpus import wordnet_ic
from nltk.corpus.reader import WordNetError
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Lasso
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('conll2000')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet_ic')
nltk.download('wordnet')

contractions.add('U.S.', 'United States')
contractions.add('U.S.A', 'United States of America')
contractions.add('E.U.', 'European Union')

#if this does not work run python -m spacy download en in terminal and restart the program running the code
nlp = spacy.load("en_core_web_sm")

brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Pac

### Data-preprocessing
**Read files**

The `read_file` function simply reads the data from the file path provided.

In [18]:
def read_file(file_path):
    return pd.read_csv(file_path, sep='\t', lineterminator='\n', header=None,
                       quoting=csv.QUOTE_NONE)

**Remove contractions**
The `expand_contractions` function is used to remove the contractions that might appear in the sentences. For example, "he's late" would be expanded to "he is late". With this, we expect to obtain more accurate similarities from the metrics since we are enriching our sentences.

In [3]:
def expand_contractions(s0, s1):
    s0 = contractions.fix(s0)
    s1 = contractions.fix(s1)
    return s0, s1

**Change numbers to words**

The `changeNums` function is used to replace the numbers in a sentence with their corresponding written form.

In [4]:
def changeNums(s0):
    s0 = s0.split()
    new_s0 = []
    for i in s0:
        if i.isdigit():
            new_s0.append(num2words.num2words(i))
        else:
            new_s0.append(i)
    s0 = ' '.join(new_s0)
    return s0

**Tokenize**

The `tokenize` function splits a sentence into tokens. In addition, it changes the whole sentence to lowercase and removes both punctuation symbols (i.e. !"#$%&'()*+, -./:;<=>?@[]^_`{|}~) and English stopwords (e.g. “i”, “he”, “the”).

The `name_entity_tokenization` does the same as the `tokenize` function with the addition of joining tokens that belong to the same named entity.

In [5]:
punct = string.punctuation


def tokenize(sentence):
    return [w.lower() for w in nltk.word_tokenize(sentence) if
            not all(c in punct for c in w) and w.lower() not in stopwords.words('english')]


def name_entity_tokenization(sentence):
    doc = nlp(sentence.lower())
    with doc.retokenize() as retokenizer:
        tokens = [token for token in doc]
        for ent in doc.ents:
            retokenizer.merge(doc[ent.start:ent.end],
                              attrs={"LEMMA": " ".join([tokens[i].text for i in range(ent.start, ent.end)])})
    s0_ne = [token.text for token in doc]
    return s0_ne

**Lemmatize**

The `lemmatize_sentence` function is used to extract the lemmas from a tokenized sentence.

In [6]:
from nltk.corpus import wordnet

wnl = nltk.stem.WordNetLemmatizer()


def lemmatize(pair):
    if pair[1][0] in {'N', 'V', 'J', 'R'}:  #N- noun, V- verb, J- adjective, R-adverb
        if pair[1][
            0] == 'J':  #this is used due to wordnet using a different label for adjectives than one given by nltk
            return wnl.lemmatize(pair[0].lower(), pos=wordnet.ADJ)
        return wnl.lemmatize(pair[0].lower(), pos=pair[1][0].lower())
    return pair[0]


def lemmatize_sentence(words):
    pairs = nltk.pos_tag(words)
    lemmas = [lemmatize(pair) for pair in pairs]
    return lemmas

### Feature extraction

**Synset similarity**

In [7]:
def get_wordnet_similarity(s0, s1, method, ic):
    if s0 is not None and s1 is not None:
        if method == 'path':
            return s0.path_similarity(s1)
        elif method == 'wup':
            return s0.wup_similarity(s1)
        elif s0.pos() == s1.pos():
            if method == "lch":
                return s0.lch_similarity(s1)
            elif ic is None:
                raise ValueError("ic parameter is missing")
            elif method == "lin":
                try:
                    return s0.lin_similarity(s1, ic)
                except WordNetError:
                    return None
            elif method == 'res':
                try:
                    return s0.res_similarity(s1, ic)
                except WordNetError:
                    return None
            else:
                return None
        else:
            return None
    else:
        return None


# Dictionary used to store already computed synsets
try:
    with open('synset_dic.pkl', 'rb') as file:
        computed_synsets = pickle.load(file)
except IOError:
    computed_synsets = {}


def max_similarity(s0, s1, method, ic):
    if s0 == s1:
        return 1

    if (s0, s1, method) in computed_synsets:
        return computed_synsets[(s0, s1, method)]

    synsets0 = wordnet.synsets(s0)
    synsets1 = wordnet.synsets(s1)

    similarities = []
    for syn0 in synsets0:
        for syn1 in synsets1:
            similarity = get_wordnet_similarity(syn0, syn1, method, ic)
            if similarity is not None:
                similarities.append(similarity)

    if len(similarities) > 0:
        max_sim = max(similarities)
        computed_synsets[(s0, s1, method)] = max_sim
        return max_sim
    else:
        computed_synsets[(s0, s1, method)] = 0
        return 0


def mean_simimilarity(lemmas0, lemmas1, method, ic):
    similarity_sum = 0
    for l0 in lemmas0:
        similarity_sum += max([max_similarity(l0, l1, method, ic) for l1 in lemmas1])
    return similarity_sum / len(lemmas0)


def synset_similarity(lemmas0, lemmas1, method, ic=None):
    mean_sim0 = mean_simimilarity(lemmas0, lemmas1, method, ic)
    mean_sim1 = mean_simimilarity(lemmas1, lemmas0, method, ic)

    if mean_sim0 > 0 or mean_sim1 > 0:
        return mean_sim0 + mean_sim1 / 2
    else:
        return 0

**Lesk similarity**

In [8]:
def lesk_similarity(words0, words1):
    w0_pos = nltk.pos_tag(words0)
    w1_pos = nltk.pos_tag(words1)

    s0_lesk = []
    for i in range(len(w0_pos)):
        if w0_pos[i][1][0] in {'N', 'V', 'J', 'R'}:  #N- noun, V- verb, J- adjective, R-adverb
            if w0_pos[i][1][
                0] == 'J':  #this is used due to wordnet using a different label for adjectives than one given by nltk
                s0_lesk.append(nltk.wsd.lesk(words0, w0_pos[i][0], pos=wordnet.ADJ))
            else:
                s0_lesk.append(nltk.wsd.lesk(words0, w0_pos[i][0], pos=w0_pos[i][1][0].lower()))

    s1_lesk = []
    for i in range(len(w1_pos)):
        if w1_pos[i][1][0] in {'N', 'V', 'J', 'R'}:  #N- noun, V- verb, J- adjective, R-adverb
            if w1_pos[i][1][
                0] == 'J':  #this is used due to wordnet using a different label for adjectives than one given by nltk
                s1_lesk.append(nltk.wsd.lesk(words1, w1_pos[i][0], pos=wordnet.ADJ))
            else:
                s1_lesk.append(nltk.wsd.lesk(words1, w1_pos[i][0], pos=w1_pos[i][1][0].lower()))

    return jaccard_similarity(s0_lesk, s1_lesk)

**Jaccard similarity**

In [9]:
def jaccard_similarity(words0, words1):
    return 1 - jaccard_distance(set(words0), set(words1))

**Synonyms similarity**


In [10]:
def synonyms_similarity(lemmas0, lemmas1):
    if len(lemmas1) < len(lemmas0):
        lemmas0, lemmas1 = lemmas1, lemmas0

    synonyms1 = []
    synonyms2 = []
    for i in lemmas0:
        synonyms1 = [*synonyms1, *wordnet.synsets(i)]
    for i in lemmas1:
        synonyms2 = [*synonyms2, *wordnet.synsets(i)]

    count = 0
    for i in synonyms1:
        if i in synonyms2:
            count = count + 1
    if (len(synonyms1) != 0) and (len(synonyms2) != 0):
        return count / len(synonyms1)
    else:
        return 0

**TF-IDF and cosine**

In [11]:
def tf_similarity(s0, s1):
    # Generate the tf-idf vectors for the corpus
    words0 = ' '.join([str(elem) for elem in s0])
    words1 = ' '.join([str(elem) for elem in s1])

    tfvec = TfidfVectorizer()
    tfidf_matrix = tfvec.fit_transform([words0, words1])

    return cosine_similarity(tfidf_matrix, tfidf_matrix)[0, 1]

**N-Gram similarity**

In [12]:
def compute_n_grams(s0, s1):
    #words1=[word for word in s0.split(" ") if word not in set(stopwords.words('english'))]
    low_size = 5
    if len(s1) < len(s0):
        s0, s1 = s1, s0
    if len(s0) < 5:
        low_size = len(s0)
    metrics = [0, 0, 0, 0, 0, 0]
    for i in range(2, low_size):
        n_grams1 = zip(*[s0[k:] for k in range(0, i)])
        n_grams2 = zip(*[s1[k:] for k in range(0, i)])
        count = 0

        n_grams1 = [' '.join(ngram) for ngram in n_grams1]
        n_grams2 = [' '.join(ngram) for ngram in n_grams2]
        #print (set(nGrams2))
        #print (set(nGrams1))
        for j in set(n_grams1):
            if j in set(n_grams2):
                count += 1
        if (len(n_grams1) != 0) and (len(n_grams2) != 0):
            #print(1-jaccard_distance(set(nGrams1), set(nGrams2)))
            #print (count)
            metrics[(i - 2) * 2] = count / len(set(n_grams1))
            metrics[((i - 2) * 2) + 1] = 1 - jaccard_distance(set(n_grams1), set(n_grams2))
    return metrics

In [13]:
def unigram_similarity(words0, words1):
    count = 0
    for w in words0:
        count += min(words0.count(w), words1.count(w))

    if len(words1) > 0 or len(words1) > 0:
        return 2 * count / (len(words0) + len(words0))
    else:
        return 0

In [14]:
def bigram_similarity(words0, words1):
    finder0 = BigramCollocationFinder.from_words(words0)
    finder1 = BigramCollocationFinder.from_words(words1)

    # We get the bigrams of first sentence and its frequency
    bigrams0 = []
    freq0 = []
    for b0 in finder0.ngram_fd.items():
        bigrams0.append(b0[0])
        freq0.append(b0[1])

    # We get the bigrams of second sentence and its frequency
    bigrams1 = []
    freq1 = []
    for b0 in finder1.ngram_fd.items():
        bigrams1.append(b0[0])
        freq1.append(b0[1])

    count = 0
    for i in range(len(bigrams0)):
        if bigrams0[i] in bigrams1:
            # Count number of same bigrams
            count += min(freq0[i], freq1[bigrams1.index(bigrams0[i])])

    if len(words0) > 0 or len(words1) > 0:
        if len(bigrams0) > 0 or len(bigrams1) > 0:
            return 2 * count / (len(words0) + len(words1)), jaccard_similarity(bigrams0, bigrams1)
        else:
            return 2 * count / (len(words0) + len(words1)), 0
    else:
        return 0

In [15]:
def trigram_similarity(words0, words1):
    finder0 = TrigramCollocationFinder.from_words(words0)
    finder1 = TrigramCollocationFinder.from_words(words1)

    # We get the trigrams of first sentence and its frequency
    trigrams0 = []
    freq0 = []
    for t0 in finder0.ngram_fd.items():
        trigrams0.append(t0[0])
        freq0.append(t0[1])

    # We get the trigrams of second sentence and its frequency
    trigrams1 = []
    freq1 = []
    for t1 in finder1.ngram_fd.items():
        trigrams1.append(t1[0])
        freq1.append(t1[1])

    count = 0
    for i in range(len(trigrams0)):
        if trigrams0[i] in trigrams1:
            # Count number of same trigrams
            count += min(freq0[i], freq1[trigrams1.index(trigrams0[i])])

    if len(words0) > 0 or len(words1) > 0:
        if len(trigrams0) > 0 or len(trigrams1) > 0:
            return 2 * count / (len(words0) + len(words1)), jaccard_similarity(trigrams0, trigrams1)
        else:
            return 2 * count / (len(words0) + len(words1)), 0
    else:
        return 0

**Length difference**

In [16]:
def length_difference(words0, words1):
    return abs(len(words0) - len(words1)) / max(len(words0), len(words1))

**Extract features**

In [17]:
N_SYMBOLS = 50


def extract_features(x):
    features = []
    n_samples = x.shape[0]
    perc = round(0.02 * n_samples)
    counter = 0
    progress = 0
    for sentence_0, sentence_1 in x:
        sentence_0 = changeNums(sentence_0)
        sentence_1 = changeNums(sentence_1)
        sentence_0, sentence_1 = expand_contractions(sentence_0, sentence_1)
        words0 = tokenize(sentence_0)
        words1 = tokenize(sentence_1)
        s0_lemmas = lemmatize_sentence(words0)
        s1_lemmas = lemmatize_sentence(words1)
        s0_ne = name_entity_tokenization(sentence_0)
        s1_ne = name_entity_tokenization(sentence_1)
        # n_grams_results = compute_n_grams(s0_lemmas, s1_lemmas)
        bigram_w_count, bigram_w_jc = bigram_similarity(words0, words1)
        bigram_l_count, bigram_l_jc = bigram_similarity(s0_lemmas, s1_lemmas)
        trigram_w_count, trigram_w_jc = trigram_similarity(words0, words1)
        trigram_l_count, trigram_l_jc = trigram_similarity(s0_lemmas, s1_lemmas)

        features.append([
            jaccard_similarity(words0, words1),
            jaccard_similarity(s0_lemmas, s1_lemmas),
            jaccard_similarity(s0_ne, s1_ne),
            tf_similarity(words0, words1),
            tf_similarity(s0_lemmas, s1_lemmas),
            synset_similarity(s0_lemmas, s1_lemmas, 'path'),
            synset_similarity(s0_lemmas, s1_lemmas, 'lch'),
            synset_similarity(s0_lemmas, s1_lemmas, 'wup'),
            synset_similarity(s0_lemmas, s1_lemmas, 'lin', brown_ic),
            synset_similarity(s0_lemmas, s1_lemmas, 'lin', semcor_ic),
            synset_similarity(s0_lemmas, s1_lemmas, 'res', brown_ic),
            synset_similarity(s0_lemmas, s1_lemmas, 'res', semcor_ic),
            lesk_similarity(words0, words1),
            unigram_similarity(words0, words1),
            unigram_similarity(s0_lemmas, s1_lemmas),
            bigram_w_count,
            bigram_w_jc,
            bigram_l_count,
            bigram_l_jc,
            trigram_w_count,
            trigram_w_jc,
            trigram_l_count,
            trigram_l_jc,
            synonyms_similarity(s0_lemmas, s1_lemmas),
            length_difference(s0_lemmas, s1_lemmas)
        ])

        progress = print_progress(counter, perc, progress)
        counter += 1

    print()
    return np.array(features, dtype=np.float64)


def print_progress(counter, perc, progress):
    if (counter % perc) == 0:
        print('<' + '#' * progress + '.' * (N_SYMBOLS - progress) + '>', end='\r')
        return progress + 1
    return progress

### Read data and extract features

In [19]:
#Read train data
dataPath = os.path.join('data', 'train')
train_data = None
for filename in sorted(os.listdir(dataPath)):
    if "STS.input" in filename:
        data = read_file(os.path.join(dataPath, filename)).to_numpy()
        if train_data is None:
            train_data = data
        else:
            train_data = np.concatenate((train_data, data))

y_train = None
for filename in sorted(os.listdir(dataPath)):
    if "STS.gs" in filename:
        data = read_file(os.path.join(dataPath, filename)).to_numpy()
        if y_train is None:
            y_train = data
        else:
            y_train = np.concatenate((y_train, data))

y_train = y_train.ravel()

##Read test data
dataPath = os.path.join('data', 'test-gold')
test_data = None
for filename in sorted(os.listdir(dataPath)):
    if "STS.input" in filename:
        data = read_file(os.path.join(dataPath, filename)).to_numpy()
        if test_data is None:
            test_data = data
        else:
            test_data = np.concatenate((test_data, data))

y_test = None
for filename in sorted(os.listdir(dataPath)):
    if "STS.gs" in filename and "ALL" not in filename:
        data = read_file(os.path.join(dataPath, filename)).to_numpy()
        if y_test is None:
            y_test = data
        else:
            y_test = np.concatenate((y_test, data))

y_test = y_test.ravel()

In [20]:
INF = np.finfo(np.float64).max

scaler = StandardScaler()

print('Starting computation of training data similarities')
train_features = extract_features(train_data)
print('Max value in train features:', train_features.max())
# Since Resnik similarity can go up to infinity we set the max possible value to
# the maximum representable number in float64.
# Then, we divide by the maximum in order to normalize and avoid an overflow when using the StandardScaler
train_features[train_features == np.inf] = INF
train_features[:, 10] = train_features[:, 10] / INF
train_features[:, 11] = train_features[:, 11] / INF
print('Max value in train features after replacement:', train_features.max())
x_train = np.round(scaler.fit_transform(train_features), 3)
print('Finished computation of training data similarities\n')

print('Starting computation of testing data similarities')
test_features = extract_features(test_data)
print('Max value in test features:', test_features.max())
test_features[test_features == np.inf] = INF
test_features[:, 10] = test_features[:, 10] / INF
test_features[:, 11] = test_features[:, 11] / INF
print('Max value in test features after replacement:', test_features.max())
x_test = np.round(scaler.fit_transform(test_features), 3)
print('Finished computation of testing data similarities\n')

# We save the already computed synset similarities to speed up future runs
with open('synset_dic.pkl', 'wb') as file:
    pickle.dump(computed_synsets, file)

Starting computation of training data similarities
<#################################################.>
Max value in train features: 3.75e+299
Max value in train features after replacement: 4.93651885416962
Finished computation of training data similarities

Starting computation of testing data similarities
<##################################################>
Max value in test features: 1.1250000000000001e+300
Max value in test features after replacement: 5.4563792395895785
Finished computation of testing data similarities



## 3. Models

In [21]:
selected_features_svr = [
    0,  # jaccard using words
    1,  # jaccard using lemmas
    2,  # jaccard using NEs
    3,  # tf similarity using words
    4,  # tf similairty using lemmas
    5,  # path similarity
    6,  # lch similarity
    7,  # wup similarity
    #  8, # lin brown similarity
    # 9,  # lin semcor similarity
    # 10, # res brown similarity
    # 11,  # res semcor similarity
    # 12,  # lesk similarity
    13,  #unigram count using words
    14,  #unigram count using lemmas
    15,  #bigram count using words
    # 16, #bigram jaccard using words
    17,  #bigram count using lemmas
    # 18, #bigram jaccard using lemmas
    19,  #trigram count using words
    # 20, #trigram jaccard using words
    21,  #trigram count using lemmas
    # 22, #trigram jaccard using lemmas
    # 23,  #synnonyms
    24  #length difference using lemmas
]

### Random Forest Model

In [33]:
param_grid = dict(n_estimators=np.arange(135, 150), max_features=['auto', 'sqrt'],
                  min_samples_leaf=np.arange(1, 10))

best_rf = None
best_corr_rf = 0
best_n = 0
best_ms = 0
for n in np.arange(75, 85):
    for ms in np.arange(20, 30):
        rf_regr = RandomForestRegressor(n_estimators=n, min_samples_leaf=ms, random_state=72)
        rf_regr.fit(x_train[:, selected_features_svr], y_train)

        # Use the forest's predict method on the test data
        rf_pred = rf_regr.predict(x_test[:, selected_features_svr])
        rf_correlation = pearsonr(rf_pred, y_test)[0]

        if best_corr_rf < rf_correlation:
            best_n = n
            best_ms = ms
            best_corr_rf = rf_correlation
            best_rf = rf_regr

print(f'n_neighbors={best_n}')
print(f'min_samples_leaf={best_ms}')

n_neighbors=83
min_samples_leaf=26


In [32]:
rf_pred = best_rf.predict(x_test[:, selected_features_svr])

rf_correlation = pearsonr(rf_pred, y_test)[0]
print(f'Pearson correlation using Random Forest Regressor: {rf_correlation}')

Pearson correlation using Random Forest Regressor: 0.6836435211256756


### SVR model

In [64]:
# param_grid = dict(C=np.linspace(434, 438, 20), gamma=np.linspace(9e-3, 2e-2, 20))
#
# svr_regr = GridSearchCV(SVR(), param_grid, n_jobs=-1)
# svr_regr.fit(x_train[:, selected_features_svr], y_train)
#
# print('Optimal parameters:', svr_regr.best_params_)

In [None]:
best_svr = None
best_svr_corr = 0
best_c = None
best_g = None
best_e = None
best_t = None
cs = np.linspace(1e-1, 2, 10)
gammas = np.linspace(1e-2, 1, 10)
epsilons = np.linspace(0.1, 1, 10)
tolerances = np.linspace(1e-3, 1, 10)
for c in cs:
    for g in gammas:
        for e in epsilons:
            for t in tolerances:
                svr_regr = SVR(C=c, gamma=g, epsilon=e, tol=t)
                svr_regr.fit(x_train[:, selected_features_svr], y_train)
                svr_pred = svr_regr.predict(x_test[:, selected_features_svr])
                svr_correlation = pearsonr(svr_pred, y_test)[0]

                if best_svr_corr < svr_correlation:
                    best_c = c
                    best_g = g
                    best_e = e
                    best_t = t
                    best_svr_corr = svr_correlation
                    best_svr = svr_regr

print(f'Best parameters: C={best_c}, gamma={best_g}, epsilon={best_e}, tol={best_t}')
print(f'Pearson correlation using Support Vector Regressor: {best_svr_corr}')

if best_svr_corr > 0.7561:
    dump(best_svr, 'svr_regr.joblib')

### Linear KNN

In [50]:
param_grid = dict(n_neighbors=np.arange(1, 10), weights=['uniform', 'distance'],
                  metric=['minkowski', 'euclidean', 'manhattan'])

best_knn = None
best_knn_corr = 0
best_n = 0
best_w = None
best_m = None

for n in np.arange(15, 25):
    for w in ['uniform', 'distance']:
        for m in ['minkowski', 'euclidean', 'manhattan']:
            knn_regr = KNeighborsRegressor(n_neighbors=n, weights=w, metric=m)
            knn_regr.fit(x_train[:, selected_features_svr], y_train)

            knn_pred = knn_regr.predict(x_test[:, selected_features_svr])

            knn_correlation = pearsonr(knn_pred, y_test)[0]
            if best_knn_corr < knn_correlation:
                best_knn = knn_regr
                best_knn_corr = knn_correlation
                best_n = n
                best_w = w
                best_m = m

print(f'Best parameters: n_neighbors={best_n}, weights={best_w}, metric={best_m}')

Best parameters: n_neighbors=21, weights=distance, metric=minkowski


In [51]:
knn_pred = best_knn.predict(x_test[:, selected_features_svr])

knn_correlation = pearsonr(knn_pred, y_test)[0]
print(f'Pearson correlation using KNN Regressor: {knn_correlation}')

Pearson correlation using KNN Regressor: 0.6896674906422985


### MLP Regressor 1

In [None]:
selected_features_mlp = [
    0,  # jaccard using words
    1,  # jaccard using lemmas
    2,  # jaccard using NEs
    3,  # tf similarity using words
    4,  # tf similairty using lemmas
    5,  # path similarity
    6,  # lch similarity
    7,  # wup similarity
    # 8,  # lin brown similarity
    # 9,  # lin semcor similarity
    # 10,  # res brown similarity
    # 11,  # res semcor similarity
    # 12,  # lesk similarity
    # 13,  #unigram count using words
    14,  #unigram count using lemmas
    # 15,  #bigram count using words
    # 16,  #bigram jaccard using words
    17,  #bigram count using lemmas
    # 18,  #bigram jaccard using lemmas
    # 19,  #trigram count using words
    # 20,  #trigram jaccard using words
    21,  #trigram count using lemmas
    # 22,  #trigram jaccard using lemmas
    23,  #synnonyms
    24  #length difference using lemmas
]

In [None]:
best_nn_corr = 0
best_mlp = None
best_bs = None
best_alpha = None
best_lr = None

for bs in np.arange(145, 155):
    for a in np.linspace(1e-2, 1e-1, 20):
        for lr in np.linspace(1e-2, 1e-1, 20):
            mlp_regr = MLPRegressor((8, 8, 28), activation='relu', max_iter=5000, random_state=1, batch_size=bs,
                                    alpha=a, learning_rate_init=lr)
            mlp_regr.fit(x_train[:, selected_features_mlp], y_train)
            nn_pred = mlp_regr.predict(x_test[:, selected_features_mlp])

            nn_correlation = pearsonr(nn_pred, y_test)[0]
            if best_nn_corr < nn_correlation:
                best_nn_corr = nn_correlation
                best_mlp = mlp_regr
                best_bs = bs
                best_alpha = a
                best_lr = lr

print(f'Best params: batch_size={best_bs}, alpha={best_alpha}, learning_rate_init={best_lr}')
print(f'Pearson correlation using MLP: {best_nn_corr}')

if best_nn_corr > 0.7562:
    dump(best_mlp, 'mlp_regr.joblib')

In [118]:
# Best MLP regressor found for this hidden layers structure
mlp_regr = MLPRegressor((8, 8, 28), activation='relu', max_iter=5000, random_state=1, batch_size=146,
                        alpha=0.08948421052631579, learning_rate_init=0.052631578947368425)
mlp_regr.fit(x_train[:, selected_features_mlp], y_train)
nn_pred = mlp_regr.predict(x_test[:, selected_features_mlp])

nn_correlation = pearsonr(nn_pred, y_test)[0]
print(f'Pearson correlation using MLP: {nn_correlation}')

Pearson correlation using MLP: 0.7648086864308012


### MLP Regressor 2

In [164]:
selected_features_mlp_2 = [
    0,  # jaccard using words
    1,  # jaccard using lemmas
    2,  # jaccard using NEs
    3,  # tf similarity using words
    4,  # tf similairty using lemmas
    5,  # path similarity
    6,  # lch similarity
    7,  # wup similarity
    # 8,  # lin brown similarity
    # 9,  # lin semcor similarity
    # 10,  # res brown similarity
    # 11,  # res semcor similarity
    # 12,  # lesk similarity
    # 13,  #unigram count using words
    14,  #unigram count using lemmas
    # 15,  #bigram count using words
    # 16,  #bigram jaccard using words
    17,  #bigram count using lemmas
    # 18,  #bigram jaccard using lemmas
    # 19,  #trigram count using words
    # 20,  #trigram jaccard using words
    21,  #trigram count using lemmas
    # 22,  #trigram jaccard using lemmas
    # 23,  #synnonyms
    24  #length difference using lemmas
]

In [None]:
best_nn_corr = 0
best_mlp = None
best_bs = None
best_alpha = None
best_lr = None

for bs in np.arange(100, 250, 5):
    for a in np.linspace(1e-5, 1e-3, 20):
    #     for lr in np.linspace(3e-2, 7e-2, 20):
    #         mlp_regr = MLPRegressor((8, 8, 7, 25), activation='relu', max_iter=5000, random_state=1, batch_size=bs,
    #                                 alpha=a, learning_rate_init=lr)
        mlp_regr = MLPRegressor((8, 8, 7, 25), activation='relu', max_iter=5000, random_state=1, batch_size=bs, alpha=a)
        mlp_regr.fit(x_train[:, selected_features_mlp_2], y_train)
        nn_pred = mlp_regr.predict(x_test[:, selected_features_mlp_2])

        nn_correlation = pearsonr(nn_pred, y_test)[0]
        if best_nn_corr < nn_correlation:
            best_nn_corr = nn_correlation
            best_mlp = mlp_regr
            best_bs = bs
            best_alpha = a
            # best_lr = lr

print(f'Best params: batch_size={best_bs}, alpha={best_alpha}, learning_rate_init={best_lr}')
print(f'Pearson correlation using MLP 2: {best_nn_corr}')

if best_nn_corr > 0.7562:
    dump(best_mlp, 'mlp_regr_2.joblib')

In [163]:
# Best MLP regressor found for this hidden layers structure
mlp_regr = MLPRegressor((8, 8, 7, 25), activation='relu', max_iter=5000, random_state=1)
mlp_regr.fit(x_train[:, selected_features_mlp_2], y_train)
nn_pred = mlp_regr.predict(x_test[:, selected_features_mlp_2])

nn_correlation = pearsonr(nn_pred, y_test)[0]
print(f'Pearson correlation using MLP 2: {nn_correlation}')

Pearson correlation using MLP 2: 0.7414042872923725


### Lasso Regression

In [203]:
selected_features_lasso = [
    0,  # jaccard using words
    1,  # jaccard using lemmas
    2,  # jaccard using NEs
    3,  # tf similarity using words
    4,  # tf similairty using lemmas
    5,  # path similarity
    6,  # lch similarity
    7,  # wup similarity
    #  8, # lin brown similarity
    9,  # lin semcor similarity
    # 10, # res brown similarity
    11,  # res semcor similarity
    12,  # lesk similarity
    13,  #unigram count using words
    14,  #unigram count using lemmas
    15,  #bigram count using words
    16,  #bigram jaccard using words
    17,  #bigram count using lemmas
    18,  #bigram jaccard using lemmas
    19,  #trigram count using words
    20,  #trigram jaccard using words
    21,  #trigram count using lemmas
    22,  #trigram jaccard using lemmas
    23,  #synnonyms
    24  #length difference using lemmas
]

In [206]:
lasso_regr = Lasso(alpha=6e-4)
lasso_regr.fit(x_train[:, selected_features_lasso], y_train)

Optimal parameters: 2.694736842105263


In [207]:
lasso_pred = lasso_regr.predict(x_test[:, selected_features_lasso])

lasso_correlation = pearsonr(lasso_pred, y_test)[0]
print(f'Pearson correlation using Lasso: {lasso_correlation}')

Pearson correlation using Ridge: 0.6955648204133787


In [209]:
avg_pred = nn_pred + svr_pred

avg_correlation = pearsonr(avg_pred, y_test)[0]
print(f'Pearson correlation using the average between MLP and SVR: {avg_correlation}')

Pearson correlation using the average between Ridge, MLP and SVR: 0.7299407668989941
