# Introduction

In this notebook, we demonstrate how to train a multi-layer perceptron model for sentence similarity.

Continuting with the last week, we use the dataset consisting of sentence pairs from PubMed articles (https://www.ncbi.nlm.nih.gov/pubmed/28881973). The input is a pair of sentences and the output is the similarity between the pair annotated by expert curators.

We will cover the following:

(1) A quick revision on what we have done last week

(2) Feature engineering: implementing new features

(3) Train an updated linear regression model

(4) Train a MLP model

(5) Hyperparameter tuning

Note that training a deep learning model requires significant computational resources and time. Here we use a simple MLP model with a super small dataset for demonstration. In practice, it needs much larger datasets to train a robust deep learning model.


# Install required libraries and load the dataset

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stopword_list = set(stopwords.words('english'))
from sklearn.linear_model import LinearRegression

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
train_df = pd.read_csv('train-0.csv', sep='\t', names=['sentence_1', 'sentence_2', 'similarity'])
test_df = pd.read_csv('test-0.csv', sep='\t', names=['sentence_1', 'sentence_2', 'similarity'])

print(train_df.head())
print(test_df.head())

                                          sentence_1  ... similarity
0  This form of necrosis, also termed necroptosis...  ...        4.0
1  BAF53 and β-actin subunits have been implicate...  ...        3.8
2  T47D, MCF-7, Skbr3, HeLa, and Caco-2 cells wer...  ...        3.0
3  This oxidative branch activity is elevated in ...  ...        0.0
4  Centrosomes increase both in size and in micro...  ...        0.0

[5 rows x 3 columns]
                                          sentence_1  ... similarity
0  It has been shown, however, that ubiquitinatio...  ...        3.0
1  Ironically, Rest has recently been described a...  ...        3.2
2  In PC9 cells, loss of GATA6 and/or HOPX did no...  ...        0.2
3  MiR-223 seems to be a target molecule of TFs r...  ...        1.2
4  BAF53 and β-actin subunits have been implicate...  ...        3.0

[5 rows x 3 columns]


# Feature engineering

We need to think about what features are important for sentence similarity. One assumption can be made the similarity between the sentences will be higher if they share more words in common. So at the first step, we could use the jaccard similarity (the number of common words in a pair over the number of total words in a pair) as a feature.

For demonstration, we use this feature only and train a linear regression model. But there are many features beyond. For example: whether the pairs talk about the same entities?, whether the pairs have common phrases?, and whether the pairs come from the same sections of the paper?

You could think about many features and train the models again. The method and logic remain the same.

NEW: we add two more new features: (1) the size ratio and (2) the edit distance, e.g., the distance between 'feature' and 'features' is 1

# Preprocess sentences

To generate the feature, we firstly need to pre-porcess the sentences. Think about the preprocessing steps that you have learnt in week 3 and decide which to apply. For demonstration, I used case folding, word tokenization, stop words removal and punctuations removal. You can try others to experiment.

In [None]:
def preprocess_sentence(sentence):
  #case folding
  sentence = sentence.lower()
  processed_tokens = []
  #word tokenization
  for token in word_tokenize(sentence):
    #remove stopwords and punctuations
    if token not in punctuation and token not in stopword_list:
      processed_tokens.append(token)
  return processed_tokens



After creating the preprocess sentence function, we can compute the number of shared words for each pair

In [None]:
def compute_shared_words(sentence1, sentence2):
  #preprocess the pairs using our created preprocess functions
  sentence1_tokens = preprocess_sentence(sentence1)
  sentence2_tokens = preprocess_sentence(sentence2)
  #use python in-built function to calculate the overlap and union between the list
  shared_words = len(list(set(sentence1_tokens) & set(sentence2_tokens)))
  total_words = len(list(set(sentence1_tokens) | set(sentence2_tokens)))
  #jaccard similarity = shared_words/total_words
  #print(sentence1, sentence2, shared_words/total_words)
  return shared_words/total_words

Add  new features. You could try other features as well.

In [None]:
def compute_length_ratio(sentence1, sentence2):
  sentence1_tokens = preprocess_sentence(sentence1)
  sentence2_tokens = preprocess_sentence(sentence2)
  if len(sentence1_tokens) <= len(sentence2_tokens):
    return len(sentence1_tokens)/len(sentence2_tokens)
  else:
    return len(sentence2_tokens)/len(sentence1_tokens)

In [None]:
def computed_shared_ngrams(sentence1, sentence2, n):
  sentence1_tokens = preprocess_sentence(sentence1)
  sentence2_tokens = preprocess_sentence(sentence2)

  sentence1_ngrams = set(nltk.ngrams(' '.join(sentence1_tokens), n))
  sentence2_ngrams = set(nltk.ngrams(' '.join(sentence2_tokens), n))
  return 1 - nltk.jaccard_distance(sentence1_ngrams, sentence2_ngrams)

In [None]:
def compute_edit_distance(sentence1, sentence2):
  sentence1_tokens = preprocess_sentence(sentence1)
  sentence2_tokens = preprocess_sentence(sentence2)
  return nltk.edit_distance(' '.join(sentence1_tokens), ' '.join(sentence2_tokens))

Now we can go ahead to compute the features for training set and testing set

In [None]:
def compute_features(instances):
  features = []
  #go through each row of the dataframe
  for _, row in instances.iterrows():
    row_feature = []
    #compute the three features
    row_feature.append(compute_shared_words(row.sentence_1, row.sentence_2))
    row_feature.append(compute_length_ratio(row.sentence_1, row.sentence_2))
    row_feature.append(compute_edit_distance(row.sentence_1, row.sentence_2))
    row_feature.append(computed_shared_ngrams(row.sentence_1, row.sentence_2, 2))
    row_feature.append(computed_shared_ngrams(row.sentence_1, row.sentence_2, 3))
    features.append(np.array(row_feature))
  return np.array(features)

In [None]:
train_features = compute_features(train_df)
test_features = compute_features(test_df)

print(train_features.shape)
print(test_features.shape)

#train the linear regression model using the training data
reg = LinearRegression().fit(train_features, train_df['similarity'])

predictions = reg.predict(test_features)

print(predictions)

(80, 5)
(20, 5)
[2.35132634 2.10961077 1.00882604 1.14717013 2.08288977 1.99332065
 2.42909151 3.0762702  3.90012101 1.25958376 1.28381563 1.57025241
 1.29908133 1.11894975 1.01228613 2.13004045 0.89101831 2.50396893
 1.69758698 2.13110393]


# Evaluation

Person correlation is used to evaluate the performance of the model, suggested by the dataset creators. It ranges from 0 to 1. The higher value indicates it is more similar to human annotations.

In [None]:
from scipy.stats import pearsonr

print (pearsonr(predictions, test_df['similarity'])[0])

0.7788680664542982


# Train a MLP model

In [None]:
from sklearn.neural_network import MLPRegressor

mlp = MLPRegressor(hidden_layer_sizes=(100), max_iter=1000,
                   batch_size=20, random_state=0).fit(train_features, train_df['similarity'])

predictions = mlp.predict(test_features)

print (pearsonr(predictions, test_df['similarity'])[0])

0.6009781080964407


# Hyperparameter tuning

In [None]:
for hidden_units in range (10, 201, 10):
  mlp = MLPRegressor(hidden_layer_sizes=(hidden_units), max_iter=1000,
                   batch_size=20, random_state=0).fit(train_features, train_df['similarity'])

  predictions = mlp.predict(test_features)

  print (hidden_units, pearsonr(predictions, test_df['similarity'])[0])

10 0.41477545400640803
20 0.7030378975884407
30 0.6127807298214164
40 0.6543617015861044
50 0.6901274190444908
60 0.695676701666488
70 0.6825270694290609
80 0.6542659764767693
90 0.6352368701930838
100 0.6009781080964407
110 0.649257720519443
120 0.6618048646929301
130 0.6547502396317846
140 0.6173160490748243
150 0.6933434647930015
160 0.6307066582205453
170 0.6566601112592119
180 0.6424480052268968
190 0.7213345501514812
200 0.6762522177537691


# Notes

We finished the first set of examples of using deep learning models for text mining. As mentioned, this is a demonstration for a simple MLP model with a super small dataset. However, the full pipeline holds and you are encouraged to apply to larger datasets and more challenging problems for practice.