We need to build a model that clusters questions in a given dataset and is able to classify new ones.
A priori, there are two directions we could take:

* Use some kind of topic modeling method method to decide what texts are talking about, then decide which topics fit for new questions.
* Use a totally unsupervised clustering method to define clusters, then use them to classify new questions.

Which method to use will depend on how the questions in the dataset look, since topic modeling only makes sense for moderately long texts. Let's load the training questions and take a look at them.

In [1]:
questions_file = 'nlp_test.csv'

with open(questions_file) as file:
    questions = file.readlines()

print(questions[:20])

['"question"\n', 'In what ways did OneEleven not help you? How can we improve?                                                                                                                                                                                                                                                                                                                                                                                                                                                        \n', 'Which is your primary time zone?                                                                                                                                                                                                                                                                                                                                                                                                                                                            

We can make some relevant observations:

* There is a column title first row that should be removed
* There are questions in different languages

Let's remove the column title and the trailing whitespaces and look at the length  of questions in more detail

In [2]:
questions = [q.strip() for q in questions[1:]]
qs_len = [len(q) for q in questions]

print("Min len: {}".format(min(qs_len)))
print("Max len: {}".format(max(qs_len)))
print("Avg len: {}".format(sum(qs_len)/float(len(qs_len))))

Min len: 0
Max len: 527
Avg len: 96.9002


There is an empty question! Let's remove it

In [3]:
questions = [q for q in questions if len(q)>0]

Summarizing, we have 100k short questions. The lengh (around 500 characters) rules out the option of using some kind of topic modeling: texts are too short for this. Therefore, we will stick to a clustering method. To do this, we need to find a way of encoding the texts into numbers that can be fed to a ML algorithm. In other words, we need to decide which embedding method we are going to use.

One first question is how we want to take into account multilinguality. If we use different encodings for different lnguages, the clusters will almost certainly be divided by language. I will assume this is not desirable, so I will use a multilingual encoding that uses the same representation for all languages.

One option is to use a language model such as BERT. These models are actually trained to perform different tasks, but embeddings can be obtained from their internal representation. For example, the library 'BERT-as-service' (https://github.com/hanxiao/bert-as-service) provides vector representations from BERT embeddings.

However, there are two issues that we should take into account with this approach:

* Normally, word or token based embeddings are not optimal for obtaining sentence embeddings like the ones we have. A solution is to use the token embeddings as a base to fine-tune a model for sentences (like in https://github.com/UKPLab/sentence-transformers). However, while 100k questions are enough to build a clustering model (or, for example, a classification one), it is not enough to train an embedding. There are currently no multilingual fine-tuned models available.
* Using very large models such as BERT is slow and expensive. For example, BERT-as-service directly does not run in my CPU. We could use a framework such as Colab for training, but ultimately we need the model to encode new sentences for the HTTP service.

For these reasons, I decided to use an encoder that is out-of-the-box multilingual and trained on short sentences like the ones we have: Google's Multilingual Universal Sentence Encoder, released in July 2019 (https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html)

In [4]:
import tensorflow_hub as hub
import tensorflow_text

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")


Now we need to decide which clustering method we are going to use. There are two salient options for classifying text: HDBSCAN and k-Means. The advantage of the first method is that it does not need to know the number of clusters to be used a priori. However, I will use k-Means since it is generally preferred in practical scenarios.

We also need to decide which framework/library we are going to use. We are using tensorflow to obtain the embeddings, so it makes sense to implement k-Means on it to use them directly. However, after some trying and research I discovered that the methods to load the multilingual embeddings in estimators are not yet fully implemented (see for example the following issue: https://github.com/tensorflow/hub/issues/420). For simplicity, I decided to obtain the embeddings first and then train the models separately, using Scikit-learn.

Now we compute the embedding for each sentence. The Universal Sentence Encoder already performs the preprocessing, and prefers sentences as raw as possible. We do it in batches to avoid memory problems.

In [None]:
# Compute embeddings.
qs_embeds = []
for i in range(int(len(questions)/1000)+1):
  if i%10==0:
    print(i)
  topq = min(len(questions), (i+1)*1000)
  qs_embeds.extend(embed(questions[i*1000:topq]).numpy())


0
10
20
30
40
50
60
70
80


Next we are going to train the k-Means model. In many cases there is a previous step of dimensionality reduction that may help finding better clusters (and makes the data more manageable). Word vectors obtained from embeddings are already quite reduced, so I am going to skip this step here, but it may make sense to see whether it yields better results. Methods that could be used for this reduction are PCA or t-SNE.

As we said, we do not know which is a reasonable number of clusters a priori. This will most likely depend on what the results will be used for. For now, we will make a quick analysis using the Silhouette Score, which does not require a labeled dataset, to see which number of clusters is better.

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics

def train_evaluate_k_means(klusters):
    km = KMeans(n_clusters=klusters, init='k-means++', max_iter=100, n_init=1,
                    verbose=False)
    print("%s clusters" % klusters)
    km.fit(qs_embeds)   
    print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(qs_embeds, km.labels_, sample_size=10000))
    print()
    return km

In [None]:
for k in [25,50,100,150,200,300]:
    train_evaluate_k_means(k)

Results are not amazing so far, at least according to the Silhouette Score, that ranges from -1 to 1. However, we need to remember that text vectors are very special type of input, and it is very likely that this particular measure is not really meaningful for this case. Measuring quality of clusters of complex data is a difficult task, specially in a totally unsupervised method. A more suitable evaluation would arise from using the clusters for something.

We will stay with the number of clusters that had better results when we tried: 25

In [None]:
km = train_evaluate_k_means(25)
y_kmeans = km.predict(qs_embeds)

Now we have questions clustered, but we still cannot classify new data, at least without re-computing k-Means, something that we do not want to do at inference time. To achieve that, we will train a classifier using the clusters found by k-Means as labels. First let us save the questions along with their clusters, which will be useful when serving the model to find other questions in the cluster.

In [None]:
import pandas as pd
labeled_questions = pd.DataFrame()
labeled_questions['question'] = questions
labeled_questions['label'] = y_kmeans

labeled_questions.to_csv('clusters.csv')

We will test two different classifiers to see which one gives better results: a Random Forest and a Multi-layer Perceptron.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier 

X_train, X_test, y_train, y_test = train_test_split(qs_embeds, y_kmeans, test_size=0.2, random_state=42)

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, random_state=1)
mlp.fit(X_train, y_train)
print("Score of the Multi-layer perceptron classifier on the test set: {}".format(mlp.score(X_test, y_test)))

rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)
print("Score of the Random Forest classifier on the test set: {}".format(rf.score(X_test, y_test)))


The MLP wins! The good results also show the clusters are not bad. Let's train a model with the full data.

In [None]:
mlp.fit(qs_embeds, y_kmeans)

Now we can try to classify some new questions.

In [None]:
new_questions = ["cuantos años tienes?", "where do you live?", "what country are you in?", "rate our services from 0 to 10", "califica nuestros servicios del 0 al 10", "are you satisfied with our services?", "did you like our product?"]
mlp.predict(embed(new_questions))

Interesting results. The classifier selected the same cluster for the following pairs:
("where do you live?", "what country are you in?"), ( "rate our services from 0 to 10", "califica nuestros servicios del 0 al 10") and ("are you satisfied with our services?", "did you like our product?"). 

We are satisfied with these results, so we will save the model.

In [None]:
from joblib import dump, load
dump(mlp, 'model.joblib') 

Finally, let's try to visualize the clusters we obtained. To plot our points we first need to reduce their dimentionality to 2, which we will do with t-SNE.

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne_init = 'pca'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
random_state = 1
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
         early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_points = model.fit_transform(qs_embeds[:200])

plt.scatter(transformed_points[:, 0], transformed_points[:, 1], c=y_kmeans[:200], s=50, cmap='viridis')
plt.show()