# Clusterizar las preguntas más frecuentes, con Sentence Transformers.

Referencias:
- https://www.sbert.net
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py
- https://www.sbert.net/docs/pretrained_models.html


In [11]:
import numpy as np
import pandas as pd
import os
import time

In [12]:
from google.colab import drive
drive.mount('/content/drive')

FOLDER_BASE = './drive/MyDrive/meetups/festivales/'
filename = os.path.join(FOLDER_BASE, 'ejemplos-contacto.txt')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
df = pd.read_csv(filename, sep=",", error_bad_lines=True)

In [14]:
# df.tail(5)
df['asunto']

0                             cambiar dni de la entrada
1                                            devolución
2                            mi perro rompió la entrada
3                         quería modificar los asientos
4                                puedo cambiar la fecha
5                          sobre el saldo de la pulsera
6                                descargar las entradas
7                              no puedo ir al concierto
8                                          aplazamiento
9                                  factura de la compra
10                                cambiar número de dni
11                                 recuperar la entrada
12                                      cambiar entrada
13                                   cambio de asientos
14                                     devolver entrada
15                                     cambio de butaca
16                                cambiar a entrada VIP
17                                  descarga de 

In [6]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 5.8 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.4-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 36.9 MB/s 
[?25hCollecting tokenizers>=0.10.3
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 30.3 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 7.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_

In [7]:
from sentence_transformers import SentenceTransformer, util

In [8]:
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') # 'distilbert-base-nli-mean-tokens'paraphrase-multilingual-MiniLM-L12-v2
embeddings = model.encode(df['asunto'], show_progress_bar=True)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
corpus_sentences = list(df['asunto'])
print("Creando embedding")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

print("Empieza el clustering")
start_time = time.time()

#Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, threshold=0.8, min_community_size=2, init_max_size=len(corpus_embeddings))

print("Clustering done tras {:.2f} sec".format(time.time() - start_time))

total = 0
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} items ".format(i+1, len(cluster)))
    total = total + len(cluster)
    for sentence_id in cluster:
        print("\t", corpus_sentences[sentence_id])

print('Total clusterizado: ' + str(total))

Creando embedding


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Empieza el clustering
Clustering done tras 0.00 sec

Cluster 1, #5 items 
	 cambiar fecha
	 puedo cambiar la fecha
	 urgente cambiar fecha
	 cambiar entradas de fecha
	 cambiar número de dni

Cluster 2, #4 items 
	 mi entrada VIP
	 compra entrada VIP
	 cambiar a entrada VIP
	 entrada + mesa VIP

Cluster 3, #3 items 
	 quería modificar los asientos
	 cambiar los asientos
	 cambio de asientos

Cluster 4, #3 items 
	 devolver entrada
	 recuperar la entrada
	 devolución

Cluster 5, #3 items 
	 no encuentro mi entrada
	 urgente no encuentro mi entrada
	 he perdido la entrada

Cluster 6, #3 items 
	 hay asientos para personas con movilidad reducida
	 consulta asientos para movilidad reducida
	 entradas para personas con movilidad reducida

Cluster 7, #2 items 
	 sobre el saldo de la pulsera
	 saldo de la pulsera

Cluster 8, #2 items 
	 no puedo ir al concierto
	 no puedo asistir

Cluster 9, #2 items 
	 autorización para menores
	 autorización para hijos

Cluster 10, #2 items 
	 mi perro se c

In [10]:
len(corpus_embeddings)

59