## Visualiseer GTST samenvattingen

In this notebook, we demonstrate how to use WizMap to visualize a text dataset.

In [1]:
import pandas as pd
from umap import UMAP
from sentence_transformers import SentenceTransformer
import wizmap

import tensorflow_text
import tensorflow_hub as hub

import plotly.express as px
from matplotlib import pyplot as plt


  @numba.jit()
  @numba.jit()
  @numba.jit()
  from .autonotebook import tqdm as notebook_tqdm
  @numba.jit()


In [2]:
#### Load the pre-trained embedding model --> 7 min
model_robbert_v2 = SentenceTransformer('jegorkitskerkin/robbert-v2-dutch-base-mqa-finetuned')


In [4]:
test = ['ik ben een testzin', 'dit is een andere testzin', 'dit is een derde testzin']
model_robbert_v2.encode(test)

array([[ 0.8653715 ,  0.03769124, -0.5125764 , ..., -0.13767365,
        -0.16036746, -0.8000312 ],
       [ 0.5744579 , -0.02723801, -0.1616888 , ..., -0.0946954 ,
        -0.12979764, -0.13668044],
       [ 0.44140196,  0.14330591,  0.09215166, ..., -0.16832323,
        -0.10179315, -0.12682426]], dtype=float32)

In [3]:
#### alternatief model. load universal sentence encoder multilingual
model_univ = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [5]:
test = ['ik ben een testzin', 'dit is een andere testzin', 'dit is een derde testzin']
model_univ(test)

<tf.Tensor: shape=(3, 512), dtype=float32, numpy=
array([[ 0.07495385, -0.02435048, -0.02693162, ..., -0.03103555,
         0.01695868,  0.00593072],
       [-0.00305933,  0.03385365,  0.00483184, ..., -0.03993021,
        -0.03613787, -0.05969235],
       [-0.00483806,  0.00139287, -0.02006915, ...,  0.01503419,
        -0.01113884, -0.0462756 ]], dtype=float32)>

## 1. Extract Embeddings

We use Sentence Transformer to extract embeddings GTST with a small pre-trained model [`all-MiniLM-L6-v2`](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/).

See also https://pypi.org/project/sentence-transformers/

In [6]:
#  Load GTS dataset
GTST = pd.read_csv('GTST_Daily_data.csv')
GTST

Unnamed: 0,text_of_1month,datums,datums2
0,Arnie Alberts' wereld van rozegeur en manesch...,Maandag 1 oktober 1990,1 10 1990
1,Laura Alberts verwerkt de scheiding van haar ...,Dinsdag 2 oktober 1990,2 10 1990
2,Linda wil hogerop in het modellenvak. De ruzi...,Woensdag 3 oktober 1990,3 10 1990
3,Annette ontmoet haar vroegere schoolvriendin ...,Donderdag 4 oktober 1990,4 10 1990
4,Helen beschuldigt Peter van diefstal. De sche...,Vrijdag 5 oktober 1990,5 10 1990
...,...,...,...
4085,Janine kan niet geloven wat ze ontdekt heeft....,Vrijdag 25 mei 2012,25 5 2012
4086,"Janine en Ludo zouden gelukkig moeten zijn, m...",Maandag 28 mei 2012,28 5 2012
4087,Amy probeert Nina in de val te lokken. Als Yv...,Dinsdag 29 mei 2012,29 5 2012
4088,Bing liegt tegen Jef over wat er speelt met J...,Woensdag 30 mei 2012,30 5 2012


In [7]:
GTST_texts = GTST['text_of_1month'].values.tolist()
GTST_texts[9]

" Peter confronteert zijn ouders met hun lang stilgehouden geheim. Jan neemt een rigoureus besluit, zonder dit aan Petra te vertellen. De relatie van Simon en Brigitte verloopt, tot Linda's wanhoop, voorspoedig. "

In [8]:
### only get first 10 elements from GTST_texts as ttest
sample_text = GTST_texts[:10]

#### Create text 

Voorbeeld sentence_transforner model:

```model.encode(['Hallo, hoe gaat het?'])```

In [9]:
## Encode all GTST recaps with the robbert_v2 model --> 7:30 min.
BATCH_SIZE = 128
embeddings_rbv2 = model_robbert_v2.encode(GTST_texts, batch_size=BATCH_SIZE, show_progress_bar=True)
embeddings_rbv2.shape

Batches: 100%|██████████| 32/32 [07:28<00:00, 14.02s/it]


(4090, 768)

In [11]:
## Encode all GTST recaps with universal sentence encoder multilingual, this runs much faster (dimension is lower thoigh --> 512 instead of 768)
embeddings_univ = model_univ(GTST_texts)
embeddings_univ.shape

TensorShape([4090, 512])

## 2. Dimensionality Reduction

Then, we apply dimensionality reduction techniques (e.g., UMAP, t-SNE, PCA) to project the embeddings from a 768-dimension space into a 2D space. Here we use UMAP, but you can use any dimensionality reduction technique you like.

To save the time to run this notebook, we will use the UMAP's default parameters. However, it's a good practice to tune the parameters when you are using WizMap on your own dataset.

In [12]:
######## UMAP on robbert v2 model --> 30 sec
reducer = UMAP(metric='cosine')
embeddings_2d = reducer.fit_transform(embeddings_rbv2)

### transform to dataframe
df_embeddings = pd.DataFrame(embeddings_2d, columns=['x', 'y'])

### create a new column with the text
df_embeddings['text'] = GTST_texts

In [13]:
df_embeddings.sample(5)

Unnamed: 0,x,y,text
1948,10.375008,-0.786545,Laura probeert spullen van de asielzoekers te...
1913,11.491228,-1.594548,Benjamin moedigt Harmsen aan om zijn pistool ...
2547,8.796644,-1.615356,Janine weigert een dokter bij Nina te laten: z...
2802,10.457141,-1.904461,Een verliefde Ludo vraagt of Janine weer bij ...
92,10.974566,0.255857,Myriam kan nog maar net voorkomen dat ze betr...


In [14]:
######## UMAP on universal sentence encoder multilingual --> 19 sec
reducer = UMAP(metric='cosine')
embeddings_uv_2d = reducer.fit_transform(embeddings_univ)

### transform to dataframe
df_embeddings_uv = pd.DataFrame(embeddings_uv_2d, columns=['x', 'y'])

### create a new column with the text
df_embeddings_uv['text'] = GTST_texts

In [15]:
df_embeddings_uv.sample(5)

Unnamed: 0,x,y,text
3342,8.538155,4.659112,"Janine krijgt hulp uit onverwachte hoek, maar..."
2926,3.797188,2.61232,Sjors wil weten of Bing oprecht gevoelens hee...
2242,6.29725,1.100445,Remco troeft Janine af door Simon doodleuk me...
144,11.225287,2.541335,Peter ontmaskert de dader van de overval op H...
1196,8.468872,4.353613,Jef probeert Sylvia over te halen om iets teg...


In [17]:
### create scatterplot woth plotly
fig = px.scatter(df_embeddings, x='x', y='y', hover_data=['text'], height=800, width=800)

### set title
fig.update_layout(title='UMAP projection of the GTST recaps with robbert v2 model')
fig

In [18]:
### create scatterplot woth plotly
fig = px.scatter(df_embeddings_uv, x='x', y='y', hover_data=['text'], height=800, width=800)
### set title
fig.update_layout(title='UMAP projection of the GTST recaps with universal sentence encoder multilingual')
fig

## 3. Generate Two JSON Files for WizMap

To use WizMap on your embeddings, you need to generate two JSON files.

- One JSON file encodes the contour plot and multi-level summaries.
- The other JSON file encodes the raw data (e.g., IMDB reviews in this example).

Fortunately, the `WizMap` Python library makes it extremely easy to generate these two files. 

In [18]:
xs = embeddings_uv_2d[:, 0].astype(float).tolist()
ys = embeddings_uv_2d[:, 1].astype(float).tolist()
texts = GTST_texts

In [19]:
data_list = wizmap.generate_data_list(xs, ys, texts)
grid_dict = wizmap.generate_grid_dict(xs, ys, texts, 'GTST recaps')

Start generating data list...
Start generating contours...
Start generating multi-level summaries...


4090it [00:00, 90536.65it/s]
100%|██████████| 6/6 [00:05<00:00,  1.13it/s]


In [20]:
# Save the JSON files
wizmap.save_json_files(data_list, grid_dict, output_dir='./')

## 4. Host JSON Files and Display WizMap

After generating these two JSON files (one with `.json` and one with `.ndjson`), you want to store them somewhere in the network so that you can provide two URLs to WizMap.

Depending on your needs, there are many options to store the files.

1. **Local host**. If you are running WizMap on your local machine, you can simply start a local server and use ‘local host’ URLs to send your JSON files to WizMap. 
2. **Static website hosting service** (e.g., GitHub page, Vercel, Hugging Face). You can use many free website hosting services to host your JSON files. A limitation is that these service usually have file size limits. For example, you can only include files that are less than 100MB in GitHub. 
3. **Cloud storage** (e.g., AWS S3, Cloudflare R2). The most general option is to put the JSON files on a cloud storage site. There is no size limit, but you might need to pay for the service.


Here, we store `data.ndjson` and `grid.json` in [Hugging Face](https://huggingface.co/datasets/xiaohk/embeddings/blob/main/imdb/).


In [19]:
data_url = 'https://longhowlam.github.io/data.ndjson'
grid_url = 'https://longhowlam.github.io/grid.json'

In [20]:
# Display wizmap
wizmap.visualize(data_url, grid_url, height=700)