In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2021-07-20
# GitHub: https://github.com/jaaack-wang 
# About: embedding visualization using paddlepaddle tool

# Overview

In previous notebooks, we learned how to [load pre-trained word embedding in paddlenp](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing) and how to [calculate text cosine similarity in paddlenlp](https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing). In this notebook, we will learn how to visualze word and text embeddings using a paddlepaddle Deep Learning Visualization Toolkit, called [VisualDL](https://github.com/PaddlePaddle/VisualDL). More concretely, we will visualize high-dimensional word embeddings in a 3-D coordinate system. 

<br>

**Please note that, for this notebook, all the code should be executed on your local machine (e.g., terminal) if you do not use Baidu's [AI Studio](https://aistudio.baidu.com/aistudio/index). Colab as well as Jupyter notebook seems not to work well with visualdl. <ins>If you do not want to download `paddlepaddle` and `paddlenlp`, that is fine too</ins>. This is notebook will show you a general way to visualize word/text embeddings.**

<br>


<table align="right">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1B9pcYR9fVvmB1pPWiIqb0u_WmxlY--T8?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab </a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/jaaack-wang"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> Author's GitHub </a>
  </td>
  <td>
    <a href="https://docs.google.com/uc?export=download&id=1B9pcYR9fVvmB1pPWiIqb0u_WmxlY"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download this notebook </a>
  </td>
</table> 


<br>


# Table of Contents
- [1. General use of VisualDL for embedding visualization](#1)
- [2. Visualizing word embeddings](#2)
  - [2.1 With `paddlepaddle` and `paddlenlp` installed](#2-1)
  - [2.2 Without `paddlepaddle` and `paddlenlp` installed](#2-2)
- [3. Visualizing sentence embeedings](#3)
  - [3.1 With `paddlepaddle` and `paddlenlp` installed](#3-1)
  - [3.2 Without `paddlepaddle` and `paddlenlp` installed](#3-2)
- [4. References](#4)

<a name="1"></a>
# 1. General use of VisualDL for embedding visualization

The general use of VisualDL can be found on its [GitHub project page](https://github.com/PaddlePaddle/VisualDL). 

<br>

For visualizing embeddings, 

  1. you first need to get the embeddings as `"embeddings"` as well as the corresponding words or texts as `"labels"`;
  2. then, you need to create a log file that stores the embeddings as follows:


```python
from visualdl import LogWriter

>>> labels = "words or texts whose embeddings we will get"
>>> embeddings = "the embeddings for the labels"
>>> with LogWriter(logdir='./embds_vdl') as writer:
>>>      writer.add_embeddings(tag='A_Example', mat=[em for em in embeddings], metadata=labels)
```

   3. finally, you need to launch the visualDL panel to see the visualized embeddings as illustrated [here](https://github.com/PaddlePaddle/VisualDL#2-launch-panel). 
    - If you are using Baidu's [AI Studio](https://aistudio.baidu.com/aistudio/index) and running a notebook there, we can launch the VisualDL panel by clicking the `"可视化"` button on the left bar. 
    - On your command line, use ```visualdl --logdir dir``` (this might not work if you do not get `visualdl` to your `$PATH`.) where `dir` = 'the path to the directory where the log file created above is stored`.
    - Run python on your shell or terminal, and do the following:
    
```python
from visualdl.server import app
# dir = 'the path to the directory where the log file created above is stored`
# the log file within the dir
>>> app.run(logdir="dir", model='file_name') 
```

<br>

**Of course, we need to install `visualdl` first! (you will also need to have it installed for step 3)**

```
pip3 install --upgrade visualdl
```

<a name="2"></a>
# 2. Visualizing word embeddings

First, we need to load the pre-trained word embeddings available in `padddlenlp.embeddings.TokenEmbedding` and get the embeddings of a list of words or texts whose embeddings you want to visualize. You can refer to [loading pre-trained word embedding in paddlenp.ipynb](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing) and [calculating text cosine similarity in paddlenlp.ipynb](https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing) to see how these can be done for words and for sentences respectively. 

<br>

For visualizing pre-trained word embeddings not natively available in `paddlenlp`, you can refer to [loading pre-trained word embedding in paddlenp.ipynb](https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing) to learn how to manually load a pre-trained word embedding model if you want to stick to `paddlenlp`.



<a name="2-1"></a>
### 2.1 With paddlepaddle and paddlenlp installed

  - First, make sure that you have `paddlepaddle` and `paddlenlp` installed by running the following commands:
  ```
  pip3 install --upgrade paddlepaddle
  pip3 install --upgrade paddlenlp
  ```

  <br>

  - In python, load a word embedding you want to visualize. Here, we will visualize the first 10,000 words of `glove.wiki2014-gigaword.target.word-word.dim50.en`. Only 10,000 word embeddings are visualized for the ease of display.
  ```python
from paddlenlp.embeddings import TokenEmbedding
# load the model: glove.wiki2014-gigaword.target.word-word.dim50.en
# model size: 73.45 MB; vocab size: 400002
tk_embedding = TokenEmbedding(embedding_name="glove.wiki2014-gigaword.target.word-word.dim50.en")
'''Output:
[2021-07-20 19:55:56,510] [    INFO] - Loading token embedding...
[2021-07-20 19:55:57,498] [    INFO] - Finish loading embedding vector.
[2021-07-20 19:55:57,498] [    INFO] - Token Embedding info:             
Unknown index: 400000             
Unknown token: [UNK]             
Padding index: 400001             
Padding token: [PAD]             
Shape :[400002, 50]
'''
```

<br>

  - Then, let's get the first 10,000 words and their embeddings.
  ```python
words = tk_embedding.vocab.to_tokens(list(range(10000)))
words_ems = tk_embedding.search(words)
  ```

<br>

  - Then, create a log file to record the word embeddings using `visualdl.LogWriter`.
  ```python
from visualdl import LogWriter
# you can define the logdir and add the tag name in your way
with LogWriter(logdir='./embds_vdl') as writer:
writer.add_embeddings(tag='glove_word_embds', mat=[em for em in words_ems], metadata=words)
  ```

<br>

  - Finally, let's visualize the word embeddings we just restored.
  ```python
from visualdl.server import app
app.run(logdir='./embds_vdl')
'''
output:
11324
VisualDL 2.2.0
* Running on http://localhost:8040/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Jul/2021 20:05:02] "GET /alive HTTP/1.1" 204 -
Running VisualDL at http://localhost:8040/ (Press CTRL+C to quit)
Serving VisualDL on localhost; to expose to the network, use a proxy or pass --host 0.0.0.0'''
  ```

<br>

You will see something like the following when you copy and paste http://localhost:8040/ into your browser and click "A" in the loaded webpage. Successful!

<img src='https://drive.google.com/uc?export=view&id=19MaoquRipqV237ODbEVdvlyTzK5OnY8J' width="1000" height="600">


<a name="2-2"></a>

### 2.2 Without `paddlepaddle` and `paddlenlp` installed

  - For your convience, I have already extract 10,000 examples from `glove.6B.50d.txt` downloaded from [glove website](https://nlp.stanford.edu/projects/glove/). You can use the reduced version directly by clicking [here](https://drive.google.com/file/d/1wU0LLC3KcZleSsT-eRrYQ0puC1_ykXjJ/view?usp=sharing).

<br>

  - In python, run the following code to load the words and their embeddings stored in the [glove.6B.50d_reduced.txt](https://drive.google.com/file/d/1wU0LLC3KcZleSsT-eRrYQ0puC1_ykXjJ/view?usp=sharing) file that you just downloaded. 
  ```python
  file_path = 'path/to/glove.6B.50d_reduced.txt'
  # we will need numpy to convert the embeddings (stored as str) into numpy array
  import numpy as np
  glove = open(file_path, 'r')
  words, words_ems = [], []
  for line in glove:
      line = line.split(maxsplit=1)
      words.append(line[0])
      words_ems.append(np.array(line[1].strip().split(), dtype=np.float32))
  ```

  <br>

  - Then, create a log file to record the word embeddings using `visualdl.LogWriter`.
  ```python
from visualdl import LogWriter
# you can define the logdir and add the tag name in your way
with LogWriter(logdir='./embds_vdl') as writer:
writer.add_embeddings(tag='glove_word_embds', mat=[em for em in words_ems], metadata=words)
  ```

<br>

  - Finally, let's visualize the word embeddings we just restored.
  ```python
from visualdl.server import app
app.run(logdir='./embds_vdl')
'''
output:
11324
VisualDL 2.2.0
* Running on http://localhost:8040/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Jul/2021 20:05:02] "GET /alive HTTP/1.1" 204 -
Running VisualDL at http://localhost:8040/ (Press CTRL+C to quit)
Serving VisualDL on localhost; to expose to the network, use a proxy or pass --host 0.0.0.0'''
  ```

<br>

You will see the same picture as above when you copy and paste http://localhost:8040/ into your browser and click "A" in the loaded webpage.  

<a name="3"></a>
# 3. Visualizing sentence embeddings

To visualize sentence embeddings, you need to have sentences embeddings to visualize first. For how to generate sentence embeddings based on word embeddings, you can use refer to [calculating text cosine similarity in paddlenlp.ipynb](https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing) to learn how to apply Bag-of-Words model. 

<br>

**Once you get the sentence embeddings ready, the rest is identical to how you visualize word embeddings above.** 

<a name="3-1"></a>
### 3.1 With paddlepaddle and paddlenlp installed

  - First, make sure that you have `paddlepaddle` and `paddlenlp` installed by running the following commands:
  ```
  pip3 install --upgrade paddlepaddle
  pip3 install --upgrade paddlenlp
  ```

  <br>

  - In python, load a word embedding you want to calculate the sentence embeddings. Here, we will load `glove.wiki2014-gigaword.target.word-word.dim50.en`. 
  ```python
from paddlenlp.embeddings import TokenEmbedding
# load the model: glove.wiki2014-gigaword.target.word-word.dim50.en
# model size: 73.45 MB; vocab size: 400002
tk_embedding = TokenEmbedding(embedding_name="glove.wiki2014-gigaword.target.word-word.dim50.en")
'''Output:
[2021-07-20 19:55:56,510] [    INFO] - Loading token embedding...
[2021-07-20 19:55:57,498] [    INFO] - Finish loading embedding vector.
[2021-07-20 19:55:57,498] [    INFO] - Token Embedding info:             
Unknown index: 400000             
Unknown token: [UNK]             
Padding index: 400001             
Padding token: [PAD]             
Shape :[400002, 50]
'''
```

<br>

  - Then, we will continue to use [sts_sample.tsv](https://drive.google.com/file/d/1TJgl4WtKY4JlcCVZtd9CQK-1w26wNo9D/view?usp=sharing) which contains 50 English sentence pairs, namely, 100 sentences. You are also welcome to use any English sentences whose embeddings you want to visualize.

  ```python
  # First load the sents
  file_path = 'file/path/to/sts_sample.tsv'
  sts = open(file_path, 'r')
  sents = []
  for line in sts:
      line = line.split('\t')
      sents.append(line[0])
      sents.append(line[1])
  # check the first 5 sentences
  print('The first 5 sentences:\n', sents[:5])
  '''output:
  The first 5 sentences:  
  ['A multi-colored bird clings to a wire fence.', 'A bird holding on to a metal gate.', 'A woman is mixing meat.', 'A woman is feeding a man.', 'Two men sailing in a small sailboat.']
  '''
  ```

<br>

  - Then, use the functions we built in third section of [calculating text cosine similarity in paddlenlp.ipynb](https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing) to use the simple Bag-of-Words model to calculate the sentence embeddings. 
  ```python
  def sentence_embedding(sentence, tokenizer, embedder):
      def embedding(sent):
        tokens = tokenizer(sent)
        return np.sum(embedder.search(tokens), axis=0) # sum vertically

      if isinstance(sentence, str):
        return embedding(sentence)
      elif isinstance(sentence, list):
        return [embedding(sent) for sent in sentence]
      else:
        raise TypeError(f'sentence should be either a str or a list.' 
    '{type(sentence)} not supported. ')
  ```

  <br>

  - Then, let's get the sentence embeddings for the 100 we just loaded. 
  ```python
  # here, we simply use space to tokenize English text
  en_tokenizer = lambda x: x.split() 
  sent_ems = sentence_embedding(sents, en_tokenizer, tk_embedding)
  ```

  <br>

  - Finally, let's log the sentence embeddings and visualize them!
    ```python
from visualdl import LogWriter
# you can define the logdir and add the tag name in your way
with LogWriter(logdir='./sent_embds_vdl') as writer:
      writer.add_embeddings(tag='glove_sent_embds', mat=[em for em in sent_ems], metadata=sents)
# visualize them
from visualdl.server import app
app.run(logdir='./sent_embds_vdl')
'''output:
11760
 * Running on http://localhost:8040/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Jul/2021 21:43:12] "GET /alive HTTP/1.1" 204 -
Running VisualDL at http://localhost:8040/ (Press CTRL+C to quit)
Serving VisualDL on localhost; to expose to the network, use a proxy or pass --host 0.0.0.0
'''
  ```

<br>

You will see something like the following when you copy and paste http://localhost:8040/ into your browser and click "A" in the loaded webpage. Successful!

<img src='https://drive.google.com/uc?export=view&id=1ALsLToHkOfKU3gOpf2hXxWTh4rrfRsh2' width="1000" height="600">

<a name="3-2"></a>
### 3.2 Without paddlepaddle and paddlenlp installed

  - For generating sentence embeddings, you may want to load the entire `glove.6B.50d.txt` so that there is less chance of running into unseen words. You will also want to set a token to represent all the unseen words. For your convience, you can download the `glove.6B.50d.txt`  by clicking [here](https://drive.google.com/file/d/1o1fUeoAt260P90FeP_L5eICiQowHIcvY/view?usp=sharing).


<br>

  - In python, run the following code to load the words and their embeddings stored in the [glove.6B.50d.txt](https://drive.google.com/file/d/1o1fUeoAt260P90FeP_L5eICiQowHIcvY/view?usp=sharing) file that you just downloaded. You also want to create a `tk_em_dict` that you can look up later to get word embeddings.  
  ```python
  file_path = 'path/to/glove.6B.50d.txt'
  # we will need numpy to convert the embeddings (stored as str) into numpy array
  import numpy as np
  glove = open(file_path, 'r')
  words, words_ems = [], []
  for line in glove:
      line = line.split(maxsplit=1)
      words.append(line[0])
      words_ems.append(np.array(line[1].strip().split(), dtype=np.float32))
  # create the tk_em_dict
  tk_em_dict = dict(zip(words, words_ems))
  # let's set np.zeros(50) for all unseen tokens
  from collections import defaultdict
  tk_em_dict = defaultdict(lambda: np.zeros(50, dtype=np.float32), tk_em_dict)
  ```

 <br>

  - Then, we will continue to use [sts_sample.tsv](https://drive.google.com/file/d/1TJgl4WtKY4JlcCVZtd9CQK-1w26wNo9D/view?usp=sharing) which contains 50 English sentence pairs, namely, 100 sentences. You are also welcome to use any English sentences whose embeddings you want to visualize.

  ```python
  # First load the sents
  file_path = 'file/path/to/sts_sample.tsv'
  sts = open(file_path, 'r')
  sents = []
  for line in sts:
      line = line.split('\t')
      sents.append(line[0])
      sents.append(line[1])
  # check the first 5 sentences
  print('The first 5 sentences:\n', sents[:5])
  '''output:
  The first 5 sentences:  
  ['A multi-colored bird clings to a wire fence.', 'A bird holding on to a metal gate.', 'A woman is mixing meat.', 'A woman is feeding a man.', 'Two men sailing in a small sailboat.']
  '''
  ```

<br>

  - Then, use the functions we built in third section of [calculating text cosine similarity in paddlenlp.ipynb](https://colab.research.google.com/drive/1QYSJ3x6Ap5HG8O4R4yqAyw6iq18JahdO?usp=sharing) with small modifications to use the simple Bag-of-Words model to calculate the sentence embeddings. 
  ```python
  def sentence_embedding(sentence, tokenizer, tk_em_dict):
      def embedding(sent):
        tokens = tokenizer(sent)
        return np.sum([tk_em_dict[tk.lower()] for tk in tokens], axis=0) # sum vertically

      if isinstance(sentence, str):
        return embedding(sentence)
      elif isinstance(sentence, list):
        return [embedding(sent) for sent in sentence]
      else:
        raise TypeError(f'sentence should be either a str or a list.' 
    '{type(sentence)} not supported. ')
  ```

  <br>

  - Then, let's get the sentence embeddings for the 100 we just loaded. 
  ```python
  # here, we simply use space to tokenize English text for convience
  # you can build or import a more sophisticated tokenizer if you wish
  en_tokenizer = lambda x: x.split() 
  sent_ems = sentence_embedding(sents, en_tokenizer, tk_em_dict)
  ```

  <br>

  - Finally, let's log the sentence embeddings and visualize them!
    ```python
from visualdl import LogWriter
# you can define the logdir and add the tag name in your way
with LogWriter(logdir='./sent_embds_vdl') as writer:
      writer.add_embeddings(tag='glove_sent_embds', mat=[em for em in sent_ems], metadata=sents)
# visualize them
from visualdl.server import app
app.run(logdir='./sent_embds_vdl')
'''output:
11857
 * Running on http://localhost:8040/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Jul/2021 21:43:12] "GET /alive HTTP/1.1" 204 -
Running VisualDL at http://localhost:8040/ (Press CTRL+C to quit)
Serving VisualDL on localhost; to expose to the network, use a proxy or pass --host 0.0.0.0
'''
  ```

<br>

You will see a similar picture as above when you copy and paste http://localhost:8040/ into your browser and click "A" in the loaded webpage.  

<a name="4"></a>
# 4. References

- [VisualDL](https://github.com/PaddlePaddle/VisualDL)