In [None]:
# Author: Zhengxiang (Jack) Wang
# Date: 2021-07-18
# GitHub: https://github.com/jaaack-wang
# About: Loading pre-trained word embedding in paddlenlp.

# Overview

`paddlenlp` provides an [Embedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html) that can easily loads various open-source pre-trained word embedding models in just one line of code when calling `padddlenlp.embeddings.TokenEmbedding`. All you need to do is to correctly specify the name of the provided word embedding model inside the `TokenEmbedding` class. Besides, this notebook will also serve as a brief introudction to these word embedding models that can be natively loaded in `paddlenlp`.


<table align="right">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab </a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/jaaack-wang"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> Author's GitHub </a>
  </td>
  <td>
    <a href="https://docs.google.com/uc?export=download&id=1WSyYtDiHwXe4MFTwe_X6hQ5atBqNsFax"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download this notebook </a>
  </td>
</table> 


<br>




# Table of Contents

- [1. Loading pre-trained word embedding models](#1)
  - [1.1 Make sure `paddlepaddle` and `paddlenlp` have been installed](#1-1)
  - [1.2 Two ways of loading a word embedding model](#1-2)
    - [1.2.1 First way: Specifying the model's name inside padddlenlp.embeddings.TokenEmbedding](#1-2-1)
    - [1.2.2 Second way: Manually load a model](#1-2-2)
- [2. Pre-trained word embedding available in `paddlenlp`](#2)
  - [2.1 Chinese word embedding models](#2-1)
  - [2.2 English word embedding models](#2-2)
  - [2.3 General info of these models](#2-3)
- [3. References](#3)

<a name="1"></a>
# 1. Loading pre-trained word embedding models

<a name="1-1"></a>
### 1.1 Make sure `paddlepaddle` and `paddlenlp` have been installed

In [None]:
# First make sure that you have paddlepaddle installed before using paddlenlp
!pip3 install --upgrade paddlepaddle

Requirement already up-to-date: paddlepaddle in /usr/local/lib/python3.7/dist-packages (2.1.1)


In [None]:
# install the lastest paddlenlp
!pip3 install --upgrade paddlenlp

Requirement already up-to-date: paddlenlp in /usr/local/lib/python3.7/dist-packages (2.0.5)


<a name="1-2"></a>
### 1.2 Two ways of loading a word embedding model
- Specifying the model's name inside `padddlenlp.embeddings.TokenEmbedding`
- Manually download a model
<br><br>

First, we need to import `padddlenlp.embeddings.TokenEmbedding`.


**Reference**: [TokenEmbedding](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.embeddings.token_embedding.html) 

**Parameters or functions of interest**:

- **embedding_name (str, optional)** -- The pre-trained embedding model name. Use **paddlenlp.embeddings.list_embedding_name()** to list the names of all embedding models available. Defaults to <ins>w2v.baidu_encyclopedia.target.word-word.dim300</ins>.
  - **`vocab.to_tokens (indices)`**: indices: `list` or `int`. Rerturns: corresponding tokens in the vocabulary of the model.
  - **`search(words)`**: words: `list` or `str` or `int`. Retruns: the vectors of specified words.

In [None]:
# import padddlenlp.embeddings.TokenEmbedding
from paddlenlp.embeddings import TokenEmbedding

<a name="1-2-1"></a>
##### **1.2.1 First way: Specifying the model's name inside padddlenlp.embeddings.TokenEmbedding**

Even if you do not have the embedding model installed beforehand, `paddlenlp` will automatically download it for you.

In [None]:
# w2v.sikuquanshu.target.word-word.dim300 is the smallest model, only 20.7 MB
token_embedding = TokenEmbedding(embedding_name="w2v.sikuquanshu.target.word-word.dim300")

# print the loaded model info
print(token_embedding)

[32m[2021-07-18 22:18:01,280] [    INFO][0m - Loading token embedding...[0m
[32m[2021-07-18 22:18:01,484] [    INFO][0m - Finish loading embedding vector.[0m
[32m[2021-07-18 22:18:01,492] [    INFO][0m - Token Embedding info:             
Unknown index: 19527             
Unknown token: [UNK]             
Padding index: 19528             
Padding token: [PAD]             
Shape :[19529, 300][0m


Object   type: TokenEmbedding(19529, 300, padding_idx=19528, sparse=False)             
Unknown index: 19527             
Unknown token: [UNK]             
Padding index: 19528             
Padding token: [PAD]             
Parameter containing:
Tensor(shape=[19529, 300], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[ 0.10636300, -0.15333299, -0.00901000, ...,  0.08603100, -0.21248300, -0.11166500],
        [ 0.03748500, -0.02521100,  0.13308799, ...,  0.10142900, -0.23215801,  0.02003300],
        [ 0.12701400, -0.05975900,  0.09185100, ...,  0.06476500, -0.24700700, -0.10331500],
        ...,
        [-0.08905500,  0.03231100,  0.10511100, ..., -0.01329400,  0.01188400, -0.02801500],
        [ 0.01407126, -0.02133216, -0.00614622, ...,  0.05757869, -0.02248066,  0.01433324],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,  0.        ,  0.        ]])


In [None]:
# We can check the vocabulary in the model as follows:
# As we can see above, the vocabulary size is 19529
vocab = token_embedding.vocab.to_tokens(list(range(19529)))

# check the first 20 vocabularies
print('This is the first 20 words in the model\' vocabulary:\n\n', vocab[:20])

This is the first 20 words in the model' vocabulary:

 ['之', '不', '以', '】', '【', '为', '而', '也', '其', '于', '曰', '丨', '子', '有', '人', '防', '云', '无', '中', '一']


In [None]:
# To check the corresponding embedding of a given word/token
# using TokenEmbedding().search(word)
w_e = token_embedding.search('之')

print('This is the pre-trained word embedding (300 dims) for 之 in the model: \n\n', w_e)

This is the pre-trained word embedding (300 dims) for 之 in the model: 

 [[ 1.06363e-01 -1.53333e-01 -9.01000e-03  1.62510e-01 -9.58900e-03
  -1.20850e-01  1.02605e-01  3.34600e-03  1.35963e-01  2.41380e-02
  -4.15040e-02  4.03810e-02  1.36350e-02 -7.29750e-02 -8.56830e-02
  -1.13738e-01  4.78800e-03 -1.21901e-01  5.74850e-02 -7.23350e-02
  -2.18830e-01  1.52990e-02  9.93080e-02  5.20510e-02  2.95950e-02
   1.19608e-01 -1.05340e-02 -1.18171e-01  2.17480e-02 -1.03129e-01
   1.92324e-01  5.96840e-02 -8.32330e-02 -2.02900e-02  9.60600e-03
  -7.17610e-02 -3.91730e-02 -1.14774e-01 -2.03384e-01  9.58800e-03
   7.48400e-02  3.13860e-02  3.51280e-02  3.80000e-05 -6.72880e-02
   5.08650e-02 -8.90020e-02  4.20170e-02 -1.87957e-01  1.58577e-01
   4.19750e-02  1.68900e-01 -1.11841e-01  3.50260e-02  3.53730e-02
  -5.59850e-02 -4.40030e-02  1.55936e-01  3.93170e-02 -5.75030e-02
   1.07756e-01 -8.60630e-02 -2.31200e-02 -6.34500e-03  1.56109e-01
  -1.43128e-01 -2.75030e-02  5.26310e-02 -6.49380e-02 -5

<a name="1-2-2"></a>
##### **1.2.2 Second way: Manually load a model**

- First, you need to download the embedding model from the internet and uncompress it; 
- Then, you should move the uncompressed model file to <ins>`HOME_DIRECRORY/.paddlenlp/models/embeddings/`</ins>;
- Then, convert the model file into `".npz"` format. 
- Finally call the model file by its name using `TokenEmbedding(embedding_name=...)` as shown above. 

<br>

---

Please note that: 
- 1. `HOME_DIRECRORY` refers to the one you see when you first open you terminal/command shell on your computer; 
- 2. `".paddlenlp"` is a hidden directory that normally be cannot be seen on your computer. To access it, on mac, you can make them visible by pressing `command` + `shift` + `.` and then move the model file inside `~/models/embeddings/` directory. You can also do the same thing by programming.  
- 3. The `.npz` should have two arrays named as ***vocab***, ***embedding***. You can see how this can be done by checking [`numpy.savez`](https://numpy.org/doc/stable/reference/generated/numpy.savez.html).


---

<br>


Below, let's take "glove.6B.50d" downloaded from [Stanford GloVe](https://nlp.stanford.edu/projects/glove/) as an example, although this model is already available in `padddlenlp.embeddings.TokenEmbedding` by using the `embedding_name=glove.wiki2014-gigaword.target.word-word.dim50.en`. The original "glove.6B.50d" file is in ".txt" format. 

If yo run the following codes on Colab, please save [this file](https://drive.google.com/file/d/1o1fUeoAt260P90FeP_L5eICiQowHIcvY/view?usp=sharing) to "My Drive" (you can either download the file and then upload or simply add a shortcut of this file to "My Drive" by clicking "Add shortcut to Drive"). 

<br>

**More on how to handle external files in Colab can be found [here](https://colab.research.google.com/notebooks/io.ipynb).**

In [None]:
from google.colab import drive
drive.mount('/drive')
file_path = '/drive/My Drive/glove.6B.50d.txt'

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [None]:
import numpy as np


glove = open(file_path, 'r')
vocab, em = [], []
for line in glove:
  line = line.split(maxsplit=1)
  vocab.append(line[0])
  em.append(np.array(line[1].strip().split(), dtype=np.float32))


len(vocab), len(em)
assert len(vocab)==len(em), 'vocab size does not match the size of word embeddings'

In [None]:
# change the type of vocab and em from list to numpy.array
# before saving them as a .npz file

vocab = np.asarray(vocab)
em = np.array(em)


# output file path 
out_fp = '/root/.paddlenlp/models/embeddings/glove.6B.50d.npz'
np.savez(out_fp, vocab=vocab, embedding=em)

In [None]:
from paddlenlp.embeddings import TokenEmbedding

tk_em = TokenEmbedding(embedding_name='glove.6B.50d')

[32m[2021-07-18 22:18:09,016] [    INFO][0m - Loading token embedding...[0m
[32m[2021-07-18 22:18:10,382] [    INFO][0m - Finish loading embedding vector.[0m
[32m[2021-07-18 22:18:10,384] [    INFO][0m - Token Embedding info:             
Unknown index: 400000             
Unknown token: [UNK]             
Padding index: 400001             
Padding token: [PAD]             
Shape :[400002, 50][0m


In [None]:
# let's de bug!

def de_bug():
  errs = []
  tk_em_dict = dict(zip(vocab, em))
  for v in vocab:
    if np.all(tk_em_dict[v] != tk_em.search(v)):
      errs.append(v)
  
  if len(errs) == 0:
    print('No bug identified, congratulations!')
  
  return errs


de_bug()

No bug identified, congratulations!


[]

<a name="2"></a>
# 2. Pre-trained word embedding available in `paddlenlp`

All the pre-trained word embedding models can be found [here](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/embeddings/constant.py). You can also run the following code to access the names of these models. 

In [None]:
from paddlenlp.embeddings.constant import *

counter = 0
tmp = '{0:30}{1:20}{2:30}'
for name in EMBEDDING_NAME_LIST:
  counter += 1
  print(tmp.format(f'embedding model {counter}', '------>', name))


embedding model 1             ------>             w2v.baidu_encyclopedia.target.word-word.dim300
embedding model 2             ------>             w2v.baidu_encyclopedia.target.word-character.char1-1.dim300
embedding model 3             ------>             w2v.baidu_encyclopedia.target.word-character.char1-2.dim300
embedding model 4             ------>             w2v.baidu_encyclopedia.target.word-character.char1-4.dim300
embedding model 5             ------>             w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300
embedding model 6             ------>             w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300
embedding model 7             ------>             w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300
embedding model 8             ------>             w2v.baidu_encyclopedia.target.word-wordLR.dim300
embedding model 9             ------>             w2v.baidu_encyclopedia.target.word-wordPosition.dim300
embedding model 10            ------>             w2v.baidu_enc

<a name="2-1"></a>
### 2.1 Chinese word embedding models

The following pre-trained Chinese word embedding models come from [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors)

<br>

Some explanations (for the following headers): 

- Corpus: The corpus based on which a corresponding word embedding is trained. 
- Word: During the pre-training process, the predicted target is at word-level. 
- Word + N-gram: the predicted target is eitehr a word or a n-gram. 
- Word + Character: the predicted target is eitehr a word or a character. "word-character.char1-2" means the length of the predicted character is either 1 or 2. 
- Word + Character + Ngram: the predicted target is either a word, or a character, or a ngram. "ngram-char" means ngram at the level of characters. 


| Corpus | Word | Word + N-gram | Word + Character | Word + Character + N-gram |
| ------------------------------------------- | ----   | ---- | ----   | ---- |
| Baidu Encyclopedia 百度百科                 | w2v.baidu_encyclopedia.target.word-word.dim300 | w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300 | w2v.baidu_encyclopedia.target.word-character.char1-2.dim300 | w2v.baidu_encyclopedia.target.bigram-char.dim300 |
| Wikipedia_zh 中文维基百科                   | w2v.wiki.target.word-word.dim300 | w2v.wiki.target.word-bigram.dim300 | w2v.wiki.target.word-char.dim300 | w2v.wiki.target.bigram-char.dim300 |
| People's Daily News 人民日报                | w2v.people_daily.target.word-word.dim300 | w2v.people_daily.target.word-bigram.dim300 | w2v.people_daily.target.word-char.dim300 | w2v.people_daily.target.bigram-char.dim300 |
| Sogou News 搜狗新闻                         | w2v.sogou.target.word-word.dim300 | w2v.sogou.target.word-bigram.dim300 | w2v.sogou.target.word-char.dim300 | w2v.sogou.target.bigram-char.dim300 |
| Financial News 金融新闻                     | w2v.financial.target.word-word.dim300 | w2v.financial.target.word-bigram.dim300 | w2v.financial.target.word-char.dim300 | w2v.financial.target.bigram-char.dim300 |
| Zhihu_QA 知乎问答                           | w2v.zhihu.target.word-word.dim300 | w2v.zhihu.target.word-bigram.dim300 | w2v.zhihu.target.word-char.dim300 | w2v.zhihu.target.bigram-char.dim300 |
| Weibo 微博                                  | w2v.weibo.target.word-word.dim300 | w2v.weibo.target.word-bigram.dim300 | w2v.weibo.target.word-char.dim300 | w2v.weibo.target.bigram-char.dim300 |
| Literature 文学作品                         | w2v.literature.target.word-word.dim300 | w2v.literature.target.word-bigram.dim300 | w2v.literature.target.word-char.dim300 | w2v.literature.target.bigram-char.dim300 |
| Complete Library in Four Sections 四库全书  | w2v.sikuquanshu.target.word-word.dim300 | w2v.sikuquanshu.target.word-bigram.dim300 | 无 | 无 |
| Mixed-large 综合                            | w2v.mixed-large.target.word-word.dim300 | 暂无 | w2v.mixed-large.target.word-word.dim300 | 暂无 |

<br>

In particular, for the Baidu Encyclopedia, target vectors and context vectors are also provided based on different co-occurence types. 

<br>


| Co-occurrence type          | Target vector | Context vector  |
| --------------------------- | ------   | ---- |
|    Word → Word              | w2v.baidu_encyclopedia.target.word-word.dim300     |   w2v.baidu_encyclopedia.context.word-word.dim300    |
|    Word → Ngram (1-2)       |  w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300    |   w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300    |
|    Word → Ngram (1-3)       |  w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300    |   w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300    |
|    Ngram (1-2) → Ngram (1-2)|  w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300   |   w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300    |
|    Word → Character (1)     |  w2v.baidu_encyclopedia.target.word-character.char1-1.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-1.dim300     |
|    Word → Character (1-2)   |  w2v.baidu_encyclopedia.target.word-character.char1-2.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-2.dim300     |
|    Word → Character (1-4)   |  w2v.baidu_encyclopedia.target.word-character.char1-4.dim300    |  w2v.baidu_encyclopedia.context.word-character.char1-4.dim300     |
|    Word → Word (left/right) |   w2v.baidu_encyclopedia.target.word-wordLR.dim300   |   w2v.baidu_encyclopedia.context.word-wordLR.dim300    |
|    Word → Word (distance)   |   w2v.baidu_encyclopedia.target.word-wordPosition.dim300   |   w2v.baidu_encyclopedia.context.word-wordPosition.dim300    |

<a name="2-2"></a>
### 2.2 English word embedding models

<br>

- Word2Vec
  - Google News: w2v.google_news.target.word-word.dim300.en
- [FastText](https://fasttext.cc/docs/en/english-vectors.html)
  - Wiki2017: fasttext.wiki-news.target.word-word.dim300.en
  - Crawl: fasttext.crawl.target.word-word.dim300.en 


- [GloVe](https://nlp.stanford.edu/projects/glove/)


| Corpus | Wiki2014 + GigaWord | Twitter |
| ---------------  | -------------  | ------  |
| 25 dims   | NA  | glove.twitter.target.word-word.dim25.en  |
| 50 dims   | glove.wiki2014-gigaword.target.word-word.dim50.en  | glove.twitter.target.word-word.dim50.en  |
| 100 dims   | glove.wiki2014-gigaword.target.word-word.dim100.en | glove.twitter.target.word-word.dim100.en  |
| 200 dims   | glove.wiki2014-gigaword.target.word-word.dim200.en  | glove.twitter.target.word-word.dim200.en  |
| 300 dims   | glove.wiki2014-gigaword.target.word-word.dim300.en  | NA  |


<br>

<a name="2-3"></a>
### 2.3 General info of these models


| Model | File size | Vocab size |
|-----|---------|---------|
| w2v.baidu_encyclopedia.target.word-word.dim300                         | 678.21 MB  | 635965 |
| w2v.baidu_encyclopedia.target.word-character.char1-1.dim300            | 679.15 MB  | 636038 |
| w2v.baidu_encyclopedia.target.word-character.char1-2.dim300            | 679.30 MB  | 636038 |
| w2v.baidu_encyclopedia.target.word-character.char1-4.dim300            | 679.51 MB  | 636038 |
| w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300                    | 679.48 MB  | 635977 |
| w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300                    | 671.27 MB  | 628669 |
| w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300                    | 7.28 GB    | 6969069 |
| w2v.baidu_encyclopedia.target.word-wordLR.dim300                       | 678.22 MB  | 635958 |
| w2v.baidu_encyclopedia.target.word-wordPosition.dim300                 | 679.32 MB  | 636038 |
| w2v.baidu_encyclopedia.target.bigram-char.dim300                       | 679.29 MB  | 635976 |
| w2v.baidu_encyclopedia.context.word-word.dim300                        | 677.74 MB  | 635952 |
| w2v.baidu_encyclopedia.context.word-character.char1-1.dim300           | 678.65 MB  | 636200 |
| w2v.baidu_encyclopedia.context.word-character.char1-2.dim300           | 844.23 MB  | 792631 |
| w2v.baidu_encyclopedia.context.word-character.char1-4.dim300           | 1.16 GB    | 1117461 |
| w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300                   | 7.25 GB    | 6967598 |
| w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300                   | 5.21 GB    | 5000001 |
| w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300                   | 7.26 GB    | 6968998 |
| w2v.baidu_encyclopedia.context.word-wordLR.dim300                      | 1.32 GB    | 1271031 |
| w2v.baidu_encyclopedia.context.word-wordPosition.dim300                | 6.47 GB    | 6293920 |
| w2v.wiki.target.bigram-char.dim300                                     | 375.98 MB  | 352274 |
| w2v.wiki.target.word-char.dim300                                       | 375.52 MB  | 352223 |
| w2v.wiki.target.word-word.dim300                                       | 374.95 MB  | 352219 |
| w2v.wiki.target.word-bigram.dim300                                     | 375.72 MB  | 352219 |
| w2v.people_daily.target.bigram-char.dim300                             | 379.96 MB  | 356055 |
| w2v.people_daily.target.word-char.dim300                               | 379.45 MB  | 355998 |
| w2v.people_daily.target.word-word.dim300                               | 378.93 MB  | 355989 |
| w2v.people_daily.target.word-bigram.dim300                             | 379.68 MB  | 355991 |
| w2v.weibo.target.bigram-char.dim300                                    | 208.24 MB  | 195199 |
| w2v.weibo.target.word-char.dim300                                      | 208.03 MB  | 195204 |
| w2v.weibo.target.word-word.dim300                                      | 207.94 MB  | 195204 |
| w2v.weibo.target.word-bigram.dim300                                    | 208.19 MB  | 195204 |
| w2v.sogou.target.bigram-char.dim300                                    | 389.81 MB  | 365112 |
| w2v.sogou.target.word-char.dim300                                      | 389.89 MB  | 365078 |
| w2v.sogou.target.word-word.dim300                                      | 388.66 MB  | 364992 |
| w2v.sogou.target.word-bigram.dim300                                    | 388.66 MB  | 364994 |
| w2v.zhihu.target.bigram-char.dim300                                    | 277.35 MB  | 259755 |
| w2v.zhihu.target.word-char.dim300                                      | 277.40 MB  | 259940 |
| w2v.zhihu.target.word-word.dim300                                      | 276.98 MB  | 259871 |
| w2v.zhihu.target.word-bigram.dim300                                    | 277.53 MB  | 259885 |
| w2v.financial.target.bigram-char.dim300                                | 499.52 MB  | 467163 |
| w2v.financial.target.word-char.dim300                                  | 499.17 MB  | 467343 |
| w2v.financial.target.word-word.dim300                                  | 498.94 MB  | 467324 |
| w2v.financial.target.word-bigram.dim300                                | 499.54 MB  | 467331 |
| w2v.literature.target.bigram-char.dim300                               | 200.69 MB  | 187975 |
| w2v.literature.target.word-char.dim300                                 | 200.44 MB  | 187980 |
| w2v.literature.target.word-word.dim300                                 | 200.28 MB  | 187961 |
| w2v.literature.target.word-bigram.dim300                               | 200.59 MB  | 187962 |
| w2v.sikuquanshu.target.word-word.dim300                                | 20.70 MB   | 19529 |
| w2v.sikuquanshu.target.word-bigram.dim300                              | 20.77 MB   | 19529 |
| w2v.mixed-large.target.word-char.dim300                                | 1.35 GB    | 1292552 |
| w2v.mixed-large.target.word-word.dim300                                | 1.35 GB    | 1292483 |
| w2v.google_news.target.word-word.dim300.en                             | 1.61 GB    | 3000000 |
| glove.wiki2014-gigaword.target.word-word.dim50.en                      | 73.45 MB   | 400002 |
| glove.wiki2014-gigaword.target.word-word.dim100.en                     | 143.30 MB  | 400002 |
| glove.wiki2014-gigaword.target.word-word.dim200.en                     | 282.97 MB  | 400002 |
| glove.wiki2014-gigaword.target.word-word.dim300.en                     | 422.83 MB  | 400002 |
| glove.twitter.target.word-word.dim25.en                                | 116.92 MB  | 1193516 |
| glove.twitter.target.word-word.dim50.en                                | 221.64 MB  | 1193516 |
| glove.twitter.target.word-word.dim100.en                               | 431.08 MB  | 1193516 |
| glove.twitter.target.word-word.dim200.en                               | 848.56 MB  | 1193516 |
| fasttext.wiki-news.target.word-word.dim300.en                          | 541.63 MB  | 999996 |
| fasttext.crawl.target.word-word.dim300.en                              | 1.19 GB    | 2000002 |

<a name="3"></a>
# 3. References




- [PaddleNLP Embedding API](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/embeddings.html)
- [paddlenlp.embeddings docs](https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.embeddings.html)
- [paddlenlp.embeddings source code](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/embeddings)