Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Latest commit 493bcc7 Jun 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md Update README.md Jun 14, 2019

README.md

Cross-lingual word embeddings from Twitter

The following repository includes the pre-trained cross-lingual word embeddings from the paper Learning Cross-lingual Embeddings from Twitter via Distant Supervision.

Twitter pre-trained word embeddings

We release the 100-dimension monolingual and cross-lingual word embeddings trained on Twitter used in our experiments (English, Spanish, Italian, German and Farsi):

  • Monolingual FastText embeddings: Available here
  • Cross-lingual embeddings post-processed with plain averaging: Available here
  • Cross-lingual embeddings post-processed with weighted averaging: Available here

Update: Embeddings for Japanese now also available!

Note 1: All words are lowercased.

Note 2: All emoji have been unified into a single neutral encoding across languages (no skin tone modifiers). All Twitter users have been anonymized with @user.

Reference paper

If you use any of these resources, please cite the following paper:

@article{twitteremb19,
  author = 	"Camacho-Collados, Jose and Doval, Yerai and Mart\'{i}nez-C\'{a}mara, Eugenio and Espinosa-Anke, Luis and Barbieri, Francesco and Schockaert, Steven",
  title = 	"Learning Cross-lingual Embeddings from Twitter via Distant Supervision",
  journal = 	"arXiv preprint arXiv:1905.07358",
  year = 	"2019"
}

Full code coming soon.

If you use Fasttext or VecMap, please also cite their corresponding papers.

You can’t perform that action at this time.