Analysis in Embeddings

DISCLAIMER: This is not an officially supported Google product.

Analysis in Embeddings

This repository contains implementations on word embedding and quick-thought sentence embedding models.

Word Embedding

1. Data

A subset of wikipedia articles: enwiki9. Preprocessing script is at: data/wiki9.py.

from data.wiki9 import write_wiki9_articles
write_wiki9_articles()

In total 140,000 articles. For experiments, 70,000 were used for training word embedding models.

2. Models

Word embedding models are usually lookup tables that map a word to a low -dimension vector. Available models includes Word2Vec, FastText and Glove. Models were trained using public library gensim and glove-python.

Hyper-parameters were kept as in the original papers. To launch training for Word2Vec for example, run: shell python train_word_embedding.py --model=w2v model option can be w2v for Word2Vec, ft for FastText and glove for Glove.

A tensorflow implementation of Word2Vec is also available, and has an option for training Word2Vec with differential privacy. For training with DP, run:

CUDA_VISIBLE_DEVICES="0" python train_word_embedding_dp.py --dpsgd \
--noise_multiplier=0.1 --l2_norm_clip=0.25 --batch_size=512

Explanation for the options is detailed in tensorflow-privacy.

Trained models are evaluated using standard evaluation questions. Commandline for evaluation: python eval_word_embedding.py --model=w2v

Sentence Embedding

1. Data

A collection of books crawled using scripts from https://github.com/soskek/bookcorpus. After preprocessing, there are more than 14,000 books and 30,000,000 sentences. Preprocessing script is at: data/bookcorpus.py.

from data.bookcorpus import preprocess_pipeline
preprocess_pipeline()

2. Model

Sentence embedding models are usually neural networks that takes a sequence of words as input and output a low-dimension vector. We train QuickThought locally on half of all books. The model is trained by predicting the sentences before and after given a input sentence. Implementation is based on https://github.com/lajanugen/S2V with slight modification. To train QuickThought on books, run:

CUDA_VISIBLE_DEVICES="0" python train_quick_thought.py ----batch_size=500 \
--emb_dim=620 --encoder_dim=1200 --cell_type=LSTM --epochs=1

Evaluation of QuickThought is done by using the model as feature extractor for downstream sentence classification tasks. Currently supports evaluation on TREC and MSRP. Run evaluation with:

CUDA_VISIBLE_DEVICES="0" python eval_quick_thought.py --eval_data=trec

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
attribute		attribute
data		data
dp_optimizer		dp_optimizer
invert		invert
membership		membership
models		models
thought		thought
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
authorship_text.py		authorship_text.py
eval_quick_thought.py		eval_quick_thought.py
eval_word_embedding.py		eval_word_embedding.py
inversion_albert.py		inversion_albert.py
inversion_bert.py		inversion_bert.py
inversion_text.py		inversion_text.py
inversion_use.py		inversion_use.py
membership_image_text.py		membership_image_text.py
membership_text.py		membership_text.py
membership_word.py		membership_word.py
save_text_embedding.py		save_text_embedding.py
text_encoder.py		text_encoder.py
train_feature_mapper.py		train_feature_mapper.py
train_image_text.py		train_image_text.py
train_quick_thought.py		train_quick_thought.py
train_quick_thought_adv.py		train_quick_thought_adv.py
train_word_embedding.py		train_word_embedding.py
train_word_embedding_dp.py		train_word_embedding_dp.py

License

google/embedding-tests

Folders and files

Latest commit

History

Repository files navigation

Analysis in Embeddings

Word Embedding

1. Data

2. Models

Sentence Embedding

1. Data

2. Model

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages