-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can not reproduce sota wikitext103 results #112
Comments
and i found something interesting, if set tgt_len to 1600, and mem_len to 128, the test ppl down to 19.7, |
This bug is caused by tensorflow2.x GPU version for the precision retention method for parallel computing |
@menghuanlater Hi, I was curious if you change the environment to TF1.12 and python 2.7 as the author suggest. Did you manage to get the result of perplexity 18.03? |
When I switch to TF1.12 and python2.7, it is possible to get SOTA results by loading pre-trained weights, as well as using the pytorch1.4 framework.
| |
安林
|
|
anlin781205936@126.com
|
签名由网易邮箱大师定制
On 9/2/2020 21:16,tonytan48<notifications@github.com> wrote:
@menghuanlater Hi, I was curious if you change the environment to TF1.12 and python 2.7 as the author suggest. Did you manage to get the result of perplexity 18.03?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
i use the pretrained-xl weights and same vocab to build transformer-xl large(we use tensorflow2.0) to eval the test set. But in my experiments, I find the {tgt_len=128, mem_len=1600, clamp_len=1000} just can reach test ppl around 35, and {tgt_len=384, mem_len=384, clamp_len=1000} can reach test ppl around 24, and {tgt_len=2048, mem_len=2048, clamp_len=1000} can reach test ppl around 20, but all of these settings can not reach the paper result 18.3, why?
`#!usr/bin/env python
-- coding:utf-8 --
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pickle
from DataService import DataObjForWT_PTB as DataObj
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
vocab_size_dic = {
"wikitext-103": 267736,
"enwiki8": 0,
"text8": 0
}
class Vanilla_XL(keras.Model):
def init(self, dataset_name: str, segment_size: int, dropout_attn, dropout_norm, n_layers,
n_heads, d_embed, d_model, ffn_mul, cutoffs):
super(Vanilla_XL, self).init()
self.vocab_size = vocab_size_dic[dataset_name]
self.segment_size = segment_size
self.dropout_attn = dropout_attn
self.dropout_norm = dropout_norm
self.d_model = d_model
self.d_embed = d_embed
self.ffn_mul = ffn_mul
self.cutoffs = cutoffs
self.n_layers = n_layers
self.n_heads = n_heads
class AdaptiveEmbedding(keras.layers.Layer):
def init(self, cutoffs, embed_drop_rate, input_dim, out_dim, d_embed, div_value=4):
super(AdaptiveEmbedding, self).init()
assert isinstance(cutoffs, list)
self.cutoffs = cutoffs
self.input_dim = input_dim
self.out_dim = out_dim
self.d_embed = d_embed
self.div_value = div_value
class AdaptiveSoftmax(keras.layers.Layer):
def init(self, cutoffs, d_embed, adaptive_embedding_obj, div_value=4):
super(AdaptiveSoftmax, self).init()
self.cutoffs = cutoffs
self.d_embed = d_embed
self.div_value = div_value
assert isinstance(adaptive_embedding_obj, AdaptiveEmbedding)
self.adaptive_embedding_obj = adaptive_embedding_obj
self.tail_clusters_embedding = keras.layers.Embedding(
input_dim=len(self.cutoffs) - 2, output_dim=self.d_embed,
weights=[tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_W:0"])]
)
self.clusters_bias = tf.Variable(
initial_value=tf.convert_to_tensor(pre_train_weights["transformer/adaptive_softmax/cutoff_0/cluster_b:0"]),
dtype=tf.float32
)
class SingleTransformerBlock(keras.layers.Layer):
def init(self, d_model, ffn_size, n_heads, dropout_attn, dropout_norm, cur_layer):
super(SingleTransformerBlock, self).init()
self.n_heads = n_heads
self.cur_layer = cur_layer
self.d_model = d_model
class GeneralFunction:
@staticmethod
def create_look_ahead_mask(q_len: int, k_len: int, same_length=True):
mask = tf.linalg.band_part(tf.ones(shape=(k_len, k_len), dtype=tf.float32), -1, 0)[-q_len:, ...]
if same_length:
x = mask[:, 0: q_len]
y = mask[:, q_len:]
x = tf.linalg.band_part(x, 0, -1)
mask = tf.concat([x, y], axis=1)
return mask
class Main:
def init(self, **kwargs):
self.kwargs = kwargs
self.data_obj = DataObj(dataset_name=kwargs["dataset_name"], segment_size=kwargs["segment_size"],
pad_id=PAD, batch_size=batch_size)
self.cache = self.get_init_cache()
self.model = Vanilla_XL(
dataset_name=kwargs["dataset_name"], n_heads=kwargs["n_heads"], n_layers=kwargs["n_layers"],
dropout_norm=kwargs["dropout_norm"], dropout_attn=kwargs["dropout_attn"],
d_embed=kwargs["d_embed"], ffn_mul=kwargs["ffn_mul"], segment_size=kwargs["segment_size"],
cutoffs=kwargs["cutoffs"], d_model=kwargs["d_model"]
)
if name == "main":
with open("InitWeights/WT103/weights.p", "rb") as f:
pre_train_weights = pickle.load(f)
dataset = "wikitext-103"
PAD = 0
batch_size = 1
G = GeneralFunction()
_cutoffs = [
1, 20001, 40001, 200001, vocab_size_dic[dataset]
]
a_epoch_segment = {
"384": 268820 // batch_size,
"512": 201615 // batch_size,
"256": 403230 // batch_size
}
E = Main(dataset_name=dataset, segment_size=128, mem_len=1600, n_heads=16, d_model=1024, n_layers=18,
d_embed=1024, batch_size=batch_size, dropout_attn=0.2, dropout_norm=0.2,
ffn_mul=4, cutoffs=_cutoffs, method="AC001")
E.train()
`
The text was updated successfully, but these errors were encountered: