Skip to content

ljm565/TESGAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TESGAN

Requirements

  • Python 3.6.9
  • PyTorch 1.10.1+cu113

Paper

Introduction

In this experiment, we apply the Generative Adversarial Networks to synthesizing text embedding space called a seed. Text Embedding Space GAN (TESGAN) does not explicitly refer to training data because it trains in an unsupervised way which is similar to the original GAN framework. Thus, data memorization does not appear when synthesizing sentences. Additionally, unlike previous studies, TESGAN does not generate discrete tokens, but rather creates text embedding space, thereby solving the gradient backpropagation problem argued in SeqGAN.

Metric

  • We use a Fréchet BERT Distance (FBD) for comparing the results. The FBD calculates the quality and diversity of generated sentences, and a lower value means better.

  • The Multi-sets Jaccard (MSJ) calculates the similarity between the generative model and the real distribution by comparing the generated text samples. MS-Jaccard focuses on the similarity of the n-grams frequencies between the two sets with considering the average frequency of the generated n-gram per sentence and a higher value means better.

  • LM measures generated sample quality which means that scores of the bad samples are poor under a well-trained language model.

  • SBL measures diversity of the generated sample based on the token combination.

  • Data Synthesis Ratio (DSR)

    We calculate Data Synthesis Ratio (DSR) to evaluate the data memorization ratio and synthetic diversity. DSR is calculated as the harmonic mean of the ratio where data memorization does not occur and the ratio of diversity. A higher value means better.

Testing

  • Seed Interpretation Model

    python3 src/main.py -d gpu -m train
    
    • You have to set "model" and "max_len" in src/config.json as "interpretation" and 64, respectively.

    • For more detailed setup, you can refer model/interp_sigmoid/interp_sigmoid.json file.

  • TESGAN

    python3 src/main.py -d gpu -m train
    
    • You have to set "model" and "max_len" in src/config.json as "tesgan" and 16, respectively.

    • If you want to train P-TESGAN, you have to set "perturbed" as 1 in src/config.json.

    • For more detailed setup, you can refer model/tesgan_sigmoid/tesgan_sigmoid.json or model/tesganP_sigmoid/tesganP_sigmoid.json files.

  • Text Synthesizing

    python3 src/main.py -d gpu -m syn -n {model file name} --interp-name {seed interpretation model dir}
    
    • After training both seed interpretation model and TESGAN, you can synthesize sentences based on your TESGAN model.

    • You must give the file name with a .pt extension as the -n argument.

    • For example

      python3 src/main.py -d gpu -m syn -n tesgan_sigmoid_17.pt --interp-name interp_sigmoid
      
  • Post-processing Synthesizing Results

    python3 etc/pp.py --input-path {synthesized txt path} --output-path {pp output txt path}
    
    • For more realistic synthesized text, we provide the simple post-processing code.

    • For example

      python3 etc/pp.py --input-path syn/syn.txt --output-path syn/pp/syn.txt
      

Acknowledgement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages