TESGAN

Requirements

Python 3.6.9
PyTorch 1.10.1+cu113

Paper

TESGAN paper (Accpeted in NEJLT)

Introduction

In this experiment, we apply the Generative Adversarial Networks to synthesizing text embedding space called a seed. Text Embedding Space GAN (TESGAN) does not explicitly refer to training data because it trains in an unsupervised way which is similar to the original GAN framework. Thus, data memorization does not appear when synthesizing sentences. Additionally, unlike previous studies, TESGAN does not generate discrete tokens, but rather creates text embedding space, thereby solving the gradient backpropagation problem argued in SeqGAN.

Metric

Fréchet BERT Distance (FBD)

We use a Fréchet BERT Distance (FBD) for comparing the results. The FBD calculates the quality and diversity of generated sentences, and a lower value means better.
Multi-sets Jaccard (MSJ)

The Multi-sets Jaccard (MSJ) calculates the similarity between the generative model and the real distribution by comparing the generated text samples. MS-Jaccard focuses on the similarity of the n-grams frequencies between the two sets with considering the average frequency of the generated n-gram per sentence and a higher value means better.
Language Model score (LM)

LM measures generated sample quality which means that scores of the bad samples are poor under a well-trained language model.
Self-BLEU (SBL)

SBL measures diversity of the generated sample based on the token combination.
Data Synthesis Ratio (DSR)

We calculate Data Synthesis Ratio (DSR) to evaluate the data memorization ratio and synthetic diversity. DSR is calculated as the harmonic mean of the ratio where data memorization does not occur and the ratio of diversity. A higher value means better.

Testing

Seed Interpretation Model
```
python3 src/main.py -d gpu -m train
```
- You have to set "model" and "max_len" in src/config.json as "interpretation" and 64, respectively.
- For more detailed setup, you can refer model/interp_sigmoid/interp_sigmoid.json file.
TESGAN
```
python3 src/main.py -d gpu -m train
```
- You have to set "model" and "max_len" in src/config.json as "tesgan" and 16, respectively.
- If you want to train P-TESGAN, you have to set "perturbed" as 1 in src/config.json.
- For more detailed setup, you can refer model/tesgan_sigmoid/tesgan_sigmoid.json or model/tesganP_sigmoid/tesganP_sigmoid.json files.
Text Synthesizing
```
python3 src/main.py -d gpu -m syn -n {model file name} --interp-name {seed interpretation model dir}
```
- After training both seed interpretation model and TESGAN, you can synthesize sentences based on your TESGAN model.
- You must give the file name with a .pt extension as the -n argument.
- For example
```
python3 src/main.py -d gpu -m syn -n tesgan_sigmoid_17.pt --interp-name interp_sigmoid
```

Post-processing Synthesizing Results

python3 etc/pp.py --input-path {synthesized txt path} --output-path {pp output txt path}

For more realistic synthesized text, we provide the simple post-processing code.

For example

python3 etc/pp.py --input-path syn/syn.txt --output-path syn/pp/syn.txt

Acknowledgement

multiset_distances.py and bert_distances.py is based on IAmS4n. Many thanks for the authors.
DailyDialog Dataset were used in this experiment. Many thanks for the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
etc		etc
model		model
src		src
syn/pp		syn/pp
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

etc

etc

model

model

src

src

syn/pp

syn/pp

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

TESGAN

Requirements

Paper

Introduction

Metric

Fréchet BERT Distance (FBD)

Multi-sets Jaccard (MSJ)

Language Model score (LM)

Self-BLEU (SBL)

Data Synthesis Ratio (DSR)

Testing

Seed Interpretation Model

TESGAN

Text Synthesizing

Post-processing Synthesizing Results

Acknowledgement

About

Releases

Packages

Languages

ljm565/TESGAN

Folders and files

Latest commit

History

Repository files navigation

TESGAN

Requirements

Paper

Introduction

Metric

Data Synthesis Ratio (DSR)

Testing

Seed Interpretation Model

TESGAN

Text Synthesizing

Post-processing Synthesizing Results

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages