CoAS: Composite Audio Steganography Based on Text and Speech Synthesis

This repository hosts the official PyTorch implementation of the paper: "\CoAS: Composite Audio Steganography Based on Text and Speech Synthesis" (Accepted by IEEE TIFS 2025).

Method

We propose Composite Audio Steganography (CoAS), a method based on text and speech synthesis that leverages the multi-carrier characteristic of audio data by utilizing side-channel information from the transcript to facilitate the steganographic process. We first maps the secret message to Gaussian noise in a distribution-preserving manner and embeds it into the generation process of a diffusion model audio sequence. To address the reduced audio diversity caused by using a fixed random seed as a key, we embed the key into the audio text, which is then retrieved by the receiver via speech recognition. This approach allows the system to randomly select a key for each transmission, ensuring both accurate message extraction and the diversity of the generated audio for enhanced concealment.

Getting Started

We will gradually split and implement the modules used in the CoAS system.

Provably Secure Linguistic Steganography

In the CoAS system, you can choose any provably secure linguistic steganography to embed the random number seed into the audio text. We will not go into details here, recommending one of them, Discop.

Audio Steganography

The audio steganography module in CoAS is based on FastDiff and ProDiff, and implemented in the text-to-speech (TTS) task.

git clone https://github.com/meterial/CoAS.git
cd CoAS
conda env create -f environment.yml 
conda activate coas

We directly use the pre-trained audio generation models provided by Rongjie Huang. You can also train your own model according to the instructions in FastDiff and ProDiff and put your checkpoints in checkpoints/$method_name$/model_ckpt_steps_.ckpt

── checkpoints/
    ├── FastDiff
    │   ├──config.yaml
    │   └──model_ckpt_steps_500000.ckpt
    ├── ProDiff
    │   ├──config.yaml
    │   └──model_ckpt_steps_200000.ckpt
    └── ProDiff_Teacher
        ├──config.yaml
        └──model_ckpt_steps_188000.ckpt

Message embedding

In the message embedding phase of the CoAS, in addition to the secret message, the sender also need the audio text and random number seed used in the provably secure linguisitc steganography above.

python inference/CoAS.py embed --text $audio_text$ --message $secret_message$ --seed $random_number_seed$

The stego audio will be stored in the folder inferout/$audio_text$.wav.

Message Extraction

During the message extraction phase of CoAS, the receiver needs to use the same audio text and random number seed as the sender to ensure correct extraction.

python inference/CoAS.py extract --text $audio_text$ --audio $stego_audio_path$ --seed $random_number_seed$

The audio text can be recognised by the following speech recognition method, and the random number seed can be extracted from the audio text by the above provably secure linguisitc steganography algorithm.

Speech Recognition

In the CoAS, we used existing speech recognition models such as parakeet and whisper. You can simply run the speech recognition by running the following command.

python speech_reco/asr.py parakeet -a $audio$
python speech_reco/asr.py whisper -a $audio$

Additional Notes

The default payload=4, you can change it in modules/FastDiff/module/util.py. Please keep the payload the same when embedding and extracting, otherwise the secret message will not be able to be extracted.
The sampling rate of the audio files generated by the diffusion models is 22.05kHz.

Acknowledgements

We heavily borrow the code from FastDiff and ProDiff. We appreciate the authors for sharing their code.

Ciation

If you find our work useful for your research, please consider citing the following paper:

@ARTICLE{11036088,
 author={Li, Yiming and Chen, Kejiang and Wang, Yaofei and Zhang, Xin and Wang, Guanjie and Zhang, Weiming and Yu, Nenghai},
 journal={IEEE Transactions on Information Forensics and Security}, 
 title={CoAS: Composite Audio Steganography Based on Text and Speech Synthesis}, 
 year={2025},
 volume={20},
 number={},
 pages={5978-5991},
 keywords={Steganography;Diffusion models;Security;Speech synthesis;Receivers;Noise reduction;Gaussian noise;Entropy;Reviews;Linguistics;Steganography;provably secure;text;audio;diffusion model},
 doi={10.1109/TIFS.2025.3579581} }

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
data/binary/LJSpeech		data/binary/LJSpeech
data_gen/tts		data_gen/tts
egs		egs
inference		inference
modules		modules
speech_reco		speech_reco
tasks		tasks
usr		usr
utils		utils
vocoders		vocoders
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoAS: Composite Audio Steganography Based on Text and Speech Synthesis

Method

Getting Started

Provably Secure Linguistic Steganography

Audio Steganography

Message embedding

Message Extraction

Speech Recognition

Additional Notes

Acknowledgements

Ciation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoAS: Composite Audio Steganography Based on Text and Speech Synthesis

Method

Getting Started

Provably Secure Linguistic Steganography

Audio Steganography

Message embedding

Message Extraction

Speech Recognition

Additional Notes

Acknowledgements

Ciation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages