All About Speech

This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning here.

To Dos:

organize stars
add more papers
- papers to read:
  1. Speech=T:Transducer for TTS and Beyond

TTS

TTS
- DC-TTS [[paper]] [pytorch][tensorflow]
- Microsoft's LightSpeech [[paper]] [code]
- SpeechFormer [[paper]] [code]
- Non-Attentive Tacotron [paper] [pytorch]
- Parallel Tacotron 2 [[paper]] [code]
- FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 [[paper]] [code]
- Transformer TTS: Neural Speech Synthesis with Transformer Network [[paper]] [code]
- VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [[paper]] [code]
- Reformer-TTS (adaptation of Reformer to TTS) [code]
Prompt-based TTS (see [link])
Voice Conversion / Voice Cloning / Speaker Embedding
- StarGan-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks [[paper]] [code]
- Neural Voice Cloning with Few Audio Samples (Baidu) [[paper]] [code]
- Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [[paper]] [code]
- Unet-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning [paper] [code]
- FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention [[paper]] [code]
- VectorQuantizedCPC: Vector-Quantized Contrastive Predictive Coding for Acoustic Unit Discovery and Voice Conversion [[paper]] [code]
- Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [[paper]] [code]
- Again-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization [[paper]] [code]
- AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss [[paper]] [code]
- SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [code]
- Deep Speaker: an End-to-End Neural Speaker Embedding System [[paper]] [code]
- VQMIVC: One-shot (any-to-any) Voice Conversion [[paper]] [code]
Style (Emotion, Prosody)
- SMART-TTS Single Emotional TTS [code]
- Cross Speaker Emotion Transfer [[paper]] [code]
- AutoPST: Global Rhythm Style Transfer Without Text Transcriptions [[paper]] [code]
- Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [[paper]] [code]
- Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency [[paper]] [code]
- Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis (Tacotron-VAE) [[paper]] [code]
- Time Domain Neural Audio Style Transfer (NIPS 2017) [[paper]] [code]
- Meta-StyleSpeech and StyleSpeech [[paper]] [code]
- Cross-Speaker Emotion Transfer Based on Speaker Conditino Layer Normalization and Semi-Supervised Training in Text-to-Speech [[paper]] [code]
Cross-lingual
- End-to-End Code-switching TTS with Cross-Lingual Language Model
  - mandarin and english
  - cross-lingual and multi-speaker
  - baseline: "Building a mixed-lingual neural TTS system with only monolingual data"
- Building a mixed-lingual neural TTS system with only monolingual data
- Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
  - has many good references
- Exploring Disentanglement with Multilingual and Monolingual VQ-VAE [paper] [code]
Music Related
- Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022) [[paper]] [code]
- Speech to Singing (Interspeech 2020) [[paper]] [code]
- DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (AAAI 2022) [[paper]] [code]
- A Universal Music Translation Network (ICLR 2019)
- Jukebox: A Generative Model for Music (OpenAI) [paper] [code]
Toolkits
- IMS Toucan Speech Synthesis Toolkit [paper] [code]
- CREPE pitch tracker [code]
- SpeechBrain - Useful tools to facilitate speech research [code]
Vocoders
Attention
- Local attention [code]

ASR

Towards End-to-End Spoken Language Understanding

Speech Classification, Detection, Filter, etc.

HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [code]
Google AI's VoiceFilter System [[paper]] [code]
Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [code]
Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [code]
Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [code]
Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [code]
Rethinking CNN Models for Audio Classification [[paper]] [code]
EEG-based emotion recognition using SincNet [[paper]] [code]

Speaker Verification

Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [code]

Linguistics

Datasets

VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [code]
CSS10: A collection of single speaker speech datsets for 10 langauges [code]
IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [website]
VoxCeleb [repo]

Data Augmentation

Audiomentations (Fast audio data augmentation in pytorch) [code]

Aligners

Montreal Forced Aligner

For Korean [link]

Data (Pre)processing / Augmentation

Data (pre)processing

Korean pronunciation and romanization based on Wiktionary ko-pron lua module [code]
Audio Signal Processing [code]
Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [code]
SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [code]
Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [code]
Webaverse Speech Tool [code]

Verification

MCD [repo]

Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios.

Other Research That May Help

Text to Image Synthesis
- Dalle2 [code]
AudioMAE (Masked Autoencoders that Listen) [code]

Organizations

DeepMind [repo]
OpenAI [repo]
Club House: WeeklyArxivTalk [repo]

Other Repositories to Refer to - Speech Included/Related

Speech Researchers List [repo]
Jackson-Kang [repo]
Rosinality's ML [repo]
ivallesp's [repo]
ddlBoJack's Speech Pretraining [repo]
fuzhenxin's Style Transfer in Text [repo]

Learning Materials

Digital Signal Processing Lecture [link]
Ratsgo's Speechbook [link]
YSDA Course in Speech Processing [code]
NHN Forward Youtube video [link]

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

All About Speech

To Dos:

TTS

ASR

Speech Classification, Detection, Filter, etc.

Speaker Verification

Linguistics

Datasets

Data Augmentation

Aligners

Data (Pre)processing / Augmentation

Verification

Other Research That May Help

Organizations

Other Repositories to Refer to - Speech Included/Related

Learning Materials

About

Releases

Packages

jinny1208/All-About-Speech

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

All About Speech

To Dos:

TTS

ASR

Speech Classification, Detection, Filter, etc.

Speaker Verification

Linguistics

Datasets

Data Augmentation

Aligners

Data (Pre)processing / Augmentation

Verification

Other Research That May Help

Organizations

Other Repositories to Refer to - Speech Included/Related

Learning Materials

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages