Skip to content

jinny1208/All-About-Speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 

Repository files navigation

All About Speech

This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning here.

To Dos:

  • organize stars
  • add more papers
    • papers to read:
      1. Speech=T:Transducer for TTS and Beyond

TTS

  • TTS

    • DC-TTS [[paper]] [pytorch][tensorflow]
    • Microsoft's LightSpeech [[paper]] [code]
    • SpeechFormer [[paper]] [code]
    • Non-Attentive Tacotron [paper] [pytorch]
    • Parallel Tacotron 2 [[paper]] [code]
    • FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 [[paper]] [code]
    • Transformer TTS: Neural Speech Synthesis with Transformer Network [[paper]] [code]
    • VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [[paper]] [code]
    • Reformer-TTS (adaptation of Reformer to TTS) [code]
  • Prompt-based TTS (see [link])

  • Voice Conversion / Voice Cloning / Speaker Embedding

    • StarGan-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks [[paper]] [code]
    • Neural Voice Cloning with Few Audio Samples (Baidu) [[paper]] [code]
    • Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques [[paper]] [code]
    • Unet-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning [paper] [code]
    • FragmentVC: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention [[paper]] [code]
    • VectorQuantizedCPC: Vector-Quantized Contrastive Predictive Coding for Acoustic Unit Discovery and Voice Conversion [[paper]] [code]
    • Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data [[paper]] [code]
    • Again-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization [[paper]] [code]
    • AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss [[paper]] [code]
    • SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model [code]
    • Deep Speaker: an End-to-End Neural Speaker Embedding System [[paper]] [code]
    • VQMIVC: One-shot (any-to-any) Voice Conversion [[paper]] [code]
  • Style (Emotion, Prosody)

    • SMART-TTS Single Emotional TTS [code]
    • Cross Speaker Emotion Transfer [[paper]] [code]
    • AutoPST: Global Rhythm Style Transfer Without Text Transcriptions [[paper]] [code]
    • Transforming spectrum and prosody for emotional voice conversion with non-parallel training data [[paper]] [code]
    • Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency [[paper]] [code]
    • Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis (Tacotron-VAE) [[paper]] [code]
    • Time Domain Neural Audio Style Transfer (NIPS 2017) [[paper]] [code]
    • Meta-StyleSpeech and StyleSpeech [[paper]] [code]
    • Cross-Speaker Emotion Transfer Based on Speaker Conditino Layer Normalization and Semi-Supervised Training in Text-to-Speech [[paper]] [code]
  • Cross-lingual

    • End-to-End Code-switching TTS with Cross-Lingual Language Model
      • mandarin and english
      • cross-lingual and multi-speaker
      • baseline: "Building a mixed-lingual neural TTS system with only monolingual data"
    • Building a mixed-lingual neural TTS system with only monolingual data
    • Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
      • has many good references
    • Exploring Disentanglement with Multilingual and Monolingual VQ-VAE [paper] [code]
  • Music Related

    • Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022) [[paper]] [code]
    • Speech to Singing (Interspeech 2020) [[paper]] [code]
    • DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (AAAI 2022) [[paper]] [code]
    • A Universal Music Translation Network (ICLR 2019)
    • Jukebox: A Generative Model for Music (OpenAI) [paper] [code]
  • Toolkits

    • IMS Toucan Speech Synthesis Toolkit [paper] [code]
    • CREPE pitch tracker [code]
    • SpeechBrain - Useful tools to facilitate speech research [code]
  • Vocoders

  • Attention

ASR

  • Towards End-to-End Spoken Language Understanding

Speech Classification, Detection, Filter, etc.

  • HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [code]
  • Google AI's VoiceFilter System [[paper]] [code]
  • Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [code]
  • Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [code]
  • Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [code]
  • Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [code]
  • Rethinking CNN Models for Audio Classification [[paper]] [code]
  • EEG-based emotion recognition using SincNet [[paper]] [code]

Speaker Verification

  • Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [code]

Linguistics


Datasets

  1. VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [code]
  2. CSS10: A collection of single speaker speech datsets for 10 langauges [code]
  3. IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [website]
  4. VoxCeleb [repo]

Data Augmentation

  1. Audiomentations (Fast audio data augmentation in pytorch) [code]

Aligners

  1. Montreal Forced Aligner

Data (Pre)processing / Augmentation

  • Data (pre)processing
  1. Korean pronunciation and romanization based on Wiktionary ko-pron lua module [code]
  2. Audio Signal Processing [code]
  3. Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [code]
  4. SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [code]
  5. Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [code]
  6. Webaverse Speech Tool [code]

Verification

  1. MCD [repo]
  • Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios.

Other Research That May Help

  • Text to Image Synthesis
  • AudioMAE (Masked Autoencoders that Listen) [code]

Organizations

Other Repositories to Refer to - Speech Included/Related

Learning Materials

  1. Digital Signal Processing Lecture [link]
  2. Ratsgo's Speechbook [link]
  3. YSDA Course in Speech Processing [code]
  4. NHN Forward Youtube video [link]