Skip to content

massabaali7/speech_parallel_corpus

Repository files navigation

speech_parallel_corpus

Creating Speech-to-Speech Corpus From Dubbed Series

Dubbed series are gaining a lot of popularity in recent years with strong support from major media service providers. Such popularity is fueled by studies that showed that dubbed versions of TV shows are more popular than their subtitled equivalents. We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels, to produce a parallel speech corpus in the source- and target- languages. Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages.

Install required libraries

  • pip install pydub
  • pip install inaSpeechSegmenter
  • pip install image-similarity-measures
  • pip install SpeechRecognition
  • pip install googletrans==4.0.0-rc1
  • pip install textblob-ar-mk

Download wiki word vectors

  • Choose wiki word vectors from the following website: https://fasttext.cc/docs/en/pretrained-vectors.html based on the target language in our case here Arabic https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ar.zip
  • Upload two videos for the same episode (in two different languages)
    • Run the video matching algorithm
  • Run the VAD (Voice Activity Detection) to create csv files for both episodes
    • python run_speech_segment.py "ep1TR.wav" "ep1AR.wav"
  • Run the automatic matching algorithm
    • Add the following attributes in order Path, Dubbed file, Org file, langSrc, langTrgt
    • python run_segment_automatic_match.py "/content/gdrive/MyDrive/parallel_corpus/samples/" "ep1AR" "ep1TR" "tr" "ar"

To see the matched segments check the result.csv file

About

Creating Speech-to-Speech Corpus From Dubbed Series

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages