Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
s5
s5b
s5c
README.txt

README.txt

About the GALE Phase 2 Arabic Broadcast Conversation:

LDC2013S02: http://catalog.ldc.upenn.edu/LDC2013S02
LDC2013S07: http://catalog.ldc.upenn.edu/LDC2013S07
LDC2013T17: http://catalog.ldc.upenn.edu/LDC2013T17
LDC2013T04: http://catalog.ldc.upenn.edu/LDC2013T04


GALE Phase 2 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 200 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program.

The data has two types of speech: conversational and report. This script trains and test on all of them and results are reported for each of them, train data is 320 hours, 9.3 hours testing

The dictionaries, and scripts can be obtained from QCRI portal: http://alt.qcri.org/

The experiments here are based on the above corpus

s5: Phoneme based: 
s5b: Grapheme based: This is the receommended setup; including nnet3 and chain modeling


[1] "A Complete Kaldi Recipe For Building Arabic Speech Recognition Systems", A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel, J. Glass.  SLT 2014 
[2] "QCRI Advanced Transcription Systems (QATS) for the Arabic Multi-Dialect Brodcast Media Recognition: MGB-2 Challenge", S. Khurana, A. Ali. SLT 2016
You can’t perform that action at this time.