README.md

AISHELL-2

AISHELL-2 is by far the largest free speech corpus available for Mandarin ASR research.

1. DATA

Training data

  • 1000 hours of speech data (around 1 million utterances)
  • 1991 speakers (845 male and 1146 female)
  • clean recording environment (studio or quiet living room)
  • read speech
  • reading prompts from various domain: entertainment, finance, technology, sports, control command, place of interest etc.
  • near field recording via 3 parallel channels (iOS, Android, Microphone).
  • iOS data is free for non-commercial research and education use (e.g. universities and non-commercial institutes)

Evaluation data:

Currently we release AISHELL2-2018A-EVAL, containing:

  • dev: 2500 utterances from 5 speakers
  • test: 5000 utterances from 10 speakers

Both sets are available across the three channel conditions.

One of interest can download the sets from here. Note that we may update and release other evaluation sets on the website later, targeting on different applications and senarios.

2. RECIPE

Based on Kaldi standard system, AISHELL-2 provides a self-contained Mandarin ASR recipe, with:

  • a word segmentation module, which is a must-have component for Chinese ASR systems
  • an open-sourced Mandarin lexicon (DaCiDian, open-sourced at here)
  • Simplified GMM training & alignment generating recipe (we stopped at speaker independent stage)
  • LFMMI TDNN training and decoding recipe

REFERENCE

We released a paper on Arxiv on a more detailed description about the corpus with some preliminary resulting numbers. If one would like to use AISHELL-2 in experiments, please cite the paper as below:

@ARTICLE{aishell2,
   author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
   title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
   journal = {ArXiv},
   eprint = {1808.10583},
   primaryClass = "cs.CL",
   year = 2018,
   month = Aug,
}

APPLY FOR DATA/CONTACT

AISHELL foundation is a non-profit online organization, with members from speech industry and research institutes.

We hope AISHELL-2 corpus and recipe could be beneficial to the entire speech community.

Depends on your location and internet speed, we distribute the corpus in two ways:

  • hard-disk delivery
  • cloud-disk downloading

To apply for AISHELL-2 corpus for free, you need to fill in a very simple application form, confirming that:

  • university department / educational institute information has been fully provided
  • only for non-commercial research / education use

AISHELL-foundation covers all data distribution fees (including the corpus, hard-disk cost etc)

Data re-distribution inside your university department is OK for convenience. However, users are not supposed to re-distribute the data to other universities or educational institutes.

To get the application form, or you come across any problem with the recipe, contact us via:

aishell.foundation@gmail.com