Around 60 million people over the world speak the Pashto language. However, with such prominent speakers, it still needs to be considered a low-resource language because of the low availability of digital content online.
Our goal is to create a community for advancing the Pashto language adoption in digital products (Automatic Speech Recognition, transcription, digital dictionaries, grammar correction, text-to-speech systems etc.)
We aim to create open source (publicly available) projects with the help of volunteers. We will create a timeline to achieve specific goals by each quarter of the year.
We must tackle some challenges to uplift Pashto from a low resource to a web-rich language. However, one of the biggest challenges for content creation in the Pashot language is typing grammatically correct sentences using the available keyboards.
Our first goal is to create an automatic speech recognition system in Pashto that will transcribe spoken words into written Pashto. We need training data in the Pashto language to create such a system. Usually, this training data is created through another open-source project called Mozilla Common Voice. Unfortunately, Pashto is one of those few languages with no data in the Common Voice project.
Our top challenges, in order of priority, are as follows:
- Complete translation of Common Voice portal to Pashto
- Create sentences in the Pashto language for Common Voice
- Collect utterances against sentences collected for Common Voice
- Train/fine-tune the ASR AI model
- Devise a unified approach towards Pashto langue corpus creation
Coming soon...