Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where do I find acii2cf file? #1

Open
djinn opened this issue Jul 23, 2021 · 4 comments
Open

Where do I find acii2cf file? #1

djinn opened this issue Jul 23, 2021 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@djinn
Copy link

djinn commented Jul 23, 2021

If you look at Guru Makefile the acii2cf is required. I am not able to find this file referenced anywhere but hindawi repo. Where does this file exist?

@obonac
Copy link
Contributor

obonac commented Jul 23, 2021

@djinn Thank you for reaching out :)

https://github.com/hindawiai/hindawi2021/blob/master/Romenagri/acii2cf.lex
This is the Lex source for the Romenagri CF in this repo.

We are currently developing in the Chintamani (Telugu target). That's all also serving as a surrogate for other languages. The Telugu branch is getting all the commits right now https://github.com/hindawiai/chintamani/tree/telugu

There have been core changes to Romenagri in Chintamani to take it as close to IPA (Intl Phonetic Alphabet) as possible. The acii2cf sources and lots of front end filters for Perso-Arabic scripts are available there.

The script for all Indic scripts as supported in ISCII standard is there. Round trip should work - yet to test.

A lot of house-keeping needed merging all these back! I will mark a ref issue in Chintamani

Here's my current target workflow (on a Chintamani clone Telugu branch ./Romenagri dir)
printf "یہ ہائی اسکول کے طلبا کو تربیت دیتا ہے" | . ./fltr_ar_pra | . ./fltr_ar_prb | ./fltr_ur_hi | iconv -tutf16 | uni2acii | acii2cf | tr '^' '_' | rmn2acii | acii2uni | iconv -futf16

@obonac
Copy link
Contributor

obonac commented Jul 23, 2021

Tracking at hindawiai/chintamani#1

@obonac obonac self-assigned this Jul 23, 2021
@obonac obonac added documentation Improvements or additions to documentation help wanted Extra attention is needed labels Jul 23, 2021
@obonac
Copy link
Contributor

obonac commented Jul 23, 2021

Chintamani has binaries checked in please recompile. (The bin files will be removed in the next commit)

https://github.com/hindawiai/chintamani/blob/telugu/Romenagri/acii2cf.lex

If you are compiling by hand, then the Romenagri lib will need to be built first.

There's are test corpus files for other scripts. e.g.
cat corp_pa.txt | ./flatten_uni_dev

Our first target is to round_trip through Devnagri. That works just fine, except that Phonetically there are shifts between different languages. Like Abhishek in Bangla is pronounced more like Obhishek. Our objective is to be as close to phonetic fidelity as feasible. That will help in other components for TTS and spech as we get to the AI layers.

@obonac obonac pinned this issue Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants