hebrew_rs

Hebrew alephbet primatives and parsing library for rust.

The aim for this library is to fully support all unicode Hebrew characters including final forms, nikkud dots, letters, words, and phrases (sentences).

Progress

Support basic letter parsing
Support final letter parsing
Support converting letters into their final versions and vice versa
Support letters with most common nikkud
Provide ability to remove all nikkud from a letter or word

Note that while dots can be removed, they cannot be added to words that did not have them to begin with. There is no logic to determine the proper nikkud. Dots in the original parsed string can be preserved however.

Support all unicode Hebrew characters (See dots.rs for what is an isn't supported)
Support word parsing
Support phrase (and sentence) parsing

A naive method of just tokenizing on space characters can split phrases up into a vector of words, which in turn are a vec of letter structs that include a base and dots. A more fully featured method will hopefully be implemented in the future.

Preserve all split characters in phrases
Provide full word representations of each letter

Right to left (RTL) VS Left to Right (LTR)

Hebrew is written from right to left (RTL) in contrast to a language like English that is written left to right (LTR). Many programs detect RTL languages and will display them in LTR format on machines that have set their local language as a LTR system. This means the actual underlying text is actually stored RTL, but will look to an end user of a LTR system as written as LTR. It would be difficult to write hebrew without this because you would have to type a single character and then move the cursor to the left for every character you would type.

This needs to be kept in mind because any parsed data will be expected to be stored in RTL as is normal for Hebrew, even though it will be displayed LTR on LTR systems.

Letters

Hebrew has 22 total letters. 5 of them have a special final or 'sophit' form which are used sometimes when it occurs at the end of a word:

Non end forms

א, ב, ג, ד, ה, ו, ז, ח, ט, י, כ, ל, מ, נ, ס, ע, פ, צ, ק, ר, ש, ת

Note that ה (Heh) looks a lot like ח (Chet) and ת (Tav)

End forms

ך, ם, ן, ף, ץ

Non end forms to end forms conversion

כ ← ך

מ ← ם

נ ← ן

פ ← ף

צ ← ץ

Note that some final versions look like a different character but are not:

Kaph (כ) looks a lot like Resh (ר), but is actually Kaph sophit (ך) Nun (נ) looks a lot like Vav (ו), but is actually Nun sophit (ן)

Nikkud / Dots

In addition to the letters, Hebrew also has special diacritical marks. These marks modify how letters are pronounced. These modifiers are often not used by native speakers as they know how to properly pronounce words through experience.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hebrew_rs

Progress

Right to left (RTL) VS Left to Right (LTR)

Letters

Non end forms

End forms

Non end forms to end forms conversion

Nikkud / Dots

About

Releases

Packages

Languages

License

jgrowl/hebrew_rs

Folders and files

Latest commit

History

Repository files navigation

hebrew_rs

Progress

Right to left (RTL) VS Left to Right (LTR)

Letters

Non end forms

End forms

Non end forms to end forms conversion

Nikkud / Dots

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages