New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import and convert dictionaries from other programs #39

Closed
balshetzer opened this Issue Jun 14, 2012 · 6 comments

Comments

Projects
None yet
2 participants
@balshetzer
Member

balshetzer commented Jun 14, 2012

Many Plover users have steno experience with other programs and therefore have mature dictionaries in those programs' formats. A tool should exist to easily convert other programs' dictionaries to the Plover dictionary format.

@stenoknight

This comment has been minimized.

Show comment
Hide comment
@stenoknight

stenoknight Jun 14, 2012

Member

Someone's also mentioned the possibility of a steno student starting on Plover and being able to export their Plover dictionary to rtf/cre format so that they can use it with commercial steno software. That's a pretty low priority, in my opinion, but something to consider as an additional feature in the conversion script.

Member

stenoknight commented Jun 14, 2012

Someone's also mentioned the possibility of a steno student starting on Plover and being able to export their Plover dictionary to rtf/cre format so that they can use it with commercial steno software. That's a pretty low priority, in my opinion, but something to consider as an additional feature in the conversion script.

@stenoknight

This comment has been minimized.

Show comment
Hide comment
@stenoknight

stenoknight Jun 25, 2012

Member

Some steps towards converting from DigitalCAT-formatted rtf/cre to Eclipse-formatted rtf/cre, which will make it convertible to Plover's json format. (Thanks to Ed)

//============Optional dual installation comments start============\

My install of ploverwin ver 212 dated approx April 16, 2012
uses a dict format setting of "DCAT" in the config file.
It works well for words (but not numbers and not commands).

My next polverwin ver install after that was ver 220
dated approx June 4, 2012.
Ploverwin ver 220 has errors using the DC format,
so it must use the dict format setting of "Eclipse"
in the config.json file.

After I installed ver 220, I still had ver 212 installed,
so I had 2 separate ploverwin installations,
at 2 different locations,
using 2 "different Kim dictionaries",
using 2 different dictionary configuration settings.

Note that I run only a single version of ploverwin at a time.

Also note that when I say "different Kim dictionaries" I mean
different names, different locations, different config settings,
but the actual dictionary entries
started out being exactly the same
(prior to the start of the conversion change process).

This dual installation situation helped me to experiment
because I could easily test for word output differences
between that DC supported version and the Eclipse only version.

============Optional dual installation comments end====================//


When Plover refuses to output English due to steno in the wrong format, it
often outputs the correctly formatted steno that it wants to see in the dictionary.

This often happens with words and also with numbers.

I looked in your dictionary, Mirabai, to see if certain character combinations
were present or not. If they were NOT present in your dictionary -- then I
removed those character combinations from Kim's DC format dictionary

(that I was trying to convert from the DC to the Eclipse format) by editing.

I am a novice user of TED Notepad. http://jsimlo.sk/notepad/download.php

This is my 1st contact with regular expressions.

My editing method is: "find & replace all", case sensitive on, regex off.
But before "find & replace all", I use "find", case sensitive on, regex on
to test to see if a "find & replace all" will do what I want it to do,

without changing anything else (I want the English right side unchanged).

To search only the English (right side only), I use the following Perl regex:
": ".#.$

(# is now representing what I am searching for.)

To search only the steno (left side only), I use the following Perl regex:
^.#.":
(Note that there is a space after the colon.)

(# is now representing what I am searching for.)

I don't know if it matters in what order the changes are done.
I usually change one character combination at a time.
My 1st change was "-" to "".
My 2nd change was "-E" to "E".

My 3rd change was "-U" to "U".

As you see below, BU files have names according to their changes.
(If anyone wants the ~654 megs of BU files I will gladly send them.)
Here are directory listings of dictionary BU files sorted by time:

Directory of C:\PloverDictBUsForMirabai1

05/26/2012 04:03 AM 7,593,697 dictDcat001.json Used with ver212 DC format
06/06/2012 12:49 PM 7,593,852 dict.jsonDc04 Added PloverToggle for ver220
06/06/2012 02:15 PM 7,535,724 dict.jsonDc05StarHyphenStar ("-" to "")
1st edit to convert from dc to eclipse format
06/06/2012 03:21 PM 7,527,938 dict.jsonDc06-EE
06/06/2012 06:16 PM 7,526,967 dict.jsonDc07-UU
06/07/2012 06:27 AM 7,515,531 dict.jsonDc08O-ROR(3)
(3 English entries to manually change back after find & replace)
06/07/2012 04:13 PM 7,491,127 dict.jsonDc09O-UOU
06/07/2012 05:02 PM 7,488,026 dict.jsonDc10A-TAT
06/07/2012 05:18 PM 7,466,532 dict.jsonDc11A-PAP(3)
06/07/2012 05:39 PM 7,458,785 dict.jsonDc12A-BAB(2)
06/07/2012 05:54 PM 7,456,626 dict.jsonDc13A-DAD(1)
06/07/2012 07:07 PM 7,419,784 dict.jsonDc14A-EAE
06/07/2012 07:17 PM 7,416,467 dict.jsonDc15A-FAF
06/07/2012 07:56 PM 7,415,002 dict.jsonDc16A-GAG(4)
(4 English entries to manually change back after find & replace)
06/07/2012 11:15 PM 7,407,436 dict.jsonDc17A-LAL
06/08/2012 01:17 AM 7,396,034 dict.jsonDc18A-RAR
06/08/2012 01:31 AM 7,393,193 dict.jsonDc19A-SAS(1)
06/08/2012 01:41 AM 7,386,891 dict.jsonDc20A-UAU
06/08/2012 01:53 AM 7,386,011 dict.jsonDc21A-ZAZ
06/08/2012 01:53 AM 7,386,011 dict.jsonA99
06/08/2012 05:18 AM 7,312,143 dict.jsonDc22O-EOE
06/08/2012 05:37 AM 7,305,776 dict.jsonDc23O-B_and_O-F 06/08/2012 12:30 PM 7,290,608 dict.jsonDc24O-POP 06/08/2012 12:40 PM 7,286,116 dict.jsonDc25O-LOL 06/08/2012 12:51 PM 7,284,844 dict.jsonDc26O-GOG 06/08/2012 01:01 PM 7,282,933 dict.jsonDc27O-TOT 06/08/2012 01:10 PM 7,281,163 dict.jsonDc28O-SOS 06/08/2012 03:40 PM 7,280,076 dict.jsonDc30O-DOD 06/08/2012 03:50 PM 7,279,586 dict.jsonDc31O-ZOZ 06/08/2012 06:34 PM 7,269,923 dict.jsonDc32S-ESE 06/08/2012 06:53 PM 7,268,272 dict.jsonDc33S-USU 06/08/2012 07:08 PM 7,260,433 dict.jsonDc34T-ETE 06/08/2012 07:13 PM 7,259,305 dict.jsonDc35T-UTU 06/08/2012 08:37 PM 7,252,035 dict.jsonDc36K-EKE 06/08/2012 08:39 PM 7,249,765 dict.jsonDc37K-UKU 06/08/2012 08:50 PM 0 dict.jsonDc38-EE missed these 1st pass 06/08/2012 09:16 PM 7,240,687 dict.jsonDc39P-EPE 06/08/2012 09:35 PM 7,230,956 dict.jsonDc40W-EWE 06/08/2012 09:41 PM 7,216,056 dict.jsonDc41H-EHE 06/08/2012 09:48 PM 7,186,221 dict.jsonDc42R-ERE 06/08/2012 10:03 PM 0 dict.jsonDc43-UU missed these 1st pass 06/08/2012 10:17 PM 7,184,559 dict.jsonDc44P-UPU 06/08/2012 10:22 PM 7,181,255 dict.jsonDc45W-UWU 06/08/2012 10:24 PM 7,176,346 dict.jsonDc46H-UHU 06/08/2012 10:28 PM 7,171,239 dict.jsonDc47R-U``RU
06/08/2012 10:28 PM 7,171,239 dict.jsonA98
06/08/2012 10:28 PM 7,171,239 dict_json_A98 This completes word conversion.

After this I worked on numbers.

I worked on the word part of the conversion 1st.

I worked on the number part after the word part was done.

A DigitalCAT format rule for any single stroke containing a number
seems to be that the steno will begin with the character "#"
so by sorting the entire dictionary I ended up with most (but not all)
of the number entries grouped together. I then made a numbers-only

dictionary file to experiment with the numbers.

The Eclipse rule for any single stroke containing a number
seems to be that the character "#" will NOT be in the steno --
only the number will be in the steno.

So that gives me my 1st numbercentric edit: delete all #s from the steno side.
as long as a single stroke does not contain a number.

Eclipse seems to use the "#" character in the steno for a single stroke
only when the numberbar is used without a number key being pressed.
So for any strokes that did NOT have numbers 1234506789

I did not delete the "#" character.

06/09/2012 12:29 AM 7,171,227 dict.jsonA97#s
06/09/2012 10:57 PM 7,171,105 dict.jsonA96#doublingDone (11223344550066778899)
(edited to make them work)
06/09/2012 11:17 PM 7,171,195 dict.json95BadTopEnd--------------
06/09/2012 11:23 PM 7,171,129 dict.json94GoodTop!!--------------->just markers
06/09/2012 11:35 PM 7,171,196 dict.json93GoodTop!!AndBottomZZZ--/
06/09/2012 11:44 PM 7,171,196 !!dict92Alpha#Sorted.json <<<<<<<<<<<<<SORTED
06/10/2012 01:28 AM 46,968 #dict#only01.txt
06/10/2012 01:34 AM 44,675 #dict#only02_2293#sDeleted.txt
06/10/2012 02:07 AM 44,724 #dict#only02_2293#sDeletedB.json
06/10/2012 08:31 PM 44,281 !dict#only09.json

I continued to work on numbers, but in a different folder.

I would summarize about the numbers that 1st I sorted,
then I isolated all the number entries starting with #
then I worked on them in a numbers-only dict (removing the # character & some hyphens)
but there were 75 other entries that were numbers that did not start with "#",
so I had to work on those, also.

Note that this was to get the numbers to work for the "dictionary defined numbers" only.

At this point the right side numbers bug is present,

because the dictionary entries that address that bug are not in the dictionary yet.

Directory of C:\PloverDictBUsForMirabai2

06/08/2012 10:28 PM 7,171,239 cA98_No#sChangedYet.json
06/10/2012 10:13 PM 7,171,330 B88_A98_plus2Lines.json
06/10/2012 10:47 PM 7,171,330 B87_sorted.json
06/10/2012 11:00 PM 7,171,302 B75_sameAs_B86.json
06/10/2012 11:00 PM 7,171,302 B86_oneTestLineRemoved.json
06/10/2012 11:46 PM 0 B86_oneLineDelFromFull.json-
06/10/2012 11:51 PM 48,398 B85_#sMostly#s.json
06/11/2012 12:17 AM 46,002 B84_DelAll#s.json
06/11/2012 12:26 AM 46,006 B83_put4#sBack.json
06/11/2012 12:32 AM 45,564 B82_-EE_-UU.json I guess I missed these before
06/11/2012 01:35 AM 45,546 B81_59-D59D.json
06/11/2012 01:48 AM 45,597 B815-D5D_rem.json
06/11/2012 02:01 AM 45,517 B80_0-D0D.json
06/11/2012 08:48 PM 45,515 B79_5-G5G_5-R5R.json
06/12/2012 04:38 AM 46,399 B800-D0D_rem.json
06/12/2012 06:15 AM 45,937 B78_Degree_.json
06/12/2012 06:43 AM 45,933 B77_.json
06/12/2012 05:44 PM 7,123,010 B74_FullButTop#sDeleted.json
06/13/2012 03:42 AM 7,122,977 B73_!WorkgOnThe75#s.json
06/13/2012 04:46 AM 7,122,995 B72_!WorkgOnThe75#s.json
06/13/2012 03:43 PM 7,122,988 B71_!WorkgOnThe75#s.json
06/13/2012 04:01 PM 7,122,964 B70_!WorkgOnThe75#s.json
06/13/2012 04:55 PM 7,122,966 B69_!WorkgOnThe75#s.json
06/13/2012 05:48 PM 7,122,966 B68_!WorkgOnThe75#sGUD.json (GUD=GOOD)
06/13/2012 06:21 PM 7,122,981 B67_!WorkgOnThe75#sBAD.json
06/13/2012 07:35 PM 7,168,982 B66_(B76onTopOfB67)BAD.json
06/13/2012 08:23 PM 7,122,939 B65_!WorkgOnThe75#sGUD.json
06/13/2012 08:30 PM 7,168,940 B64_(B76onTopOfB65)GUD.json
06/14/2012 12:44 AM 7,137,292 B63_Slash-E_Slash-U.json /-E to /E /-U to /U
06/14/2012 12:59 AM 7,137,289 B62_0thru9-E_0thru9-U.json more -E to E -U to U
06/14/2012 01:09 AM 7,137,288 B61_Line232208_Del1#.json <<< Converted (maybe)
I think the file B61 has all the needed changes to be in the eclipse format for plover,
but I am not absolutely sure, because it needs more testing.

The right side numbers bug did not get (maybe) resolved until later, so it is still in B61

The top of this file contains entries that may fix the right side numbers bug:

06/16/2012 12:50 AM 7,142,087 B54_1234with-AllCombosOfRtSide.json

End_Of_Message_And_End_Of_File->
->

Member

stenoknight commented Jun 25, 2012

Some steps towards converting from DigitalCAT-formatted rtf/cre to Eclipse-formatted rtf/cre, which will make it convertible to Plover's json format. (Thanks to Ed)

//============Optional dual installation comments start============\

My install of ploverwin ver 212 dated approx April 16, 2012
uses a dict format setting of "DCAT" in the config file.
It works well for words (but not numbers and not commands).

My next polverwin ver install after that was ver 220
dated approx June 4, 2012.
Ploverwin ver 220 has errors using the DC format,
so it must use the dict format setting of "Eclipse"
in the config.json file.

After I installed ver 220, I still had ver 212 installed,
so I had 2 separate ploverwin installations,
at 2 different locations,
using 2 "different Kim dictionaries",
using 2 different dictionary configuration settings.

Note that I run only a single version of ploverwin at a time.

Also note that when I say "different Kim dictionaries" I mean
different names, different locations, different config settings,
but the actual dictionary entries
started out being exactly the same
(prior to the start of the conversion change process).

This dual installation situation helped me to experiment
because I could easily test for word output differences
between that DC supported version and the Eclipse only version.

============Optional dual installation comments end====================//


When Plover refuses to output English due to steno in the wrong format, it
often outputs the correctly formatted steno that it wants to see in the dictionary.

This often happens with words and also with numbers.

I looked in your dictionary, Mirabai, to see if certain character combinations
were present or not. If they were NOT present in your dictionary -- then I
removed those character combinations from Kim's DC format dictionary

(that I was trying to convert from the DC to the Eclipse format) by editing.

I am a novice user of TED Notepad. http://jsimlo.sk/notepad/download.php

This is my 1st contact with regular expressions.

My editing method is: "find & replace all", case sensitive on, regex off.
But before "find & replace all", I use "find", case sensitive on, regex on
to test to see if a "find & replace all" will do what I want it to do,

without changing anything else (I want the English right side unchanged).

To search only the English (right side only), I use the following Perl regex:
": ".#.$

(# is now representing what I am searching for.)

To search only the steno (left side only), I use the following Perl regex:
^.#.":
(Note that there is a space after the colon.)

(# is now representing what I am searching for.)

I don't know if it matters in what order the changes are done.
I usually change one character combination at a time.
My 1st change was "-" to "".
My 2nd change was "-E" to "E".

My 3rd change was "-U" to "U".

As you see below, BU files have names according to their changes.
(If anyone wants the ~654 megs of BU files I will gladly send them.)
Here are directory listings of dictionary BU files sorted by time:

Directory of C:\PloverDictBUsForMirabai1

05/26/2012 04:03 AM 7,593,697 dictDcat001.json Used with ver212 DC format
06/06/2012 12:49 PM 7,593,852 dict.jsonDc04 Added PloverToggle for ver220
06/06/2012 02:15 PM 7,535,724 dict.jsonDc05StarHyphenStar ("-" to "")
1st edit to convert from dc to eclipse format
06/06/2012 03:21 PM 7,527,938 dict.jsonDc06-EE
06/06/2012 06:16 PM 7,526,967 dict.jsonDc07-UU
06/07/2012 06:27 AM 7,515,531 dict.jsonDc08O-ROR(3)
(3 English entries to manually change back after find & replace)
06/07/2012 04:13 PM 7,491,127 dict.jsonDc09O-UOU
06/07/2012 05:02 PM 7,488,026 dict.jsonDc10A-TAT
06/07/2012 05:18 PM 7,466,532 dict.jsonDc11A-PAP(3)
06/07/2012 05:39 PM 7,458,785 dict.jsonDc12A-BAB(2)
06/07/2012 05:54 PM 7,456,626 dict.jsonDc13A-DAD(1)
06/07/2012 07:07 PM 7,419,784 dict.jsonDc14A-EAE
06/07/2012 07:17 PM 7,416,467 dict.jsonDc15A-FAF
06/07/2012 07:56 PM 7,415,002 dict.jsonDc16A-GAG(4)
(4 English entries to manually change back after find & replace)
06/07/2012 11:15 PM 7,407,436 dict.jsonDc17A-LAL
06/08/2012 01:17 AM 7,396,034 dict.jsonDc18A-RAR
06/08/2012 01:31 AM 7,393,193 dict.jsonDc19A-SAS(1)
06/08/2012 01:41 AM 7,386,891 dict.jsonDc20A-UAU
06/08/2012 01:53 AM 7,386,011 dict.jsonDc21A-ZAZ
06/08/2012 01:53 AM 7,386,011 dict.jsonA99
06/08/2012 05:18 AM 7,312,143 dict.jsonDc22O-EOE
06/08/2012 05:37 AM 7,305,776 dict.jsonDc23O-B_and_O-F 06/08/2012 12:30 PM 7,290,608 dict.jsonDc24O-POP 06/08/2012 12:40 PM 7,286,116 dict.jsonDc25O-LOL 06/08/2012 12:51 PM 7,284,844 dict.jsonDc26O-GOG 06/08/2012 01:01 PM 7,282,933 dict.jsonDc27O-TOT 06/08/2012 01:10 PM 7,281,163 dict.jsonDc28O-SOS 06/08/2012 03:40 PM 7,280,076 dict.jsonDc30O-DOD 06/08/2012 03:50 PM 7,279,586 dict.jsonDc31O-ZOZ 06/08/2012 06:34 PM 7,269,923 dict.jsonDc32S-ESE 06/08/2012 06:53 PM 7,268,272 dict.jsonDc33S-USU 06/08/2012 07:08 PM 7,260,433 dict.jsonDc34T-ETE 06/08/2012 07:13 PM 7,259,305 dict.jsonDc35T-UTU 06/08/2012 08:37 PM 7,252,035 dict.jsonDc36K-EKE 06/08/2012 08:39 PM 7,249,765 dict.jsonDc37K-UKU 06/08/2012 08:50 PM 0 dict.jsonDc38-EE missed these 1st pass 06/08/2012 09:16 PM 7,240,687 dict.jsonDc39P-EPE 06/08/2012 09:35 PM 7,230,956 dict.jsonDc40W-EWE 06/08/2012 09:41 PM 7,216,056 dict.jsonDc41H-EHE 06/08/2012 09:48 PM 7,186,221 dict.jsonDc42R-ERE 06/08/2012 10:03 PM 0 dict.jsonDc43-UU missed these 1st pass 06/08/2012 10:17 PM 7,184,559 dict.jsonDc44P-UPU 06/08/2012 10:22 PM 7,181,255 dict.jsonDc45W-UWU 06/08/2012 10:24 PM 7,176,346 dict.jsonDc46H-UHU 06/08/2012 10:28 PM 7,171,239 dict.jsonDc47R-U``RU
06/08/2012 10:28 PM 7,171,239 dict.jsonA98
06/08/2012 10:28 PM 7,171,239 dict_json_A98 This completes word conversion.

After this I worked on numbers.

I worked on the word part of the conversion 1st.

I worked on the number part after the word part was done.

A DigitalCAT format rule for any single stroke containing a number
seems to be that the steno will begin with the character "#"
so by sorting the entire dictionary I ended up with most (but not all)
of the number entries grouped together. I then made a numbers-only

dictionary file to experiment with the numbers.

The Eclipse rule for any single stroke containing a number
seems to be that the character "#" will NOT be in the steno --
only the number will be in the steno.

So that gives me my 1st numbercentric edit: delete all #s from the steno side.
as long as a single stroke does not contain a number.

Eclipse seems to use the "#" character in the steno for a single stroke
only when the numberbar is used without a number key being pressed.
So for any strokes that did NOT have numbers 1234506789

I did not delete the "#" character.

06/09/2012 12:29 AM 7,171,227 dict.jsonA97#s
06/09/2012 10:57 PM 7,171,105 dict.jsonA96#doublingDone (11223344550066778899)
(edited to make them work)
06/09/2012 11:17 PM 7,171,195 dict.json95BadTopEnd--------------
06/09/2012 11:23 PM 7,171,129 dict.json94GoodTop!!--------------->just markers
06/09/2012 11:35 PM 7,171,196 dict.json93GoodTop!!AndBottomZZZ--/
06/09/2012 11:44 PM 7,171,196 !!dict92Alpha#Sorted.json <<<<<<<<<<<<<SORTED
06/10/2012 01:28 AM 46,968 #dict#only01.txt
06/10/2012 01:34 AM 44,675 #dict#only02_2293#sDeleted.txt
06/10/2012 02:07 AM 44,724 #dict#only02_2293#sDeletedB.json
06/10/2012 08:31 PM 44,281 !dict#only09.json

I continued to work on numbers, but in a different folder.

I would summarize about the numbers that 1st I sorted,
then I isolated all the number entries starting with #
then I worked on them in a numbers-only dict (removing the # character & some hyphens)
but there were 75 other entries that were numbers that did not start with "#",
so I had to work on those, also.

Note that this was to get the numbers to work for the "dictionary defined numbers" only.

At this point the right side numbers bug is present,

because the dictionary entries that address that bug are not in the dictionary yet.

Directory of C:\PloverDictBUsForMirabai2

06/08/2012 10:28 PM 7,171,239 cA98_No#sChangedYet.json
06/10/2012 10:13 PM 7,171,330 B88_A98_plus2Lines.json
06/10/2012 10:47 PM 7,171,330 B87_sorted.json
06/10/2012 11:00 PM 7,171,302 B75_sameAs_B86.json
06/10/2012 11:00 PM 7,171,302 B86_oneTestLineRemoved.json
06/10/2012 11:46 PM 0 B86_oneLineDelFromFull.json-
06/10/2012 11:51 PM 48,398 B85_#sMostly#s.json
06/11/2012 12:17 AM 46,002 B84_DelAll#s.json
06/11/2012 12:26 AM 46,006 B83_put4#sBack.json
06/11/2012 12:32 AM 45,564 B82_-EE_-UU.json I guess I missed these before
06/11/2012 01:35 AM 45,546 B81_59-D59D.json
06/11/2012 01:48 AM 45,597 B815-D5D_rem.json
06/11/2012 02:01 AM 45,517 B80_0-D0D.json
06/11/2012 08:48 PM 45,515 B79_5-G5G_5-R5R.json
06/12/2012 04:38 AM 46,399 B800-D0D_rem.json
06/12/2012 06:15 AM 45,937 B78_Degree_.json
06/12/2012 06:43 AM 45,933 B77_.json
06/12/2012 05:44 PM 7,123,010 B74_FullButTop#sDeleted.json
06/13/2012 03:42 AM 7,122,977 B73_!WorkgOnThe75#s.json
06/13/2012 04:46 AM 7,122,995 B72_!WorkgOnThe75#s.json
06/13/2012 03:43 PM 7,122,988 B71_!WorkgOnThe75#s.json
06/13/2012 04:01 PM 7,122,964 B70_!WorkgOnThe75#s.json
06/13/2012 04:55 PM 7,122,966 B69_!WorkgOnThe75#s.json
06/13/2012 05:48 PM 7,122,966 B68_!WorkgOnThe75#sGUD.json (GUD=GOOD)
06/13/2012 06:21 PM 7,122,981 B67_!WorkgOnThe75#sBAD.json
06/13/2012 07:35 PM 7,168,982 B66_(B76onTopOfB67)BAD.json
06/13/2012 08:23 PM 7,122,939 B65_!WorkgOnThe75#sGUD.json
06/13/2012 08:30 PM 7,168,940 B64_(B76onTopOfB65)GUD.json
06/14/2012 12:44 AM 7,137,292 B63_Slash-E_Slash-U.json /-E to /E /-U to /U
06/14/2012 12:59 AM 7,137,289 B62_0thru9-E_0thru9-U.json more -E to E -U to U
06/14/2012 01:09 AM 7,137,288 B61_Line232208_Del1#.json <<< Converted (maybe)
I think the file B61 has all the needed changes to be in the eclipse format for plover,
but I am not absolutely sure, because it needs more testing.

The right side numbers bug did not get (maybe) resolved until later, so it is still in B61

The top of this file contains entries that may fix the right side numbers bug:

06/16/2012 12:50 AM 7,142,087 B54_1234with-AllCombosOfRtSide.json

End_Of_Message_And_End_Of_File->
->

@stenoknight

This comment has been minimized.

Show comment
Hide comment
@stenoknight

stenoknight Jul 16, 2012

Member

From the Launchpad site (for Eclipse-formatted dictionaries):

A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.


escape backslashes

%s//\/g

escape "

%s/"/"/g

convert double spaces to single spaces

%s/ / /g

Remove lines with court reporter-specific paragraphing commands (this is drastic, but they cause no end of trouble. Will maybe try to support them

to some degree in a later version.)

%s/^.{$}.$\n//
%s/^.\par\.$\n//

Convert steno half of entry to Python format

%s/{\.\cxs ([^}]+)}/"\1": /

Get rid of any lines that don't start with quotes. (i.e., more court reporting formatting residue)

%s/^[^"].*$\n//

Convert infixes.

%s/: \cxds (.*)\cxds/: {^\1^}/

Convert suffixes.

%s/: \cxds (.*)/: {^\1}/

Convert prefixes.

%s/: (.*)\cxds/: {\1^}/

Delete "force uncap" command (caption-specific command that Plover doesn't need to implement now, if ever.)

%s/{l1}//g
%s/{l0}//g

Delete \cxp, the punctuation marker, since Plover recognizes specific punctuation marks independently.

%s/\cxp//g

Convert glue strokes.

%s/\cxfing /&/g

Convert "cap next" strokes.

%s/\cxfc /-|/g

Convert "stitch" strokes to suffix with hyphen.

%s/{\cxstit /{^-/

Search for other cx strokes and deal with them manually.

/cx

Delete spaces at ends of line.

%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)

Convert other half of entries.

:%s/^"([-A-Z0-9/]+)": (.)$/"\1": "\2",

Put in curly brackets at beginning and end of dictionary

I'm sure there's a way to do this automatically, but I just did it manually.

You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:

http://stenoknight.com/plover/ploverdicts.zip

The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata.

Member

stenoknight commented Jul 16, 2012

From the Launchpad site (for Eclipse-formatted dictionaries):

A list of (Vim-flavored) regular expressions that will convert a dictionary exported in rtf/cre format into Python dictionary format. Ideally this should be turned into a simple script that new users can run on their dictionaries without prior knowledge of regular expressions. This has only been fully tested with rtf/cre dictionaries exported by Eclipse. Additional formatting is probably necessary for rtf/cre files exported from CAT software other than Eclipse. More testing is required. Note that Plover currently supports two types of steno dictionary: Eclipse format, where hyphens are only made explicit when necessary, and DigitalCAT format, where all hyphens are explicit. Default format is Eclipse, so if you are importing a DigitalCAT dictionary, change the format in Plover's .config file.


escape backslashes

%s//\/g

escape "

%s/"/"/g

convert double spaces to single spaces

%s/ / /g

Remove lines with court reporter-specific paragraphing commands (this is drastic, but they cause no end of trouble. Will maybe try to support them

to some degree in a later version.)

%s/^.{$}.$\n//
%s/^.\par\.$\n//

Convert steno half of entry to Python format

%s/{\.\cxs ([^}]+)}/"\1": /

Get rid of any lines that don't start with quotes. (i.e., more court reporting formatting residue)

%s/^[^"].*$\n//

Convert infixes.

%s/: \cxds (.*)\cxds/: {^\1^}/

Convert suffixes.

%s/: \cxds (.*)/: {^\1}/

Convert prefixes.

%s/: (.*)\cxds/: {\1^}/

Delete "force uncap" command (caption-specific command that Plover doesn't need to implement now, if ever.)

%s/{l1}//g
%s/{l0}//g

Delete \cxp, the punctuation marker, since Plover recognizes specific punctuation marks independently.

%s/\cxp//g

Convert glue strokes.

%s/\cxfing /&/g

Convert "cap next" strokes.

%s/\cxfc /-|/g

Convert "stitch" strokes to suffix with hyphen.

%s/{\cxstit /{^-/

Search for other cx strokes and deal with them manually.

/cx

Delete spaces at ends of line.

%s/ \n/^M/g - (don't type in the ^M; do control-q, then control-m, and what will display is ^M)

Convert other half of entries.

:%s/^"([-A-Z0-9/]+)": (.)$/"\1": "\2",

Put in curly brackets at beginning and end of dictionary

I'm sure there's a way to do this automatically, but I just did it manually.

You can find a ~9 mb zip file containing several unconverted dictionaries in rtf format and a few converted dictionaries in json format as well, in both Eclipse (only necessary hyphens explicit) and DigitalCAT (all hyphens explicit) flavors of steno here:

http://stenoknight.com/plover/ploverdicts.zip

The DigitalCAT dictionaries will require much more weeding, since they have extra metadata that the regular expressions in the launchpad blueprint doesn't account for. Stuff like dictentrydate, which we can just cut out completely, and conflicts, which will require the sacrifice of the entry, since Plover doesn't support conflict differentiation (nor will it ever, if I have anything to say about it). Basically anything starting with cx is steno-specific metadata.

@balshetzer

This comment has been minimized.

Show comment
Hide comment
@balshetzer

balshetzer Dec 11, 2012

Member

I thought I'd put a reference here to the rtf cre spec:
http://www.legalxml.org/workgroups/substantive/transcripts/cre-spec.htm

Member

balshetzer commented Dec 11, 2012

I thought I'd put a reference here to the rtf cre spec:
http://www.legalxml.org/workgroups/substantive/transcripts/cre-spec.htm

@balshetzer

This comment has been minimized.

Show comment
Hide comment
@balshetzer

balshetzer Dec 25, 2012

Member

I ran my script on the dictionaries in the zip file and it ran int some problems with ab-digitalcat-0528.rtf because it had something in it that wasn't legal RTF. I took a look and that part of the file didn't make sense. Is it possible that there was some kind of copy paste change in that file or is it as it was on export?

Member

balshetzer commented Dec 25, 2012

I ran my script on the dictionaries in the zip file and it ran int some problems with ab-digitalcat-0528.rtf because it had something in it that wasn't legal RTF. I took a look and that part of the file didn't make sense. Is it possible that there was some kind of copy paste change in that file or is it as it was on export?

@balshetzer

This comment has been minimized.

Show comment
Hide comment
@balshetzer

balshetzer Jul 12, 2013

Member

Plover now supports RTF dictionaries natively.

Member

balshetzer commented Jul 12, 2013

Plover now supports RTF dictionaries natively.

@balshetzer balshetzer closed this Jul 12, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment