Support to detect languages in Word #2047

Closed
nvaccessAuto opened this Issue Jan 17, 2012 · 26 comments

1 participant

@nvaccessAuto

Reported by jfauchon on 2012-01-17 12:32
Hello,

Please, can you make support to detect languages full Microsoft Word.

Thanks for you Work.

@nvaccessAuto

Comment 1 by jteh on 2012-01-17 21:08
Can Word documents indicate the language of each piece of text? If not, this can't be done at present.

@nvaccessAuto

Comment 2 by jhomme (in reply to comment description) on 2012-01-18 12:35
Hi,
In the Word VBA Help, I see this item under the Range object.

DetectLanguage
Analyzes the specified text to determine the language that it is written in.  

There are other items that have to do with language. It also says that a Range is a piece of text of any size.

Hope that helps.
Jim
Replying to jfauchon:

Hello,

Please, can you make support to detect languages full Microsoft Word.

Thanks for you Work.

@nvaccessAuto

Attachment bug-2047-auto-language-detection-Word.patch added by manish on 2013-02-22 11:33
Description:
patch file created using bzr send.

@nvaccessAuto

Comment 3 by manish on 2013-02-22 12:26
Have tried to put this comment several times now - apologies if this shows up multiple times eventually:
I have attached a patch for the fix for this issue. I tested this with msword 2010 with the vocalizer voices for english and hindi. This should work for other languages as well. Will appreciate if someone can review and commit this patch.

@nvaccessAuto

Comment 4 by mdcurran on 2013-02-22 14:47
Thank you for this patch. I have not tested it myself yet, but it looks correct. Nice work.

However, there are a few issues:

  • Your patch removes some whitespace at the end of particular existing lines. This is nice, but should be in a separate patch if necessary as it makes it harder to read your functional changes.
  • There is a areEastLanguageID property in MS Word also. For languages such as Chinese, this property is set to the correct Asian language, and languageID is left on English or something else. We must work out whether the font is farEast and if so, use the farEastLanguageID property rather than languageID. Not sure how to do this yet.
  • You hard-code autoLanguageSwitching in the formatConfig to true. Please remove this. It must be left up to the user's configuration. Unless there is another bug we don't know about?

Thanks for your work.

@nvaccessAuto

Comment 5 by manish (in reply to comment 4) on 2013-02-28 12:24
I have made the following changes based on your comments below:

  • removed extraneous white space changes from the bundle.

  • For far east languages, I have read both the fareastlanguageId and the LanguageId from word. While setting the language command in winword.py, I first check if the FarEastLanguageId is a valid far east language. If yes, then that will get precedence. Otherwise, the regular language Id property will be used. This will work correctly on a non far east language computer if I open a far east language or mixed language document. On a far east language computer, however, the logic may need to be inverted based on how word fills these two values. I will request Takuyasan, the person working on japanese NVDA to help me test this.

  • I did not understand your comment about my hard coding the auto language detection property. Which line do you see that happening?

Replying to mdcurran:

Thank you for this patch. I have not tested it myself yet, but it looks correct. Nice work.

However, there are a few issues:

  • Your patch removes some whitespace at the end of particular existing lines. This is nice, but should be in a separate patch if necessary as it makes it harder to read your functional changes.

  • There is a areEastLanguageID property in MS Word also. For languages such as Chinese, this property is set to the correct Asian language, and languageID is left on English or something else. We must work out whether the font is farEast and if so, use the farEastLanguageID property rather than languageID. Not sure how to do this yet.

  • You hard-code autoLanguageSwitching in the formatConfig to true. Please remove this. It must be left up to the user's configuration. Unless there is another bug we don't know about?

Thanks for your work.

@nvaccessAuto

Comment 6 by manish (in reply to comment 5) on 2013-02-28 13:45
Please hold off on reviewing this. I've found some problems that I'll fix and re-upload.
Replying to manish:

I have made the following changes based on your comments below:

  • removed extraneous white space changes from the bundle.

  • For far east languages, I have read both the fareastlanguageId and the LanguageId from word. While setting the language command in winword.py, I first check if the FarEastLanguageId is a valid far east language. If yes, then that will get precedence. Otherwise, the regular language Id property will be used. This will work correctly on a non far east language computer if I open a far east language or mixed language document. On a far east language computer, however, the logic may need to be inverted based on how word fills these two values. I will request Takuyasan, the person working on japanese NVDA to help me test this.

  • I did not understand your comment about my hard coding the auto language detection property. Which line do you see that happening?

Replying to mdcurran:

Thank you for this patch. I have not tested it myself yet, but it looks correct. Nice work.

However, there are a few issues:

  • Your patch removes some whitespace at the end of particular existing lines. This is nice, but should be in a separate patch if necessary as it makes it harder to read your functional changes.

  • There is a areEastLanguageID property in MS Word also. For languages such as Chinese, this property is set to the correct Asian language, and languageID is left on English or something else. We must work out whether the font is farEast and if so, use the farEastLanguageID property rather than languageID. Not sure how to do this yet.

  • You hard-code autoLanguageSwitching in the formatConfig to true. Please remove this. It must be left up to the user's configuration. Unless there is another bug we don't know about?

Thanks for your work.

@nvaccessAuto

Comment 7 by mdcurran (in reply to comment 5) on 2013-03-05 00:10
Replying to manish:

  • I did not understand your comment about my hard coding the auto language detection property. Which line do you see that happening?

Not sure of the exact line number, but its in winword.py, in getTextWithFields.
The line you added is:
formatConfig['autoLanguageSwitching']=config.conf['speech'].get('autoLanguageSwitching',False)
Is AutoLanguageSwitching not already in formatConfig when passed in by speakTextInfo etc?
If you find that you do need this line, let me know in exactly what situations and I will look into it.

Otherthan this, things look good. I will await further fixes from you and I can also test with Chinese.

@nvaccessAuto

Comment 8 by manish (in reply to comment 7) on 2013-03-10 13:45
In the line above my change and in several other places, the FormatConfig being passed to this method is being filled with:
formatConfig=config.conf[autoLanguageSwitching property is however defined in the Speech section of the config file and doesn't get populated here. Should there be a separate "autoLanguageSwitching" property in the DocumentFormatting section of the config?

The new patch attached after this comment takes care of the other 2 items: blank spaces and far east lang support. Please let me know if you face any problems with chinese.

Regards,
Manish

Replying to [comment:7 mdcurran]('documentFormatting'].copy()

The):

Replying to manish:

  • I did not understand your comment about my hard coding the auto language detection property. Which line do you see that happening?

Not sure of the exact line number, but its in winword.py, in getTextWithFields.

The line you added is:

formatConfig['autoLanguageSwitching']=config.conf['speech'].get('autoLanguageSwitching',False)

Is AutoLanguageSwitching not already in formatConfig when passed in by speakTextInfo etc?

If you find that you do need this line, let me know in exactly what situations and I will look into it.

Otherthan this, things look good. I will await further fixes from you and I can also test with Chinese.

@nvaccessAuto

Comment 9 by mdcurran on 2013-03-12 07:38
Um, the most recent patch seems to be missing most of the language code in winword.cpp. In fact it only contains a few language constants.
Am I supposed to apply both patches, or is there a mistake with this patch?

@nvaccessAuto

Comment 10 by manish (in reply to comment 9) on 2013-03-12 15:26
OOPS. uploaded the wrong patch file. uploading the correct one now. Sorry.
Replying to mdcurran:

Um, the most recent patch seems to be missing most of the language code in winword.cpp. In fact it only contains a few language constants.

Am I supposed to apply both patches, or is there a mistake with this patch?

@nvaccessAuto

Attachment bug-2047-fixes-with-far-east-support.patch added by manish on 2013-03-12 15:28
Description:
Correcting the patch file.

@nvaccessAuto

Comment 11 by mdcurran on 2013-03-13 23:38
Merged in 3c7c78b.
However, some testing found that I mis-understood the far east language property; in fact its not needed at all as languageID does expose all east asian language correctly itself. Therefore I stripped out that code from the patch.
Sorry for the confusion and extra work there Manish.
Successfully tested in MS Word 2003, 2007 and 2010 with English, German, French, Chinese Taiwan, Chinese Hong Kong and Japanese. Hopefully all other languages therefore work okay.
Thanks Manish for your work on this ticket.
Changes:
Milestone changed from None to 2013.1
State: closed

@nvaccessAuto

Comment 12 by nishimotz on 2013-03-22 10:34
With Japanese version of Windows 8 and Word 2013, "page break" and "line feed" are not translated if automatic language switching option is enabled.

They should be translated using the entries of \f and \n in symbols.dic.

It occurs with main-5974 and 2013.1-5975 snapshots.
It doesn't occur with NVDA 2013.1 beta1.

@nvaccessAuto

Comment 13 by jteh on 2013-03-23 06:47
Perhaps I'm misunderstanding, but the language used for those characters depends on the language set for that character in Word. If you're reading an English section of text, you will hear English names for those symbols. If this isn't what you meant, please provide exact steps to reproduce and possibly a sample document.

@nvaccessAuto

Attachment test-ja-en.docx added by nishimotz on 2013-03-23 14:09
Description:
test-ja-en.docx regarding commen 12

@nvaccessAuto

Comment 14 by nishimotz on 2013-03-23 14:28
Though it is arguable that "page break" and "line feed" should coordinate with the content launguage, firstly I doubt that Word is reporting the language attributes of such special characters correctly.

The attached file test-ja-en.docx contains four lines.
Line 1 and 2 contains Japanese Kana characters 0x3042 (vowel 'a') and 0x3044 (vowel 'i') respectively.
Line 3 and 4 contains 'a' and 'b' respectively.

According to the status bar of Japanese version of Microsoft Word 2013 (32bit), the line 1 and 2 are written in Japanese language.

If automatic language switching option is enabled, the line feeds are always annunced as "line feed", although I expect "kaigyo" (line feed in Japanese) for line 1 and 2.

Tested with nvda_snapshot_main-5974.exe and SAPI5 voice (Microsoft Haruka Desktop) which comes with Windows 8 Japanese (64bit).

@nvaccessAuto

Comment 15 by nishimotz on 2013-03-24 04:35
Issue of comment:14 can be reproduced with combinations of Japanese Windows 8 + Word 2013, Japanese Windows 7 sp1 + Word 2010, and Japanese Windows XP sp3 + Word 2003, respectively.

As far as I know, the language attribute of "line feed" character is "en-us" even preceding and trailing characters are non-Latin Japanese characters.
It seems not changable by the user with Japanese Microsoft Word.

If my understanding is correct, for translating "line feed", it would be better to consider language setting of NVDA, rather than content property.

@nvaccessAuto

Comment 16 by mdcurran on 2013-03-25 05:11
I fear that trying to choose what characters should be spoken with the interface language rather than document markup language will never be correct in all situations. You mention new line, form feed, and page break. But what about tab, and other marks such as the paragraph mark if MS Word is configured to show it?

There is what I would consider a bug in MS Word where its impossible to select any non-eastern text (alpha numeric characters, line feed, page break etc) and tell MS Word its Japanese. This even includes when typing with a Japanese layout: MS Word marks japanese characters as Japanese, but not the typed line feeds. This is not a problem in all european languages. The line feeds seem to stay in the chosen language.

The most obvious work around is for east-asian users to disable auto language switching. though I don't like this option. Remember though this is a problem with MS Word. Web documents, PDFs etc when marked up correctly would not have this problem.

I'm itnerested in any opinions on what we should do here.

We could:

  • Remove the language detection code for Microsoft Word alltogether: This would fix the issue for east-asian users, but deny a very useful feature from the rest of the world.
  • Some how disable language detection in MS Word specifically for east-asian users: this would solve the problem when reading that particular language, but deny a useful feature for any east-asian user reading documents in other languages.
  • Identify all language neutural language characters (at least all whitespace ones) and some how ignore document language markup formthem and use interface language instead.
@nvaccessAuto

Comment 17 by jteh (in reply to comment 16) on 2013-03-25 05:14
Replying to mdcurran:

  • Identify all language neutural language characters (at least all whitespace ones) and some how ignore document language markup formthem and use interface language instead.

I'd probably choose this option. However, imo, it should only be done for Microsoft Word, since Word is the problem here, not these characters in general. Also, we should use current speech language, not interface language, just like we do for symbols normally.

@nvaccessAuto

Comment 18 by jteh on 2013-03-25 05:15
Furthermore, can we ignore the language for these characters only for east Asian languages?

@nvaccessAuto

Comment 19 by mdcurran on 2013-03-25 05:18
I guess that would mean in MS Word backend marking these language changes as whitespace, which most code would ignore except for speech if the synth language was east-asian... hmmm, achieves the result but a little messy perhaps...

@nvaccessAuto

Comment 20 by jteh on 2013-03-25 05:55
Hmm. i was thinking more of post-processing the output, rather than inventing a fake language code. However, I just realised the solution I proposed affects everything, where it probably only needs to affect reading by character.

Perhaps just ignoring the language when reading whitespace (or perhaps even line/page/paragraph marks) by character is sufficient. I just can't decide whether it's appropriate. To give us a bit of distance, if I'm arrowing through German text when English is my primary language, should I be hearing "space" and "line feed" in English or German? I would have thought German, but comment:14 disagrees.

@nvaccessAuto

Comment 21 by mdcurran on 2013-03-25 23:38
To confuse things further: characters such as tab and page break are announced when reading by larger chunks than character. How would these be handled? page break was one of the originally mentioned issues.

@nvaccessAuto

Comment 22 by mdcurran on 2013-03-26 01:47
1d78fb2 removes the language attribute from any format change that surrounds only whitespace. Therefore line feeds, tab, page breaks etc should now be spoken in the synth's set language if it has one or else the NVDA interface language, rather than what MS Word supports.
nishimotz: please test this change and let me know if its suitable. Note that all we're doing is reverting to old functionality for whitespace.

@nvaccessAuto

Comment 23 by nishimotz on 2013-03-26 03:57
NVDA Japanese team is happy with this solution.
This change can avoid confusion of Japanese language users.
Thank you.

Dear Manish,
I believe I met you last year. Thank you for this contribution.
Takuya Nishimoto, a.k.a. nishimotz

@nvaccessAuto nvaccessAuto added this to the 2013.1 milestone Nov 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment