-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think. #4
Comments
Could show the flags you used to initialize |
My your_script.py:
|
You are missing a comma between |
it was the comma indeed! thank you so much for having spot this!
Op vr 30 aug. 2019 om 15:04 schreef miau1 <notifications@github.com>:
… My your_script.py:
import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()
You are missing a comma between "-wm" and "tmx", but I don't think that's
causing the error here. I suspect this is an issue with Windows encoding
behavior. I'll try to setup a Windows environment and see if I can
replicate the error. In the meanwhile, you could try to run your script in
a unix-like environment, if you have access to one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA>
.
|
I think there is a problem anyhow, maybe only on WIN. I could not yet test
on Linux.
This is my script:
import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-s", "en",
"-t", "ta", "-wm", "tmx", "-w", "enta.tmx"])
opus_reader.printPairs()
and the Error I get is:
Traceback (most recent call last):
File "your_script.py", line 3, in <module>
opus_reader.printPairs()
File
"C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py",
line 350, in printPairs
lastline = self.readAlignment(gzipAlign)
File
"C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py",
line 308, in readAlignment
lastline = self.outputPair(self.par, line)[1]
File
"C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py",
line 251, in outputPair
self.sendPairOutput(wpair)
File
"C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py",
line 210, in sendPairOutput
self.resultfile.write(wpair[0])
File "C:\Program
Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py",
line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
126-132: character maps to <undefined>
and the TMX looks like this:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4.">
<header srclang="en"
adminlang="en"
segtype="sentence"
datatype="PlainText" />
<body>
<tu>
<tuv xml:lang="en"><seg>The Beauty of Bovine Design “ DAD , today our
schoolteacher said that a cow has four stomachs , which it has developed by
a process of evolution .</seg></tuv>
Op vr 30 aug. 2019 om 15:15 schreef Gert Van Assche <gertva@gmail.com>:
… it was the comma indeed! thank you so much for having spot this!
Op vr 30 aug. 2019 om 15:04 schreef miau1 ***@***.***>:
> My your_script.py:
>
> import opustools_pkg
> opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
> opus_reader.printPairs()
>
> You are missing a comma between "-wm" and "tmx", but I don't think
> that's causing the error here. I suspect this is an issue with Windows
> encoding behavior. I'll try to setup a Windows environment and see if I can
> replicate the error. In the meanwhile, you could try to run your script in
> a unix-like environment, if you have access to one.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4?email_source=notifications&email_token=ACENS6WJCYFDED7BFAQ5HX3QHELGFA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5RS7XY#issuecomment-526594015>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ACENS6WXLOTOF26IZ4LVTTTQHELGFANCNFSM4ISLWNYA>
> .
>
|
@gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging. |
works like magic now!
thank you so much.
gert
Op ma 2 sep. 2019 om 11:06 schreef miau1 <notifications@github.com>:
… @gertva <https://github.com/gertva> I haven't been able to set up a
Windows environment, but I changed something in the program: I now specify
the encoding of the result files, which might help with your issue. Upgrade
opustools_pkg to version 0.0.43, run your script again and let me know if
it works.
I'll probably get my hands on a Windows machine by tomorrow, so I'll be
able to do proper debugging.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4?email_source=notifications&email_token=ACENS6XVZIN73IFQJMWWTA3QHTJQZA5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VGVUA#issuecomment-527067856>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACENS6UUPD37GBKST473AF3QHTJQZANCNFSM4ISLWNYA>
.
|
Great, good to hear! |
A question, if I want to filter on language, is this to correct construct?
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-f", "-ln", "-s",
"en", "-t", "ta", "-wm", "tmx", "-w", "enta.tmx", "src_cld2", "en", "0.95",
"trg_cld2", "ta", "0.95", "src_langid", "en", "0.95", "trg_langid", "ta",
"0.95"])
My output file is always the same size, so I wonder if it is correct.
Thanks for your help.
Op ma 2 sep. 2019 om 12:53 schreef miau1 <notifications@github.com>:
… Closed #4 <#4>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4?email_source=notifications&email_token=ACENS6V5ZGYQHKD2SKGWYVTQHTWC3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOTMCB5NY#event-2600738487>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACENS6TMFWOYMH5UDPDMC7DQHTWC3ANCNFSM4ISLWNYA>
.
|
Almost, you have to include
But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install
Then you can run a script like this to create the language ids:
And then you can filter by language ids. |
thanks for explaining so clearly.
Op ma 2 sep. 2019 om 14:53 schreef miau1 <notifications@github.com>:
… Almost, you have to include -- before the language id flags. You can also
use split() to make the argument list formation a little easier like this:
opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())
But for this to work, you have to create the language ids to the zip
files, as they don't yet include them by default. First you need to install
pycld2 and langid:
pip install pycld2
pip install langid
Then you can run a script like this to create the language ids:
from opustools_pkg.opus_langid import OpusLangid
OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
And then you can filter by language ids.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4?email_source=notifications&email_token=ACENS6QQYVSVDS6BOS7IECDQHUEF3A5CNFSM4ISLWNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5VXYBA#issuecomment-527137796>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACENS6R7QDOGTWT6LDOH7BTQHUEF3ANCNFSM4ISLWNYA>
.
|
not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think.
I have no idea how to fix this. Your help is highly appreciated.
Originally posted by @gertva in #3 (comment)
The text was updated successfully, but these errors were encountered: