Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think. #4

Closed
miau1 opened this issue Aug 30, 2019 · 11 comments
Labels
bug Something isn't working

Comments

@miau1
Copy link
Member

miau1 commented Aug 30, 2019

not sure if I can ask questions here, but I got stuck on this when I try to download a TMX from the OPUS JW300 set. It has nothing to do with the set, I think.

Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-ta.xml.gz not found. The following files are available for downloading:

   8 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en-ta.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/en.zip
  94 MB https://object.pouta.csc.fi/OPUS-JW300/v1/xml/ta.zip

 365 MB Total size
Downloading 3 file(s) with the total size of 365 MB. Continue? (y/n) y
JW300_latest_xml_en-ta.xml.gz ... 100% of 8 MB
JW300_latest_xml_en.zip ... 100% of 263 MB
JW300_latest_xml_ta.zip ... 100% of 94 MB
Traceback (most recent call last):
  File "your_script.py", line 3, in <module>
    opus_reader.printPairs()
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 350, in printPairs
    lastline = self.readAlignment(gzipAlign)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 308, in readAlignment
    lastline = self.outputPair(self.par, line)[1]
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 251, in outputPair
    self.sendPairOutput(wpair)
  File "C:\Users\gertv\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\opustools_pkg\opus_read.py", line 210, in sendPairOutput
    self.resultfile.write(wpair[0])
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1264.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 83-89: character maps to <undefined>

I have no idea how to fix this. Your help is highly appreciated.

Originally posted by @gertva in #3 (comment)

@miau1
Copy link
Member Author

miau1 commented Aug 30, 2019

Could show the flags you used to initialize opus_reader?

@miau1 miau1 added the bug Something isn't working label Aug 30, 2019
@gertva
Copy link

gertva commented Aug 30, 2019

My your_script.py:

import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()

@miau1
Copy link
Member Author

miau1 commented Aug 30, 2019

My your_script.py:

import opustools_pkg
opus_reader = opustools_pkg.OpusRead(["-d", "JW300", "-s", "en", "-t", "ta", "-wm" "tmx"])
opus_reader.printPairs()

You are missing a comma between "-wm" and "tmx", but I don't think that's causing the error here. I suspect this is an issue with Windows encoding behavior. I'll try to setup a Windows environment and see if I can replicate the error. In the meanwhile, you could try to run your script in a unix-like environment, if you have access to one.

@gertva
Copy link

gertva commented Aug 30, 2019 via email

@gertva
Copy link

gertva commented Aug 30, 2019 via email

@miau1
Copy link
Member Author

miau1 commented Sep 2, 2019

@gertva I haven't been able to set up a Windows environment, but I changed something in the program: I now specify the encoding of the result files, which might help with your issue. Upgrade opustools_pkg to version 0.0.43, run your script again and let me know if it works.

I'll probably get my hands on a Windows machine by tomorrow, so I'll be able to do proper debugging.

@gertva
Copy link

gertva commented Sep 2, 2019 via email

@miau1
Copy link
Member Author

miau1 commented Sep 2, 2019

Great, good to hear!

@miau1 miau1 closed this as completed Sep 2, 2019
@gertva
Copy link

gertva commented Sep 2, 2019 via email

@miau1
Copy link
Member Author

miau1 commented Sep 2, 2019

Almost, you have to include -- before the language id flags. You can also use split() to make the argument list formation a little easier like this:

opus_reader = opustools_pkg.OpusRead("-d JW300 -f -ln -s en -t ta -wm tmx -w enta.tmx --src_cld2 en 0.95 --trg_cld2 ta 0.95 --src_langid en 0.95 --trg_langid ta 0.95".split())

But for this to work, you have to create the language ids to the zip files, as they don't yet include them by default. First you need to install pycld2 and langid:

pip install pycld2
pip install langid

Then you can run a script like this to create the language ids:

from opustools_pkg.opus_langid import OpusLangid

OpusLangid("-f JW300_latest_xml_en.zip -v".split()).processFiles()
OpusLangid("-f JW300_latest_xml_ta.zip -v".split()).processFiles()

And then you can filter by language ids.

@gertva
Copy link

gertva commented Sep 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants