Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel output in non-ascii encodings #3710

Closed
wants to merge 4 commits into from
Closed

Excel output in non-ascii encodings #3710

wants to merge 4 commits into from

Conversation

jtornero
Copy link

First, I would like to thank the developers for this excellent work.

My patch is about the possibility of specify the workbook encoding when making output to excel xls files. By default, pandas hasn't it, so say-so non ascii language speakers weren't able to save files with special characters, receiving an error like, for instance:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

My change makes possible to specify both in ExcelWriter and DataFrame.to_excel the parameter encoding which is passed by both the ExcelWriter init and DataFrame.to_excel functions to the appropiate xlwt function to make the trick.

I'm sorry my explanations aren't very proffesional, but I think they are enough for this purpose.

Also I have to apologize in the sense that I've run the tests with nosetests and I lack of knowledge about tests interpretation, but I have checked the log and no error comes from the modules/files I've worked with.

Best regards,

Jorge Tornero

…using xlwt, also for output without ExcelWriter - just filename
…using xlwt, also for output without ExcelWriter - just filename - final
@jreback
Copy link
Contributor

jreback commented May 29, 2013

@jtornero this is nice, but a couple of things.

  1. the argument encoding needs to be moved to the end of the named arguments (as its a new argument)
  2. needs some tests, e.g. look at _check_extension_mixed in test_excel.py; at the very least need to create a frame, write it with an encoding, then read it back and check for equality

@jreback
Copy link
Contributor

jreback commented May 31, 2013

@jtornero

couple of things

  1. don't include the test output (test2.txt) as a file
  2. put the tests in test_excel (don't create another test file)
  3. don't include extra files in the repo, just write them out and then read them back in to test
    (if you really need say 1 file to make sure the encoding is correct, ok)
  4. is there a reason that the encoding should even have a default? shouldn't it be empty? (e.g. let the system encoding determine it), unless overriden by the user?

@jreback
Copy link
Contributor

jreback commented May 31, 2013

also...need to rebase off of master

git br your-branch --set-upstream master
git checkout master

git pull

git checkout your-branch
git pull --rebase

then merge conflcts

the excel stuff has been moved to pandas.io.excel (tests are in the same location)....these should show up as a merge conflict

fix and proceed

thanks

@jtornero
Copy link
Author

jtornero commented Jun 1, 2013

@jreback

First, I think I should learn this git stuff a bit before doing the big mess. I have to recognize that git overwhelmes me a little.

I tried to do what you said and no way. Maybe it is better for me (and for all) to start in git from scratch (delete my fork and start again)

Anyway, about your suggest has two optionals parameters:ions (will be followed in the future)

  1. I won't include the test results file in the future. I thought that maybe it will be useful for someone interested in the same issue

  2. I created a separate test file apart from test_excel because (AFAIK) I must specify the # -- coding: -- specified in PEP 0263, taking into account that my non-ascii sample data must contain non ascii characters. I don't know if it a better idea to have such data in a file apart, as I think you suggest in your point 3), load it, and make the savings/conversions

  3. In this line, the extra files are those generated by the test itself. I guess I should provide a way to get rid of them after test completion, shouldn't I?

  4. pandas.ExcelWriter/.toExcel makes use of xlwt.Workbook(). This class has an optional parameter encoding, of which default is ascii and it what makes the magic here. As far as my python knowledge lets me go, I need to pass that parameter to ExcelWriter to be passed to xlwt_Workbook. And all the say-so ascii writer have it already made, I think.

Best regards,

Jorge Tornero

@jreback
Copy link
Contributor

jreback commented Jun 3, 2013

@jtornero the encoding my default in xlwt.Workbook(), but encoding=None should be the default in to_excel. so the idea is to test the following cases

  • encoding=None, the current behavior and the default
  • encoding='ascii', should equal the above case
  • encoding='another encoding'

these are easy enough to test without an actual data file; this will temporarily create a file and clean it up when done.

you will get read_excel if you rebase off of master (or start over), up to u, git is an animal!

with ensure_clean('test_encoding.xls') as path:
    df.to_excel(path,encoding=whatever)
    result = read_excel(path,encoding=whatever)
    assert_frame_equal(result,df)

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@jtornero how's this coming along?

@jtornero
Copy link
Author

Hello again. Sorry for the time been away. I'll be on holiday in a week, and I think I'll have some spare time to try to cope with it. I really need to start over with my git forking etc., because I made sort of mess with my repository. Maybe it's a big thing for my level of knowledge, but I'll try to do it again.

Best regards

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

ping!

@jtornero
Copy link
Author

Yes, I'm still alive. I've started again and I'm going to try to make it work this weekend. Sorry.

@jtornero
Copy link
Author

Well I've made it and it passes the tests. But I can't manage to commit my changes in this *****+ of git. I know it's my fault but I am sooooo disappointed... need help about this. Also, the changes are so simle after the new code of excel output. But I'm so stuck... guess I'll be able to put my changes here some time.

I don't know if it's the thing about Travis IC or what, but 've done this:

git clone https://github.com/jtornero/pandas.git

git checkout -b excelout_encodings

after my changes:

git add .

git status

I've got this:

On branch excelout_encodings
Changes to be committed:
(use "git reset HEAD ..." to unstage)

   modified:   pandas/core/frame.py
   modified:   pandas/io/excel.py
   modified:   pandas/io/tests/test_excel.py

Then
git commit -a -m "#3710 ENH/TST: Support for non-ascii encodings in DataFrame.to_excel"

and then:

git push origin master

Well... then nothing happens. I guess I should be able to see my commit in my repository... but nothing happens!!!

I don't know how to do it... maybe it is related to Travis IC?

Thank yo very much

@jtratner
Copy link
Contributor

@jtornero it looks like you aren't making that many changes. Might be easier to go back and checkout pydata/master then apply just your changes to it.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@jtornero I believe the #5025 is the current PR, this should be closed, yes? (as its not an issue but another PR for the same issue)

@jtornero
Copy link
Author

I guess everything could be "assigned" to #3710, but I had to re-git everything and started an new pull request, hence the #5025... Sorry for my unskillfullness

@jreback jreback closed this Sep 30, 2013
@jreback
Copy link
Contributor

jreback commented Mar 12, 2014

merged via 268ee80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants