Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Added Excel read write functions #97

Merged
merged 26 commits into from Nov 5, 2020
Merged

Added Excel read write functions #97

merged 26 commits into from Nov 5, 2020

Conversation

LeandroC89
Copy link
Contributor

#58
I've added excel read/write functions and their respective tests. (Plus two files, one for data and another generated by the test)
I've also had to add a dependency to apache poi

For now I kept excel values as strings, as Excel is a bit tricky with the way it stores its data (and their typing), this way we can read a data with the displayed value instead of excel's internal format (same for numbers where the .0 would be added).

I've been using these a lot on a project but since I've made some changes/refactoring in order to contribute they don't have all the live testing their previous versions had (although I still tested on my projects and a few variations to be sure)

Please let me know your feedback on this

Thanks and best regards

@holgerbrandl
Copy link
Owner

holgerbrandl commented Oct 9, 2020

Thanks for this great contribution. Excel support is definitely a big missing piece in the puzzle so far.

I think after merging it, we/me should try to modularize the artifacts to prevent an ever growing set of dependencies and to still keep the slim possibility that krangl could support multiplatform someday. So artifacts should be organized somehow into

  • de.mpicbg.scicomp:krangl-corel

  • de.mpicbg.scicomp:krangl-excel

  • de.mpicbg.scicomp:krangl-smile (withe Machine learning bindings to https://haifengl.github.io/)

  • .. some more as listed in the submodules under examples in the repo

  • de.mpicbg.scicomp:krangl-all (a main aggregator pom for those who don't care about dependency trees

Does this make sense to you? I would clearly take care of this modularization.

Specifically concerning the PR, I think we should apply some fine-tuning to improve compliance with the state of the art (which is https://readxl.tidyverse.org imho. In particular, we/me/you should

I know this is a lot, and I'm happy to help with the implementation, but it may take some days/weeks before I manage to realize all these requirements. Could you support by taking over some of these extensions?

@LeandroC89
Copy link
Contributor Author

Thank you, sure I can make the changes. As I don't have much time I'll be taking on only a couple of them at a time, and I'll add comments every time I start working on another so that we don't overlap in case you're working on them too.

For now I'll start with:

  • simplify naming readSheetFromExcelFile --> readExcel (same for write)
  • Add a secondary public read method (while reusing as much as code as possible) that accepts also Int indices as "sheet" argument
  • Consider adding colTypes as used in DataFrame.readCSV and also needed for excel because poor type handling (see https://readxl.tidyverse.org/reference/read_excel.html)

Depending on how my weekend goes I'll try to tackle as much of the others as I can.

Best Regards

@LeandroC89
Copy link
Contributor Author

LeandroC89 commented Oct 11, 2020

Changes made:

  • Renaming read/write functions
  • Overloaded readExcel to allow sheet retrieval by Index (minimal code duplication)
  • Added colTypes parameter
  • Added column type guessing via peekCol (will test this one further and add guessMax limit)

Will start working on:

  • replace in rowNumber with cellRange because users tend to put multiple tables into single sheets. See
    https://readxl.tidyverse.org/reference/read_excel.html for the spec of this feature
  • Add dokka example (there are some in the codebase already, so this should be quick)
  • Add javadoc to public API

I'll also add these smaller changes

  • Renaming some of the parameters to match read_excel and write_excel specifications
  • Ensure names are unique as in readCSV

I believe I didn't answer the modularization part, but yes, it makes perfect sense to me.

Edit: Corrected "minimal code reuse" since I meant the exact opposite.

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
- Added guessMax to readExcel type guessing

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
…mmit)

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
@LeandroC89
Copy link
Contributor Author

Most of the things are already updated, there is one that I may need some light on what to do:

  • Add javadoc to public API

I've added the Dokka Samples and even ran the Javadoc generation to make sure the new functions are there (unversioned so they wont be pulled). Other than this I don't know what I should do, so any info would much appreciated.

I only now noticed that the cellRange is a bit more complex than I initially thought (specifying only the rows/cols, anchor and so on). I thought about doing it with a wildcard, for example "A*:D* would consider columns A,B,C,D up until the first blank row, or even "A1:*" stopping at the first blank column and row but I'll be checking how similar I can make it to read_excel without sacrificing much. So I may need to work on it during the weekend, that should also give me some time to refactor and retest all the functionalities properly.

Thanks

@holgerbrandl
Copy link
Owner

Thank you @LeandroC89 for this wonderful contribution.

Concerning cellranges, let's keep it simple. IMHO a simple D3:F4 without any anchoring or exclusions or wildcards is fine. What we could do potentially is to not use String as parameter type in readExcel but CellRangeAdress (which could be named just CellRange) directly. This would allow adding more complex cell range definitions later by simply enhancing CellRange with secondary constructors or more complex builder patterns.

One thing which I'll do (but you could prepare for it already) is to use dedicated files such as ExcelIO.kt, Exceltests.kt and so on for all code that imports poi. As pointed out in my initial comment, we should start providing submodules to minimize dependencies. I'll do that once we have merged your PR.

Concerning the javadoc, please see my code review comment.

Feel welcome if you need any help, or think that I should take over the remaining bits and pieces.

- Defaulted CellRange rows to the first and last populated rows when no address is provided
- Defaulted cellRange columns to first and last populated columns with headers (taken from first populated row) when no address is provided
 - Added test case for this default case

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
- Added capability to include/skip blank rows

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
@LeandroC89
Copy link
Contributor Author

LeandroC89 commented Oct 18, 2020

Hello @holgerbrandl and thank you for all the tips/reviews,

cellRange is now of type CellRangeAddress and if not provided will instead default to sheet values first/last row/column (which only take into account the populated rows)

I also added a few more functionalities in order to allow more flexibility without the users having to prepare the files beforehand, and moved the apache poi dependent functions to the new .kt files.

As for the javadoc I was unable to find the review comment (or the review) on the Files changed tab, I'm a bit new to contributions so I'm probably not looking in the right places and may need some help on that.

If there is any feature you think would be useful, or that something is missing/needing an update I'll gladly help as much as I can but feel free to take over if/when you think it best.

Thank you

Edit: Typo and phrasing

LeandroC89 and others added 3 commits October 21, 2020 10:36
Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
…till throwing it)

Signed-off-by: LeandroC89 <leandroadao3@gmail.com>
@holgerbrandl holgerbrandl merged commit 457d34b into holgerbrandl:master Nov 5, 2020
@holgerbrandl
Copy link
Owner

Thanks again for this great contribution.

I've added just a default parameter. sheetIndex=0 in readExcel to enable usage with just a file argument.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants