-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added column-filter functionality #16
base: master
Are you sure you want to change the base?
Conversation
Because by default CSV tables shouldn’t contain markdown syntax
This reverts commit e0d74b7.
Python2 support
because contaminated panflute 1.9.7 and 1.10 has been deleted.
pip no longer support py3.2
If the column_filter is not specified or is an empty list, then the table is not modified. Else the raw_table_list is filtered based on the values in the column_filter (i.e., column indexes not specified in the filter is removed). Each element in the column_filter list must be an integer or a dictionary with at least the key 'col'. Specifying an integer in the column_filter list makes sure that column index is kept (first column is index 0 -- python list indexing). Specifying a dictionary, gives the optional possibility of specifying the following keys in the dictionary (note: the keys are mutually exclusive and specifying more than one has undefined behaviour). - filter: filters out (removes) the row, if the content inside this column doesn't match (exact string matching of the value of this key and the content of the cell). The value may be a list of strings to be matched. - regex: filters out (removes) the row, if the content inside this column doesn't match. The value of this key is placed directly into `re.match(pattern, string)` as the `pattern` and the cell value as the `string`. Note: Currently we assume that a small amount of regex's is used, such that we don't have to deal with compiling of regex's, but rely on the built in caching to handle it for us. Example: This example won't filter out any column, but it demonstrates the three different ways that you may specify a column-filter. Just try and make changes to either one of them, and see how either columns or rows will be filtered from the resulting table. ```{.table} --- caption: "*Bar* table" markdown: yes column-filter: - 0 - col: 1 regex: ".*B|[\\d]" - col: 2 filter: ['C', '3'] --- A,B,C 1,2,3 ```
I think we need more discussion on this for the syntax of this. You might try to ask people in Markdown, tables and CSV - Google Groups to see if there's any suggestions there, and/or open an issue here (I'll open one soon). For now I'll put this pull request on hold. |
And remember to include tests in pull requests. There's 2 kinds of tests here, one is Python unit test that calls the functions and compare the results. Another is to run pandoc directly and see if the output native AST is the same as a predefined one (usually generated automatically and just eyeballing to verify it's doing what it's supposed to do). |
I also agree. Ideally, you want a solution that is both general and simple to implement. For instance, allowing lambdas that will be |
@reenberg, can you briefly describe what you want to accomplished exactly? i.e. let's forget about syntax and how to do it for the moment, but gives some small, before & after example on what you want to do. In particular, how you would want the regex to behave. e.g. the simplest kind of filter will be |
My current issue is that I'm writing a document, where i have a spreadsheet of events. This actually started out as a .csv file that i edited with a spreadsheet editor, but it has now evolved such that i found the need for using formulas (time calculations, column concatenation) and conditional formatting (to easily show groups of rows, etc) and thus it is now a .ods document, that I export to .csv. The .csv file describes all the event data, such as type, start, end, various kinds of descriptions, work loads, etc. Thus my need is specifically to be able to filter only some of the columns (e.g., 1,3,4,6,7) and then I also need to filter the rows, such that I only get the rows concerning the specific event types. This has previously been delt with by some nasty LaTeX macros, that I just couldn't bother maintainer any more. My initial implementation with the 'filter' and 'regex' properties, was just what came to my mind when coding it. However specifically I'm using the regex right now to easily filter out 'event', 'event2' and 'event3'. I use suffix numbering of the event type to have the events in different colours when generating some of the other overviews (think something like graphs) |
* Changed column_filter to table_filter. * Changed the filter into a generator.
9c62628
to
c645dae
Compare
I'm accomplishing something similar using CSVKit, specifically @reenberg Why do you think this should be implemented in Pantable itself? |
The “pandoc way” to accomplish a task like this, without over bloating a filter, is to have another filter processing the filtering of the csv before pantable (ie piping a filter before pantable.) But inevitably this other filter before pantable has to be designed for pantable (eg which class to use.) So it is not strictly composable (ie not entirely independent of pantable.) So this hypothetical other filter is more like a pantable plugin, and hence may be why people want to put them together. I think the solution you mention has to go through the shell (eg to me I’d use a makefile with an intermediate file chaining them together.) A solution like this is not universal. Also, a build like this makes the document less reproducible (in the sense that more details in how the document is built is needed.) |
Its always nice that someone cares, even if its just shy of 2 years since I left a reply to your comments @ickc. To be honest, I don't think that I knew about CSVkit back then. And I guess I just fell victim of the classical "everything looks like a nail, when you have a hammer". I Don't think that my proposed changes does anything more than what can be achieved with a good combination of However I remember thinking that it was cleaner not having to setup an "elaborate build pipeline"/ Makefile to carve out all the intermediate files that I needed back then. It felt more smooth having the people writing the document and invoking pantable being able to just specify what data they needed from the csv file inside markdown. Not everyone is comfortable piping unix tools. |
c51c4a4
to
accb831
Compare
<tl;dr> This adds the feature to optionally filter out entire columns and rows based on column data in a .csv file, if you only wan't to create a small table based on a large .csv file.
If the column_filter is not specified or is an empty list, then the table is
not modified. Else the raw_table_list is filtered based on the values in
the column_filter (i.e., column indexes not specified in the filter is removed).
Each element in the column_filter list must be an integer or a dictionary
with at least the key 'col'.
Specifying an integer in the column_filter list makes sure that column
index is kept (first column is index 0 -- python list indexing).
Specifying a dictionary, gives the optional possibility of specifying the
following keys in the dictionary (note: the keys are mutually exclusive and
specifying more than one has undefined behaviour).
filter: filters out (removes) the row, if the content inside this column
doesn't match (exact string matching of the value of this key and the content
of the cell). The value may be a list of strings to be matched.
regex: filters out (removes) the row, if the content inside this column
doesn't match. The value of this key is placed directly into
re.match(pattern, string)
as thepattern
and the cell value as thestring
. Note: Currently we assume that a small amount of regex's is used,such that we don't have to deal with compiling of regex's, but rely on the
built in caching to handle it for us.
Example: This example won't filter out any column, but it demonstrates the
three different ways that you may specify a column-filter. Just try and
make changes to either one of them, and see how either columns or rows will
be filtered from the resulting table.