ENH: Implement skiptrailingspace parameter for the read_table function #42054

owenlamont · 2021-06-16T13:43:33Z

Is your feature request related to a problem?

I had to do several additional manual data cleaning steps to parse certain markdown tables using the read_table function. The most cumbersome was not having an option to strip trailing spaces between values and separators. See mock code example at the end of this description.

Describe the solution you'd like

I would like the read_table function to have a skiptrailingspace boolean parameter that would automatically strip trailing whitespace in the same way the skipinitialspace parameter enables stripping leading whitespace.

API breaking implications

It would involve adding a new named parameter to read_table - so if users were calling with positional arguments it would be a breaking change if it wasn't made the last argument to the function.

Describe alternatives you've considered

Manual data cleaning is tolerable but more onerous. Another alternative that would be ideal from my perspective would be a read_markdown function that could extract indexed tables from markdown text in the same way the read_html function does for html. I suppose a related alternative is to use another Python package to render the markdown then use read_html but that is also onerous.

Additional context

See code example below for my current manual work-around.

import pandas as pd
import io

annoying_table = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |"""

df = (
    pd.read_table(
        filepath_or_buffer=io.StringIO(annoying_table),
        sep="|",
        skipinitialspace=True,
        # Would like a skiptrailingspace=True too
        skiprows=[2],
    )
    .convert_dtypes()
    .iloc[:, 1:-1]
)

df.columns = df.columns.str.strip()
print(df)

print(df.loc[0,"Heading 2"]) # output "cat       "

# Hacky work-around to strip trailing spaces
for column in df.select_dtypes("string").columns:
    df[column] = df[column].str.strip()

print(df.loc[0,"Heading 2"]) # output "cat"

owenlamont · 2021-06-16T13:47:45Z

If this change is approved I'd be happy to work on the implementation.

pablodz · 2021-06-17T03:49:00Z

What about this method

import pandas as pd
import io

tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |
|  -3434.00 |           | 987       |"""

pd.read_table(filepath_or_buffer=io.StringIO(tablestr), 
              sep="|", 
              header=0, 
              index_col=1, 
              skipinitialspace=True)\
  .dropna(axis=1, how='all')\
  .iloc[1:]

Output:

More info here

pablodz · 2021-06-17T04:03:08Z

Another method could be

import pandas as pd
import io

tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |
|  -3434.00 |           | 987       |"""

def read_table_markdown(tablestr):    
    splitted=[row.split('|')[1:-1] for row in tablestr.split('\n')][1:]
    columns_fixed=[' '.join(column.split(' ')[1:-1]) for column in splitted[0]]
    values_fixed=[[''.join(value.split(' ')[1:-1]) for value in  row] for row in splitted[2:]]
    df=pd.DataFrame(values_fixed,columns=columns_fixed)
    return df

read_table_markdown(tablestr)

Output:

owenlamont · 2021-06-18T14:13:34Z

Thanks @pablodz - your first method didn't strip off the trailing spaces - your second way does but you did all the heavy lifting in plain Python (which I acknowledge might be the most appropriate implementation).

When I Googled "Pandas read markdown" not much came up other than this StackOverflow answer which is what led me to using read_table. I did just notice in the read_table documentation the reference to IO tools documentation which nicely describes the reading and writing functions for different data formats but it notably omits markdown in the list of formats.

That does get to the heart of my question though - is it worth adding some syntactic sugar functionality to read_table to make parsing markdown tables easier or is this such a niche case it isn't worth further complicating the Pandas API? I was hoping to get some feedback on that. I'm happy if the consensus is that it isn't worth complicating the API further for this issue to be closed in that case. My original thought was it seemed like a fairly obvious pairing for the skipinitialspace parameter that it could be justified.

owenlamont added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2021

owenlamont changed the title ~~ENH: Implement striptrailingspace parameter for the read_table function~~ ENH: Implement skiptrailingspace parameter for the read_table function Jun 16, 2021

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

jbrockmendel added the IO CSV read_csv, to_csv label Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement skiptrailingspace parameter for the read_table function #42054

ENH: Implement skiptrailingspace parameter for the read_table function #42054

owenlamont commented Jun 16, 2021 •

edited

owenlamont commented Jun 16, 2021

pablodz commented Jun 17, 2021

pablodz commented Jun 17, 2021 •

edited

owenlamont commented Jun 18, 2021 •

edited

ENH: Implement skiptrailingspace parameter for the read_table function #42054

ENH: Implement skiptrailingspace parameter for the read_table function #42054

Comments

owenlamont commented Jun 16, 2021 • edited

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

owenlamont commented Jun 16, 2021

pablodz commented Jun 17, 2021

pablodz commented Jun 17, 2021 • edited

owenlamont commented Jun 18, 2021 • edited

owenlamont commented Jun 16, 2021 •

edited

pablodz commented Jun 17, 2021 •

edited

owenlamont commented Jun 18, 2021 •

edited