Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement skiptrailingspace parameter for the read_table function #42054

Open
owenlamont opened this issue Jun 16, 2021 · 4 comments
Open
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@owenlamont
Copy link
Contributor

owenlamont commented Jun 16, 2021

Is your feature request related to a problem?

I had to do several additional manual data cleaning steps to parse certain markdown tables using the read_table function. The most cumbersome was not having an option to strip trailing spaces between values and separators. See mock code example at the end of this description.

Describe the solution you'd like

I would like the read_table function to have a skiptrailingspace boolean parameter that would automatically strip trailing whitespace in the same way the skipinitialspace parameter enables stripping leading whitespace.

API breaking implications

It would involve adding a new named parameter to read_table - so if users were calling with positional arguments it would be a breaking change if it wasn't made the last argument to the function.

Describe alternatives you've considered

Manual data cleaning is tolerable but more onerous. Another alternative that would be ideal from my perspective would be a read_markdown function that could extract indexed tables from markdown text in the same way the read_html function does for html. I suppose a related alternative is to use another Python package to render the markdown then use read_html but that is also onerous.

Additional context

See code example below for my current manual work-around.

import pandas as pd
import io

annoying_table = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |"""

df = (
    pd.read_table(
        filepath_or_buffer=io.StringIO(annoying_table),
        sep="|",
        skipinitialspace=True,
        # Would like a skiptrailingspace=True too
        skiprows=[2],
    )
    .convert_dtypes()
    .iloc[:, 1:-1]
)

df.columns = df.columns.str.strip()
print(df)

print(df.loc[0,"Heading 2"]) # output "cat       "

# Hacky work-around to strip trailing spaces
for column in df.select_dtypes("string").columns:
    df[column] = df[column].str.strip()

print(df.loc[0,"Heading 2"]) # output "cat"
@owenlamont owenlamont added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2021
@owenlamont owenlamont changed the title ENH: Implement striptrailingspace parameter for the read_table function ENH: Implement skiptrailingspace parameter for the read_table function Jun 16, 2021
@owenlamont
Copy link
Contributor Author

If this change is approved I'd be happy to work on the implementation.

@pablodz
Copy link

pablodz commented Jun 17, 2021

What about this method

import pandas as pd
import io

tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |
|  -3434.00 |           | 987       |"""

pd.read_table(filepath_or_buffer=io.StringIO(tablestr), 
              sep="|", 
              header=0, 
              index_col=1, 
              skipinitialspace=True)\
  .dropna(axis=1, how='all')\
  .iloc[1:]

Output:
image

More info here

@pablodz
Copy link

pablodz commented Jun 17, 2021

Another method could be

import pandas as pd
import io

tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
|      3.14 | cat       | 72        |
|        42 | mat       | 87.3      |
|   1234.56 | rat       | 128       |
|  -3434.00 |           | 987       |"""

def read_table_markdown(tablestr):    
    splitted=[row.split('|')[1:-1] for row in tablestr.split('\n')][1:]
    columns_fixed=[' '.join(column.split(' ')[1:-1]) for column in splitted[0]]
    values_fixed=[[''.join(value.split(' ')[1:-1]) for value in  row] for row in splitted[2:]]
    df=pd.DataFrame(values_fixed,columns=columns_fixed)
    return df

read_table_markdown(tablestr)

Output:
image

@owenlamont
Copy link
Contributor Author

owenlamont commented Jun 18, 2021

Thanks @pablodz - your first method didn't strip off the trailing spaces - your second way does but you did all the heavy lifting in plain Python (which I acknowledge might be the most appropriate implementation).

When I Googled "Pandas read markdown" not much came up other than this StackOverflow answer which is what led me to using read_table. I did just notice in the read_table documentation the reference to IO tools documentation which nicely describes the reading and writing functions for different data formats but it notably omits markdown in the list of formats.

That does get to the heart of my question though - is it worth adding some syntactic sugar functionality to read_table to make parsing markdown tables easier or is this such a niche case it isn't worth further complicating the Pandas API? I was hoping to get some feedback on that. I'm happy if the consensus is that it isn't worth complicating the API further for this issue to be closed in that case. My original thought was it seemed like a fairly obvious pairing for the skipinitialspace parameter that it could be justified.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants