New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement skiptrailingspace parameter for the read_table function #42054
Comments
If this change is approved I'd be happy to work on the implementation. |
What about this method import pandas as pd
import io
tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
| 3.14 | cat | 72 |
| 42 | mat | 87.3 |
| 1234.56 | rat | 128 |
| -3434.00 | | 987 |"""
pd.read_table(filepath_or_buffer=io.StringIO(tablestr),
sep="|",
header=0,
index_col=1,
skipinitialspace=True)\
.dropna(axis=1, how='all')\
.iloc[1:] More info here |
Another method could be import pandas as pd
import io
tablestr = """
| Heading 1 | Heading 2 | Heading 3 |
| --------- | --------- | --------- |
| 3.14 | cat | 72 |
| 42 | mat | 87.3 |
| 1234.56 | rat | 128 |
| -3434.00 | | 987 |"""
def read_table_markdown(tablestr):
splitted=[row.split('|')[1:-1] for row in tablestr.split('\n')][1:]
columns_fixed=[' '.join(column.split(' ')[1:-1]) for column in splitted[0]]
values_fixed=[[''.join(value.split(' ')[1:-1]) for value in row] for row in splitted[2:]]
df=pd.DataFrame(values_fixed,columns=columns_fixed)
return df
read_table_markdown(tablestr) |
Thanks @pablodz - your first method didn't strip off the trailing spaces - your second way does but you did all the heavy lifting in plain Python (which I acknowledge might be the most appropriate implementation). When I Googled "Pandas read markdown" not much came up other than this StackOverflow answer which is what led me to using read_table. I did just notice in the read_table documentation the reference to IO tools documentation which nicely describes the reading and writing functions for different data formats but it notably omits markdown in the list of formats. That does get to the heart of my question though - is it worth adding some syntactic sugar functionality to read_table to make parsing markdown tables easier or is this such a niche case it isn't worth further complicating the Pandas API? I was hoping to get some feedback on that. I'm happy if the consensus is that it isn't worth complicating the API further for this issue to be closed in that case. My original thought was it seemed like a fairly obvious pairing for the skipinitialspace parameter that it could be justified. |
Is your feature request related to a problem?
I had to do several additional manual data cleaning steps to parse certain markdown tables using the read_table function. The most cumbersome was not having an option to strip trailing spaces between values and separators. See mock code example at the end of this description.
Describe the solution you'd like
I would like the read_table function to have a skiptrailingspace boolean parameter that would automatically strip trailing whitespace in the same way the skipinitialspace parameter enables stripping leading whitespace.
API breaking implications
It would involve adding a new named parameter to read_table - so if users were calling with positional arguments it would be a breaking change if it wasn't made the last argument to the function.
Describe alternatives you've considered
Manual data cleaning is tolerable but more onerous. Another alternative that would be ideal from my perspective would be a read_markdown function that could extract indexed tables from markdown text in the same way the read_html function does for html. I suppose a related alternative is to use another Python package to render the markdown then use read_html but that is also onerous.
Additional context
See code example below for my current manual work-around.
The text was updated successfully, but these errors were encountered: