-
-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_html - how to prevent the conversion of numerical fields #21379
Comments
Actually, examples would be very much appreciated. My first guess is that you should be able to specify |
@gfyoung sure, sorry for the delay with the responce, please find the example attached below. Basically the table is the following: Number in double-quotes (like a string) According to my understanding, suggested approach to process numeric values is based on the manipulations with "thousands" and "decimal" fields.
In this workaround, only few cases leads to "prevention" of numeric conversion:
But obviously, such method could fail on some other datasets. So problem statement: is it possible to extract (with read_html) the content of a table as it is, so the result will be stored as strings without conversions / casting of any columns by pandas? Example
I'm referering and reminding you guys, that this example is not artificial. There could be files with numbers in both UK / US format or at least an algorithm that have to handle both types. ConvertersSpecification of the "converters" keyword with the mapping from a column name to a string leads to the following:
As I see, the results are all string now, but it looks like the content was still converted. dtype@gfyoung, you've mentioned specification of dtypes. As I see it's available in read_csv function, but unfortunately not in read_html. In case "converters" is equivalent to "dtype", please see the section above. Another problem - date formattingI'm reminding you guys that you may want to take a look at the stackoverflow question I've mentioned in the first message of this issue. This numeric converter that I've just described also attempts to convert date values like "01.12.2017" to numbers like 1122017, which is also pretty important point. |
Have you tried In [22]: html = "<html><body><table><tr><th>foo</th></tr><tr><td>60,00</td></tr></table></body></html>"
In [23]: pd.read_html(html, thousands=None)[0]
Out[23]:
0
0 foo
1 60,00 If that works for you then would certainly take a PR to update documentation and add as an example. Alternately adding |
Hi everyone,
Is it possible to use pandas.read_html function in way so it won't convert the numbers in html tables and export them as they are (as strings)?
Assume that there is a html table which contains a value "60,00".
Reading that table using pandas.read_html will lead to an integer 6000.
Adding the flag thousands='.' will result in a string "60.00".
Adding both flags thousands='.' and decimal=',' will result in a float 60.0.
Is it possible to ask pandas.read_html to stop performing conversion of numbers by itself based on the logic of "thousands" and "decimals"? Since the file may contain data in both EU and US formats.
It would be amazing to use pandas as a tool that will just export the data from html to dataframe as it is and leaves the logic of data postprocessing / conversion / etc to the further logic (which could be also implemented with a help of pandas).
A similar issue that I've described here: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results
Thank you a lot in advance, please let me know if it's reasonable to attach examples.
The text was updated successfully, but these errors were encountered: