read_html - how to prevent the conversion of numerical fields #21379

vzelen · 2018-06-08T08:37:40Z

Hi everyone,

Is it possible to use pandas.read_html function in way so it won't convert the numbers in html tables and export them as they are (as strings)?

Assume that there is a html table which contains a value "60,00".

Reading that table using pandas.read_html will lead to an integer 6000.
Adding the flag thousands='.' will result in a string "60.00".
Adding both flags thousands='.' and decimal=',' will result in a float 60.0.

Is it possible to ask pandas.read_html to stop performing conversion of numbers by itself based on the logic of "thousands" and "decimals"? Since the file may contain data in both EU and US formats.

It would be amazing to use pandas as a tool that will just export the data from html to dataframe as it is and leaves the logic of data postprocessing / conversion / etc to the further logic (which could be also implemented with a help of pandas).

A similar issue that I've described here: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results

Thank you a lot in advance, please let me know if it's reasonable to attach examples.

gfyoung · 2018-06-08T09:08:01Z

Thank you a lot in advance, please let me know if it's reasonable to attach examples.

Actually, examples would be very much appreciated.

My first guess is that you should be able to specify dtype for the column containing that value, but let's see when you post examples.

vzelen · 2018-06-13T17:28:35Z

@gfyoung sure, sorry for the delay with the responce, please find the example attached below.

Basically the table is the following:

Number in double-quotes (like a string)
0 "1234"
1 "1234.0"
2 "1234.00"
3 "1,234"
4 "1,234.0"
5 "1,234.00"
6 "1234,0"
7 "1.234"
8 "1.234,0"
9 "1.234,00"

According to my understanding, suggested approach to process numeric values is based on the manipulations with "thousands" and "decimal" fields.
I'm iterating on all possible combinations of these keywords with values "," and "." (including empty case) an extracting the last column from the attached table:

{} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'thousands': '.'} --> ['1234', '12340', '123400', '1,234', '1,234.0', '1,234.00', '1234,0', '1234', '1.234,0', '1.234,00']
{'thousands': ','} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': '.'} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': ','} --> ['1234', '1234.0', '1234.00', '1234', '1,234.0', '1,234.00', '12340', '1.234', '1.234,0', '1.234,00']
{'decimal': '.', 'thousands': ','} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': ',', 'thousands': '.'} --> [1234.0, 12340.0, 123400.0, 1.234, 1.234, 1.234, 1234.0, 1234.0, 1234.0, 1234.0]

In this workaround, only few cases leads to "prevention" of numeric conversion:

'thousands': '.'
'decimal': ','

But obviously, such method could fail on some other datasets.

So problem statement: is it possible to extract (with read_html) the content of a table as it is, so the result will be stored as strings without conversions / casting of any columns by pandas?

Example

Thousands sep	Floats sep	Number of zeros at the end	Number in double-quotes (like a string)	Parsed number (different forms of number 1234.0)
Nothing	Nothing	0	"1234"	1234
Nothing	.	1	"1234.0"	1234.0
Nothing	.	2	"1234.00"	1234.00
","	Nothing	0	"1,234"	1,234
","	.	1	"1,234.0"	1,234.0
","	.	2	"1,234.00"	1,234.00
Nothing	","	1	"1234,0"	1234,0
.	","	0	"1.234"	1.234
.	","	1	"1.234,0"	1.234,0
.	","	2	"1.234,00"	1.234,00

I'm referering and reminding you guys, that this example is not artificial. There could be files with numbers in both UK / US format or at least an algorithm that have to handle both types.

Converters

Specification of the "converters" keyword with the mapping from a column name to a string leads to the following:

converters = {
 'Floats sep': str,
 'Number in double-quotes (like a string)': str,
 'Number of zeros at the end': str,
 'Parsed number (different forms of number 1234.0)': str,
 'Thousands sep': str
}
{} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'thousands': '.'} --> ['1234', '12340', '123400', '1,234', '1,234.0', '1,234.00', '1234,0', '1234', '1.234,0', '1.234,00']
{'thousands': ','} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': '.'} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': ','} --> ['1234', '1234.0', '1234.00', '1234', '1,234.0', '1,234.00', '12340', '1.234', '1.234,0', '1.234,00']
{'decimal': '.', 'thousands': ','} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': ',', 'thousands': '.'} --> ['1234', '12340', '123400', '1.234', '1.2340', '1.23400', '1234.0', '1234', '1234.0', '1234.00']

As I see, the results are all string now, but it looks like the content was still converted.

dtype

@gfyoung, you've mentioned specification of dtypes. As I see it's available in read_csv function, but unfortunately not in read_html. In case "converters" is equivalent to "dtype", please see the section above.

Another problem - date formatting

I'm reminding you guys that you may want to take a look at the stackoverflow question I've mentioned in the first message of this issue. This numeric converter that I've just described also attempts to convert date values like "01.12.2017" to numbers like 1122017, which is also pretty important point.

WillAyd · 2018-06-13T23:54:00Z

Have you tried thousands=None?

In [22]: html = "<html><body><table><tr><th>foo</th></tr><tr><td>60,00</td></tr></table></body></html>"

In [23]: pd.read_html(html, thousands=None)[0]
Out[23]: 
       0
0    foo
1  60,00

If that works for you then would certainly take a PR to update documentation and add as an example.

Alternately adding dtype as a keyword argument could work, but read_html operates differently from other read functions if only for the fact that it returns a list instead of a single DataFrame. Implementing that in a way that is clear and consistent may be complicated, though a PR for that would be welcome as well

gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jun 8, 2018

Zhenye-Na mentioned this issue Dec 31, 2019

How can I force Pandas read_html function to read digit field as string not integer #30589

Closed

mroeschke added the Enhancement label May 7, 2020

jbrockmendel removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 23, 2020

holymonson mentioned this issue Feb 14, 2021

ENH: read_html support dtype param #39804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html - how to prevent the conversion of numerical fields #21379

read_html - how to prevent the conversion of numerical fields #21379

vzelen commented Jun 8, 2018

gfyoung commented Jun 8, 2018 •

edited

vzelen commented Jun 13, 2018 •

edited

WillAyd commented Jun 13, 2018

read_html - how to prevent the conversion of numerical fields #21379

read_html - how to prevent the conversion of numerical fields #21379

Comments

vzelen commented Jun 8, 2018

gfyoung commented Jun 8, 2018 • edited

vzelen commented Jun 13, 2018 • edited

Example

Converters

dtype

Another problem - date formatting

WillAyd commented Jun 13, 2018

gfyoung commented Jun 8, 2018 •

edited

vzelen commented Jun 13, 2018 •

edited