Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html - how to prevent the conversion of numerical fields #21379

Open
vzelen opened this issue Jun 8, 2018 · 3 comments
Open

read_html - how to prevent the conversion of numerical fields #21379

vzelen opened this issue Jun 8, 2018 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@vzelen
Copy link

vzelen commented Jun 8, 2018

Hi everyone,

Is it possible to use pandas.read_html function in way so it won't convert the numbers in html tables and export them as they are (as strings)?

Assume that there is a html table which contains a value "60,00".

Reading that table using pandas.read_html will lead to an integer 6000.
Adding the flag thousands='.' will result in a string "60.00".
Adding both flags thousands='.' and decimal=',' will result in a float 60.0.

Is it possible to ask pandas.read_html to stop performing conversion of numbers by itself based on the logic of "thousands" and "decimals"? Since the file may contain data in both EU and US formats.

It would be amazing to use pandas as a tool that will just export the data from html to dataframe as it is and leaves the logic of data postprocessing / conversion / etc to the further logic (which could be also implemented with a help of pandas).

A similar issue that I've described here: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results

Thank you a lot in advance, please let me know if it's reasonable to attach examples.

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Jun 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 8, 2018

Thank you a lot in advance, please let me know if it's reasonable to attach examples.

Actually, examples would be very much appreciated.

My first guess is that you should be able to specify dtype for the column containing that value, but let's see when you post examples.

@vzelen
Copy link
Author

vzelen commented Jun 13, 2018

@gfyoung sure, sorry for the delay with the responce, please find the example attached below.

Basically the table is the following:

Number in double-quotes (like a string)
0 "1234"
1 "1234.0"
2 "1234.00"
3 "1,234"
4 "1,234.0"
5 "1,234.00"
6 "1234,0"
7 "1.234"
8 "1.234,0"
9 "1.234,00"

According to my understanding, suggested approach to process numeric values is based on the manipulations with "thousands" and "decimal" fields.
I'm iterating on all possible combinations of these keywords with values "," and "." (including empty case) an extracting the last column from the attached table:

{} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'thousands': '.'} --> ['1234', '12340', '123400', '1,234', '1,234.0', '1,234.00', '1234,0', '1234', '1.234,0', '1.234,00']
{'thousands': ','} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': '.'} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': ','} --> ['1234', '1234.0', '1234.00', '1234', '1,234.0', '1,234.00', '12340', '1.234', '1.234,0', '1.234,00']
{'decimal': '.', 'thousands': ','} --> [1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 1234.0, 12340.0, 1.234, 1.234, 1.234]
{'decimal': ',', 'thousands': '.'} --> [1234.0, 12340.0, 123400.0, 1.234, 1.234, 1.234, 1234.0, 1234.0, 1234.0, 1234.0]

In this workaround, only few cases leads to "prevention" of numeric conversion:

  • 'thousands': '.'
  • 'decimal': ','

But obviously, such method could fail on some other datasets.

So problem statement: is it possible to extract (with read_html) the content of a table as it is, so the result will be stored as strings without conversions / casting of any columns by pandas?

Example

Thousands sep Floats sep Number of zeros at the end Number in double-quotes (like a string) Parsed number (different forms of number 1234.0)
Nothing Nothing 0 "1234" 1234
Nothing . 1 "1234.0" 1234.0
Nothing . 2 "1234.00" 1234.00
"," Nothing 0 "1,234" 1,234
"," . 1 "1,234.0" 1,234.0
"," . 2 "1,234.00" 1,234.00
Nothing "," 1 "1234,0" 1234,0
. "," 0 "1.234" 1.234
. "," 1 "1.234,0" 1.234,0
. "," 2 "1.234,00" 1.234,00

I'm referering and reminding you guys, that this example is not artificial. There could be files with numbers in both UK / US format or at least an algorithm that have to handle both types.

Converters

Specification of the "converters" keyword with the mapping from a column name to a string leads to the following:

converters = {
 'Floats sep': str,
 'Number in double-quotes (like a string)': str,
 'Number of zeros at the end': str,
 'Parsed number (different forms of number 1234.0)': str,
 'Thousands sep': str
}
{} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'thousands': '.'} --> ['1234', '12340', '123400', '1,234', '1,234.0', '1,234.00', '1234,0', '1234', '1.234,0', '1.234,00']
{'thousands': ','} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': '.'} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': ','} --> ['1234', '1234.0', '1234.00', '1234', '1,234.0', '1,234.00', '12340', '1.234', '1.234,0', '1.234,00']
{'decimal': '.', 'thousands': ','} --> ['1234', '1234.0', '1234.00', '1234', '1234.0', '1234.00', '12340', '1.234', '1.2340', '1.23400']
{'decimal': ',', 'thousands': '.'} --> ['1234', '12340', '123400', '1.234', '1.2340', '1.23400', '1234.0', '1234', '1234.0', '1234.00']

As I see, the results are all string now, but it looks like the content was still converted.

dtype

@gfyoung, you've mentioned specification of dtypes. As I see it's available in read_csv function, but unfortunately not in read_html. In case "converters" is equivalent to "dtype", please see the section above.

Another problem - date formatting

I'm reminding you guys that you may want to take a look at the stackoverflow question I've mentioned in the first message of this issue. This numeric converter that I've just described also attempts to convert date values like "01.12.2017" to numbers like 1122017, which is also pretty important point.

@WillAyd
Copy link
Member

WillAyd commented Jun 13, 2018

Have you tried thousands=None?

In [22]: html = "<html><body><table><tr><th>foo</th></tr><tr><td>60,00</td></tr></table></body></html>"

In [23]: pd.read_html(html, thousands=None)[0]
Out[23]: 
       0
0    foo
1  60,00

If that works for you then would certainly take a PR to update documentation and add as an example.

Alternately adding dtype as a keyword argument could work, but read_html operates differently from other read functions if only for the fact that it returns a list instead of a single DataFrame. Implementing that in a way that is clear and consistent may be complicated, though a PR for that would be welcome as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

5 participants