Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html: fails to parse column #3606

Closed
timmie opened this issue May 14, 2013 · 10 comments

Comments

@timmie
Copy link
Contributor

commented May 14, 2013

The second column of the table
http://code.google.com/p/pythonxy/wiki/StandardPlugins#Python_packages

is not parsed as shown with this code:

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

dfs = pd.read_html(url, flavor='lxml', attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

python_core = dfs[0]

# <codecell>

python_core[:10]
@cpcloud

This comment has been minimized.

Copy link
Member

commented May 14, 2013

i will be submitting a pr soon that should fix issues like this. this is a result of killing whitespace in a table when parsing the raw data, which might not have been the best decision on my part, i.e., prolly should let the user decide what he/she wants to keep.

@cpcloud

This comment has been minimized.

Copy link
Member

commented May 15, 2013

@timmie fixed in #3616 if u want 2 try it out, the branch is cpcloud/read-html-fixes. i even put a test in there using ur table :) (https won't work with lxml so do url.replace('https', 'http') b4 passing 2 read_html).

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 20, 2013

closed by #3616

@jreback jreback closed this May 20, 2013

@timmie

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2013

@cpcloud

it occurrs again.

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd
pd.__version__

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

match = 'Distribute'
dfs = pd.read_html(url, attrs={'class': 'wikitable'}, match=match)

# <codecell>

x = dfs[0]

# <codecell>

x.head()

# <codecell>


# <codecell>



  • The version column shows: NaN.
  • And the parse result returns a list not a df: x = dfs[0]

using 0.12.dev (yesterday).

@cpcloud

This comment has been minimized.

Copy link
Member

commented Aug 2, 2013

I'll take a look. FYI the parse result has always been a list.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Aug 2, 2013

@timmie try passing infer_types=False

In [18]: dfs = read_html('http://code.google.com/p/pythonxy/wiki/StandardPlugins',attrs={'class':'wikitable'},match='Distribute',infer_types=False)

In [19]: dfs[0]
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 4 columns):
0    91  non-null values
1    91  non-null values
2    91  non-null values
3    91  non-null values
dtypes: object(4)

In [20]: dfs[0].head()
Out[20]:
                0         1 2                                                  3
0          Python     2.7.5                            Python standard libraries
1  Base Libraries   1.1.0-5      shared libraries commonly used by other plugins
2     Base Python   1.3.0-5    A collection of small (in scope and size) but ...
3      Distribute  0.6.45-8    Download, build, install, upgrade, and uninsta...
4             Pip   1.3.1-2    pip is a tool for installing and managing Pyth...
@cpcloud

This comment has been minimized.

Copy link
Member

commented Aug 2, 2013

wonder if it might be useful to parse the src attr of an img tag...i'll raise an issue

@timmie

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2013

OK, shall we add it to the docs, then?

@timmie

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2013

BTW, thank you.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Aug 6, 2013

i believe infer_types is in the docs, let me check...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.