read_html: fails to parse column #3606

Closed
timmie opened this Issue May 14, 2013 · 10 comments

Comments

Projects
None yet
3 participants
Contributor

timmie commented May 14, 2013

The second column of the table
http://code.google.com/p/pythonxy/wiki/StandardPlugins#Python_packages

is not parsed as shown with this code:

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

dfs = pd.read_html(url, flavor='lxml', attrs={'class': 'wikitable'})

# <codecell>

dfs

# <codecell>

python_core = dfs[0]

# <codecell>

python_core[:10]
Member

cpcloud commented May 14, 2013

i will be submitting a pr soon that should fix issues like this. this is a result of killing whitespace in a table when parsing the raw data, which might not have been the best decision on my part, i.e., prolly should let the user decide what he/she wants to keep.

Member

cpcloud commented May 15, 2013

@timmie fixed in #3616 if u want 2 try it out, the branch is cpcloud/read-html-fixes. i even put a test in there using ur table :) (https won't work with lxml so do url.replace('https', 'http') b4 passing 2 read_html).

Contributor

jreback commented May 20, 2013

closed by #3616

jreback closed this May 20, 2013

Contributor

timmie commented Aug 2, 2013

@cpcloud

it occurrs again.

# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <codecell>

import pandas as pd
pd.__version__

# <codecell>

url = 'http://code.google.com/p/pythonxy/wiki/StandardPlugins'

# <codecell>

dfs = pd.read_html(url, attrs={'class': 'wikitable'})

# <codecell>

match = 'Distribute'
dfs = pd.read_html(url, attrs={'class': 'wikitable'}, match=match)

# <codecell>

x = dfs[0]

# <codecell>

x.head()

# <codecell>


# <codecell>



  • The version column shows: NaN.
  • And the parse result returns a list not a df: x = dfs[0]

using 0.12.dev (yesterday).

Member

cpcloud commented Aug 2, 2013

I'll take a look. FYI the parse result has always been a list.

Member

cpcloud commented Aug 2, 2013

@timmie try passing infer_types=False

In [18]: dfs = read_html('http://code.google.com/p/pythonxy/wiki/StandardPlugins',attrs={'class':'wikitable'},match='Distribute',infer_types=False)

In [19]: dfs[0]
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 4 columns):
0    91  non-null values
1    91  non-null values
2    91  non-null values
3    91  non-null values
dtypes: object(4)

In [20]: dfs[0].head()
Out[20]:
                0         1 2                                                  3
0          Python     2.7.5                            Python standard libraries
1  Base Libraries   1.1.0-5      shared libraries commonly used by other plugins
2     Base Python   1.3.0-5    A collection of small (in scope and size) but ...
3      Distribute  0.6.45-8    Download, build, install, upgrade, and uninsta...
4             Pip   1.3.1-2    pip is a tool for installing and managing Pyth...
Member

cpcloud commented Aug 2, 2013

wonder if it might be useful to parse the src attr of an img tag...i'll raise an issue

Contributor

timmie commented Aug 6, 2013

OK, shall we add it to the docs, then?

Contributor

timmie commented Aug 6, 2013

BTW, thank you.

Member

cpcloud commented Aug 6, 2013

i believe infer_types is in the docs, let me check...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment