TST/BUG: fix bs4 tests that were getting erroneously run when lxml is installed but not bs4 #3741

Merged
merged 3 commits into from Jun 4, 2013

Conversation

Projects
None yet
2 participants
Member

cpcloud commented Jun 3, 2013

No description provided.

Contributor

jreback commented Jun 3, 2013

also maybe list the deps for lxml as well

e.g.
sudo apt-get install libxml2 libxslt1.1 libxslt-dev

Contributor

jreback commented Jun 3, 2013

fyi....i get this on 32-bit machine.....(this is on master)
and I uninstalled xlml,bs4 then reinstalled


vagrant@precise32:~/pandas_master$ nosetests -A 'not slow' pandas/io/tests/test_html.py
........................F.............................................
======================================================================
FAIL: test_banklist_no_match (pandas.io.tests.test_html.TestBs4LxmlParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vagrant/pandas_master/pandas/io/tests/test_html.py", line 172, in test_banklist_no_match
    dfs = self.run_read_html(self.banklist_data, attrs={'id': 'table'})
  File "/home/vagrant/pandas_master/pandas/io/tests/test_html.py", line 393, in run_read_html
    return _run_read_html(parser, *args, **kwargs)
  File "/home/vagrant/pandas_master/pandas/io/tests/test_html.py", line 74, in _run_read_html
    infer_types, attrs)
  File "/home/vagrant/pandas_master/pandas/io/html.py", line 733, in _parse
    tables = p.parse_tables()
  File "/home/vagrant/pandas_master/pandas/io/html.py", line 160, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "/home/vagrant/pandas_master/pandas/io/html.py", line 391, in _parse_tables
    raise AssertionError('No tables found')
AssertionError: No tables found

----------------------------------------------------------------------
Ran 70 tests in 10.158s

FAILED (failures=1)
vagrant@precise32:~/pandas_master$ python ci/print_versions.py 

INSTALLED VERSIONS
------------------
Python: 2.7.3.final.0
OS: Linux 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686
LC_ALL: en_US
LANG: en_US

Cython: 0.15.1
Numpy: 1.7.0
Scipy: 0.11.0
statsmodels: 0.5.0.dev-1bbd4ca
    patsy: 0.1.0
scikits.timeseries: Not installed
dateutil: 2.1
pytz: 2011k
PyTables: 2.4.0
    numexpr: 2.0.1
matplotlib: Not installed
openpyxl: 1.6.1
xlrd: 0.9.2
xlwt: 0.7.5
sqlalchemy: Not installed
lxml: 3.2.1
bs4: 4.0.2
html5lib: 1.0b1
Member

cpcloud commented Jun 3, 2013

can u post a link to your vagrant box so i can tinker

Member

cpcloud commented Jun 3, 2013

you're not using anaconda correct?

Member

cpcloud commented Jun 3, 2013

oh gosh print_versions.py doesn't work for me, darn those line endings...

Contributor

jreback commented Jun 3, 2013

not using anaconda!

how do I post a link to my box?

Member

cpcloud commented Jun 3, 2013

i think vagrant package --base my_base_box and then share a link 2 it through Dropbox or some other means

Member

cpcloud commented Jun 3, 2013

why the heck is only ONE test failing here?!?

Contributor

jreback commented Jun 3, 2013

alright....what e-mail address?

Member

cpcloud commented Jun 3, 2013

cpcloud@gmail.com
On Jun 3, 2013 2:17 PM, "jreback" notifications@github.com wrote:

alright....what e-mail address?


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/pull/3741#issuecomment-18860182
.

Contributor

jreback commented Jun 3, 2013

ok....sent you the link...prob about 1 more hour till upload is done ...big file 1.7gb!

Member

cpcloud commented Jun 3, 2013

whew that's big. thanks. here's the output of my current precise32 setup

INSTALLED VERSIONS
------------------
Python: 2.7.3.final.0
OS: Linux 3.2.0-45-generic-pae #70-Ubuntu SMP Wed May 29 20:31:05 UTC 2013 i686
LC_ALL: en_US
LANG: en_US

Cython: 0.19.1
Numpy: 1.7.1
Scipy: 0.12.0
statsmodels: 0.5.0.dev-1bbd4ca
    patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.1
pytz: 2013b
PyTables: 3.0.0
    numexpr: 2.1
matplotlib: 1.2.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
sqlalchemy: Not installed
lxml: 3.2.1
bs4: 4.2.1
html5lib: 1.0b1
Contributor

jreback commented Jun 3, 2013

I have bs4 4.0.2

Member

cpcloud commented Jun 3, 2013

tried with that and can repro

Member

cpcloud commented Jun 3, 2013

@jreback drop into pdb and press u followed by

import bs4
soup  = bs4.BeautifulSoup(self._setup_build_doc())
soup

and you will see the root of the error is invalid markup.

Member

cpcloud commented Jun 3, 2013

this pr fixes that failing test

Contributor

jreback commented Jun 3, 2013

hmm....mine actually looks ok (it returns a big string), no error

Member

cpcloud commented Jun 3, 2013

it had to with the integrity of banklist.html

Member

cpcloud commented Jun 3, 2013

u don't get a bunch of extra whitespace?

Contributor

jreback commented Jun 3, 2013

fyi...also not sure if you can grab a bs4/lxml error message and put that on your no tables message, e.g.

AssertionError("not tables: ") or something

Contributor

jreback commented Jun 3, 2013

when i print soup just prints an html like string

Member

cpcloud commented Jun 3, 2013

hm check out what i get:

   /   s   c   r   i   p   t   >
   n   o   s   c   r   i   p   t   >
   a       h   r   e   f   =   "   h   t   t   p   :   /   /   w   w   w   .   o   m   n   i   t   u   r   e   .   c   o   m   "       t   i   t   l   e   =   "   W   e   b       A   n   a   l   y   t   i   c   s   "   >
   i   m   g       s   r   c   =   "   h   t   t   p   :   /   /   f   d   i   c   .   1   2   2   .   2   o   7   .   n   e   t   /   b   /   s   s   /   f   d   i   c   g   o   v   p   r   o   d   /   1   /   H   .   2   1   -   -   N   S   /   0   ?   [   A   Q   B   ]   %   2   6   c   l   =   S   e   s   s   i   o   n   %   2   6   A   Q   E   "       h   e   i   g   h   t   =   "   1   "       w   i   d   t   h   =   "   1   "       b   o   r   d   e   r   =   "   0   "       a   l   t   =   "   "       /   >   /       >
   /   n   o   s   c   r   i   p   t   >

   !   -   -   /   D   O       N   O   T       R   E   M   O   V   E   /   -   -   >
   !   -   -       E   n   d       S   i   t   e   C   a   t   a   l   y   s   t       c   o   d   e           e   r   s   i   o   n   :       H   .   2   1   .       -   -   >
   !   -   -       e   n   d       f   o   o   t   e   r       -   -   >
   !   -   -       E   N   D       F   O   O   T   E   R       I   N   C   L   U   D   E       -   -   &gt```
Member

cpcloud commented Jun 3, 2013

with


INSTALLED VERSIONS
------------------
Python: 2.7.3.final.0
OS: Linux 3.2.0-45-generic-pae #70-Ubuntu SMP Wed May 29 20:31:05 UTC 2013 i686
LC_ALL: en_US
LANG: en_US

Cython: 0.19.1
Numpy: 1.7.1
Scipy: 0.12.0
statsmodels: 0.5.0.dev-1bbd4ca
    patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.1
pytz: 2013b
PyTables: 3.0.0
    numexpr: 2.1
matplotlib: 1.2.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
sqlalchemy: Not installed
lxml: 3.2.1
bs4: 4.0.2
html5lib: 1.0b1
Member

cpcloud commented Jun 3, 2013

dl'ing ur vagrant box now...2 MB ish / sec shouldn't be too long

Contributor

jreback commented Jun 3, 2013

hmmm...i get what looks like valid output.....

Member

cpcloud commented Jun 3, 2013

dude it's so cool that u can send me a whole computer over the interwebz

Member

cpcloud commented Jun 3, 2013

i get the same extra whitespace with ur vagrant box

Member

cpcloud commented Jun 3, 2013

   i   p   t   >
   s   c   r   i   p   t       l   a   n   g   u   a   g   e   =   "   J   a   v   a   S   c   r   i   p   t   "       t   y   p   e   =   "   t   e   x   t   /   j   a   v   a   s   c   r   i   p   t   "   > 
   i   f   (   n   a   v   i   g   a   t   o   r   .   a   p   p   V   e   r   s   i   o   n   .   i   n   d   e   x   O   f   (   '   M   S   I   E   '   )   >   =   0   )   d   o   c   u   m   e   n   t   .   w   r   i   t   e   (   u   n   e   s   c   a   p   e   (   '   %   3   C   '   )   +   '   \   !   -       +   '   -   '   )
   /   s   c   r   i   p   t   >
   n   o   s   c   r   i   p   t   >
   a       h   r   e   f   =   "   h   t   t   p   :   /   /   w   w   w   .   o   m   n   i   t   u   r   e   .   c   o   m   "       t   i   t   l   e   =   "   W   e   b       A   n   a   l   y   t   i   c   s   "   >
   i   m   g       s   r   c   =   "   h   t   t   p   :   /   /   f   d   i   c   .   1   2   2   .   2   o   7   .   n   e   t   /   b   /   s   s   /   f   d   i   c   g   o   v   p   r   o   d   /   1   /   H   .   2   1   -   -   N   S   /   0   ?   [   A   Q   B   ]   %   2   6   c   l   =   S   e   s   s   i   o   n   %   2   6   A   Q   E   "       h   e   i   g   h   t   =   "   1   "       w   i   d   t   h   =   "   1   "       b   o   r   d   e   r   =   "   0   "       a   l   t   =   "   "       /   >   /       >
   /   n   o   s   c   r   i   p   t   >

   !   -   -   /   D   O       N   O   T       R   E   M   O   V   E   /   -   -   >
   !   -   -       E   n   d       S   i   t   e   C   a   t   a   l   y   s   t       c   o   d   e           e   r   s   i   o   n   :       H   .   2   1   .       -   -   >
   !   -   -       e   n   d       f   o   o   t   e   r       -   -   >
   !   -   -       E   N   D       F   O   O   T   E   R       I   N   C   L   U   D   E       -   -   &gt
Contributor

jreback commented Jun 3, 2013

oh....ok....then that is fine (that's what your fix did)

weird that it doesn't raise?

Member

cpcloud commented Jun 3, 2013

i literally did nothing except download it followed by v init jreback-precise32 package.box && v up && v ssh (I've aliased v to vagrant)

Member

cpcloud commented Jun 3, 2013

the markup must be "more parseable" in the lastest version (the data set is updated every monday)

Contributor

jreback commented Jun 3, 2013

ahh...ok....well good that it checked otu

Member

cpcloud commented Jun 3, 2013

anyway i'm going to submit a couple of prs addressing the following. this is my first ever case of real honest-to-goodness dependency hell. i think i've sprouted a couple of gray hairs...

1st pr

  • docs for this mess

2nd

  • raise on an invalid lxml parse
  • falling back on html5lib
Member

cpcloud commented Jun 3, 2013

completely OT: how does one become a member of pydata?

Contributor

jreback commented Jun 3, 2013

wesm grants priv to push to master (and group membership)

you r prob deserving :)

Member

cpcloud commented Jun 3, 2013

ah ok. thanks!

Contributor

jreback commented Jun 3, 2013

sounds good
maybe a big warning at the beginning of read_html

Contributor

jreback commented Jun 4, 2013

ready to merge?

Member

cpcloud commented Jun 4, 2013

Not just yet I think might need to rebase I'm not at my computer just yet so I can't check. If I don't need to then it's ready 2 go.

Member

cpcloud commented Jun 4, 2013

added release and what's new...ready to merge

Contributor

jreback commented Jun 4, 2013

mergine.....3.2.1

@jreback jreback added a commit that referenced this pull request Jun 4, 2013

@jreback jreback Merge pull request #3741 from cpcloud/read-html
TST/BUG: fix bs4 tests that were getting erroneously run when lxml is installed but not bs4
2787d08

@jreback jreback merged commit 2787d08 into pandas-dev:master Jun 4, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment