Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEPS=true #189

Merged
merged 2 commits into from Mar 8, 2016

Conversation

Projects
None yet
4 participants
@sakurai-youhei
Contributor

sakurai-youhei commented Mar 3, 2016

I'm not sure if this pull-request has right direction but I've made small change to overcome following problem which I have. It'll be more than great if you'd merge this PR, provide feedback on this PR or anything else. Thanks in advance.

Problem
I want to realize CI of one Python project which uses lxml and other modules. To make preparation simple in CI, all of dependencies are written in requirements.txt and ready to be installed by pip install -r requirements.txt. Because the Python project is targeting Windows environment, I've also set up STATIC_DEPS=true in CI job.

However, build of lxml fails on CI servers because of no direct access to the Internet; Other modules are able to be installed through proxy server (with http_proxy and https_proxy environmental variables) but lxml is not because of usage of ftplib which doesn't take care of ftp_proxy environmental variable.

Downloading lxml-3.5.0.tar.gz (3.8MB)
Downloading from URL https://pypi.python.org/packages/source/l/lxml/lxml-3.5.0.tar.gz#md5=9f0c5f1eb43ff44d5455dab4b4efbe73 (from https://pypi.python.org/simple/lxml/)
Running setup.py (path:C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py) egg_info for package lxml
  Running command python setup.py egg_info
  Building lxml version 3.5.0.
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py", line 233, in <module>
      **setup_extra_options()
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setup.py", line 144, in setup_extra_options
      STATIC_CFLAGS, STATIC_BINARIES)
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\setupinfo.py", line 55, in ext_modules
      OPTION_DOWNLOAD_DIR, static_include_dirs, static_library_dirs)
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 86, in get_prebuilt_libxml2xslt
      libs = download_and_extract_zlatkovic_binaries(download_dir)
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 34, in download_and_extract_zlatkovic_binaries
      for fn in ftp_listdir(url):
    File "C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-trs2ox77\lxml\buildlibxml.py", line 106, in ftp_listdir
      server = ftplib.FTP(netloc)
    File "C:\Python34\lib\ftplib.py", line 118, in __init__
      self.connect(host)
    File "C:\Python34\lib\ftplib.py", line 153, in connect
      source_address=self.source_address)
    File "C:\Python34\lib\socket.py", line 512, in create_connection
      raise err
    File "C:\Python34\lib\socket.py", line 503, in create_connection
      sock.connect(sa)
  TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Notes

  • I tested this change in Python 2.7.10 and 3.4.3 with/without Squid proxy server.
Replace ftplib with urllib to pick up ftp_proxy
* Rewrite ftp_listdir to replace ftplib which doesn't
  take care of ftp_proxy environmental variable at all.
* Add parse_text_ftplist and parse_html_ftplist as a
  kind of sub-function of ftp_listdir.
  - parse_text_ftplist will be called when ftp_proxy
    environmental variable is empty.
  - parse_html_ftplist will be called when ftp_proxy
    environmental variable is not empty (i.e. there's
    proxy server) and urllib gets some HTML generated
    by proxy server (e.g. squid) instead of plaintext.
@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Mar 4, 2016

The 54 here looks like a fairly magic number that could use some explanation.

The 54 here looks like a fairly magic number that could use some explanation.

@scoder scoder closed this Mar 4, 2016

@scoder scoder reopened this Mar 4, 2016

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Mar 4, 2016

Member

Sorry, wrong button. :) This looks ok, and if it helps...

Member

scoder commented Mar 4, 2016

Sorry, wrong button. :) This looks ok, and if it helps...

@sakurai-youhei

This comment has been minimized.

Show comment
Hide comment
@sakurai-youhei

sakurai-youhei Mar 7, 2016

Contributor

Thank you for your comment.

The '54' is because of LIST command in urllib (instead of NLST with ftplib); Here's what FTP server replays in case of direct access to FTP server. The '54' shouldn't be a problem as long as FTP server is running on Unix.

C:\Users\sakurai>C:\Python34\python.exe -c "from urllib.request import urlopen; print(urlopen('ftp://ftp.zlatkovic.com/pub/libxml/').read().decode('utf-8'))"
drwxr-xr-x   3 ftp      ftp          4096 Dec  7 12:10 64bit
dr-xr-xr-x   2 ftp      ftp          4096 May  3  2010 Delphi
-r--r--r--   1 ftp      ftp       1320616 Aug 17  2008 iconv-1.9.2.win32.zip
-rw-r--r--   1 ftp      ftp       2736611 Jun 19  2011 libxml2-2.7.8.win32.zip
-rw-r--r--   1 ftp      ftp        989833 Jun 19  2011 libxmlsec-1.2.18.win32.zip
-rw-r--r--   1 ftp      ftp        973680 Jun 20  2011 libxmlsec-nounicode-1.2.18.win32.zip
-r--r--r--   1 ftp      ftp        408001 Nov  1  2009 libxslt-1.1.26.win32.zip
-rw-r--r--   1 ftp      ftp           476 Sep  1  2011 md5sum.txt
dr-xr-xr-x   2 ftp      ftp          4096 Jun 19  2011 oldreleases
-r--r--r--   1 ftp      ftp       2481853 Aug 17  2008 openssl-0.9.8a.win32.zip
-r--r--r--   1 ftp      ftp        704733 Aug 17  2008 xsldbg-3.1.7.win32.zip
-r--r--r--   1 ftp      ftp        173472 Oct  2  2010 zlib-1.2.5.win32.zip


C:\Users\sakurai>C:\Python27\python.exe -c "from urllib import urlopen; print urlopen('ftp://ftp.zlatkovic.com/pub/libxml/').read().decode('utf-8')"
drwxr-xr-x   3 ftp      ftp          4096 Dec  7 12:10 64bit
dr-xr-xr-x   2 ftp      ftp          4096 May  3  2010 Delphi
-r--r--r--   1 ftp      ftp       1320616 Aug 17  2008 iconv-1.9.2.win32.zip
-rw-r--r--   1 ftp      ftp       2736611 Jun 19  2011 libxml2-2.7.8.win32.zip
-rw-r--r--   1 ftp      ftp        989833 Jun 19  2011 libxmlsec-1.2.18.win32.zip
-rw-r--r--   1 ftp      ftp        973680 Jun 20  2011 libxmlsec-nounicode-1.2.18.win32.zip
-r--r--r--   1 ftp      ftp        408001 Nov  1  2009 libxslt-1.1.26.win32.zip
-rw-r--r--   1 ftp      ftp           476 Sep  1  2011 md5sum.txt
dr-xr-xr-x   2 ftp      ftp          4096 Jun 19  2011 oldreleases
-r--r--r--   1 ftp      ftp       2481853 Aug 17  2008 openssl-0.9.8a.win32.zip
-r--r--r--   1 ftp      ftp        704733 Aug 17  2008 xsldbg-3.1.7.win32.zip
-r--r--r--   1 ftp      ftp        173472 Oct  2  2010 zlib-1.2.5.win32.zip

So how should I put explanation there? Shall I insert FTP_LIST_CMD_MAGIC=54 as variable or # '[54:]' trims permission, links, user, group, size and timestamp as comment?

Contributor

sakurai-youhei commented Mar 7, 2016

Thank you for your comment.

The '54' is because of LIST command in urllib (instead of NLST with ftplib); Here's what FTP server replays in case of direct access to FTP server. The '54' shouldn't be a problem as long as FTP server is running on Unix.

C:\Users\sakurai>C:\Python34\python.exe -c "from urllib.request import urlopen; print(urlopen('ftp://ftp.zlatkovic.com/pub/libxml/').read().decode('utf-8'))"
drwxr-xr-x   3 ftp      ftp          4096 Dec  7 12:10 64bit
dr-xr-xr-x   2 ftp      ftp          4096 May  3  2010 Delphi
-r--r--r--   1 ftp      ftp       1320616 Aug 17  2008 iconv-1.9.2.win32.zip
-rw-r--r--   1 ftp      ftp       2736611 Jun 19  2011 libxml2-2.7.8.win32.zip
-rw-r--r--   1 ftp      ftp        989833 Jun 19  2011 libxmlsec-1.2.18.win32.zip
-rw-r--r--   1 ftp      ftp        973680 Jun 20  2011 libxmlsec-nounicode-1.2.18.win32.zip
-r--r--r--   1 ftp      ftp        408001 Nov  1  2009 libxslt-1.1.26.win32.zip
-rw-r--r--   1 ftp      ftp           476 Sep  1  2011 md5sum.txt
dr-xr-xr-x   2 ftp      ftp          4096 Jun 19  2011 oldreleases
-r--r--r--   1 ftp      ftp       2481853 Aug 17  2008 openssl-0.9.8a.win32.zip
-r--r--r--   1 ftp      ftp        704733 Aug 17  2008 xsldbg-3.1.7.win32.zip
-r--r--r--   1 ftp      ftp        173472 Oct  2  2010 zlib-1.2.5.win32.zip


C:\Users\sakurai>C:\Python27\python.exe -c "from urllib import urlopen; print urlopen('ftp://ftp.zlatkovic.com/pub/libxml/').read().decode('utf-8')"
drwxr-xr-x   3 ftp      ftp          4096 Dec  7 12:10 64bit
dr-xr-xr-x   2 ftp      ftp          4096 May  3  2010 Delphi
-r--r--r--   1 ftp      ftp       1320616 Aug 17  2008 iconv-1.9.2.win32.zip
-rw-r--r--   1 ftp      ftp       2736611 Jun 19  2011 libxml2-2.7.8.win32.zip
-rw-r--r--   1 ftp      ftp        989833 Jun 19  2011 libxmlsec-1.2.18.win32.zip
-rw-r--r--   1 ftp      ftp        973680 Jun 20  2011 libxmlsec-nounicode-1.2.18.win32.zip
-r--r--r--   1 ftp      ftp        408001 Nov  1  2009 libxslt-1.1.26.win32.zip
-rw-r--r--   1 ftp      ftp           476 Sep  1  2011 md5sum.txt
dr-xr-xr-x   2 ftp      ftp          4096 Jun 19  2011 oldreleases
-r--r--r--   1 ftp      ftp       2481853 Aug 17  2008 openssl-0.9.8a.win32.zip
-r--r--r--   1 ftp      ftp        704733 Aug 17  2008 xsldbg-3.1.7.win32.zip
-r--r--r--   1 ftp      ftp        173472 Oct  2  2010 zlib-1.2.5.win32.zip

So how should I put explanation there? Shall I insert FTP_LIST_CMD_MAGIC=54 as variable or # '[54:]' trims permission, links, user, group, size and timestamp as comment?

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Mar 7, 2016

Member

Well, "..._MAGIC" is obviously bad as name for a constant that is trying to reduce the magic. ;-)

The correct way to do this is to not cut off a fixed length prefix, but to split the line into 9 parts (line.split(None, 9)) and take the last one. Then, add a comment before the split command that shows an example output line and thus explains why 9 is the correct number here.

Member

scoder commented Mar 7, 2016

Well, "..._MAGIC" is obviously bad as name for a constant that is trying to reduce the magic. ;-)

The correct way to do this is to not cut off a fixed length prefix, but to split the line into 9 parts (line.split(None, 9)) and take the last one. Then, add a comment before the split command that shows an example output line and thus explains why 9 is the correct number here.

@sakurai-youhei sakurai-youhei changed the title from Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEP=true to Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEPS=true Mar 8, 2016

@sakurai-youhei

This comment has been minimized.

Show comment
Hide comment
@sakurai-youhei

sakurai-youhei Mar 8, 2016

Contributor

Thank you. I pushed one more commit to follow your instruction. Also, it's good to know split(None). :)

Contributor

sakurai-youhei commented Mar 8, 2016

Thank you. I pushed one more commit to follow your instruction. Also, it's good to know split(None). :)

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Mar 8, 2016

Member

Thanks

Member

scoder commented Mar 8, 2016

Thanks

scoder added a commit that referenced this pull request Mar 8, 2016

Merge pull request #189 from sakurai-youhei/PR/static_dep-with-proxy
Replace ftplib with urllib to pick up ftp_proxy when building lxml with STATIC_DEPS=true

@scoder scoder merged commit 0334372 into lxml:master Mar 8, 2016

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
@debjan

This comment has been minimized.

Show comment
Hide comment
@debjan

debjan Mar 25, 2016

Hi, this breaks version detection on my pc.

I had to make following patch:

 def download_libxml2(dest_dir, version=None):
     """Downloads libxml2, returning the filename where the library was downloaded"""
-    version_re = re.compile(r'^LATEST_LIBXML2_IS_(.*)$')
+    version_re = re.compile(r'^-> libxml2-([0-9.]+[0-9]).tar.gz$')
     filename = 'libxml2-%s.tar.gz'
     return download_library(dest_dir, LIBXML2_LOCATION, 'libxml2',
                             version_re, filename, version=version)

 def download_libxslt(dest_dir, version=None):
     """Downloads libxslt, returning the filename where the library was downloaded"""
-    version_re = re.compile(r'^LATEST_LIBXSLT_IS_(.*)$')
+    version_re = re.compile(r'^-> libxslt-([0-9.]+[0-9]).tar.gz$')
     filename = 'libxslt-%s.tar.gz'
     return download_library(dest_dir, LIBXML2_LOCATION, 'libxslt',
                             version_re, filename, version=version)

as a workaround

debjan commented Mar 25, 2016

Hi, this breaks version detection on my pc.

I had to make following patch:

 def download_libxml2(dest_dir, version=None):
     """Downloads libxml2, returning the filename where the library was downloaded"""
-    version_re = re.compile(r'^LATEST_LIBXML2_IS_(.*)$')
+    version_re = re.compile(r'^-> libxml2-([0-9.]+[0-9]).tar.gz$')
     filename = 'libxml2-%s.tar.gz'
     return download_library(dest_dir, LIBXML2_LOCATION, 'libxml2',
                             version_re, filename, version=version)

 def download_libxslt(dest_dir, version=None):
     """Downloads libxslt, returning the filename where the library was downloaded"""
-    version_re = re.compile(r'^LATEST_LIBXSLT_IS_(.*)$')
+    version_re = re.compile(r'^-> libxslt-([0-9.]+[0-9]).tar.gz$')
     filename = 'libxslt-%s.tar.gz'
     return download_library(dest_dir, LIBXML2_LOCATION, 'libxslt',
                             version_re, filename, version=version)

as a workaround

@sakurai-youhei

This comment has been minimized.

Show comment
Hide comment
@sakurai-youhei

sakurai-youhei Mar 25, 2016

Contributor

Sorry for my breaking build on your pc. I suppose build is now broken on non-Windows environment because of difference between ftplib and urllib at ftp://xmlsoft.org/libxml2/ where I haven't checked at all; I checked my change only on Windows.

In [29]: import ftplib

In [30]: server = ftplib.FTP("xmlsoft.org")

In [31]: server.login()
Out[31]: '230 Login successful.'

In [32]: server.nlst("libxml2")[:5]
Out[32]:
['libxml2/LATEST_LIBXML2',
 'libxml2/LATEST_LIBXML2_IS_2.9.3',
 'libxml2/LATEST_LIBXSLT',
 'libxml2/LATEST_LIBXSLT_IS_1.1.28',
 'libxml2/ListArchives']

In [33]: import urllib

In [34]: urllib.urlopen("ftp://xmlsoft.org/libxml2/").readlines()[:5]
Out[34]:
['lrwxrwxrwx    1 50138    69             20 Mar 25 11:28 LATEST_LIBXML2 -> libxml2-2.9.3.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             20 Mar 25 11:28 LATEST_LIBXML2_IS_2.9.3 -> libxml2-2.9.3.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             21 Mar 25 11:28 LATEST_LIBXSLT -> libxslt-1.1.28.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             21 Mar 25 11:28 LATEST_LIBXSLT_IS_1.1.28 -> libxslt-1.1.28.tar.gz\r\n',
 'drwxrwxr-x    4 50138    69           4096 Jun 19  2006 ListArchives\r\n']

@scoder, Do you mind if I suggest fix by myself? I'm thinking following 3 changes:

  • Not line.split(None, 9)[-1] but line.split(None, 8)[-1].rstrip()
  • Not r'^LATEST_LIBXML2_IS_(.*)$' but r'^LATEST_LIBXML2_IS_([0-9.]+[0-9])'
  • Not r'^LATEST_LIBXSLT_IS_(.*)$' but r'^LATEST_LIBXSLT_IS_([0-9.]+[0-9])'

If you or someone else wants to take it rather than to leave further changes by me, please feel free to let me know.

Contributor

sakurai-youhei commented Mar 25, 2016

Sorry for my breaking build on your pc. I suppose build is now broken on non-Windows environment because of difference between ftplib and urllib at ftp://xmlsoft.org/libxml2/ where I haven't checked at all; I checked my change only on Windows.

In [29]: import ftplib

In [30]: server = ftplib.FTP("xmlsoft.org")

In [31]: server.login()
Out[31]: '230 Login successful.'

In [32]: server.nlst("libxml2")[:5]
Out[32]:
['libxml2/LATEST_LIBXML2',
 'libxml2/LATEST_LIBXML2_IS_2.9.3',
 'libxml2/LATEST_LIBXSLT',
 'libxml2/LATEST_LIBXSLT_IS_1.1.28',
 'libxml2/ListArchives']

In [33]: import urllib

In [34]: urllib.urlopen("ftp://xmlsoft.org/libxml2/").readlines()[:5]
Out[34]:
['lrwxrwxrwx    1 50138    69             20 Mar 25 11:28 LATEST_LIBXML2 -> libxml2-2.9.3.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             20 Mar 25 11:28 LATEST_LIBXML2_IS_2.9.3 -> libxml2-2.9.3.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             21 Mar 25 11:28 LATEST_LIBXSLT -> libxslt-1.1.28.tar.gz\r\n',
 'lrwxrwxrwx    1 50138    69             21 Mar 25 11:28 LATEST_LIBXSLT_IS_1.1.28 -> libxslt-1.1.28.tar.gz\r\n',
 'drwxrwxr-x    4 50138    69           4096 Jun 19  2006 ListArchives\r\n']

@scoder, Do you mind if I suggest fix by myself? I'm thinking following 3 changes:

  • Not line.split(None, 9)[-1] but line.split(None, 8)[-1].rstrip()
  • Not r'^LATEST_LIBXML2_IS_(.*)$' but r'^LATEST_LIBXML2_IS_([0-9.]+[0-9])'
  • Not r'^LATEST_LIBXSLT_IS_(.*)$' but r'^LATEST_LIBXSLT_IS_([0-9.]+[0-9])'

If you or someone else wants to take it rather than to leave further changes by me, please feel free to let me know.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Mar 25, 2016

Member

Please try latest master.
177fb08
ba14368

Member

scoder commented Mar 25, 2016

Please try latest master.
177fb08
ba14368

@EmilStenstrom

This comment has been minimized.

Show comment
Hide comment
@EmilStenstrom

EmilStenstrom Jul 2, 2016

I'm getting this error installing lxml (assume that this is related to this PR):

$ STATIC_DEPS=true pip install lxml==3.6
Collecting lxml==3.6
  Using cached lxml-3.6.0.tar.gz
    Complete output from command python setup.py egg_info:
    Building lxml version 3.6.0.
    Latest version of libiconv is 1.14
    Downloading libiconv into libs/libiconv-1.14.tar.gz
    Unpacking libiconv-1.14.tar.gz into build/tmp
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/<user>/tmp/pip-build-s4Dtcu/lxml/setup.py", line 233, in <module>
        **setup_extra_options()
      File "/home/<user>/tmp/pip-build-s4Dtcu/lxml/setup.py", line 144, in setup_extra_options
        STATIC_CFLAGS, STATIC_BINARIES)
      File "setupinfo.py", line 64, in ext_modules
        multicore=OPTION_MULTICORE)
      File "buildlibxml.py", line 308, in build_libxml2xslt
        libxml2_dir  = unpack_tarball(download_libxml2(download_dir, libxml2_version), build_dir)
      File "buildlibxml.py", line 145, in download_libxml2
        version_re, filename, version=version)
      File "buildlibxml.py", line 180, in download_library
        % (name, fns))
    Exception: Could not find the most current version of the libxml2 from the files: <generator object parse_text_ftplist at 0x7f6841e37280>

I'm getting this error installing lxml (assume that this is related to this PR):

$ STATIC_DEPS=true pip install lxml==3.6
Collecting lxml==3.6
  Using cached lxml-3.6.0.tar.gz
    Complete output from command python setup.py egg_info:
    Building lxml version 3.6.0.
    Latest version of libiconv is 1.14
    Downloading libiconv into libs/libiconv-1.14.tar.gz
    Unpacking libiconv-1.14.tar.gz into build/tmp
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/<user>/tmp/pip-build-s4Dtcu/lxml/setup.py", line 233, in <module>
        **setup_extra_options()
      File "/home/<user>/tmp/pip-build-s4Dtcu/lxml/setup.py", line 144, in setup_extra_options
        STATIC_CFLAGS, STATIC_BINARIES)
      File "setupinfo.py", line 64, in ext_modules
        multicore=OPTION_MULTICORE)
      File "buildlibxml.py", line 308, in build_libxml2xslt
        libxml2_dir  = unpack_tarball(download_libxml2(download_dir, libxml2_version), build_dir)
      File "buildlibxml.py", line 145, in download_libxml2
        version_re, filename, version=version)
      File "buildlibxml.py", line 180, in download_library
        % (name, fns))
    Exception: Could not find the most current version of the libxml2 from the files: <generator object parse_text_ftplist at 0x7f6841e37280>
@EmilStenstrom

This comment has been minimized.

Show comment
Hide comment
@EmilStenstrom

EmilStenstrom Jul 2, 2016

Using the latest buillibxml.py from master fixes my problem. 👍

Using the latest buillibxml.py from master fixes my problem. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment