Skip to content
This repository has been archived by the owner on Nov 30, 2018. It is now read-only.

Commit

Permalink
Version 0.3.0
Browse files Browse the repository at this point in the history
Renaming to xmlr.
Ensuring unicode in all keys and values.
Docstrings expanded.
XMLParsingMethods enum improved.
Tests also test unicode compliance now.
Added kwargs to iter and parse to enable tweaking of iterparse method.
Minor general modifications to tests and methods.
  • Loading branch information
hbldh committed May 23, 2016
1 parent 76a82c1 commit 24627f3
Show file tree
Hide file tree
Showing 18 changed files with 301 additions and 192 deletions.
5 changes: 3 additions & 2 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
[run]
branch = True
source = xmller
include = */xmller/*
source = xmlr
include = */xmlr/*
omit =
*/setup.py

[report]
exclude_lines =
except ImportError
except Exception
if is_py3:


2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ install:
- "pip install python-coveralls"
- "pip install lxml"
- "pip install -e ."
script: py.test tests/ --cov xmller --cov-report term-missing
script: py.test tests/ --cov xmlr --cov-report term-missing
after_success:
- coveralls
8 changes: 8 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
v0.3.0 (2016-05-23)
===================
- Renaming from `xmller` to `xmlr`.
- General improvements.
- Test coverage increased.
- More documentation.
- Development Status classifier increased from Alpha to Beta.

v0.2.0 (2016-05-20)
===================
- Bugfixes.
Expand Down
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# xmller
# xmlr

[![Build Status](https://travis-ci.org/hbldh/xmller.svg?branch=master)](https://travis-ci.org/hbldh/xmller)
[![Coverage Status](https://coveralls.io/repos/github/hbldh/xmller/badge.svg?branch=master)](https://coveralls.io/github/hbldh/xmller?branch=master)
[![Build Status](https://travis-ci.org/hbldh/xmlr.svg?branch=master)](https://travis-ci.org/hbldh/xmlr)
[![Coverage Status](https://coveralls.io/repos/github/hbldh/xmlr/badge.svg?branch=master)](https://coveralls.io/github/hbldh/xmlr?branch=master)

It can be problematic to handle large XML files (>> 10 MB) and using the `xml` module
in Python directly leads to huge memory overheads. Most often, these large XML
Expand All @@ -10,7 +10,7 @@ intrinsic need to be stored in XML.

This package provides iterative methods for dealing with them, reading the
XML documents into Python dict representation instead, according to the
methodology specified in \[3\]. `xmller` is inspired by the
methodology specified in \[3\]. `xmlr` is inspired by the
solutions described in \[1\] and \[2\], enabling the parsing of very
large documents without problems with overtaxing the memory.

Expand All @@ -23,15 +23,15 @@ large documents without problems with overtaxing the memory.
## Installation

```
pip install git+https://www.github.com/hbldh/xmller
pip install git+https://www.github.com/hbldh/xmlr
```

## Usage

To parse an entire document, use the `xmlparse` method:

```python
from xmller import xmlparse
from xmlr import xmlparse

doc = xmlparse('very_large_doc.xml')

Expand All @@ -41,7 +41,7 @@ An iterator, `xmliter`, yielding elements of a specified type as they are parsed
the document is also present:

```python
from xmller import xmliter
from xmlr import xmliter

for d in xmliter('very_large_record.xml', 'Record'):
print(d)
Expand All @@ -55,8 +55,10 @@ The desired parser can also be specified. Available methods are:
- `LXML_ELEMENTTREE` - Using the `lxml.etree` solution. Requires
installation of the `lxml` package.

These can then be used like this:

```python
from xmller import xmliter, XMLParsingMethods
from xmlr import xmliter, XMLParsingMethods

for d in xmliter('very_large_record.xml', 'Record',
parser=XMLParsingMethods.LXML_ELEMENTTREE):
Expand All @@ -75,17 +77,16 @@ Tests are run with `pytest`:

```bash
$ py.test tests/

============================= test session starts ==============================
platform linux2 -- Python 2.7.11+, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/hbldh/Repos/xmller, inifile:
collected 44 items
platform linux2 -- Python 2.7.6, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/hbldh/Repos/xmlr, inifile:
collected 50 items

tests/test_iter.py ........................
tests/test_iter.py ...........................
tests/test_methods.py ..
tests/test_parsing.py ..................
tests/test_parsing.py .....................

========================== 44 passed in 2.70 seconds ===========================
========================== 50 passed in 0.50 seconds ===========================
```

The tests fetches some XML documents from
Expand Down
14 changes: 7 additions & 7 deletions measure_memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from __future__ import absolute_import

import os
from xmller import xmlparse, xmliter, XMLParsingMethods
from xmlr import xmlparse, xmliter, XMLParsingMethods

def document_size(doc):
"""A storage size estimator.
Expand Down Expand Up @@ -57,38 +57,38 @@ def document_size(doc):

# xmlparse

print ('xmller.xmlparse using xml.etree.ElementTree')
print ('xmlr.xmlparse using xml.etree.ElementTree')
doc = xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.ELEMENTTREE)
print('Size in MB: {0:.2f} MB'.format(document_size(doc)/1024./1024.))
del doc

print ('xmller.xmlparse using xml.etree.cElementTree')
print ('xmlr.xmlparse using xml.etree.cElementTree')
doc = xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.C_ELEMENTTREE)
print('Size in MB: {0:.2f} MB'.format(document_size(doc)/1024./1024.))
del doc

print ('xmller.xmlparse using lxml.etree')
print ('xmlr.xmlparse using lxml.etree')
doc = xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.LXML_ELEMENTTREE)
print('Size in MB: {0:.2f} MB'.format(document_size(doc)/1024./1024.))
del doc

# xmliter

print ('xmller.xmliter using xml.etree.ElementTree')
print ('xmlr.xmliter using xml.etree.ElementTree')
docs = []
for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.ELEMENTTREE):
docs.append(d)
print('Size in MB: {0:.2f} MB'.format(document_size(docs)/1024./1024.))
del docs

print ('xmller.xmliter using xml.etree.cElementTree')
print ('xmlr.xmliter using xml.etree.cElementTree')
docs = []
for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.C_ELEMENTTREE):
docs.append(d)
print('Size in MB: {0:.2f} MB'.format(document_size(docs)/1024./1024.))
del docs

print ('xmller.xmliter using lxml.etree')
print ('xmlr.xmliter using lxml.etree')
docs = []
for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.LXML_ELEMENTTREE):
docs.append(d)
Expand Down
60 changes: 30 additions & 30 deletions measure_timing.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,51 +22,51 @@
Python 2.7
--------
xmller.xmlparse using xml.etree.ElementTree
xmlr.xmlparse using xml.etree.ElementTree
Time: 114.6762 s
xmller.xmlparse using xml.etree.cElementTree
xmlr.xmlparse using xml.etree.cElementTree
Time: 51.9837 s
xmller.xmlparse using lxml.etree
xmlr.xmlparse using lxml.etree
Time: 53.6425 s
xmller.xmliter using xml.etree.ElementTree
xmlr.xmliter using xml.etree.ElementTree
Time: 117.9564 s
xmller.xmliter using xml.etree.cElementTree
xmlr.xmliter using xml.etree.cElementTree
Time: 60.8914 s
xmller.xmliter using lxml.etree
xmlr.xmliter using lxml.etree
Time: 48.2994 s
Python 3.5
--------
xmller.xmlparse using xml.etree.ElementTree
xmlr.xmlparse using xml.etree.ElementTree
Time: 73.2584 s
xmller.xmlparse using xml.etree.cElementTree
xmlr.xmlparse using xml.etree.cElementTree
Time: 72.6901 s
xmller.xmlparse using lxml.etree
xmlr.xmlparse using lxml.etree
Time: 53.3402 s
xmller.xmliter using xml.etree.ElementTree
xmlr.xmliter using xml.etree.ElementTree
Time: 71.5361 s
xmller.xmliter using xml.etree.cElementTree
xmlr.xmliter using xml.etree.cElementTree
Time: 72.6967 s
xmller.xmliter using lxml.etree
xmlr.xmliter using lxml.etree
Time: 48.9455 s
PyPy
----
xmller.xmlparse using xml.etree.ElementTree
xmlr.xmlparse using xml.etree.ElementTree
Time: 42.3088 s
xmller.xmlparse using xml.etree.cElementTree
xmlr.xmlparse using xml.etree.cElementTree
Time: 43.0353 s
xmller.xmlparse using lxml.etree
xmlr.xmlparse using lxml.etree
Time: 538.7466 s
xmller.xmliter using xml.etree.ElementTree
xmlr.xmliter using xml.etree.ElementTree
Time: 42.5941 s
xmller.xmliter using xml.etree.cElementTree
xmlr.xmliter using xml.etree.cElementTree
Time: 42.3841 s
xmller.xmliter using lxml.etree
xmlr.xmliter using lxml.etree
Time: 271.5306 s
Expand All @@ -76,31 +76,31 @@

# xmlparse

print ('xmller.xmlparse using xml.etree.ElementTree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.ELEMENTTREE)', number=n, setup='from xmller import xmlparse, XMLParsingMethods')
print ('xmlr.xmlparse using xml.etree.ElementTree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.ELEMENTTREE)', number=n, setup='from xmlr import xmlparse, XMLParsingMethods')
print("Time: {0:.4f} s".format(t / n))

print ('xmller.xmlparse using xml.etree.cElementTree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.C_ELEMENTTREE)', number=n, setup='from xmller import xmlparse, XMLParsingMethods')
print ('xmlr.xmlparse using xml.etree.cElementTree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.C_ELEMENTTREE)', number=n, setup='from xmlr import xmlparse, XMLParsingMethods')
print("Time: {0:.4f} s".format(t / n))

print ('xmller.xmlparse using lxml.etree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.LXML_ELEMENTTREE)', number=n, setup='from xmller import xmlparse, XMLParsingMethods')
print ('xmlr.xmlparse using lxml.etree')
t = timeit.timeit('xmlparse("/home/hbldh/Downloads/google-renewals-all-20080624.xml", XMLParsingMethods.LXML_ELEMENTTREE)', number=n, setup='from xmlr import xmlparse, XMLParsingMethods')
print("Time: {0:.4f} s".format(t / n))

# xmliter

print ('xmller.xmliter using xml.etree.ElementTree')
print ('xmlr.xmliter using xml.etree.ElementTree')
t = timeit.timeit('for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.ELEMENTTREE): docs.append(d)',
number=n, setup='from xmller import xmliter, XMLParsingMethods; docs = []')
number=n, setup='from xmlr import xmliter, XMLParsingMethods; docs = []')
print("Time: {0:.4f} s".format(t / n))

print ('xmller.xmliter using xml.etree.cElementTree')
print ('xmlr.xmliter using xml.etree.cElementTree')
t = timeit.timeit('for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.C_ELEMENTTREE): docs.append(d)',
number=n, setup='from xmller import xmliter, XMLParsingMethods; docs = []')
number=n, setup='from xmlr import xmliter, XMLParsingMethods; docs = []')
print("Time: {0:.4f} s".format(t / n))

print ('xmller.xmliter using lxml.etree')
print ('xmlr.xmliter using lxml.etree')
t = timeit.timeit('for d in xmliter("/home/hbldh/Downloads/google-renewals-all-20080624.xml", "Record", XMLParsingMethods.LXML_ELEMENTTREE): docs.append(d)',
number=n, setup='from xmller import xmliter, XMLParsingMethods; docs = []')
number=n, setup='from xmlr import xmliter, XMLParsingMethods; docs = []')
print("Time: {0:.4f} s".format(t / n))
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,17 +31,17 @@
def read(f):
return open(f, encoding='utf-8').read()

with open('xmller/__init__.py', 'r') as fd:
with open('xmlr/__init__.py', 'r') as fd:
version = re.search(r'^__version__\s*=\s*[\'"]([^\'"]*)[\'"]',
fd.read(), re.MULTILINE).group(1)


setup(
name='xmller',
name='xmlr',
version=version,
author='Henrik Blidh',
author_email='henrik.blidh@nedomkull.com',
url='https://github.com/hbldh/xmller',
url='https://github.com/hbldh/xmlr',
description='XML parsing package for very large files',
long_description=read('README.md'),
license='MIT',
Expand All @@ -55,7 +55,7 @@ def read(f):
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Operating System :: OS Independent',
'Development Status :: 3 - Alpha',
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Topic :: Text Processing :: Markup :: XML'
],
Expand Down
36 changes: 36 additions & 0 deletions tests/test_doc.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
<?xml version="1.0" encoding="UTF-8"?>
<MainSection ns="https://example.org">
<AnItem>
<Year>1923</Year>
<MD5Sum></MD5Sum>
<Title>Group A</Title>
<Files>
<File type="txt">a.txt</File>
<File type="bat">b.bat</File>
<File type="exe">c.exe</File>
<File type="pdf">d.pdf</File>
<File type="ai">e.ai</File>
<File type="jpeg"></File>
<File></File>
</Files>
</AnItem>
<AnItem>
<Year>1988</Year>
<MD5Sum>3108ea337254c046f204d01a4bbdcb4d</MD5Sum>
<Title>Group B</Title>
<Files>
<File type="txt">a.txt</File>
</Files>
</AnItem>
<AnItem>
<Year>2014</Year>
<MD5Sum>3108ea337254c046f204d01a4bbdcb4d</MD5Sum>
<Title color="red">Group C</Title>
<Files>
<File type="exe">c.exe</File>
<File type="pdf">d.pdf</File>
<File type="ai">e.ai</File>
<File type="jpeg"></File>
</Files>
</AnItem>
</MainSection>
Loading

0 comments on commit 24627f3

Please sign in to comment.