Skip to content

Commit

Permalink
Add messytables -> jts converter, with docs. Fixes #40
Browse files Browse the repository at this point in the history
  • Loading branch information
domoritz committed Apr 27, 2013
1 parent 8f22055 commit 6e98d06
Show file tree
Hide file tree
Showing 7 changed files with 101 additions and 38 deletions.
79 changes: 46 additions & 33 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@
messytables: all your rows are belong to us
===========================================

Tabular data as published on the web is often not well formatted
and structured. Messytables tries to detect and fix errors in the
Tabular data as published on the web is often not well formatted
and structured. Messytables tries to detect and fix errors in the
data. Typical examples include:

* Finding the header of a table when there are explanations and
* Finding the header of a table when there are explanations and
text fragments in the first few rows of the table.
* Guessing the type of columns in CSV data.
* Guessing the format of a byte stream.

This library provides data structures and some heuristics to
fix these problems and read a wide number of different tabular
This library provides data structures and some heuristics to
fix these problems and read a wide number of different tabular
abominations.

Example
Expand All @@ -35,18 +35,18 @@ evaluate data. A typical use might look like this::
# If you aren't sure what kind of file it is, you can use
# AnyTableSet instead.
#table_set = AnyTableSet.from_fileobj(fh)

# A table set is a collection of tables:
row_set = table_set.tables[0]

# A row set is an iterator over the table, but it can only
# A row set is an iterator over the table, but it can only
# be run once. To peek, a sample is provided:
print row_set.sample.next()

# guess column types:
types = type_guess(row_set.sample)

# and tell the row set to apply these types to
# and tell the row set to apply these types to
# each row when traversing the iterator:
row_set.register_processor(types_processor(types))

Expand All @@ -61,7 +61,7 @@ evaluate data. A typical use might look like this::
for row in row_set:
do_something(row)

As you can see in the example above, messytables gives you a toolbox
As you can see in the example above, messytables gives you a toolbox
of independent methods. There is no ready-made ``row_set.guess_types()``
because there are many ways to perform type guessing that we may
implement in the future. Therefore, heuristic operations are independent
Expand All @@ -75,9 +75,9 @@ in generic data types (a dict in a list in a dict).

.. autoclass:: messytables.core.Cell
:members: empty

.. attribute:: value

The actual content of the cell.

.. attribute:: column
Expand All @@ -100,7 +100,7 @@ in generic data types (a dict in a list in a dict).
CSV support
-----------

CSV support uses Python's dialect sniffer to detect the separator and
CSV support uses Python's dialect sniffer to detect the separator and
quoting mechanism used in the input file.

.. autoclass:: messytables.commas.CSVTableSet
Expand Down Expand Up @@ -129,21 +129,21 @@ The newer, XML-based Excel format is also supported but uses a different class.
:members: raw

ZIP file support
-------------
----------------

The library supports loading CSV or Excel files from within ZIP files.

.. autoclass:: messytables.zip.ZIPTableSet
:members: from_fileobj, tables

Auto-detecting file format
-------------
--------------------------

The library supports loading files in a generic way.

.. autoclass:: messytables.any.AnyTableSet
:members: from_fileobj

Type detection
--------------

Expand All @@ -166,9 +166,9 @@ The supported types include:
Headers detection
-----------------

While the CSV convention is to include column headers as the first row of
the data file. Unfortunately, many people feel the need to put titles,
general info etc. in the top of tabular data. Therefore, we need to scan
While the CSV convention is to include column headers as the first row of
the data file. Unfortunately, many people feel the need to put titles,
general info etc. in the top of tabular data. Therefore, we need to scan
the first few rows of the data, to guess which one is actually the header.

.. automethod:: messytables.headers.headers_guess
Expand All @@ -177,9 +177,9 @@ Stream processors
-----------------

Stream processors are used to apply transformations to the row set upon
iteration. In order to apply transformations to a ``RowSet`` you can
register a stream processor. A processor is simply a function that takes
the ``RowSet`` and the current row (a list of ``Cell``) as arguments and
iteration. In order to apply transformations to a ``RowSet`` you can
register a stream processor. A processor is simply a function that takes
the ``RowSet`` and the current row (a list of ``Cell``) as arguments and
returns a modified version of the row or ``None`` to indicate the row
should be dropped.

Expand All @@ -191,26 +191,39 @@ Most processors are implemented as closures called with some arguments:

.. automethod:: messytables.headers.headers_processor


JSON table schema
-----------------

Messytables can convert guessed headers and types to the `JSON table schema`_.

.. _JSON Table Schema: http://www.dataprotocols.org/en/latest/json-table-schema.html

.. automethod:: messytables.jts.rowset_as_jts

.. automethod:: messytables.jts.headers_and_typed_as_jts


License
-------

Copyright (c) 2011 The Open Knowledge Foundation Ltd.
Copyright (c) 2013 The Open Knowledge Foundation Ltd.

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.

1 change: 1 addition & 0 deletions messytables/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
from messytables.zip import ZIPTableSet
from messytables.any import AnyTableSet

from messytables.jts import rowset_as_jts, headers_and_typed_as_jts
1 change: 0 additions & 1 deletion messytables/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,4 +198,3 @@ def dicts(self, sample=False):

def __repr__(self):
return "RowSet(%s)" % self.name

42 changes: 42 additions & 0 deletions messytables/jts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
'''
Convert a rowset to the json table schema (http://www.dataprotocols.org/en/latest/json-table-schema.html)
'''

import messytables
import jsontableschema

MESSYTABLES_TO_JTS_MAPPING = {
messytables.types.StringType: 'string',
messytables.types.IntegerType: 'integer',
messytables.types.FloatType: 'number',
messytables.types.DecimalType: 'number',
messytables.types.DateType: 'date',
messytables.types.DateUtilType: 'date'
}


def celltype_as_string(celltype):
return MESSYTABLES_TO_JTS_MAPPING[celltype.__class__]


def rowset_as_jts(rowset, headers=None, types=None):
''' Create a json table schema from a rowset
'''
_, headers = messytables.headers_guess(rowset.sample)
types = map(celltype_as_string, messytables.type_guess(rowset.sample))

return headers_and_typed_as_jts(headers, types)


def headers_and_typed_as_jts(headers, types):
''' Create a json table schema from headers and types as
returned from :meth:`~messytables.headers.headers_guess` and :meth:`~messytables.types.type_guess`.
'''
j = jsontableschema.JSONTableSchema()

for field_id, field_type in zip(headers, types):
j.add_field(field_id=field_id,
label=field_id,
field_type=field_type)

return j
3 changes: 2 additions & 1 deletion requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ chardet==2.1.1
python-dateutil>=1.5.0,<2.0.0
httpretty
nose
requests>=1.0
requests>=1.0
json-table-schema
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@
'python-magic==0.4.3', # used for type guessing
'openpyxl==1.5.7',
'chardet==2.1.1',
'python-dateutil>=1.5.0,<2.0.0'
'python-dateutil>=1.5.0,<2.0.0',
'json-table-schema'
],
tests_require=[],
entry_points=\
Expand Down
10 changes: 8 additions & 2 deletions test/test_rowset.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,8 +231,6 @@ def rows(skip_policy):

second = lambda r: r[1].value

print map(second, rows(True))

assert "goodbye" in map(second, rows(True))
assert " goodbye" in map(second, rows(False))

Expand All @@ -244,6 +242,14 @@ def test_bad_first_sheet(self):
assert_equal(0, len(list(tables[0].sample)))
assert_equal(1000, len(list(tables[1].sample)))

def test_rowset_as_schema(self):
from StringIO import StringIO as sio
ts = CSVTableSet.from_fileobj(sio('''name,dob\nmk,2012-01-02\n'''))
rs = ts.tables[0]
jts = rowset_as_jts(rs).as_dict()
assert_equal(jts['fields'], [{'type': 'string', 'id': u'name', 'label': u'name'},
{'type': 'date', 'id': u'dob', 'label': u'dob'}])


class TypeGuessTest(unittest.TestCase):
def test_type_guess(self):
Expand Down

0 comments on commit 6e98d06

Please sign in to comment.