Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated changes: typecast, python 3 support, etc. #155

Closed
wants to merge 33 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
163d964
Use `typecast` for type conversion.
pudo Aug 6, 2015
6f6e574
Fix up type guessing tests.
pudo Aug 24, 2015
a933003
Hide coverage results.
pudo Aug 24, 2015
3c97004
Clean up imports.
pudo Aug 24, 2015
c099b29
Get rid of old type names.
pudo Aug 24, 2015
5346210
Clean out old aliases for XLSXTableSet
pudo Aug 24, 2015
0925697
Further pieces of clean up.
pudo Aug 24, 2015
e0deaa1
Start getting rid of the compatibility layer
pudo Aug 25, 2015
41592c5
Remove remaining awkward compatibility work-arounds.
pudo Aug 25, 2015
b08c056
avoid circular import
pudo Aug 25, 2015
5ac1713
Clean up README.
pudo Aug 25, 2015
c181d57
fix py3 compat
pudo Aug 25, 2015
81e7518
Don’t raise for 0 as a date.
pudo Jul 23, 2016
bc42f4d
merge
pudo Jul 23, 2016
059ace7
fix up test errors, attempt to make travis pass
pudo Jul 23, 2016
896612b
skip tests if en_GB is not supported
pudo Jul 23, 2016
716d976
remove ambiguous var
pudo Jul 23, 2016
47dd64f
dont score null values in type detection
pudo Jul 23, 2016
6a1a686
Move test utilities to a specific module.
pudo Jul 23, 2016
beddd57
Move the buffered reader to it’s own module.
pudo Jul 23, 2016
4bb2c75
Move guesser class to typecast.
pudo Jul 23, 2016
77ae286
Factor out CSV re-coder
pudo Jul 23, 2016
6911378
use cchardet
pudo Jul 23, 2016
fc223f2
simplify the handling of CSV dialects
pudo Jul 23, 2016
c5a1882
try relative imports with py3
pudo Jul 23, 2016
cd229a7
PEP8.
pudo Jul 23, 2016
03b3425
Simplify JTS code.
pudo Jul 23, 2016
ed3cc87
pep8
pudo Jul 23, 2016
99acadf
Move stuff around.
pudo Jul 23, 2016
3e48593
Formatting.
pudo Jul 23, 2016
fa5f2f4
Replace CSV reader with a fully streaming implementation.
pudo Jul 24, 2016
ec1a9df
Fix up Python 3 support
pudo Jul 24, 2016
54747aa
confirm at least python 3.5 is working
pudo Jul 24, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
*.swp
*.egg-info
*.pyc
*.eggs
*.DS_Store
*/_build/*
*.py~
*.~lock.*#
.coverage
dist/*
.tox/*
pyenv3
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
language: python
python:
- "2.6"
- "2.7"
- "3.4"
- "3.5"
install:
- pip install -U pip setuptools
- pip install -e .
- pip install -r requirements-test.txt
- pip install coveralls
- pip install coveralls nose coverage httpretty
script: nosetests --with-coverage --cover-package=messytables
after_success:
- coveralls
30 changes: 0 additions & 30 deletions Dockerfile

This file was deleted.

12 changes: 3 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,4 @@
run: build
@docker run \
--rm \
-ti \
messytables
test:
nosetests --with-coverage --cover-package=messytables --cover-erase

build:
@docker build -t messytables .

.PHONY: run build
.PHONY: run build test
14 changes: 3 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,11 @@
# Parsing for messy tables

[![Build Status](https://travis-ci.org/okfn/messytables.png?branch=master)](https://travis-ci.org/okfn/messytables)
[![Coverage Status](https://coveralls.io/repos/okfn/messytables/badge.png?branch=master)](https://coveralls.io/r/okfn/messytables?branch=master)
[![Latest Version](https://pypip.in/version/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![Downloads](https://pypip.in/download/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![Supported Python versions](https://pypip.in/py_versions/messytables/badge.svg)](https://pypi.python.org/pypi/ckanserviceprovider/)
[![Development Status](https://pypip.in/status/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
[![License](https://pypip.in/license/messytables/badge.svg)](https://pypi.python.org/pypi/messytables/)
# Parsing for messy tables [![Build Status](https://travis-ci.org/okfn/messytables.png?branch=master)](https://travis-ci.org/okfn/messytables) [![Coverage Status](https://coveralls.io/repos/okfn/messytables/badge.png?branch=master)](https://coveralls.io/r/okfn/messytables?branch=master)

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

See the documentation at: https://messytables.readthedocs.io
See the full documentation at: https://messytables.readthedocs.org

Find the package at: https://pypi.python.org/pypi/messytables

See CONTRIBUTING.md for how to send patches, run tests.
See ``CONTRIBUTING.md`` for how to send patches, run tests.

**Contact**: Open Knowledge Labs - http://okfnlabs.org/contact/. We especially recommend the forum: http://discuss.okfn.org/category/open-knowledge-labs/
22 changes: 9 additions & 13 deletions messytables/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,21 @@

from messytables.util import offset_processor, null_processor
from messytables.headers import headers_guess, headers_processor, headers_make_unique
from messytables.headers import headers_guess, headers_processor
from messytables.headers import headers_make_unique
from messytables.types import type_guess, types_processor
from messytables.types import StringType, IntegerType, FloatType, \
DecimalType, DateType, DateUtilType, BoolType
from messytables.error import ReadError

from messytables.core import Cell, TableSet, RowSet, seekable_stream
from messytables.commas import CSVTableSet, CSVRowSet
from messytables.buffered import seekable_stream
from messytables.core import Cell, TableSet, RowSet
from messytables.commas import CSVTableSet, CSVRowSet, TSVTableSet
from messytables.ods import ODSTableSet, ODSRowSet
from messytables.excel import XLSTableSet, XLSRowSet

# XLSXTableSet has been deprecated and its functionality is now provided by
# XLSTableSet. This is to retain backwards compatibility with anyone
# constructing XLSXTableSet directly (rather than using any_tableset)
XLSXTableSet = XLSTableSet
XLSXRowSet = XLSRowSet

from messytables.zip import ZIPTableSet
from messytables.html import HTMLTableSet, HTMLRowSet
from messytables.pdf import PDFTableSet, PDFRowSet
from messytables.any import any_tableset, AnyTableSet
from messytables.any import any_tableset

from messytables.jts import rowset_as_jts, headers_and_typed_as_jts

import warnings
warnings.filterwarnings('ignore', "Coercing non-XML name")
28 changes: 11 additions & 17 deletions messytables/any.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
from messytables import (ZIPTableSet, PDFTableSet, CSVTableSet, XLSTableSet,
HTMLTableSet, ODSTableSet)
import messytables
import re

from messytables import ZIPTableSet, PDFTableSet, CSVTableSet, XLSTableSet
from messytables import HTMLTableSet, ODSTableSet, TSVTableSet
from messytables.buffered import seekable_stream
from messytables.error import ReadError


MIMELOOKUP = {'application/x-zip-compressed': 'ZIP',
'application/zip': 'ZIP',
Expand All @@ -24,14 +26,13 @@
'application/pdf': 'PDF',
'text/plain': 'CSV', # could be TAB.
'application/CDFV2-corrupt': 'XLS',
'application/CDFV2-unknown': 'XLS',
'application/vnd.oasis.opendocument.spreadsheet': 'ODS',
'application/x-vnd.oasis.opendocument.spreadsheet': 'ODS',
}

def TABTableSet(fileobj):
return CSVTableSet(fileobj, delimiter='\t')

parsers = {'TAB': TABTableSet,
parsers = {'TAB': TSVTableSet,
'ZIP': ZIPTableSet,
'XLS': XLSTableSet,
'HTML': HTMLTableSet,
Expand Down Expand Up @@ -61,9 +62,9 @@ def get_mime(fileobj):
import magic
# Since we need to peek the start of the stream, make sure we can
# seek back later. If not, slurp in the contents into a StringIO.
fileobj = messytables.seekable_stream(fileobj)
fileobj = seekable_stream(fileobj)
header = fileobj.read(4096)
mimetype = magic.from_buffer(header, mime=True).decode('utf-8')
mimetype = magic.from_buffer(header, mime=True) # .decode('utf-8')
fileobj.seek(0)
if MIMELOOKUP.get(mimetype) == 'ZIP':
# consider whether it's an Microsoft Office document
Expand Down Expand Up @@ -159,13 +160,6 @@ def any_tableset(fileobj, mimetype=None, extension='', auto_detect=True, **kw):
mimetype=magic_mime))

if error:
raise messytables.ReadError('any: \n'.join(error))
raise ReadError('any: \n'.join(error))
else:
raise messytables.ReadError("any: Did not attempt any detection.")


class AnyTableSet:
'''Deprecated - use any_tableset instead.'''
@staticmethod
def from_fileobj(fileobj, mimetype=None, extension=None):
return any_tableset(fileobj, mimetype=mimetype, extension=extension)
raise ReadError("any: Did not attempt any detection.")
89 changes: 89 additions & 0 deletions messytables/buffered.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import io

BUFFER_SIZE = 4096


def seekable_stream(fileobj):
try:
fileobj.seek(0)
# if we got here, the stream is seekable
return fileobj
except:
# otherwise seek failed, so slurp in stream and wrap
# it in a BytesIO
return BufferedFile(fileobj)


class BufferedFile(object):
"""A buffered file that preserves the beginning of a stream."""

def __init__(self, fp, buffer_size=BUFFER_SIZE + 2):
self.data = io.BytesIO()
self.fp = fp
self.offset = 0
self.len = 0
self.fp_offset = 0
self.buffer_size = buffer_size

def _next_line(self):
try:
return self.fp.readline()
except AttributeError:
return next(self.fp)

def _read(self, n):
return self.fp.read(n)

@property
def _buffer_full(self):
return self.len >= self.buffer_size

def readline(self):
if self.len < self.offset < self.fp_offset:
raise BufferError('Line is not available anymore')
if self.offset >= self.len:
line = self._next_line()
self.fp_offset += len(line)

self.offset += len(line)

if not self._buffer_full:
self.data.write(line)
self.len += len(line)
else:
line = self.data.readline()
self.offset += len(line)
return line

def read(self, n=-1):
if n == -1:
# if the request is to do a complete read, then do a complete
# read.
self.data.seek(self.offset)
return self.data.read(-1) + self.fp.read(-1)

if self.len < self.offset < self.fp_offset:
raise BufferError('Data is not available anymore')
if self.offset >= self.len:
byte = self._read(n)
self.fp_offset += len(byte)

self.offset += len(byte)

if not self._buffer_full:
self.data.write(byte)
self.len += len(byte)
else:
byte = self.data.read(n)
self.offset += len(byte)
return byte

def tell(self):
return self.offset

def seek(self, offset):
if self.len < offset < self.fp_offset:
raise BufferError('Cannot seek because data is not buffered here')
self.offset = offset
if offset < self.len:
self.data.seek(offset)
Loading