Skip to content

Commit

Permalink
Read Stata file incrementally
Browse files Browse the repository at this point in the history
Remove testing code

Use partition in null_terminate

Manage warnings better in test

Further warning management in testing; add skip_data argument

Major refactoring to address code review

Fix strl reading, templatize docstrings

Fix bug in attaching docstring

Add new test file

Add release note

Call read instead of data when calling pandas.read_stata

various small issues following code review

Improve performance of %td processing

Docs edit (minor)
  • Loading branch information
kshedden committed Mar 1, 2015
1 parent c88b0ba commit 709c034
Show file tree
Hide file tree
Showing 5 changed files with 570 additions and 219 deletions.
33 changes: 26 additions & 7 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3821,22 +3821,41 @@ outside of this range, the variable is cast to ``int16``.
Reading from Stata format
~~~~~~~~~~~~~~~~~~~~~~~~~

The top-level function ``read_stata`` will read a dta files
and return a DataFrame. Alternatively, the class :class:`~pandas.io.stata.StataReader`
can be used if more granular access is required. :class:`~pandas.io.stata.StataReader`
reads the header of the dta file at initialization. The method
:func:`~pandas.io.stata.StataReader.data` reads and converts observations to a DataFrame.
The top-level function ``read_stata`` will read a dta file and return
either a DataFrame or a :class:`~pandas.io.stata.StataReader` that can
be used to read the file incrementally.

.. ipython:: python
pd.read_stata('stata.dta')
.. versionadded:: 0.16.0

Specifying a ``chunksize`` yields a
:class:`~pandas.io.stata.StataReader` instance that can be used to
read ``chunksize`` lines from the file at a time. The ``StataReader``
object can be used as an iterator.

reader = pd.read_stata('stata.dta', chunksize=1000)
for df in reader:
do_something(df)

For more fine-grained control, use ``iterator=True`` and specify
``chunksize`` with each call to
:func:`~pandas.io.stata.StataReader.read`.

.. ipython:: python
reader = pd.read_stata('stata.dta', iterator=True)
chunk1 = reader.read(10)
chunk2 = reader.read(20)
Currently the ``index`` is retrieved as a column.

The parameter ``convert_categoricals`` indicates whether value labels should be
read and used to create a ``Categorical`` variable from them. Value labels can
also be retrieved by the function ``variable_labels``, which requires data to be
called before use (see ``pandas.io.stata.StataReader``).
also be retrieved by the function ``value_labels``, which requires :func:`~pandas.io.stata.StataReader.read`
to be called before use.

The parameter ``convert_missing`` indicates whether missing value
representations in Stata should be preserved. If ``False`` (the default),
Expand Down
2 changes: 2 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes.

Highlights include:

- Allow Stata files to be read incrementally, support for long strings in Stata files (issue:`9493`:) :ref:`here<io.stata_reader>`.

See the :ref:`v0.16.0 Whatsnew <whatsnew_0160>` overview or the issue tracker on GitHub for an extensive list
of all API changes, enhancements and bugs that have been fixed in 0.16.0.

Expand Down
Loading

0 comments on commit 709c034

Please sign in to comment.