diff --git a/doc/redirects.csv b/doc/redirects.csv index 4f4b3d7fc0780..8c62ecc362ccd 100644 --- a/doc/redirects.csv +++ b/doc/redirects.csv @@ -4,6 +4,10 @@ # getting started 10min,getting_started/10min basics,getting_started/basics +comparison_with_r,getting_started/comparison/comparison_with_r +comparison_with_sql,getting_started/comparison/comparison_with_sql +comparison_with_sas,getting_started/comparison/comparison_with_sas +comparison_with_stata,getting_started/comparison/comparison_with_stata dsintro,getting_started/dsintro overview,getting_started/overview tutorials,getting_started/tutorials @@ -12,6 +16,7 @@ tutorials,getting_started/tutorials advanced,user_guide/advanced categorical,user_guide/categorical computation,user_guide/computation +cookbook,user_guide/cookbook enhancingperf,user_guide/enhancingperf gotchas,user_guide/gotchas groupby,user_guide/groupby diff --git a/doc/source/comparison_with_r.rst b/doc/source/getting_started/comparison/comparison_with_r.rst similarity index 100% rename from doc/source/comparison_with_r.rst rename to doc/source/getting_started/comparison/comparison_with_r.rst diff --git a/doc/source/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst similarity index 100% rename from doc/source/comparison_with_sas.rst rename to doc/source/getting_started/comparison/comparison_with_sas.rst diff --git a/doc/source/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst similarity index 100% rename from doc/source/comparison_with_sql.rst rename to doc/source/getting_started/comparison/comparison_with_sql.rst diff --git a/doc/source/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst similarity index 100% rename from doc/source/comparison_with_stata.rst rename to doc/source/getting_started/comparison/comparison_with_stata.rst diff --git a/doc/source/getting_started/comparison/index.rst b/doc/source/getting_started/comparison/index.rst new file mode 100644 index 0000000000000..998706ce0c639 --- /dev/null +++ b/doc/source/getting_started/comparison/index.rst @@ -0,0 +1,15 @@ +{{ header }} + +.. _comparison: + +=========================== +Comparison with other tools +=========================== + +.. toctree:: + :maxdepth: 2 + + comparison_with_r + comparison_with_sql + comparison_with_sas + comparison_with_stata diff --git a/doc/source/getting_started/index.rst b/doc/source/getting_started/index.rst index 116efe79beef1..4c5d26461a667 100644 --- a/doc/source/getting_started/index.rst +++ b/doc/source/getting_started/index.rst @@ -13,4 +13,5 @@ Getting started 10min basics dsintro + comparison/index tutorials diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst index 1e07df47aadca..b531f686951fc 100644 --- a/doc/source/getting_started/overview.rst +++ b/doc/source/getting_started/overview.rst @@ -6,25 +6,80 @@ Package overview **************** -:mod:`pandas` is an open source, BSD-licensed library providing high-performance, -easy-to-use data structures and data analysis tools for the `Python `__ -programming language. - -:mod:`pandas` consists of the following elements: - -* A set of labeled array data structures, the primary of which are - Series and DataFrame. -* Index objects enabling both simple axis indexing and multi-level / - hierarchical axis indexing. -* An integrated group by engine for aggregating and transforming data sets. -* Date range generation (date_range) and custom date offsets enabling the - implementation of customized frequencies. -* Input/Output tools: loading tabular data from flat files (CSV, delimited, - Excel 2003), and saving and loading pandas objects from the fast and - efficient PyTables/HDF5 format. -* Memory-efficient "sparse" versions of the standard data structures for storing - data that is mostly missing or mostly constant (some fixed value). -* Moving window statistics (rolling mean, rolling standard deviation, etc.). +**pandas** is a `Python `__ package providing fast, +flexible, and expressive data structures designed to make working with +"relational" or "labeled" data both easy and intuitive. It aims to be the +fundamental high-level building block for doing practical, **real world** data +analysis in Python. Additionally, it has the broader goal of becoming **the +most powerful and flexible open source data analysis / manipulation tool +available in any language**. It is already well on its way toward this goal. + +pandas is well suited for many different kinds of data: + + - Tabular data with heterogeneously-typed columns, as in an SQL table or + Excel spreadsheet + - Ordered and unordered (not necessarily fixed-frequency) time series data. + - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and + column labels + - Any other form of observational / statistical data sets. The data actually + need not be labeled at all to be placed into a pandas data structure + +The two primary data structures of pandas, :class:`Series` (1-dimensional) +and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use +cases in finance, statistics, social science, and many areas of +engineering. For R users, :class:`DataFrame` provides everything that R's +``data.frame`` provides and much more. pandas is built on top of `NumPy +`__ and is intended to integrate well within a scientific +computing environment with many other 3rd party libraries. + +Here are just a few of the things that pandas does well: + + - Easy handling of **missing data** (represented as NaN) in floating point as + well as non-floating point data + - Size mutability: columns can be **inserted and deleted** from DataFrame and + higher dimensional objects + - Automatic and explicit **data alignment**: objects can be explicitly + aligned to a set of labels, or the user can simply ignore the labels and + let `Series`, `DataFrame`, etc. automatically align the data for you in + computations + - Powerful, flexible **group by** functionality to perform + split-apply-combine operations on data sets, for both aggregating and + transforming data + - Make it **easy to convert** ragged, differently-indexed data in other + Python and NumPy data structures into DataFrame objects + - Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** + of large data sets + - Intuitive **merging** and **joining** data sets + - Flexible **reshaping** and pivoting of data sets + - **Hierarchical** labeling of axes (possible to have multiple labels per + tick) + - Robust IO tools for loading data from **flat files** (CSV and delimited), + Excel files, databases, and saving / loading data from the ultrafast **HDF5 + format** + - **Time series**-specific functionality: date range generation and frequency + conversion, moving window statistics, moving window linear regressions, + date shifting and lagging, etc. + +Many of these principles are here to address the shortcomings frequently +experienced using other languages / scientific research environments. For data +scientists, working with data is typically divided into multiple stages: +munging and cleaning data, analyzing / modeling it, then organizing the results +of the analysis into a form suitable for plotting or tabular display. pandas +is the ideal tool for all of these tasks. + +Some other notes + + - pandas is **fast**. Many of the low-level algorithmic bits have been + extensively tweaked in `Cython `__ code. However, as with + anything else generalization usually sacrifices performance. So if you focus + on one feature for your application you may be able to create a faster + specialized tool. + + - pandas is a dependency of `statsmodels + `__, making it an important part of the + statistical computing ecosystem in Python. + + - pandas has been used extensively in production in financial applications. Data Structures --------------- diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template index bc420a906b59c..ab51911a610e3 100644 --- a/doc/source/index.rst.template +++ b/doc/source/index.rst.template @@ -22,93 +22,15 @@ pandas: powerful Python data analysis toolkit **Developer Mailing List:** https://groups.google.com/forum/#!forum/pydata -**pandas** is a `Python `__ package providing fast, -flexible, and expressive data structures designed to make working with -"relational" or "labeled" data both easy and intuitive. It aims to be the -fundamental high-level building block for doing practical, **real world** data -analysis in Python. Additionally, it has the broader goal of becoming **the -most powerful and flexible open source data analysis / manipulation tool -available in any language**. It is already well on its way toward this goal. - -pandas is well suited for many different kinds of data: - - - Tabular data with heterogeneously-typed columns, as in an SQL table or - Excel spreadsheet - - Ordered and unordered (not necessarily fixed-frequency) time series data. - - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and - column labels - - Any other form of observational / statistical data sets. The data actually - need not be labeled at all to be placed into a pandas data structure - -The two primary data structures of pandas, :class:`Series` (1-dimensional) -and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use -cases in finance, statistics, social science, and many areas of -engineering. For R users, :class:`DataFrame` provides everything that R's -``data.frame`` provides and much more. pandas is built on top of `NumPy -`__ and is intended to integrate well within a scientific -computing environment with many other 3rd party libraries. - -Here are just a few of the things that pandas does well: - - - Easy handling of **missing data** (represented as NaN) in floating point as - well as non-floating point data - - Size mutability: columns can be **inserted and deleted** from DataFrame and - higher dimensional objects - - Automatic and explicit **data alignment**: objects can be explicitly - aligned to a set of labels, or the user can simply ignore the labels and - let `Series`, `DataFrame`, etc. automatically align the data for you in - computations - - Powerful, flexible **group by** functionality to perform - split-apply-combine operations on data sets, for both aggregating and - transforming data - - Make it **easy to convert** ragged, differently-indexed data in other - Python and NumPy data structures into DataFrame objects - - Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** - of large data sets - - Intuitive **merging** and **joining** data sets - - Flexible **reshaping** and pivoting of data sets - - **Hierarchical** labeling of axes (possible to have multiple labels per - tick) - - Robust IO tools for loading data from **flat files** (CSV and delimited), - Excel files, databases, and saving / loading data from the ultrafast **HDF5 - format** - - **Time series**-specific functionality: date range generation and frequency - conversion, moving window statistics, moving window linear regressions, - date shifting and lagging, etc. - -Many of these principles are here to address the shortcomings frequently -experienced using other languages / scientific research environments. For data -scientists, working with data is typically divided into multiple stages: -munging and cleaning data, analyzing / modeling it, then organizing the results -of the analysis into a form suitable for plotting or tabular display. pandas -is the ideal tool for all of these tasks. - -Some other notes - - - pandas is **fast**. Many of the low-level algorithmic bits have been - extensively tweaked in `Cython `__ code. However, as with - anything else generalization usually sacrifices performance. So if you focus - on one feature for your application you may be able to create a faster - specialized tool. - - - pandas is a dependency of `statsmodels - `__, making it an important part of the - statistical computing ecosystem in Python. - - - pandas has been used extensively in production in financial applications. - -.. note:: - - This documentation assumes general familiarity with NumPy. If you haven't - used NumPy much or at all, do invest some time in `learning about NumPy - `__ first. - -See the package overview for more detail about what's in the library. +:mod:`pandas` is an open source, BSD-licensed library providing high-performance, +easy-to-use data structures and data analysis tools for the `Python `__ +programming language. +See the :ref:`overview` for more detail about what's in the library. {% if single_doc and single_doc.endswith('.rst') -%} .. toctree:: - :maxdepth: 4 + :maxdepth: 2 {{ single_doc[:-4] }} {% elif single_doc %} @@ -118,21 +40,15 @@ See the package overview for more detail about what's in the library. {{ single_doc }} {% else -%} .. toctree:: - :maxdepth: 4 + :maxdepth: 2 {% endif %} {% if not single_doc -%} What's New install getting_started/index - cookbook user_guide/index - r_interface ecosystem - comparison_with_r - comparison_with_sql - comparison_with_sas - comparison_with_stata {% endif -%} {% if include_api -%} api/index diff --git a/doc/source/r_interface.rst b/doc/source/r_interface.rst deleted file mode 100644 index 9839bba4884d4..0000000000000 --- a/doc/source/r_interface.rst +++ /dev/null @@ -1,94 +0,0 @@ -.. _rpy: - -{{ header }} - -****************** -rpy2 / R interface -****************** - -.. warning:: - - Up to pandas 0.19, a ``pandas.rpy`` module existed with functionality to - convert between pandas and ``rpy2`` objects. This functionality now lives in - the `rpy2 `__ project itself. - See the `updating section `__ - of the previous documentation for a guide to port your code from the - removed ``pandas.rpy`` to ``rpy2`` functions. - - -`rpy2 `__ is an interface to R running embedded in a Python process, and also includes functionality to deal with pandas DataFrames. -Converting data frames back and forth between rpy2 and pandas should be largely -automated (no need to convert explicitly, it will be done on the fly in most -rpy2 functions). -To convert explicitly, the functions are ``pandas2ri.py2ri()`` and -``pandas2ri.ri2py()``. - - -See also the documentation of the `rpy2 `__ project: https://rpy2.readthedocs.io. - -In the remainder of this page, a few examples of explicit conversion is given. The pandas conversion of rpy2 needs first to be activated: - -.. ipython:: - :verbatim: - - In [1]: from rpy2.robjects import pandas2ri - ...: pandas2ri.activate() - -Transferring R data sets into Python ------------------------------------- - -Once the pandas conversion is activated (``pandas2ri.activate()``), many conversions -of R to pandas objects will be done automatically. For example, to obtain the 'iris' dataset as a pandas DataFrame: - -.. ipython:: - :verbatim: - - In [2]: from rpy2.robjects import r - - In [3]: r.data('iris') - - In [4]: r['iris'].head() - Out[4]: - Sepal.Length Sepal.Width Petal.Length Petal.Width Species - 0 5.1 3.5 1.4 0.2 setosa - 1 4.9 3.0 1.4 0.2 setosa - 2 4.7 3.2 1.3 0.2 setosa - 3 4.6 3.1 1.5 0.2 setosa - 4 5.0 3.6 1.4 0.2 setosa - -If the pandas conversion was not activated, the above could also be accomplished -by explicitly converting it with the ``pandas2ri.ri2py`` function -(``pandas2ri.ri2py(r['iris'])``). - -Converting DataFrames into R objects ------------------------------------- - -The ``pandas2ri.py2ri`` function support the reverse operation to convert -DataFrames into the equivalent R object (that is, **data.frame**): - -.. ipython:: - :verbatim: - - In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, - ...: index=["one", "two", "three"]) - - In [6]: r_dataframe = pandas2ri.py2ri(df) - - In [7]: print(type(r_dataframe)) - Out[7]: - - In [8]: print(r_dataframe) - Out[8]: - A B C - one 1 4 7 - two 2 5 8 - three 3 6 9 - - -The DataFrame's index is stored as the ``rownames`` attribute of the -data.frame instance. - - -.. - Calling R functions with pandas objects - High-level interface to R estimators diff --git a/doc/source/cookbook.rst b/doc/source/user_guide/cookbook.rst similarity index 100% rename from doc/source/cookbook.rst rename to doc/source/user_guide/cookbook.rst diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst index 60e722808d647..d39cf7103ab63 100644 --- a/doc/source/user_guide/index.rst +++ b/doc/source/user_guide/index.rst @@ -37,3 +37,4 @@ Further information on any specific method can be obtained in the enhancingperf sparse gotchas + cookbook diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb index a238c3b16e9ad..79a9848704eec 100644 --- a/doc/source/user_guide/style.ipynb +++ b/doc/source/user_guide/style.ipynb @@ -1133,7 +1133,7 @@ "metadata": {}, "outputs": [], "source": [ - "with open(\"template_structure.html\") as f:\n", + "with open(\"templates/template_structure.html\") as f:\n", " structure = f.read()\n", " \n", "HTML(structure)" diff --git a/doc/source/templates/myhtml.tpl b/doc/source/user_guide/templates/myhtml.tpl similarity index 100% rename from doc/source/templates/myhtml.tpl rename to doc/source/user_guide/templates/myhtml.tpl diff --git a/doc/source/template_structure.html b/doc/source/user_guide/templates/template_structure.html similarity index 100% rename from doc/source/template_structure.html rename to doc/source/user_guide/templates/template_structure.html