Conversion of DataFrame to R's data.frame #350

lbeltrame · 2011-11-08T09:56:11Z

Although I have already produced code for this (see below) I'm posting this as an issue rather than a pull request to discuss the design, because there are some issues open in my code:

Series of dtype object need an explicit cast or rpy2's numpy conversion will treat them improperly
The performance has not been profiled
Probably some room for optimizations
Proper name for the function
The generation of an intermediate OrdDict object may cause problems in case of very large datasets

The code in the current form is posted below. If there is interest, I will work towards integrating it in pandas.rpy.common and add unit tests.

import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri as numpy2ri
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

def dataset_to_data_frame(dataset, strings_as_factors=True):

    # Activate conversion for numpy objects
    robjects.conversion.py2ri = numpy2ri.numpy2ri
    robjects.numpy2ri.activate()

    base = importr("base")
    columns = rlc.OrdDict()

    for column in dataset:
        value = dataset[column]

        # object type requires explicit cast
        if value.dtype == np.object:
            value = robjects.StrVector(value)
             #FIXME: how to generalize it?
            if not strings_as_factors:
                value = base.I(value)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)
    dataframe.rownames = robjects.StrVector(dataset.index)

    # To prevent side-effects in other code
    robjects.conversion.pi2ri = robjects.default_py2ri

    return dataframe

lbeltrame · 2012-01-03T10:31:35Z

I've been experimenting with alternative solutions so far, but at the moment most of them, save for the dict intermediate, are horribly slow.

lbeltrame · 2012-02-08T08:44:15Z

I think I might get some code to merge in for 0.8.0, but I'd need to adapt my unit tests (which use unittest) to pandas. Is there any document on how unit tests are handled in pandas and any guidelines to follow?

lbeltrame · 2012-02-24T10:36:21Z

Progress: the current implementation (not integrated in my pandas clone yet as I have no idea on how to handle things like MultiIndex, which I don't use in my normal workflow). The main advantage is that now nans are translated to R's NA:

import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

def dataset_to_data_frame(dataset, strings_as_factors=True):

    base = importr("base")
    columns = rlc.OrdDict()

    # Type casting is more efficient than rpy2's own numpy2ri

    vectors = {np.float64: robjects.FloatVector,
               np.float32: robjects.FloatVector,
               np.float: robjects.FloatVector,
               np.int: robjects.IntVector,
               np.object_: robjects.StrVector,
               np.str: robjects.StrVector}

    columns = rlc.OrdDict()

    for column in dataset:
        value = dataset[column]
        value = vectors[value.dtype.type](value)

        # These SHOULD be fast as they use vector operations

        if isinstance(value, robjects.StrVector):
            value.rx[value.ro == "nan"] = robjects.NA_Character
        else:
            value.rx[base.is_nan(value)] = robjects.NA_Logical

        if not strings_as_factors:
            value = base.I(value)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)

    del columns

    dataframe.rownames = robjects.StrVector(dataset.index)

    return dataframe

nspies · 2012-03-26T20:35:06Z

I'm an rpy2 user. This is what I'm using to go between pandas and rpy2:

import numpy as np
import rpy2.robjects as robj
import rpy2.rlike.container as rlc

def pandas_data_frame_to_rpy2_data_frame(pDataframe):
    orderedDict = rlc.OrdDict()

    for columnName in pDataframe:
        columnValues = pDataframe[columnName].values
        filteredValues = [value if pandas.notnull(value) else robj.NA_Real 
                          for value in columnValues]

        try:
            orderedDict[columnName] = robj.FloatVector(filteredValues)
        except ValueError:
            orderedDict[columnName] = robj.StrVector(filteredValues)

    rDataFrame = robj.DataFrame(orderedDict)
    rDataFrame.rownames = robj.StrVector(pDataframe.index)

    return rDataFrame

This:

avoids using importr("base") which is horrendously slow
uses the pandas definition of what is and isn't missing data; this may be slower but I'm not sure it will be (I have large but not enormous datasets)
coerces to a FloatVector unless it can't; if memory usage is an issue, it might be tried to convert to an IntVector first
the function name makes explicit the direction of the conversion; the current pandas.rpy.common module is pretty confusing as convert_robj() could be conversion to or from robj

I'm not sure if the call to robj.r.I() is necessary sometimes; I've omitted it as I almost never use StrVectors in data frames.

lbeltrame · 2012-03-27T05:34:43Z

The call to "Importr" can be substituted with a much faster:

I = robjects.baseenv.get("I")
is_nan = robjects.baseenv.get("is.nan")

but yes, it is necessary if you deal with, e.g., "omics" data where you have primary identifiers (the index) and a series of non-float columns (annotation) alongside measurements (floats). If you handle strings as factors, later on if you convert them back to Python objects you will get (unless you're careful) a list of ints, rather than of strings.

Also "pandas.notnull" doesn't work with strings (and R has a NA character type, again useful for annotations).

Of course (hopefully!) this will become much simpler when numpy adapts a missing data type.

nspies · 2012-03-27T13:31:31Z

Okay, fair enough on the need to use robj.r.I() (which I'd prefer over robj.baseenv.get("I")).

I think it's a mistake to use R's definition of missing data when converting from pandas, though -- since pandas.notnull doesn't recognize "nan" as null, you probably shouldn't be using that as a null value in pandas (or since you apparently are, you should convert it to something that pandas understands to be null, such as None or numpy.nan).

Finally, I'd definitely move the precomputed values (I and vectors) out of the function, both for speed and readability.

lbeltrame · 2012-03-27T13:37:43Z

After reviewing my own code that uses this, I noticed that It's there mostly for "historical" reasons: probably it can be substituted by pandas' own notnull.

lbeltrame · 2012-03-27T13:46:22Z

import numpy as np
from pandas import notnull
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

I = robjects.baseenv.get("I")
VECTOR_TYPES = {np.float64: robjects.FloatVector,
               np.float32: robjects.FloatVector,
               np.float: robjects.FloatVector,
               np.int: robjects.IntVector,
               np.object_: robjects.StrVector,
               np.str: robjects.StrVector}

def dataset_to_data_frame(dataset, strings_as_factors=True):

    columns = rlc.OrdDict()

    for column in dataset:
        values = dataset[column]
        value_type = values.dtype.type
        values = [item if notnull(item) else robjects.NA_Logical for item in values]
        values = VECTOR_TYPES[value_type](values)

        if not strings_as_factors:
            values = I(values)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)

    del columns

    dataframe.rownames = robjects.StrVector(dataset.index)

    return dataframe

Here's another version. I should try to port my own unit tests for this to pandas....

lbeltrame · 2012-04-05T09:59:25Z

Since I don't use stuff like MultiIndex etc.: How do those DataFrames get converted by convert_robj?

lbeltrame mentioned this issue May 14, 2012

ENH: Add support for converting DataFrames to R data.frames and matrices #1212

Closed

wesm closed this as completed in eecc018 May 14, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion of DataFrame to R's data.frame #350

Conversion of DataFrame to R's data.frame #350

lbeltrame commented Nov 8, 2011

lbeltrame commented Jan 3, 2012

lbeltrame commented Feb 8, 2012

lbeltrame commented Feb 24, 2012

nspies commented Mar 26, 2012

lbeltrame commented Mar 27, 2012

nspies commented Mar 27, 2012

lbeltrame commented Mar 27, 2012

lbeltrame commented Mar 27, 2012

lbeltrame commented Apr 5, 2012

Conversion of DataFrame to R's data.frame #350

Conversion of DataFrame to R's data.frame #350

Comments

lbeltrame commented Nov 8, 2011

lbeltrame commented Jan 3, 2012

lbeltrame commented Feb 8, 2012

lbeltrame commented Feb 24, 2012

nspies commented Mar 26, 2012

lbeltrame commented Mar 27, 2012

nspies commented Mar 27, 2012

lbeltrame commented Mar 27, 2012

lbeltrame commented Mar 27, 2012

lbeltrame commented Apr 5, 2012