Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion of DataFrame to R's data.frame #350

Closed
lbeltrame opened this issue Nov 8, 2011 · 9 comments
Closed

Conversion of DataFrame to R's data.frame #350

lbeltrame opened this issue Nov 8, 2011 · 9 comments
Milestone

Comments

@lbeltrame
Copy link
Contributor

Although I have already produced code for this (see below) I'm posting this as an issue rather than a pull request to discuss the design, because there are some issues open in my code:

  • Series of dtype object need an explicit cast or rpy2's numpy conversion will treat them improperly
  • The performance has not been profiled
  • Probably some room for optimizations
  • Proper name for the function
  • The generation of an intermediate OrdDict object may cause problems in case of very large datasets

The code in the current form is posted below. If there is interest, I will work towards integrating it in pandas.rpy.common and add unit tests.

import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri as numpy2ri
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

def dataset_to_data_frame(dataset, strings_as_factors=True):

    # Activate conversion for numpy objects
    robjects.conversion.py2ri = numpy2ri.numpy2ri
    robjects.numpy2ri.activate()

    base = importr("base")
    columns = rlc.OrdDict()

    for column in dataset:
        value = dataset[column]

        # object type requires explicit cast
        if value.dtype == np.object:
            value = robjects.StrVector(value)
             #FIXME: how to generalize it?
            if not strings_as_factors:
                value = base.I(value)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)
    dataframe.rownames = robjects.StrVector(dataset.index)

    # To prevent side-effects in other code
    robjects.conversion.pi2ri = robjects.default_py2ri

    return dataframe
@lbeltrame
Copy link
Contributor Author

I've been experimenting with alternative solutions so far, but at the moment most of them, save for the dict intermediate, are horribly slow.

@lbeltrame
Copy link
Contributor Author

I think I might get some code to merge in for 0.8.0, but I'd need to adapt my unit tests (which use unittest) to pandas. Is there any document on how unit tests are handled in pandas and any guidelines to follow?

@lbeltrame
Copy link
Contributor Author

Progress: the current implementation (not integrated in my pandas clone yet as I have no idea on how to handle things like MultiIndex, which I don't use in my normal workflow). The main advantage is that now nans are translated to R's NA:

import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

def dataset_to_data_frame(dataset, strings_as_factors=True):

    base = importr("base")
    columns = rlc.OrdDict()

    # Type casting is more efficient than rpy2's own numpy2ri

    vectors = {np.float64: robjects.FloatVector,
               np.float32: robjects.FloatVector,
               np.float: robjects.FloatVector,
               np.int: robjects.IntVector,
               np.object_: robjects.StrVector,
               np.str: robjects.StrVector}

    columns = rlc.OrdDict()

    for column in dataset:
        value = dataset[column]
        value = vectors[value.dtype.type](value)

        # These SHOULD be fast as they use vector operations

        if isinstance(value, robjects.StrVector):
            value.rx[value.ro == "nan"] = robjects.NA_Character
        else:
            value.rx[base.is_nan(value)] = robjects.NA_Logical

        if not strings_as_factors:
            value = base.I(value)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)

    del columns

    dataframe.rownames = robjects.StrVector(dataset.index)

    return dataframe

@nspies
Copy link
Contributor

nspies commented Mar 26, 2012

I'm an rpy2 user. This is what I'm using to go between pandas and rpy2:

import numpy as np
import rpy2.robjects as robj
import rpy2.rlike.container as rlc

def pandas_data_frame_to_rpy2_data_frame(pDataframe):
    orderedDict = rlc.OrdDict()

    for columnName in pDataframe:
        columnValues = pDataframe[columnName].values
        filteredValues = [value if pandas.notnull(value) else robj.NA_Real 
                          for value in columnValues]

        try:
            orderedDict[columnName] = robj.FloatVector(filteredValues)
        except ValueError:
            orderedDict[columnName] = robj.StrVector(filteredValues)

    rDataFrame = robj.DataFrame(orderedDict)
    rDataFrame.rownames = robj.StrVector(pDataframe.index)

    return rDataFrame

This:

  • avoids using importr("base") which is horrendously slow
  • uses the pandas definition of what is and isn't missing data; this may be slower but I'm not sure it will be (I have large but not enormous datasets)
  • coerces to a FloatVector unless it can't; if memory usage is an issue, it might be tried to convert to an IntVector first
  • the function name makes explicit the direction of the conversion; the current pandas.rpy.common module is pretty confusing as convert_robj() could be conversion to or from robj

I'm not sure if the call to robj.r.I() is necessary sometimes; I've omitted it as I almost never use StrVectors in data frames.

@lbeltrame
Copy link
Contributor Author

The call to "Importr" can be substituted with a much faster:

I = robjects.baseenv.get("I")
is_nan = robjects.baseenv.get("is.nan")

but yes, it is necessary if you deal with, e.g., "omics" data where you have primary identifiers (the index) and a series of non-float columns (annotation) alongside measurements (floats). If you handle strings as factors, later on if you convert them back to Python objects you will get (unless you're careful) a list of ints, rather than of strings.

Also "pandas.notnull" doesn't work with strings (and R has a NA character type, again useful for annotations).

Of course (hopefully!) this will become much simpler when numpy adapts a missing data type.

@nspies
Copy link
Contributor

nspies commented Mar 27, 2012

Okay, fair enough on the need to use robj.r.I() (which I'd prefer over robj.baseenv.get("I")).

I think it's a mistake to use R's definition of missing data when converting from pandas, though -- since pandas.notnull doesn't recognize "nan" as null, you probably shouldn't be using that as a null value in pandas (or since you apparently are, you should convert it to something that pandas understands to be null, such as None or numpy.nan).

Finally, I'd definitely move the precomputed values (I and vectors) out of the function, both for speed and readability.

@lbeltrame
Copy link
Contributor Author

After reviewing my own code that uses this, I noticed that It's there mostly for "historical" reasons: probably it can be substituted by pandas' own notnull.

@lbeltrame
Copy link
Contributor Author

import numpy as np
from pandas import notnull
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc

I = robjects.baseenv.get("I")
VECTOR_TYPES = {np.float64: robjects.FloatVector,
               np.float32: robjects.FloatVector,
               np.float: robjects.FloatVector,
               np.int: robjects.IntVector,
               np.object_: robjects.StrVector,
               np.str: robjects.StrVector}

def dataset_to_data_frame(dataset, strings_as_factors=True):

    columns = rlc.OrdDict()

    for column in dataset:
        values = dataset[column]
        value_type = values.dtype.type
        values = [item if notnull(item) else robjects.NA_Logical for item in values]
        values = VECTOR_TYPES[value_type](values)

        if not strings_as_factors:
            values = I(values)

        columns[column] = value

    dataframe = robjects.DataFrame(columns)

    del columns

    dataframe.rownames = robjects.StrVector(dataset.index)

    return dataframe

Here's another version. I should try to port my own unit tests for this to pandas....

@lbeltrame
Copy link
Contributor Author

Since I don't use stuff like MultiIndex etc.: How do those DataFrames get converted by convert_robj?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants