Explore options for a different DataFrameTransformer interface #152

MrPowers · 2017-10-16T23:58:44Z

I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.

The Spark Scala API has a nifty transform method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.

I like the DataFrameTransformer class, but it doesn't let users easily access the native PySpark DataFrame methods.

We might want to take these methods out of the DataFrameTransfrormer class, so the user can mix and match the Optimus API and the PySpark API.

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

The transform method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

Let me know what you think!

The text was updated successfully, but these errors were encountered:

FavioVazquez · 2017-10-17T00:02:30Z

Hi @MrPowers! But the source_dfis a Spark Dataframe right? I'm actually not sure how to implement this. But it seems very interesting. I'll take a look.

MrPowers · 2017-10-17T00:08:49Z

Yep, source_df is a Spark DataFrame.

Let me make the current, quinn, and "ideal" options more clear.

current - think this code would work ;)

transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame

using quinn transform method (if DataFrameTransformer methods weren't in a class)

source_df\
    .transform(lambda df: lower_case(df, "*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(lambda df: trim_col(df, "address"))

using "ideal" transform method - I'm not sure this is even possible

source_df\
    .transform(lower_case("*"))\
    .withColumn("funny", lit("spongebob"))\
    .transform(trim_col("address"))

FavioVazquez · 2017-10-17T00:11:30Z

Yes I think this is a great idea. But I'm thinking, maybe the only way is to extend the DataFrame class in Pyspark so we can add quinn transform method. Do you think this could work?

MrPowers · 2017-10-17T00:13:07Z

Yep, the quinn library extends the PySpark DataFrame class to add the transform method. I am going to ask some coworkers / StackOverflow if they know how to write the "elegant" transform method. I'll report back ;)

FavioVazquez · 2017-10-17T00:16:07Z

Great @MrPowers! Let me know please! I think we should add more functionality of quinn to Optimus!

MrPowers · 2017-11-01T13:32:06Z

@pirate showed me how to define PySpark custom transformations with inner functions, so they can be easily chained with a DataFrame.transform method 🎊

I wrote a blog post with more details.

How about I do a spike to see if the DataFrameTransformer can be refactored, so users can do this:

 (source_df
    .transform(lower_case("*"))
    .withColumn("funny", lit("spongebob"))
    .transform(trim_col("address")))

Instead of this:

transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame

Thanks!

FavioVazquez · 2017-11-01T15:42:56Z

Hi @MrPowers! We've talked about it and it actually sounds great. Amazing that you've found the way.

What is the impact into the way the transformer is written right now? Can you create a small snippet of a function in the new programming with the transform method so we can check it out please?

I do think this could be a major change into the code. So please go ahead and do your magic :) 👍

FavioVazquez · 2017-11-15T14:31:37Z

Hi @MrPowers any advances in this Issue?

MrPowers · 2017-11-15T14:59:33Z

Sorry for the delay on this one @FavioVazquez. I'll get you something today or tomorrow. Thanks for following up!

FavioVazquez · 2017-11-15T15:13:07Z

Great! We'll be waiting :) @MrPowers

MrPowers · 2017-11-17T21:27:00Z

@FavioVazquez - Check out this pull request and let me know what you think 😉

I think the custom DataFrame transformation interface I've outlined in the df_transformer_exp.py file is the best path forward for this project. I work on a large Scala/Spark codebase and we organize the DataFrame transformations in this manner. I think it's best to give the user an interface that allows them to easily access the built-in DataFrame methods / functions as well as the functionality that's provided by Optimus. It's awkward to switch back and forth between the native methods and the Optimus methods with the current interface (see the test_existing_interface test).

I think this new interface will also encourage Optimus to focus on the unique functionality that your project is bringing to the table. A lot of the current DataFrameTransformer methods are basically just aliases for methods that are already provided by PySpark (e.g. show, replace_na, drop_col, delete_row, etc.). I think Optimus should be focused on unique data munging methods that aren't provided by Spark natively (e.g. clear_accents, remove_special_chars, remove_special_chars_regex, move_col).

I'm optimistic about the future of this project and am happy to catch up on a call to discuss the next steps!

MrPowers · 2017-11-22T16:57:23Z

@FavioVazquez - Glad we're on the same page. Let me get this pull request in better shape, so it can get merged into master. In the short run, I think we can build out the new interface and keep the existing code. I think we'll be able to clean up the existing code by leveraging the native Spark functions a bit more. Let me know what you think 😄

I think we'll also want to build out some Optimus functions, similar to the PySpark functions.

I think a good next step is to go through the DataFrameTransformer class and categorize each method as a "Spark alias method" or a "unique Optimus method". We can then start migrating over the unique Optimus methods.

FavioVazquez · 2017-11-22T17:38:12Z

Great! Yes that should be done. The thing here @MrPowers is that I hate some of the names for the functions of spark, there are not intuitive at all. So what could happen there?

Question: what do you mean when you said: "I think we'll also want to build out some Optimus functions, similar to the PySpark functions."?

Another thing is that, I was planning on doing a thing with an annotator that says that is experimental, like the ones in spark, but I'm not sure how they do it. Do you know anything about this?

FavioVazquez · 2017-11-27T17:12:47Z

@MrPowers I think this will be a major change. So I'm putting it in the plans for version 2.0. You can check the board, and add more issues there. It will be a great way for letting us know the state of the progress. Thanks!

MrPowers · 2017-12-02T14:58:24Z

@FavioVazquez - I went through all the current DataFrameTransformer methods and classified them as "not needed in the new interface", "should be in the new interface", or "I'm not sure yet". Let me know what you think!

def df(self):
no longer needed

def show(self, n=10, truncate=True):
no longer needed

def lower_case(self, columns):
There is already a lower function so I don't think this is needed: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lower

def upper_case(self, columns):
There is already an upper function: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.upper

def impute_missing(self, columns, out_cols, strategy):
Not sure about this one

def replace_na(self, value, columns=None):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

def check_point(self):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.checkpoint

def trim_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim

def drop_col(self, columns):
Already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop

def replace_col(self, search, change_to, columns):
I think this is pretty much the same as regexp_replace: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace

def delete_row(self, func):
Filter already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter

def set_col(self, columns, func, data_type):
I am not sure about this one

def keep_col(self, columns):
Looks like the same as select: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select

def clear_accents(self, columns):
Keep this one

def remove_special_chars(self, columns):
Keep this one

def remove_special_chars_regex(self, columns, regex):
Not sure about this one

def rename_col(self, columns):
You can use withColumn for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn

def lookup(self, column, str_to_replace, list_str=None):
Not sure about this one

def move_col(self, column, ref_col, position):
Keep this one

def count_items(self, col_id, col_search, new_col_feature, search_string):
Keep this one

def date_transform(self, columns, current_format, output_format):
Keep this one

def age_calculate(self, column, dates_format, name_col_age):
Keep this one

def cast_func(self, cols_and_types):
Not sure about this one

def empty_str_to_str(self, columns, custom_str):
Not sure about this one

def operation_in_type(self, parameters):
Not sure about this one

def row_filter_by_type(self, column_name, type_to_delete):
Not sure about this one

def undo_vec_assembler(self, column, feature_names):
Think explode can be used for this: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

def scale_vec_col(self, columns, name_output_col):
Not sure if this belongs here, seems like a pyspark.ml related function: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html

def split_str_col(self, column, feature_names, mark):
See explode method: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

def remove_empty_rows(self, how="all"):
See DataFrameNaFunctions#drop: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.drop

def remove_duplicates(self, cols=None):
dropDuplicates already exists: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates

def write_df_as_json(self, path):
Think we can just use the DataFrameWriter to write out JSON: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.format

def to_csv(self, path_name, header="true", mode="overwrite", sep=",", *args, **kargs):
No longer needed

def string_to_index(self, input_cols):
Not sure

def index_to_string(self, input_cols):
Not sure

def one_hot_encoder(self, input_cols):
Not sure

def sql(self, sql_expression):
Not sure

def vector_assembler(self, input_cols):
Not sure

def normalizer(self, input_cols, p=2.0):
Not sure

def select(self, columns):
Not needed

def select_idx(self, indices):
Keep

FavioVazquez · 2017-12-03T23:47:39Z

Hey @MrPowers thanks for this. I think now is a decision time. So, lots of the functions you mention to remove are in spark, and that's true, but some of them only work column by column, I mean the functions doesn't allow for changing multiple columns. Why not keeping maybe the same name as in spark, but giving this functionality? Apart from all the assertions we make, that spark doesn't to help the user.

On the other hand, why are you not sure about the feature transformations we programmed? They're a pain in the ass for most users, and allow for single transformations at a time, apart that you have to use the fit and transform methods that are not clear for everyone.

We may move all of this into the OptimusML library, but I'm voting to keeping them.

I think this new interface will be a great step forward for Optimus so I'm in, but I want to emphasize that most of what we are doing here can be done with spark, but is not easy or pretty and that's what we want to give to user. Apart that most of them come from pandas or dplyr, so the plan was trying to make something like that.

What are your thoughts on this?

And thank you again :)

MrPowers · 2017-12-04T02:41:01Z

I just wrote a blog post on how to perform operations on multiple columns of a DataFrame with the Scala API. I am going to do some research and see how to run operations on multiple columns with PySpark. I think I'll be able to figure something out with reduce and/or a list comprehension 🤔

I haven't used the Spark ML library much yet, but I think you're right that the methods you've coded up might be super useful for users. Making the ML methods easily accessible might turn out to be the secret sauce of Optimus 😉

For now, I'll research and see how to run operations on multiple columns with PySpark and will get back to you with what I find!

MrPowers · 2017-12-04T03:14:42Z

@FavioVazquez - Here's one way we can make it easy for users to apply the transformations to multiple columns:

def remove_chars(colName, chars):
    def inner(df):
        regexp = "|".join('\{0}'.format(i) for i in chars)
        return df.withColumn(colName, regexp_replace(colName, regexp, ""))
    return inner

def multi_remove_chars(colNames, chars):
    def inner(df):
        return reduce(
            lambda memo_df, col_name: remove_chars(col_name, chars)(memo_df),
            colNames,
            df
        )
    return inner

There's a test in this commit.

I am going to ask some people I work with that are more experienced with Python about what they think about this approach.

If we go with this approach, I'm not sure if we should expose both remove_chars and multi_remove_chars in the public API, or only multi_remove_chars. I'll need to think about that one a bit.

FavioVazquez · 2017-12-04T03:16:10Z

@MrPowers I think all functions should work on multicolumn. And keep the name remore_chars or something like that. The user can choose to only add one or many. That's more natural. Thanks!

MrPowers · 2017-12-04T03:48:29Z

@FavioVazquez - Yep, I agree with your feedback. Something like this might work to keep the code clean:

def __remove_chars(col_name, removed_chars):
    def inner(df):
        regexp = "|".join('\{0}'.format(i) for i in removed_chars)
        return df.withColumn(col_name, regexp_replace(col_name, regexp, ""))
    return inner


def remove_chars(col_names, removed_chars):
    def inner(df):
        return reduce(
            lambda memo_df, col_name: __remove_chars(col_name, removed_chars)(memo_df),
            col_names,
            df
        )
    return inner

FavioVazquez · 2017-12-04T18:05:13Z

Yes I like that. @MrPowers Finish cleaning the PR up so I can merge it, and add the documentation, assertions and more functions. Thanks :)

MrPowers · 2017-12-07T00:58:28Z

@FavioVazquez - I wrote the blog post on performing operations on multiple columns in a PySpark DataFrame. Let me know what you think!

I think we should write private functions that work on a single column and then expose functions that work on multiple columns as the public API.

I am traveling to a remote part of Colombia and will be offline for the next few days. I will pick this back up next week 😄

argenisleon · 2018-06-22T18:26:45Z

@MrPowers @FavioVazquez

I have been working in simplify how you can work with Optimus.
The approach about using a transform function seems fine so you can perform native data frame function but it seems a little verbose.

After some experimentation with hierarchy and decorators, it seems that de decorator option is more flexible and was the only way I could implement chaining.

from functools import wraps # This convenience func preserves name and docstring
from pyspark.sql import DataFrame
from pyspark.sql import functions as F

# decorator to attach a custom fuction to a class
def add_method(cls):
    def decorator(func):
        @wraps(func) 
        def wrapper(self, *args, **kwargs): 
            return func(self, *args, **kwargs)
        setattr(cls, func.__name__, wrapper)
        # Note we are not binding func, but wrapper which accepts self but does exactly the same as func
        return func # returning func means func can still be used normally
    return decorator

@add_method(DataFrame)
def lower(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.lower(col(column)))
    return self

@add_method(DataFrame)
def upper(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.upper(col(column)))
    return self

@add_method(DataFrame)
def reverse(self, columns):
    for column in columns: 
        self= self.withColumn(column, F.reverse(col(column)))
    return self

schema = StructType([
        StructField("city", StringType(), True),
        StructField("country", StringType(), True),
        StructField("population", IntegerType(), True)])

countries = ['Colombia', 'US@A', 'Brazil', 'Spain']
cities = ['Bogotá', 'New York', '   São Paulo   ', '~Madrid']
population = [37800000,19795791,12341418,6489162]

# Create dataframe
df = spark.createDataFrame(list(zip(cities, countries, population)), schema=schema)

# Some operations in multiple columns
r = df.lower(["city","country"]).withColumn("city", F.upper(col("city"))).reverse(["city"]).reverse(["city", "country"])
r.show()

@MrPowers I was reading your article about processing multiple columns but I can not figure out how to use an implementation like this with chaining.

def multi_remove_some_chars(col_names):
    def inner(df):
        for col_name in col_names:
            df = df.withColumn(
                col_name,
                remove_some_chars(col_name)
            )
        return df
    return inner

Any thought about this?

FavioVazquez · 2018-06-22T22:03:21Z

This is a great idea @argenisleon, I think we should explore this option too. I created the PR for the second version in #217. It follows some of the things @MrPowers started. @argenisleon check the reduce function there. the chaining part is very easy with the transformer that @MrPowers created in Quinn, and not we should think on how to do it here.

FavioVazquez self-assigned this Oct 17, 2017

FavioVazquez added enhancement help wanted labels Oct 17, 2017

FavioVazquez added this to To Do in OptimusBoard Nov 27, 2017

FavioVazquez moved this from To Do to In Progress in OptimusBoard Nov 27, 2017

FavioVazquez added this to In Progress in 2.0 Version Nov 27, 2017

argenisleon moved this from In Progress to Done in 2.0 Version Aug 9, 2018

argenisleon closed this as completed in 4a59ad5 Aug 22, 2018

OptimusBoard automation moved this from In Progress to Done Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore options for a different DataFrameTransformer interface #152

Explore options for a different DataFrameTransformer interface #152

MrPowers commented Oct 16, 2017 •

edited

FavioVazquez commented Oct 17, 2017

MrPowers commented Oct 17, 2017

FavioVazquez commented Oct 17, 2017

MrPowers commented Oct 17, 2017

FavioVazquez commented Oct 17, 2017

MrPowers commented Nov 1, 2017

FavioVazquez commented Nov 1, 2017

FavioVazquez commented Nov 15, 2017

MrPowers commented Nov 15, 2017

FavioVazquez commented Nov 15, 2017

MrPowers commented Nov 17, 2017 •

edited

MrPowers commented Nov 22, 2017

FavioVazquez commented Nov 22, 2017

FavioVazquez commented Nov 27, 2017

MrPowers commented Dec 2, 2017 •

edited

FavioVazquez commented Dec 3, 2017

MrPowers commented Dec 4, 2017

MrPowers commented Dec 4, 2017

FavioVazquez commented Dec 4, 2017 •

edited

MrPowers commented Dec 4, 2017

FavioVazquez commented Dec 4, 2017

MrPowers commented Dec 7, 2017

argenisleon commented Jun 22, 2018 •

edited

FavioVazquez commented Jun 22, 2018

Explore options for a different DataFrameTransformer interface #152

Explore options for a different DataFrameTransformer interface #152

Comments

MrPowers commented Oct 16, 2017 • edited

FavioVazquez commented Oct 17, 2017

MrPowers commented Oct 17, 2017

FavioVazquez commented Oct 17, 2017

MrPowers commented Oct 17, 2017

FavioVazquez commented Oct 17, 2017

MrPowers commented Nov 1, 2017

FavioVazquez commented Nov 1, 2017

FavioVazquez commented Nov 15, 2017

MrPowers commented Nov 15, 2017

FavioVazquez commented Nov 15, 2017

MrPowers commented Nov 17, 2017 • edited

MrPowers commented Nov 22, 2017

FavioVazquez commented Nov 22, 2017

FavioVazquez commented Nov 27, 2017

MrPowers commented Dec 2, 2017 • edited

FavioVazquez commented Dec 3, 2017

MrPowers commented Dec 4, 2017

MrPowers commented Dec 4, 2017

FavioVazquez commented Dec 4, 2017 • edited

MrPowers commented Dec 4, 2017

FavioVazquez commented Dec 4, 2017

MrPowers commented Dec 7, 2017

argenisleon commented Jun 22, 2018 • edited

FavioVazquez commented Jun 22, 2018

MrPowers commented Oct 16, 2017 •

edited

MrPowers commented Nov 17, 2017 •

edited

MrPowers commented Dec 2, 2017 •

edited

FavioVazquez commented Dec 4, 2017 •

edited

argenisleon commented Jun 22, 2018 •

edited