-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore options for a different DataFrameTransformer interface #152
Comments
Hi @MrPowers! But the |
Yep, Let me make the current, quinn, and "ideal" options more clear. current - think this code would work ;) transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame using quinn transform method (if DataFrameTransformer methods weren't in a class) source_df\
.transform(lambda df: lower_case(df, "*"))\
.withColumn("funny", lit("spongebob"))\
.transform(lambda df: trim_col(df, "address")) using "ideal" transform method - I'm not sure this is even possible source_df\
.transform(lower_case("*"))\
.withColumn("funny", lit("spongebob"))\
.transform(trim_col("address")) |
Yes I think this is a great idea. But I'm thinking, maybe the only way is to extend the DataFrame class in Pyspark so we can add quinn transform method. Do you think this could work? |
Yep, the quinn library extends the PySpark DataFrame class to add the transform method. I am going to ask some coworkers / StackOverflow if they know how to write the "elegant" transform method. I'll report back ;) |
Great @MrPowers! Let me know please! I think we should add more functionality of quinn to Optimus! |
@pirate showed me how to define PySpark custom transformations with inner functions, so they can be easily chained with a I wrote a blog post with more details. How about I do a spike to see if the DataFrameTransformer can be refactored, so users can do this: (source_df
.transform(lower_case("*"))
.withColumn("funny", lit("spongebob"))
.transform(trim_col("address"))) Instead of this: transformer = DataFrameTransformer(source_df)
df1 = transformer.lower_case("*").get_data_frame
df2 = df1.withColumn("funny", lit("spongebob"))
transformer2 = DataFrameTransformer(df2)
transformer2.trim_col("address").get_data_frame Thanks! |
Hi @MrPowers! We've talked about it and it actually sounds great. Amazing that you've found the way. What is the impact into the way the transformer is written right now? Can you create a small snippet of a function in the new programming with the transform method so we can check it out please? I do think this could be a major change into the code. So please go ahead and do your magic :) 👍 |
Hi @MrPowers any advances in this Issue? |
Sorry for the delay on this one @FavioVazquez. I'll get you something today or tomorrow. Thanks for following up! |
Great! We'll be waiting :) @MrPowers |
@FavioVazquez - Check out this pull request and let me know what you think 😉 I think the custom DataFrame transformation interface I've outlined in the I think this new interface will also encourage Optimus to focus on the unique functionality that your project is bringing to the table. A lot of the current I'm optimistic about the future of this project and am happy to catch up on a call to discuss the next steps! |
@FavioVazquez - Glad we're on the same page. Let me get this pull request in better shape, so it can get merged into master. In the short run, I think we can build out the new interface and keep the existing code. I think we'll be able to clean up the existing code by leveraging the native Spark functions a bit more. Let me know what you think 😄 I think we'll also want to build out some Optimus functions, similar to the PySpark functions. I think a good next step is to go through the |
Great! Yes that should be done. The thing here @MrPowers is that I hate some of the names for the functions of spark, there are not intuitive at all. So what could happen there? Question: what do you mean when you said: "I think we'll also want to build out some Optimus functions, similar to the PySpark functions."? Another thing is that, I was planning on doing a thing with an annotator that says that is experimental, like the ones in spark, but I'm not sure how they do it. Do you know anything about this? |
@MrPowers I think this will be a major change. So I'm putting it in the plans for version 2.0. You can check the board, and add more issues there. It will be a great way for letting us know the state of the progress. Thanks! |
@FavioVazquez - I went through all the current def df(self): def show(self, n=10, truncate=True): def lower_case(self, columns): def upper_case(self, columns): def impute_missing(self, columns, out_cols, strategy): def replace_na(self, value, columns=None): def check_point(self): def trim_col(self, columns): def drop_col(self, columns): def replace_col(self, search, change_to, columns): def delete_row(self, func): def set_col(self, columns, func, data_type): def keep_col(self, columns): def clear_accents(self, columns): def remove_special_chars(self, columns): def remove_special_chars_regex(self, columns, regex): def rename_col(self, columns): def lookup(self, column, str_to_replace, list_str=None): def move_col(self, column, ref_col, position): def count_items(self, col_id, col_search, new_col_feature, search_string): def date_transform(self, columns, current_format, output_format): def age_calculate(self, column, dates_format, name_col_age): def cast_func(self, cols_and_types): def empty_str_to_str(self, columns, custom_str): def operation_in_type(self, parameters): def row_filter_by_type(self, column_name, type_to_delete): def undo_vec_assembler(self, column, feature_names): def scale_vec_col(self, columns, name_output_col): def split_str_col(self, column, feature_names, mark): def remove_empty_rows(self, how="all"): def remove_duplicates(self, cols=None): def write_df_as_json(self, path): def to_csv(self, path_name, header="true", mode="overwrite", sep=",", *args, **kargs): def string_to_index(self, input_cols): def index_to_string(self, input_cols): def one_hot_encoder(self, input_cols): def sql(self, sql_expression): def vector_assembler(self, input_cols): def normalizer(self, input_cols, p=2.0): def select(self, columns): def select_idx(self, indices): |
Hey @MrPowers thanks for this. I think now is a decision time. So, lots of the functions you mention to remove are in spark, and that's true, but some of them only work column by column, I mean the functions doesn't allow for changing multiple columns. Why not keeping maybe the same name as in spark, but giving this functionality? Apart from all the assertions we make, that spark doesn't to help the user. On the other hand, why are you not sure about the feature transformations we programmed? They're a pain in the ass for most users, and allow for single transformations at a time, apart that you have to use the We may move all of this into the OptimusML library, but I'm voting to keeping them. I think this new interface will be a great step forward for Optimus so I'm in, but I want to emphasize that most of what we are doing here can be done with spark, but is not easy or pretty and that's what we want to give to user. Apart that most of them come from pandas or dplyr, so the plan was trying to make something like that. What are your thoughts on this? And thank you again :) |
I just wrote a blog post on how to perform operations on multiple columns of a DataFrame with the Scala API. I am going to do some research and see how to run operations on multiple columns with PySpark. I think I'll be able to figure something out with I haven't used the Spark ML library much yet, but I think you're right that the methods you've coded up might be super useful for users. Making the ML methods easily accessible might turn out to be the secret sauce of Optimus 😉 For now, I'll research and see how to run operations on multiple columns with PySpark and will get back to you with what I find! |
@FavioVazquez - Here's one way we can make it easy for users to apply the transformations to multiple columns: def remove_chars(colName, chars):
def inner(df):
regexp = "|".join('\{0}'.format(i) for i in chars)
return df.withColumn(colName, regexp_replace(colName, regexp, ""))
return inner
def multi_remove_chars(colNames, chars):
def inner(df):
return reduce(
lambda memo_df, col_name: remove_chars(col_name, chars)(memo_df),
colNames,
df
)
return inner There's a test in this commit. I am going to ask some people I work with that are more experienced with Python about what they think about this approach. If we go with this approach, I'm not sure if we should expose both |
@MrPowers I think all functions should work on multicolumn. And keep the name |
@FavioVazquez - Yep, I agree with your feedback. Something like this might work to keep the code clean: def __remove_chars(col_name, removed_chars):
def inner(df):
regexp = "|".join('\{0}'.format(i) for i in removed_chars)
return df.withColumn(col_name, regexp_replace(col_name, regexp, ""))
return inner
def remove_chars(col_names, removed_chars):
def inner(df):
return reduce(
lambda memo_df, col_name: __remove_chars(col_name, removed_chars)(memo_df),
col_names,
df
)
return inner |
Yes I like that. @MrPowers Finish cleaning the PR up so I can merge it, and add the documentation, assertions and more functions. Thanks :) |
@FavioVazquez - I wrote the blog post on performing operations on multiple columns in a PySpark DataFrame. Let me know what you think! I think we should write private functions that work on a single column and then expose functions that work on multiple columns as the public API. I am traveling to a remote part of Colombia and will be offline for the next few days. I will pick this back up next week 😄 |
I have been working in simplify how you can work with Optimus. After some experimentation with hierarchy and decorators, it seems that de decorator option is more flexible and was the only way I could implement chaining. from functools import wraps # This convenience func preserves name and docstring
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
# decorator to attach a custom fuction to a class
def add_method(cls):
def decorator(func):
@wraps(func)
def wrapper(self, *args, **kwargs):
return func(self, *args, **kwargs)
setattr(cls, func.__name__, wrapper)
# Note we are not binding func, but wrapper which accepts self but does exactly the same as func
return func # returning func means func can still be used normally
return decorator
@add_method(DataFrame)
def lower(self, columns):
for column in columns:
self= self.withColumn(column, F.lower(col(column)))
return self
@add_method(DataFrame)
def upper(self, columns):
for column in columns:
self= self.withColumn(column, F.upper(col(column)))
return self
@add_method(DataFrame)
def reverse(self, columns):
for column in columns:
self= self.withColumn(column, F.reverse(col(column)))
return self
schema = StructType([
StructField("city", StringType(), True),
StructField("country", StringType(), True),
StructField("population", IntegerType(), True)])
countries = ['Colombia', 'US@A', 'Brazil', 'Spain']
cities = ['Bogotá', 'New York', ' São Paulo ', '~Madrid']
population = [37800000,19795791,12341418,6489162]
# Create dataframe
df = spark.createDataFrame(list(zip(cities, countries, population)), schema=schema)
# Some operations in multiple columns
r = df.lower(["city","country"]).withColumn("city", F.upper(col("city"))).reverse(["city"]).reverse(["city", "country"])
r.show() @MrPowers I was reading your article about processing multiple columns but I can not figure out how to use an implementation like this with chaining. def multi_remove_some_chars(col_names):
def inner(df):
for col_name in col_names:
df = df.withColumn(
col_name,
remove_some_chars(col_name)
)
return df
return inner Any thought about this? |
This is a great idea @argenisleon, I think we should explore this option too. I created the PR for the second version in #217. It follows some of the things @MrPowers started. @argenisleon check the reduce function there. the chaining part is very easy with the transformer that @MrPowers created in Quinn, and not we should think on how to do it here. |
I'm not sure how much we'll want to explore this option. Just want to introduce a design pattern that works well with the Scala API of Spark.
The Spark Scala API has a nifty
transform
method that lets users chain user defined transformations and methods defined in the Dataset class. See this blog post for more information.I like the
DataFrameTransformer
class, but it doesn't let users easily access the native PySparkDataFrame
methods.We might want to take these methods out of the
DataFrameTransfrormer
class, so the user can mix and match the Optimus API and the PySpark API.The
transform
method is defined in quinn. I'd love to make an interface like this, but not sure how to implement it with Python.Let me know what you think!
The text was updated successfully, but these errors were encountered: