Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delimiter option for reading CSV files for Feathr #307

Merged
merged 22 commits into from
Aug 16, 2022

Conversation

ahlag
Copy link
Contributor

@ahlag ahlag commented May 30, 2022

Signed-off-by: Chang Yong Lik theeahlag@gmail.com

Description

  • Added delimiter options

Resolves #241

How was this patch tested?

Tested locally with sbt

sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestCsvDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestBatchDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.hdfs.TestFileFormat'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestDataLoaderFactory'

Progress Tracker

  • Use the Job configurations like here to implement, i.e. add a section called spark.feathr.inputFormat.csvOptions.sep which allows end users to pass the delimiters as options
  • in the scala code, if you search for ss.read.format("csv").option("header", "true"), there will be a bunch of places that you need to modify. Eventually they will be using something like the csv reader here (https://spark.apache.org/docs/3.2.0/sql-data-sources-csv.html).
  • You can get the config in different places thru something like this: sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",")
  • Test case
  • Also please help the job configuration docs (https://linkedin.github.io/feathr/how-to-guides/feathr-job-configuration.html) to make sure the options are clear to end users.

Does this PR introduce any user-facing changes?

Allows users to specify delimiters

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from 2b1c0a6 to ddb5643 Compare May 30, 2022 15:24
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from c764d1e to 36fe0b3 Compare June 11, 2022 15:56
@xiaoyongzhu xiaoyongzhu added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 15, 2022
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
(cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)
ahlag and others added 4 commits June 23, 2022 16:27
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
@ahlag ahlag changed the title [WIP] Add delimiter option for reading CSV files for Feathr Add delimiter option for reading CSV files for Feathr Jun 24, 2022
@xiaoyongzhu
Copy link
Member

@ahlag thanks for the contribution! This PR looks good to me, but I'm not sure why the test fails. Spent a bit time to investigate and feel it might be caused by the newly added tests?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 9, 2022

@xiaoyongzhu
Did it fail in GitHub Actions or by command line e.g. sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'? I tried rerunning the following commands locally but the unit cases were mot failling.

sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestCsvDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestBatchDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.hdfs.TestFileFormat'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestDataLoaderFactory'

@xiaoyongzhu
Copy link
Member

Talked with @ahlag offline and asked him to run sbt test. Feel the issue is mostly because those failed tests were relying on loadDataFrame to read the csv files

Signed-off-by: changyonglik <theeahlag@gmail.com>
@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

@xiaoyongzhu
I think I found the problem. The delimiter was not pass successfully when I tried passing the options with sqlContext.
Is there a way to set the config with SparkSession in unit tests?

TestFileFormat.scala

    val sqlContext = ss.sqlContext
    sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")

FileFormat.scala

val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",")

@xiaoyongzhu
Copy link
Member

@xiaoyongzhu
I think I found the problem. The delimiter was not pass successfully when I tried passing the options with sqlContext.
Is there a way to set the config with SparkSession in unit tests?

TestFileFormat.scala

    val sqlContext = ss.sqlContext
    sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")

FileFormat.scala

val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",")

Hmm it's a bit weird. Is it possible to force set the delimiters?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

Look like there can only be one SparkContext

[info] TestFileFormat:
[info] - testLoadDataFrame
[info] - testLoadDataFrameWithCsvDelimiterOption *** FAILED ***
[info]   org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:

@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client.

@xiaoyongzhu
Copy link
Member

I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client.

I did some research and found this answer:
https://stackoverflow.com/a/44613011

sqlContext.setConf("spark.sql.shuffle.partitions", "10") will set the property parameter for whole application before logicalPlan is generated.

sqlContext.sql("set spark.sql.shuffle.partitions=15") will also set the property but only for particular query and is generated at the time of logicalPlan creation.

Choosing between them depends on what your requirement is.

Maybe you can try sqlContext.sql?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 11, 2022

Ok, I'll give this a shot

@ahlag
Copy link
Contributor Author

ahlag commented Jul 27, 2022

@xiaoyongzhu
Ok! I have updated the release version.

xiaoyongzhu
xiaoyongzhu previously approved these changes Jul 27, 2022
Copy link
Collaborator

@hangfei hangfei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update wiki that we support tsv.

Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from 3842194 to 7711253 Compare August 4, 2022 13:00
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
@ahlag
Copy link
Contributor Author

ahlag commented Aug 4, 2022

@xiaoyongzhu @hangfei
I have finished the changes. Could you review?

xiaoyongzhu
xiaoyongzhu previously approved these changes Aug 11, 2022
@blrchen
Copy link
Collaborator

blrchen commented Aug 16, 2022

@ahlag would you mind merge latest main and resolve the conflicts so that we can get this merged, thanks for your time!

@ahlag
Copy link
Contributor Author

ahlag commented Aug 16, 2022

@xiaoyongzhu @blrchen
Done! Could you merge it today? Cause I am afraid it might have new conflicts.

@xiaoyongzhu xiaoyongzhu merged commit 6a0aba3 into feathr-ai:main Aug 16, 2022
@ahlag ahlag deleted the feature/delimiter branch August 17, 2022 08:05
ahlag added a commit to ahlag/feathr that referenced this pull request Aug 26, 2022
* Added documentation

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Added delimiter to CSVLoader

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
(cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)

* Added delimiter to BatchDataLoader, FileFormat and SourceUtils

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Added test case for BatchDataLoader

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Added test case for FileFormat

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Added test case for BatchDataLoader

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Added test case and fixed indent

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>

* Passing failure

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Removed unused imports from BatchDataLoader

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Fixed test failures

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Added release version

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Removed tailing space

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Removed wildcard imports

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Paraphrased comments and docstring

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Added DelimiterUtils

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Refactored utils

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Updated wiki to support both tsv and csv

Signed-off-by: changyonglik <theeahlag@gmail.com>

* Fixed spelling error

Signed-off-by: changyonglik <theeahlag@gmail.com>

* trigger GitHub actions

Signed-off-by: Chang Yong Lik <theeahlag@gmail.com>
Signed-off-by: changyonglik <theeahlag@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add delimiter option for reading CSV files for Feathr
4 participants