[HLS-387] GFF3 Data Source #182

kianfar77 · 2020-03-31T19:54:56Z

What changes are proposed in this pull request?

This PR adds the "gff" datasource to glow for reading GFF3 (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) files. The usage will be like:
val df = spark.read.format("gff").load

The data source is able to infer the schema or accept a user-specified schema. It flattens the attributes field by creating a column for each tag that appears in the attributes column of the gff file.

The inferred schema will have base fields corresponding to the first 8 columns of gff3 called seqId, source, type, start, end, score, strand, and phase, followed by any official attribute field among id, name, alias, parent, target, gap, derivesfrom, note, dbxref, ontologyterm, and iscircular that appears in the gff tags followed by any unofficial attribute field that appears in the tags. In the inferred schema, the base and official fields will be in the same order as listed above. The unofficial fields will be in alphabetical order.

Any user-specified schema can have any subset of fields corresponding to the 9 columns of gff3 (named seqId, source, type, start, end, score, strand, phase, and attributes), the official attribute fields, and the unofficial attribute fields. The name of the official and unofficial fields should match the tag name in a case-and-underscore-insensitive fashion.

How is this patch tested?

Unit tests
Integration tests
Manual tests

(Details)

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

codecov · 2020-04-07T19:29:54Z

Codecov Report

Merging #182 into master will increase coverage by 0.14%.
The diff coverage is 97.93%.

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
+ Coverage   93.64%   93.79%   +0.14%     
==========================================
  Files          86       87       +1     
  Lines        4077     4222     +145     
  Branches      365      389      +24     
==========================================
+ Hits         3818     3960     +142     
- Misses        259      262       +3

Impacted Files	Coverage Δ
.../main/scala/io/projectglow/gff/GffDataSource.scala	`97.00% <97.00%> (ø)`
...src/main/scala/io/projectglow/common/schemas.scala	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 505cbf7...f3b1ae8. Read the comment docs.

core/src/main/scala/io/projectglow/gff/GffFileFormat.scala

core/src/main/scala/io/projectglow/gff/GffRowToInternalRowConverter.scala

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

henrydavidge

I had some preliminary comments but it looks like a great start to me!

One question: are bgzip files with a .gz extensions handled correctly now? I'm not sure if we'll know they're splittable.

henrydavidge · 2020-04-14T03:28:45Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+   */
+  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
+
+    createRelation(sqlContext, parameters, null)


Think it would be better to infer schema here and then call the method below.

henrydavidge · 2020-04-14T03:30:24Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+  val sqlContext: SQLContext,
+  path: String,
+  requiredSchema: StructType
+) extends BaseRelation with TableScan  { // TODO: investigate use of PrunedFilteredScan


IMO the column pruning would be helpful and probably very simple to implement. I doubt the filters are worth spending time on.

Incorporated ColumnPruning. Skipped Filters

henrydavidge · 2020-04-14T03:34:37Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+      gffOfficialAttributeFields.foldLeft(Seq[StructField]()) { (s, f) =>
+        if (attributeTags
+          .map(t => deUnderscore(t.toLowerCase))
+          .contains(f.name.toLowerCase)) {


FYI, I think that spark sql is already case insensitive

True. The case sensitivity against tag names needs to be dealt with though. Revise this section based on your other comments.

henrydavidge · 2020-04-14T03:36:04Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+  }
+
+  def filterCommentsAndFasta(df: DataFrame): DataFrame = {
+    df.where(


There's an option in the CSV reader to drop rows that don't have the right number of fields. Might save you some code here.

As well as an option to filter out comments.

Used filtering the comments.
Used DROPMALFORMED. This has issues in spark 2.4.4 and before which I dealt with and documented in the code.

henrydavidge · 2020-04-14T03:39:30Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+      )
+      .collect()(0).getAs[Seq[String]](0)
+
+    // generate the schema. The field names will be case and underscore insensitive


Code style thing: I think this would be more concise and clear if you normalized all the names together and then used sortBy to put them in the order you want.

henrydavidge · 2020-04-14T03:42:10Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+      }
+    )
+      .drop(attributesField.name, attributesMapColumnName) // TODO: Complete the option of keeping attributes field
+      .rdd


For an easy performance gain, we can override needConversion and return queryExecution.toRdd.

Great. Done

henrydavidge · 2020-04-14T03:44:38Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+    gffBaseSchema
+      .fields
+      .toSeq
+      .map(f => StructField(f.name, StringType))


Why do you have to parse everything as a string at first? You can't parse them as the proper data types?

I used nullValue option of csv reader and was able to do it.

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 · 2020-04-17T18:58:00Z

I had some preliminary comments but it looks like a great start to me!

One question: are bgzip files with a .gz extensions handled correctly now? I'm not sure if we'll know they're splittable.

As we discussed, it is handled but not in splittable fashion. Will address in a separate PR by extending CSV datasource as we discussed.

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

henrydavidge

Made a first pass. I think the high level is really solid! Most of my comments are about code style and readability.

I'm a little surprised there aren't more edge cases to look at. We should be sure we've thought hard about the kinds of "weird" GFFs we'll encounter.

henrydavidge · 2020-04-20T16:55:25Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+  override def createRelation(
+      sqlContext: SQLContext,
+      parameters: Map[String, String]): BaseRelation = {
+    val path = parameters.get("path")


Can you handle this validation logic in a common place? Place checkAndGetPath(options: Map[String, String]): String

henrydavidge · 2020-04-20T17:00:19Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+
+    val attributeFields = attributeTags
+      .foldLeft(Seq[StructField]()) { (s, t) =>
+        val officialIdx = officialFieldNames.indexOf(normalizeString(t))


nit: might be a little cleaner to gffOfficialAttributeFields.find(f => f.name == normalizeString(t)).map(_.dataType).getOrElse(StringType)

henrydavidge · 2020-04-20T17:01:12Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+      .sortBy(_.name)(FieldNameOrdering)
+
+    StructType(
+      gffBaseSchema.fields.dropRight(1) ++ attributeFields


nit: comment that attributes is the last field in gffBaseSchema.fields

henrydavidge · 2020-04-20T17:08:31Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+  }
+
+  def filterFastaLines(df: DataFrame): DataFrame = {
+    df.where(


This is kind of a confusing way to write this filter. fold is specifically called out in our style guide as a hard-to-understand pattern: https://github.com/databricks/scala-style-guide. Coalesce might be a good alternative here.

Used Coalesce

henrydavidge · 2020-04-20T17:12:38Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+          s :+ StructField(t, StringType)
+        }
+      }
+      .sortBy(_.name)(FieldNameOrdering)


nit: I think the custom ordering is not necessary. It might be simpler to use the natural ordering on whether the field is in the official attributes and then the name.

attributes.sortBy(field => (!officialAttributes.contains(field), field.name))

Used your suggestion but with indexOf because I want the official fields to be in a set order as well.

henrydavidge · 2020-04-20T17:12:50Z

core/src/main/scala/io/projectglow/gff/GffDataSource.scala

+    spark.conf.set(columnPruningConf, originalColumnPruning)
+
+    val attributeFields = attributeTags
+      .foldLeft(Seq[StructField]()) { (s, t) =>


Why is this a fold instead of a simple map?

remnant of the older version! didnt pay attention that does nothing. Changed.

henrydavidge · 2020-04-20T17:18:17Z

core/src/test/scala/io/projectglow/gff/GffReaderSuite.scala

+    :+ StructField("Is_circular", BooleanType)
+  )
+
+  test("Schema inference") {


Are there any particular edge cases we should look for here? E.g., no attributes fields, attributes fields containing weird characters (can there be escaped = or ; inside attribute values)?

The only unescaped character allowed in gff3 is space. I added tags in the test schema that include space. Values that contain space are also covered in my test files.

= and ; and other characters must be escaped using percent-encoding in gff3 and I read them in their percent-encoding as well.

Gff3 must have nine columns. Files with no attribute column are not valid. I also checked that with an online gff validator.

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 · 2020-04-21T16:59:38Z

Made a first pass. I think the high level is really solid! Most of my comments are about code style and readability.

I'm a little surprised there aren't more edge cases to look at. We should be sure we've thought hard about the kinds of "weird" GFFs we'll encounter.

Please see my comment in the tests. GFF3 format is pretty restricted. I could not think of cases other than those mentioned that are both weird and valid.

henrydavidge

Just had two comments on testing. The main code looks great!

henrydavidge · 2020-04-22T17:53:32Z

core/src/test/scala/io/projectglow/gff/GffReaderSuite.scala

+    assert(dfRows.sameElements(expectedRows))
+  }
+
+  test("Read gff with user-specified schema containing attributesField") {


What's the expected behavior if the user provides a schema with fields that are not actually in the attributes map? We should have a test for that.

The column will be added. All entries will be null. Added a test for that.

henrydavidge · 2020-04-22T17:55:54Z

core/src/test/scala/io/projectglow/gff/GffReaderSuite.scala

+    }
+    assert(e.getMessage.contains("GFF data source does not support writing!"))
+  }
+


We should have a test for column pruning where we select only a few columns from the inferred schema.

Added the "column pruning" test

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

henrydavidge

LGTM!

* prep Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * schema inferrer Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * field order Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * cleanup Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * reader Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * more reader Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * projection Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * works Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * sbt Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test schema Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test schema Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * test read Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * updateToken Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * more tests Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * filess Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * filtered schema Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * revert Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * datasource api Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * comments and more Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * working version Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * working version Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * final and tests Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * case test Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * comments Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * empty Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * tests added Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> Signed-off-by: Henry Davidge <hhd@databricks.com>

kianfar77 added 10 commits March 26, 2020 13:18

prep

dd98251

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

schema inferrer

6c5f37c

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

field order

8a6db21

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

cleanup

29df8e8

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

reader

756b37b

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

more reader

8bb68d4

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

projection

2d74f16

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

works

4608232

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

Merge branch 'master' into hls387

eed4bb2

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

sbt

d9987bb

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 requested a review from henrydavidge April 6, 2020 19:21

kianfar77 added 7 commits April 6, 2020 16:05

test schema

dd26d67

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

test schema

2993ac1

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

test read

e52d871

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

updateToken

7778950

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

more tests

469488d

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

filess

0b644e6

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

filtered schema

f4ec757

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

henrydavidge reviewed Apr 8, 2020

View reviewed changes

core/src/main/scala/io/projectglow/gff/GffFileFormat.scala Outdated Show resolved Hide resolved

core/src/main/scala/io/projectglow/gff/GffRowToInternalRowConverter.scala Outdated Show resolved Hide resolved

kianfar77 added 2 commits April 9, 2020 10:19

revert

c0bc003

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

datasource api

178a14a

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

henrydavidge reviewed Apr 14, 2020

View reviewed changes

kianfar77 added 5 commits April 14, 2020 18:00

comments and more

57ddcbd

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

working version

f4554ca

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

working version

f886600

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

final and tests

13b6dbe

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

Merge branch 'master' into hls387

c5f3ca8

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

case test

bf7e857

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 requested a review from henrydavidge April 17, 2020 19:58

henrydavidge reviewed Apr 20, 2020

View reviewed changes

kianfar77 added 2 commits April 21, 2020 09:04

comments

29c7390

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

empty

56fe84a

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 changed the title ~~[WIP][HLS-387] GFF/GTF Reader~~ [HLS-387] GFF3 Data Source Apr 21, 2020

kianfar77 requested a review from henrydavidge April 21, 2020 17:00

henrydavidge reviewed Apr 22, 2020

View reviewed changes

tests added

f3b1ae8

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

kianfar77 requested a review from henrydavidge April 22, 2020 19:49

henrydavidge approved these changes Apr 23, 2020

View reviewed changes

kianfar77 merged commit cbb1305 into projectglow:master Apr 24, 2020

[HLS-387] GFF3 Data Source #182

[HLS-387] GFF3 Data Source #182

Conversation

kianfar77 commented Mar 31, 2020 • edited

What changes are proposed in this pull request?

How is this patch tested?

codecov bot commented Apr 7, 2020 • edited

Codecov Report

henrydavidge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kianfar77 commented Apr 17, 2020

henrydavidge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kianfar77 commented Apr 21, 2020

henrydavidge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henrydavidge left a comment

Choose a reason for hiding this comment

kianfar77 commented Mar 31, 2020 •

edited

codecov bot commented Apr 7, 2020 •

edited