Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from Parquet file generates an assertion error #338

Closed
metasim opened this issue Sep 6, 2019 · 1 comment · Fixed by #339

Comments

@metasim
Copy link
Member

commented Sep 6, 2019

I'm guessing something changed between Spark 2.2 and 2.3 that we didn't catch because the unit tests around Parquet IO in EncodingSpec only counted rows, and didn't perform any action that realized the tiles.

java.lang.AssertionError: assertion failed: User-defined types in Catalyst schema should have already been expanded:

{
  "type" : "struct",
  "fields" : [ {
    "name" : "tile",
    "type" : {
      "type" : "struct",
      "fields" : [ {
        "name" : "cell_context",
        "type" : {
          "type" : "struct",
          "fields" : [ {
            "name" : "cellType",
            "type" : {
              "type" : "struct",
              "fields" : [ {
                "name" : "cellTypeName",
                "type" : "string",
                "nullable" : false,
                "metadata" : { }
              } ]
            },
            "nullable" : false,
            "metadata" : { }
          }, {
            "name" : "dimensions",
            "type" : {
              "type" : "struct",
              "fields" : [ {
                "name" : "cols",
                "type" : "short",
                "nullable" : false,
                "metadata" : { }
              }, {
                "name" : "rows",
                "type" : "short",
                "nullable" : false,
                "metadata" : { }
              } ]
            },
            "nullable" : false,
            "metadata" : { }
          } ]
        },
        "nullable" : false,
        "metadata" : { }
      }, {
        "name" : "cell_data",
        "type" : {
          "type" : "struct",
          "fields" : [ {
            "name" : "cells",
            "type" : "binary",
            "nullable" : true,
            "metadata" : { }
          }, {
            "name" : "ref",
            "type" : {
              "type" : "struct",
              "fields" : [ {
                "name" : "source",
                "type" : {
                  "type" : "udt",
                  "class" : "org.apache.spark.sql.rf.RasterSourceUDT",
                  "pyClass" : "pyrasterframes.rf_types.RasterSourceUDT",
                  "sqlType" : {
                    "type" : "struct",
                    "fields" : [ {
                      "name" : "raster_source_kryo",
                      "type" : "binary",
                      "nullable" : false,
                      "metadata" : { }
                    } ]
                  }
                },
                "nullable" : false,
                "metadata" : { }
              }, {
                "name" : "bandIndex",
                "type" : "integer",
                "nullable" : false,
                "metadata" : { }
              }, {
                "name" : "subextent",
                "type" : {
                  "type" : "struct",
                  "fields" : [ {
                    "name" : "xmin",
                    "type" : "double",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "ymin",
                    "type" : "double",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "xmax",
                    "type" : "double",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "ymax",
                    "type" : "double",
                    "nullable" : false,
                    "metadata" : { }
                  } ]
                },
                "nullable" : true,
                "metadata" : { }
              }, {
                "name" : "subgrid",
                "type" : {
                  "type" : "struct",
                  "fields" : [ {
                    "name" : "colMin",
                    "type" : "integer",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "rowMin",
                    "type" : "integer",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "colMax",
                    "type" : "integer",
                    "nullable" : false,
                    "metadata" : { }
                  }, {
                    "name" : "rowMax",
                    "type" : "integer",
                    "nullable" : false,
                    "metadata" : { }
                  } ]
                },
                "nullable" : true,
                "metadata" : { }
              } ]
            },
            "nullable" : true,
            "metadata" : { }
          } ]
        },
        "nullable" : false,
        "metadata" : { }
      } ]
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

@metasim metasim self-assigned this Sep 6, 2019
@metasim

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2019

Problem is likely having a UDT inside a UDT

                  "class" : "org.apache.spark.sql.rf.RasterSourceUDT",
metasim added a commit to s22s/rasterframes that referenced this issue Sep 9, 2019
@vpipkt vpipkt closed this in #339 Sep 9, 2019
metasim added a commit to s22s/rasterframes that referenced this issue Sep 13, 2019
* develop: (254 commits)
  Incorporated PR feedback.
  Make python RasterSourceTest.test_list_of_list_of_str clearer, more stable
  Propagate errors encountered in RasterSourceToRasterRefs. Closes locationtech#267.

  Updated release notes.
  Switched Explode tiles to use UnsafeRow for slight improvement on memory pressure. Reworked TileExplodeBench
  Changed CatalystSerialize implementations to store scheams as fields rather than methods.
  Benchmark and fix for CellType reification issue. Closes locationtech#343
  PR feedback edits.
  Fleshed out details on using Scala. Closes locationtech#324
  Fixes locationtech#338.
  Tweaked parquet I/O tests to trigger UDT issue.
  Normalize RasterSourceDataSource param names between python and SQL
  PR feedback
  Run python tile exploder test for projected raster
  Fix for locationtech#333 and additional tests in that vein.
  Add failing unit test for issue 333, error in rf_agg_local_mean
  Updated ExplodeTiles to work with proj_raster type.
  Ignoring RGB composite tests until next round of improvements.
  IT test build fix.
  Incremental work on refactoring aggregate raster creation.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.