Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark SQL driver is not reading database schema of Hive #7630

Open
lucasloami opened this issue May 12, 2018 · 8 comments

Comments

Projects
None yet
9 participants
@lucasloami
Copy link

commented May 12, 2018

Hi,

I'm using a Metabase version with SparkSQL driver and it's working correctly (I described what I did in #7528 (comment) ). I was able to add a new connection to my Hive instance and run queries in it.

However, Metabase can't read my Database Schema. Everytime I press the button Sync database schema now I receive the following error:

05-11 21:01:42 INFO sync.util :: STARTING: Sync metadata for sparksql Database 2 'teste'
05-11 21:01:42 ERROR sync.util :: Error running sync step: Output of db-metadata does not match schema: 

	   {:tables #{{:name (not (instance? java.lang.String nil))}}}  


("sync.fetch_metadata$fn__39574$db_metadata__39579.invoke(fetch_metadata.clj:10)"
 "sync.sync_metadata.tables$fn__40704$db_metadata__40709$fn__40710.invoke(tables.clj:127)"
 "sync.sync_metadata.tables$fn__40704$db_metadata__40709.invoke(tables.clj:124)"
 "sync.sync_metadata.tables$fn__40764$sync_tables_BANG___40769$fn__40770.invoke(tables.clj:144)"
 "sync.sync_metadata.tables$fn__40764$sync_tables_BANG___40769.invoke(tables.clj:139)"
 "sync.sync_metadata$fn__40794$sync_db_metadata_BANG___40799$fn__40800$fn__40801.invoke(sync_metadata.clj:26)"
 "sync.util$do_with_error_handling.invokeStatic(util.clj:124)"
 "sync.util$do_with_error_handling.invoke(util.clj:119)"
 "sync.util$do_with_error_handling.invokeStatic(util.clj:122)"
 "sync.util$do_with_error_handling.invoke(util.clj:119)"
 "driver$fn__26280.invokeStatic(driver.clj:245)"
 "driver$fn__26280.invoke(driver.clj:245)"
 "driver$fn__26215$G__26022__26224.invoke(driver.clj:61)"
 "sync.util$sync_in_context$fn__38140.invoke(util.clj:115)"
 "sync.util$with_db_logging_disabled$fn__38137.invoke(util.clj:106)"
 "sync.util$with_start_and_finish_logging$fn__38132.invoke(util.clj:92)"
 "sync.util$with_sync_events$fn__38129.invoke(util.clj:75)"
 "sync.util$with_duplicate_ops_prevented$fn__38120.invoke(util.clj:54)"
 "sync.util$do_sync_operation.invokeStatic(util.clj:142)"
 "sync.util$do_sync_operation.invoke(util.clj:139)"
 "sync.sync_metadata$fn__40794$sync_db_metadata_BANG___40799$fn__40800.invoke(sync_metadata.clj:23)"
 "sync.sync_metadata$fn__40794$sync_db_metadata_BANG___40799.invoke(sync_metadata.clj:20)"
 "api.database$fn__41692$fn__41693$fn__41694.invoke(database.clj:510)")

05-11 21:01:42 INFO sync.util :: FINISHED: Sync metadata for sparksql Database 2 'teste' (77 ms)

I'm not sure if this bug is related to Metabase or to the driver version I'm using (it's not the latest one), but I believe this point is worth fixing because without the schema sync, Metabase is not able to display my Hive Tables in Admin->Data Model view and also it's not able to display tables in "Custom Question" and in "Native Query" views.

  • Metabase version: v.0.29.0 (I compiled it with SparkSQL dependencies)
  • Operating system: Ubuntu 17.10
  • SparkSQL driver version: v0.13.1

@lucasloami lucasloami changed the title Spark SQL driver is not database schema of Hive Spark SQL driver is not reading database schema of Hive May 13, 2018

@camsaul

This comment has been minimized.

Copy link
Member

commented May 16, 2018

@lucasloami it sounds like it's complaining that when it's looking up metadata it thinks one of your tables has a blank name. So either:

  • One of your tables does actually have a blank name, and we have to figure out what to do about that
  • For some reason it's not fetching Table names properly (perhaps a version mismatch between the SparkSQL JDBC driver we're using and your version of SparkSQL)

@camsaul camsaul self-assigned this May 16, 2018

@m30m

This comment has been minimized.

Copy link

commented May 17, 2018

@camsaul
I faced the same problem. I'm pretty sure this is because of table partitioning. Here is the result of describe table MY_TABLE for a partitioned table:

+--------------------------+------------+----------+--+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+--+
| pk                       | int        | NULL     |
.
.
.
| published_at             | timestamp  | NULL     |
| date                     | date       | NULL     |
| # Partition Information  |            |          |
| # col_name               | data_type  | comment  |
| date                     | date       | NULL     |
+--------------------------+------------+----------+

Maybe we can skip rows starting with #.

If you want to reproduce it easily in spark, here is a sample code:

df = sc.parallelize(range(1000)).map(lambda x:[x%10,x]).toDF().withColumnRenamed('_1','mod').withColumnRenamed('_2','num')
df.write.format('parquet').partitionBy('mod').saveAsTable('partition_test')

If I execute the same command without the partitionBy, Metabase is able to read it.

@witwall

This comment has been minimized.

Copy link

commented May 25, 2018

I tested the last version metabase, and it is not problem to read the schema, and I tested partitioned parquet files.
I use this script to build the schema to let metabase to find the parquet files (it can be execute through native query or beeline)

create view trans as select * from parquet.`/Users/steven/hdfs/trans`
@srohr

This comment has been minimized.

Copy link

commented Sep 7, 2018

I've encountered this bug as well trying to run Metabase with SparkSQL driver connecting to a Hive database on a Hortonworks HDP 3.0 cluster.

I've attempted recompiling the Metabase SparkSQL driver to use the same version of SparkSQL which HDP 3.0 uses (1.21.2.3.0.0.0-1634, an internal Hortonworks version available at http://repo.hortonworks.com/content/repositories/releases/). This did not result in any change, I still receive the error

ERROR sync.util :: Error running sync step: Output of db-metadata does not match schema:
{:tables #{{:name (not (instance? java.lang.String nil))}}}

I also tried saving the table as JSON rather than Parquet- No change.

If you have any suggestions (e.g. Metabase options to enable more verbose logging), I'd be happy to try them.

@chenjiajia

This comment has been minimized.

Copy link

commented Jan 4, 2019

I find the problem
image
hive describe table result contains extra partition column this code did not handle it

@chenjiajia

This comment has been minimized.

Copy link

commented Jan 7, 2019

i submit the fix code,fix sparksql handle partition table error 7630 #9215
image

@FranciscoMMAlves

This comment has been minimized.

Copy link

commented Feb 21, 2019

We had a similar problem using sparkSql to access an impala server:
Error {:tables #{{:name (not (instance? java.lang.String nil))}}}
And we solve the problem by correcting the return schema names of the tables, columns and types in /src/metabase/driver/sparksql.clj file.

screenshot from 2019-02-21 17-38-41

Hope it helps.

@camsaul camsaul added this to the 0.32.0 milestone Feb 21, 2019

@AnonyV

This comment has been minimized.

Copy link

commented Feb 22, 2019

Hi All,
I faced the same problem with using Metabase version v0.31.0 and metabase-sparksql-deps-1.2.1.spark2-standalone.jar plugin.

Everytime I press the button Sync database schema now I receive the following errors:

ERROR sync.util :: Error syncing fields for ??3,549?test.tmp?: java.lang.NullPointerException
("org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)"

ERROR sync.util :: Error syncing fields for ??3,551?test.test_test1?: Output of table-metadata does not match schema: {:fields #{{:database-type (not ("Non-blank string" "")), :base-type (not ("???????" nil))} {:base-type (not ("???????" nil))}}}

ERROR sync.util :: Error running sync step: Don't know how to create ISeq from: clojure.lang.ExceptionInfo
("clojure.lang.RT.seqFrom(RT.java:550)"
"clojure.lang.RT.seq(RT.java:530)"

i can select the data from all tables in SQL mode,
also i can use the ‘Count rows’ feature to count the table,
but there is something wrong with the ‘Raw data’ feature to show the data and the fields in table.
And i get these errors in backend:

WARN metabase.query-processor :: {:status :failed,
:class java.lang.Exception,
:error “Table ‘test’ has no Fields associated with it.”,
:stacktrace
("–> query_processor.middleware.add_implicit_clauses$fn__31049$add_implicit_fields__31054$fn__31060.invoke(add_implicit_clauses.clj:81)"
“query_processor.middleware.add_implicit_clauses$fn__31049$add_implicit_fields__31054.invoke(add_implicit_clauses.clj:64)”
“query_processor.middleware.add_implicit_clauses$fn__31143$add_implicit_mbql_clauses__31148$fn__31154.invoke(add_implicit_clauses.clj:107)”
“query_processor.middleware.add_implicit_clauses$fn__31143$add_implicit_mbql_clauses__31148.invoke(add_implicit_clauses.clj:105)”
“query_processor.middleware.add_implicit_clauses$maybe_add_implicit_clauses.invokeStatic(add_implicit_clauses.clj:116)”
“query_processor.middleware.add_implicit_clauses$maybe_add_implicit_clauses.invoke(add_implicit_clauses.clj:114)”
“query_processor.middleware.add_row_count_and_status$add_row_count_and_status$fn__31203.invoke(add_row_count_and_status.clj:15)”
“query_processor.middleware.driver_specific$process_query_in_context$fn__32507.invoke(driver_specific.clj:12)”
“query_processor.middleware.resolve_driver$resolve_driver$fn__35249.invoke(resolve_driver.clj:15)”
“query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__31708$fn__31709.invoke(bind_effective_timezone.clj:9)”
“util.date$call_with_effective_timezone.invokeStatic(date.clj:88)”
“util.date$call_with_effective_timezone.invoke(date.clj:77)”
“query_processor.middleware.bind_effective_timezone$bind_effective_timezone$fn__31708.invoke(bind_effective_timezone.clj:8)”
“query_processor.middleware.store$initialize_store$fn__37454$fn__37455.invoke(store.clj:11)”
“query_processor.store$do_with_new_store.invokeStatic(store.clj:34)”
“query_processor.store$do_with_new_store.invoke(store.clj:30)”
“query_processor.middleware.store$initialize_store$fn__37454.invoke(store.clj:10)”
“query_processor.middleware.cache$maybe_return_cached_results$fn__32130.invoke(cache.clj:127)”
“query_processor.middleware.catch_exceptions$catch_exceptions$fn__32178.invoke(catch_exceptions.clj:64)”
“query_processor$process_query.invokeStatic(query_processor.clj:213)”
“query_processor$process_query.invoke(query_processor.clj:209)”
“query_processor$run_and_save_query_BANG_.invokeStatic(query_processor.clj:322)”
“query_processor$run_and_save_query_BANG_.invoke(query_processor.clj:316)”
“query_processor$fn__37627$process_query_and_save_execution_BANG___37632$fn__37633.invoke(query_processor.clj:354)”
“query_processor$fn__37627$process_query_and_save_execution_BANG___37632.invoke(query_processor.clj:340)”
“query_processor$fn__37651$process_query_and_save_with_max_BANG___37656$fn__37657.invoke(query_processor.clj:373)”
“query_processor$fn__37651$process_query_and_save_with_max_BANG___37656.invoke(query_processor.clj:369)”
“api.dataset$fn__41976$fn__41979.invoke(dataset.clj:45)”
“api.common$fn__23138$invoke_thunk_with_keepalive__23143$fn__23144$fn__23145.invoke(common.clj:435)”),
:query
{:type “query”,
:query {:source-table 61},
:parameters [],
:constraints {:max-results 10000, :max-results-bare-rows 2000},
:info
{:executed-by 1,
:context :ad-hoc,
:card-id nil,
:nested? false,
:query-hash [82, 57, 57, 89, -33, -84, -68, 25, 19, -1, 12, -11, 110, 14, -49, 24, 39, -114, 22, 66, -16, -81, 97, 111, 94, 42, 41, -69, -74, -33, -44, 12],
:query-type “MBQL”}},
:preprocessed nil,
:native nil}


[“JdbcSQLException:”
“Message: Syntax error in SQL statement “SELECT ““TABLE_ID””, COUNT() AS ““COUNT”” FROM ““METABASE_FIELD”” WHERE (("“TABLE_ID”" IN NULL[]) AND ““ACTIVE”” = TRUE AND (”“SPECIAL_TYPE”” IN (?, ?))) GROUP BY ““TABLE_ID”” “; expected “(”; SQL statement:”
“SELECT “TABLE_ID”, count(*) AS “COUNT” FROM “METABASE_FIELD” WHERE ((“TABLE_ID” in NULL) AND “ACTIVE” = TRUE AND (“SPECIAL_TYPE” in (?, ?))) GROUP BY “TABLE_ID” [42001-197]”
“SQLState: 42001”
“Error Code: 42001”]}

That's all my error logs for this issue , hope can help for fixing the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.