Column data types not being parsed under v0.8.0-dev #29

davidski · 2019-01-07T01:08:21Z

Under sergeant master (20190106), csvh files being read in over DBI (src_drill()) are coming across as character columns, while the same file parsed under CRAN v0.5.0 sergeant is properly coerced into R data types.

Apologies for the lack of reprex here. I'm in the thick of a drill-powered project and wanted to throw this early report up in case this was something related to the rapidjsonr conversion progress and the new dev branches. If previously unknown, I'll generate a reprex when my hair is a bit less on fire. ;)

The text was updated successfully, but these errors were encountered:

hrbrmstr · 2019-01-07T12:42:04Z

Hey David!

Can you share a sanitized version of the … in a tbl(src, …) query?

By default, Drill will treat all columns in a CSV as VARCHAR; e.g: this:

write_csv(mtcars, "~/Data/mtcars.csv")

httr::POST(
  "http://localhost:8047/query.json",
  encode = "json",
  body = list(
    queryType = "SQL", 
    query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
  ) 
) -> res

cat(
  httr::content(res, as = "text")
)

causes this to be passed on down the wire:

{
  "queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
  "columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
  "rows" : [ {
    "carb" : "4",
    "mpg" : "21",
    "qsec" : "16.46",
    "cyl" : "6",
    "hp" : "110",
    "am" : "1",
    "disp" : "160",
    "wt" : "2.62",
    "vs" : "0",
    "drat" : "3.9",
    "gear" : "4"
  } ],
  "metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
}

Would an type_convert = TRUE/FALSE option to tbl() be helpful? i.e. force a bypass of the reading the metadata that comes along for the ride? I can add that pretty quick to the 0.8.0 branch (I should likely put back the main branch to pre-0.8.0 dev too since I made the "will dev on branches" decision as part of a new year's resolution :-)

hrbrmstr · 2019-01-07T12:52:32Z

AUGHH!!! they've deprecated the … in tbl_sql() (meaning if I make the param change to tbl it will just break at some point). Would making the param to src_drill() be OK? That would impact all query results for that db but it may be a workaround if having R do the heavy lifting (which is really just a call to readr::type_convert() sans any type hints post-query result retrieval) is desired.

davidski · 2019-01-07T21:37:38Z

Hrm. 🤔 My inexperience with drill may be showing here. I've been using CTAS queries to convert CSVs (CSVH, technically) to Parquet. Your comments about drill always returning VARCHAR are right in line with https://drill.apache.org/docs/text-files-csv-tsv-psv/, which cautions against using CTAS queries in this fashion.

There does seem to be a difference in how v.0.5.0 and master operate. The following code under the CRAN published v0.5.0 gives a typed-version of this data set, while the same code under master gives me character() types for all of the columns.

library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat)

I may have been relying inadvertently on sergeant magic... Would you recommend explicit conversions using drill_query() to go from CSV -> Parquet, in line with the drill guide? Not as slick as I was hoping, but perhaps more stable. 😁

hrbrmstr · 2019-01-07T22:14:30Z

Aye. I inadvertently pushed the new code to master and just lazily didn't rebase.

You could also do:

dat <- tbl(ds, sql("
SELECT 
  CAST(`year` AS INT) `year`,
  CAST(`month` as INT) `month`,
  CAST(`day` AS INT) `day`,
  CAST(dep_time AS INT) dep_time,
  CAST(sched_dep_time AS INT) sched_dep_time,
  CAST(dep_delay AS DOUBLE) dep_delay,
  CAST(arr_time AS INT) arr_time,
  CAST(sched_arr_time AS INT) sched_arr_time,
  CAST(arr_delay AS DOUBLE) arr_delay,
  carrier,
  CAST(flight AS VARCHAR) flight,
  tailnum,
  origin
FROM dfs.data.`flights.csvh`
"))

glimpse(dat)
## Observations: ??
## Variables: 13
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",...

hrbrmstr · 2019-01-07T22:23:04Z

The above is likely the best way to do it (apart from running readr::type_convert() ex-post-query-facto) since it's deterministic and actually pulling what is expected (provided folks are running latest Drill). Old Drill shld still behave the old way.

davidski · 2019-01-07T22:56:46Z

Sounds like a plan. I don't suppose you know of any dbWriteTable()-like magic that would make writing out a skeleton fully qualified CTAS query easier? I have some relatively wide tables that the current type conversion does a good job against. I'd ideally like a flow for:

pulling in the first couple of records of a file w/best guess on types
convert the dataframe to a fully qualified create table query
inspect the query manually
concat the create table with a select for a drill CTAS file conversion.

After a few columns of manual query creation my typo probability increases. 😬

If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue.

Appreciate the package!

hrbrmstr · 2019-01-08T12:13:46Z

Something like…

ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
  
  stopifnot(inherits(x, "tbl_drill"))
  
  vals <- collect(head(x))
  
  vals <- suppressMessages(type_convert(vals))
  
  data_type <- function(x) {
    switch(
      class(x)[1],
      integer64 = "BIGINT",
      logical = "BOOLEAN",
      integer = "INTEGER",
      numeric = "DOUBLE",
      factor =  "VARCHAR",
      character = "VARCHAR",
      Date = "DATE",
      POSIXct = "TIMESTAMP"
    )
  }
  
  field_types <- vapply(vals, data_type, character(1))
  
  sprintf(
    "  CAST(`%s` AS %s) AS `%s`", 
    names(field_types), 
    field_types, 
    names(field_types)
  ) -> casts
  
  casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
  
  case_when(
    grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
    grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
    TRUE ~ casts
  ) -> casts
  
  orig_query <- x$ops$x
  
  if (!grepl("select", orig_query, ignore.case=TRUE)) {
    orig_query <- sprintf("SELECT * FROM %s", orig_query)
  }
  
  sprintf(
    "CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n", 
    new_table_name,
    paste0(casts, collapse="\n"),
    orig_query
  )  
  
}

A cpl light tests.

Bare data source

flt1 <- tbl(db, "dfs.d.`/flights.csvh`")

cat(
  ctas_profile(flt1)
)

CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS CHARACTER) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`)

Carved out fields with `SELECT`

flt2 <- tbl(db, sql("SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`"))

cat(
  ctas_profile(flt2, "dfs.d.`flights.parquet`")
)

CREATE TABLE dfs.d.`flights.parquet` AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`)

I'll be augmenting it to handle figuring out the timestamp and date formats and replacing the CAST with the appropriate converter and get it into 0.8.0 (just running short on time right now).

hrbrmstr · 2019-01-08T12:29:40Z

I put this WIP ctas_profile() into the 0.8.0 branch as well.

hrbrmstr self-assigned this Jan 7, 2019

hrbrmstr added the investigating label Jan 7, 2019

hrbrmstr added the enhancement label Jan 8, 2019

hrbrmstr added this to In Progress in 0.8.0 Release Jan 8, 2019

hrbrmstr added this to the 0.8.0 release milestone Jan 8, 2019

hrbrmstr closed this as completed Jan 16, 2019

0.8.0 Release automation moved this from In Progress to Done Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column data types not being parsed under v0.8.0-dev #29

Column data types not being parsed under v0.8.0-dev #29

davidski commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019 •

edited

hrbrmstr commented Jan 7, 2019 •

edited

davidski commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019

davidski commented Jan 7, 2019

hrbrmstr commented Jan 8, 2019 •

edited

hrbrmstr commented Jan 8, 2019

Column data types not being parsed under v0.8.0-dev #29

Column data types not being parsed under v0.8.0-dev #29

Comments

davidski commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019 • edited

hrbrmstr commented Jan 7, 2019 • edited

davidski commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019

hrbrmstr commented Jan 7, 2019

davidski commented Jan 7, 2019

hrbrmstr commented Jan 8, 2019 • edited

Bare data source

Carved out fields with SELECT

hrbrmstr commented Jan 8, 2019

hrbrmstr commented Jan 7, 2019 •

edited

hrbrmstr commented Jan 7, 2019 •

edited

hrbrmstr commented Jan 8, 2019 •

edited

Carved out fields with `SELECT`