Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column data types not being parsed under v0.8.0-dev #29

Closed
davidski opened this issue Jan 7, 2019 · 8 comments
Closed

Column data types not being parsed under v0.8.0-dev #29

davidski opened this issue Jan 7, 2019 · 8 comments

Comments

@davidski
Copy link

davidski commented Jan 7, 2019

Under sergeant master (20190106), csvh files being read in over DBI (src_drill()) are coming across as character columns, while the same file parsed under CRAN v0.5.0 sergeant is properly coerced into R data types.

Apologies for the lack of reprex here. I'm in the thick of a drill-powered project and wanted to throw this early report up in case this was something related to the rapidjsonr conversion progress and the new dev branches. If previously unknown, I'll generate a reprex when my hair is a bit less on fire. ;)

@hrbrmstr hrbrmstr self-assigned this Jan 7, 2019
@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 7, 2019

Hey David!

Can you share a sanitized version of the in a tbl(src, …) query?

By default, Drill will treat all columns in a CSV as VARCHAR; e.g: this:

write_csv(mtcars, "~/Data/mtcars.csv")

httr::POST(
  "http://localhost:8047/query.json",
  encode = "json",
  body = list(
    queryType = "SQL", 
    query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
  ) 
) -> res

cat(
  httr::content(res, as = "text")
)

causes this to be passed on down the wire:

{
  "queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
  "columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
  "rows" : [ {
    "carb" : "4",
    "mpg" : "21",
    "qsec" : "16.46",
    "cyl" : "6",
    "hp" : "110",
    "am" : "1",
    "disp" : "160",
    "wt" : "2.62",
    "vs" : "0",
    "drat" : "3.9",
    "gear" : "4"
  } ],
  "metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
}

Would an type_convert = TRUE/FALSE option to tbl() be helpful? i.e. force a bypass of the reading the metadata that comes along for the ride? I can add that pretty quick to the 0.8.0 branch (I should likely put back the main branch to pre-0.8.0 dev too since I made the "will dev on branches" decision as part of a new year's resolution :-)

@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 7, 2019

AUGHH!!! they've deprecated the in tbl_sql() (meaning if I make the param change to tbl it will just break at some point). Would making the param to src_drill() be OK? That would impact all query results for that db but it may be a workaround if having R do the heavy lifting (which is really just a call to readr::type_convert() sans any type hints post-query result retrieval) is desired.

@davidski
Copy link
Author

davidski commented Jan 7, 2019

Hrm. 🤔 My inexperience with drill may be showing here. I've been using CTAS queries to convert CSVs (CSVH, technically) to Parquet. Your comments about drill always returning VARCHAR are right in line with https://drill.apache.org/docs/text-files-csv-tsv-psv/, which cautions against using CTAS queries in this fashion.

There does seem to be a difference in how v.0.5.0 and master operate. The following code under the CRAN published v0.5.0 gives a typed-version of this data set, while the same code under master gives me character() types for all of the columns.

library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat)

I may have been relying inadvertently on sergeant magic... Would you recommend explicit conversions using drill_query() to go from CSV -> Parquet, in line with the drill guide? Not as slick as I was hoping, but perhaps more stable. 😁

@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 7, 2019

Aye. I inadvertently pushed the new code to master and just lazily didn't rebase.

You could also do:

dat <- tbl(ds, sql("
SELECT 
  CAST(`year` AS INT) `year`,
  CAST(`month` as INT) `month`,
  CAST(`day` AS INT) `day`,
  CAST(dep_time AS INT) dep_time,
  CAST(sched_dep_time AS INT) sched_dep_time,
  CAST(dep_delay AS DOUBLE) dep_delay,
  CAST(arr_time AS INT) arr_time,
  CAST(sched_arr_time AS INT) sched_arr_time,
  CAST(arr_delay AS DOUBLE) arr_delay,
  carrier,
  CAST(flight AS VARCHAR) flight,
  tailnum,
  origin
FROM dfs.data.`flights.csvh`
"))

glimpse(dat)
## Observations: ??
## Variables: 13
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",...

@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 7, 2019

The above is likely the best way to do it (apart from running readr::type_convert() ex-post-query-facto) since it's deterministic and actually pulling what is expected (provided folks are running latest Drill). Old Drill shld still behave the old way.

@davidski
Copy link
Author

davidski commented Jan 7, 2019

Sounds like a plan. I don't suppose you know of any dbWriteTable()-like magic that would make writing out a skeleton fully qualified CTAS query easier? I have some relatively wide tables that the current type conversion does a good job against. I'd ideally like a flow for:

  • pulling in the first couple of records of a file w/best guess on types
  • convert the dataframe to a fully qualified create table query
  • inspect the query manually
  • concat the create table with a select for a drill CTAS file conversion.

After a few columns of manual query creation my typo probability increases. 😬

If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue.

Appreciate the package!

@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 8, 2019

Something like…

ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
  
  stopifnot(inherits(x, "tbl_drill"))
  
  vals <- collect(head(x))
  
  vals <- suppressMessages(type_convert(vals))
  
  data_type <- function(x) {
    switch(
      class(x)[1],
      integer64 = "BIGINT",
      logical = "BOOLEAN",
      integer = "INTEGER",
      numeric = "DOUBLE",
      factor =  "VARCHAR",
      character = "VARCHAR",
      Date = "DATE",
      POSIXct = "TIMESTAMP"
    )
  }
  
  field_types <- vapply(vals, data_type, character(1))
  
  sprintf(
    "  CAST(`%s` AS %s) AS `%s`", 
    names(field_types), 
    field_types, 
    names(field_types)
  ) -> casts
  
  casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
  
  case_when(
    grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
    grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
    TRUE ~ casts
  ) -> casts
  
  orig_query <- x$ops$x
  
  if (!grepl("select", orig_query, ignore.case=TRUE)) {
    orig_query <- sprintf("SELECT * FROM %s", orig_query)
  }
  
  sprintf(
    "CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n", 
    new_table_name,
    paste0(casts, collapse="\n"),
    orig_query
  )  
  
}

A cpl light tests.

Bare data source

flt1 <- tbl(db, "dfs.d.`/flights.csvh`")

cat(
  ctas_profile(flt1)
)
CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS CHARACTER) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`)

Carved out fields with SELECT

flt2 <- tbl(db, sql("SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`"))

cat(
  ctas_profile(flt2, "dfs.d.`flights.parquet`")
)
CREATE TABLE dfs.d.`flights.parquet` AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`)

I'll be augmenting it to handle figuring out the timestamp and date formats and replacing the CAST with the appropriate converter and get it into 0.8.0 (just running short on time right now).

@hrbrmstr
Copy link
Owner

hrbrmstr commented Jan 8, 2019

I put this WIP ctas_profile() into the 0.8.0 branch as well.

@hrbrmstr hrbrmstr added this to In Progress in 0.8.0 Release Jan 8, 2019
@hrbrmstr hrbrmstr added this to the 0.8.0 release milestone Jan 8, 2019
0.8.0 Release automation moved this from In Progress to Done Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
0.8.0 Release
  
Done
Development

No branches or pull requests

2 participants