-
Notifications
You must be signed in to change notification settings - Fork 0
/
when.Rmd
642 lines (428 loc) · 25.1 KB
/
when.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
---
title: "Generating the When Dimension based on date and time tables"
date: "2024-01-08"
output: rmarkdown::html_vignette
bibliography: bibliography.bib
vignette: >
%\VignetteIndexEntry{Generating the When Dimension based on date and time tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
The *When Dimension* plays a fundamental role in *Multidimensional Systems*, it allows us to express **when** the analysed focus of attention have occurred.
This dimension corresponds to the generically called “Time Dimension”, as named in @kimball1996data and @adamson2010star. Later it has also been called "Date Dimension" (date-based dimension) and "Time Dimension" (time-of-day-based dimension) to express granularity [@kimball2013data]. We prefer to call it "When Dimension" because granularity is not involved in that term.
Although conceptually the date and time can be represented together in a single dimension, in ROLAP (*Relational On-Line Analytical Processing*) star database systems, it is common to implement these concepts separately, sometimes in separate tables "to avoid a row count explosion in the date dimension" [@kimball2013data] by adding the time for each day.
In @kimball2013data these dimensions are considered *static dimensions*: they are built at the beginning of the project for the period under consideration and are referenced when adding the instances to the fact table. As mentioned there: "Typically, these dimensions are built in an afternoon with a spreadsheet."
The purpose of the `when` package is to assist in the implementation of the ***When Dimension*** in Multidimensional Systems implemented on a ROLAP star database, regardless of its granularity, with the following features:
- It supports the generation of tables with the granularity needed (second, minute, hour, date, week, month, quarter, semester or year) according to the relational star schema design criteria [@kimball2013data].
- We consider separate tables for **time-of-day** data and data obtained from **date** to avoid a row count explosion if a single table were used.
- It can be used to generate tables **statically**, indicating a period, or **dynamically**, indicating data series.
- In addition to attributes at the selected granularity, other attributes can be added at coarser granularity to facilitate querying. We can also add a surrogate key.
- We can obtain the tables in `tibble` format and in other formats (*xlsx*, *csv* or *Relational DBMS* table).
It relies on the functions offered by the [`lubridate`](https://CRAN.R-project.org/package=lubridate) package to obtain the components from the date and time.
The rest of this document is structured as follows: First, the general process of defining these dimensions is presented. Next, we present dimension-specific aspects that we can define, both for date-based and time-based dimensions. Finally, the document ends with conclusions and bibliography.
# Definition process
The definition process consists of the following steps:
1. Creating an object.
1. Definition of general aspects.
1. Definition of instances.
1. Selection of levels and attributes.
1. Generation of the result.
1. Getting the dimension table.
Steps 2, 3 and 4 can be done in any order, they can even be included in step 1.
Below we discuss each of them.
## Creating an object
The definition process begins by creating an object of the `when` class.
```{r setup}
library(when)
w_date <- when()
```
By default, an object is created based on date, but we can also create it based on time by indicating it using the `type` parameter.
```{r}
w_time <- when(type = 'time')
```
Virtually all configuration options can be defined using the object creation function. They can also be defined later using specific functions, as shown in the following section.
## Definition of general aspects {#general_aspects}
For example, using the following combination of functions we get the same result as defining the type when we create the object.
```{r}
w_time_2 <- when() |>
define_characteristics(type = 'time')
identical(w_time, w_time_2)
```
By default it considers the system language. Since our system is Spanish, for the literals of day and month names to appear in English, we must configure it explicitly.
```{r, eval=FALSE}
w_date <- w_date |>
define_characteristics(locale = Sys.setlocale("LC_TIME", "English"))
```
In section [Other configuration options](#other_config_options) we present the rest of the configuration options for dimension generation. For the examples that we are going to use, the default options of the package are enough.
Next we have to select and configure the levels and define the instances. These operations, along with the current one, can be performed in any order.
## Definition of instances
We can define the instances by giving a range of values or by indicating the specific instances. The values in both cases will depend on the type of dimension (date-based or time-based).
We can indicate these values at the time of creating the class object or through the `define_instances()` function.
- In the case of date, we can indicate a year, year and month (with the format `yyyy-mm`), or year month and day (`yyyy-mm-dd`).
- In the case of time, we can indicate an hour, hour and minute (with the format `hh:mm`), or hour, minute and second (`hh:mm:ss`).
In both cases we can use the string format, the one provided by the [`lubridate`](https://CRAN.R-project.org/package=lubridate) package or, when it is a single number (year or hour), we can use an integer value.
For example, below they are used with the format of `lubridate`.
```{r}
w_date <- w_date |>
define_instances(start = lubridate::today(),
end = lubridate::today() + lubridate::years(5))
```
In the following definitions we configure the dimensions so that they consider only the indicated dates. The second will be appropriate if we want to obtain the dimension at the year level, because we are indicating the specific values to consider (not a period).
```{r}
w_date_2_1 <-
when(
values = c(
"2023-12-31",
"2023-01-01",
"2022-12-31",
"2022-01-01",
"2021-12-31",
"2021-01-01"
)
)
w_date_2_2 <- w_date |>
define_instances(values = 2020:2030)
```
The following definition will be appropriate at any level of detail because we are giving the range of values to consider (dates between those two years).
```{r}
w_date_3 <- w_date |>
define_instances(start = 2020, end = 2030)
```
When only the year is indicated, it is considered its first day. Therefore the previous definition is equivalent to the following one.
```{r}
w_date_4 <- w_date |>
define_instances(start = "2020-01-01", end = "2030-01-01")
identical(w_date_3, w_date_4)
```
In the case of time, by default, it is considered at second level and all seconds of the day.
```{r}
w_time_3 <- w_time |>
define_instances(start = "00:00:00", end = "23:59:59")
identical(w_time, w_time_3)
```
We can also indicate a period or a set of times according to the indicated criteria. In the following example, the two definitions are equivalent.
```{r}
w_time_4 <- w_time |>
define_instances(start = 8, end = 17)
w_time_5 <- w_time |>
define_instances(start = "08:00:00", end = "17:00:00")
identical(w_time_4, w_time_5)
```
## Selection of levels and attributes
In the definition, by default, the attributes that we have considered to be most frequently used have been included. We can consult them for each level using the `get_level_attribute_names()` function with the parameter `selected = TRUE`. If we define the parameter as `selected = FALSE` (default option), it will return all available attributes. We can indicate the name of the level or, if we do not indicate any, it will obtain the attributes of all the levels. We obtain the level names using the `get_level_names()` function. Below are examples of using these functions for the date-based dimension.
```{r}
w_date |>
get_level_attribute_names(selected = TRUE)
w_date |>
get_level_names()
w_date |>
get_level_attribute_names(name = 'month', selected = TRUE)
w_date |>
get_level_attribute_names(name = 'month')
```
Also for the time-based dimension.
```{r}
w_time |>
get_level_attribute_names()
w_time |>
get_level_names()
```
There is a function to configure each level. We can select or deselect each attribute individually using the attribute name as a boolean parameter of the function. We can also select or deselect all attributes defined at the level using the special `include_all` and `exclude_all` parameters: Once all the attributes have been updated with these special parameters, in the same function call, we can select or deselect the attributes we want.
For example, if we do not want to include the name of the month, we can configure it using the `select_month_level()` function or at the time of object creation with a parameter of the same name, as shown below.
```{r}
w_date_5 <- w_date |>
select_month_level(month_name = FALSE)
w_date_6 <- when(
start = lubridate::today(),
end = lubridate::today() + lubridate::years(5),
month_name = FALSE
)
identical(w_date_5, w_date_6)
w_date_5 |>
get_level_attribute_names(name = 'month', selected = TRUE)
```
If we only want to include the name of the month, we can do it as follows.
```{r}
w_date_7 <- w_date |>
select_month_level(exclude_all = TRUE, month_name = TRUE)
w_date_7 |>
get_level_attribute_names(name = 'month', selected = TRUE)
```
There is available a similar function for each level: time, date, week, month, quarter, semester and year.
In the case of the date-based dimension levels, we can select or deselect them directly using the `select_date_levels()` function. It allows us to treat the levels globally in the same way that we have been treating the attributes. For example, to exclude the month level we can do the following.
```{r}
w_date_8 <- w_date |>
select_date_levels(month_level = FALSE)
w_date_8 |>
get_level_attribute_names(name = 'month', selected = TRUE)
```
We can also perform that same operation at object creation time, as shown below.
```{r}
w_date_9 <- when(
start = lubridate::today(),
end = lubridate::today() + lubridate::years(5),
month_level = FALSE
)
```
In the case of a time-based dimension, as we have seen, we only consider one level called "time", as shown again below.
```{r}
w_time |>
get_level_names()
```
We can configure the attributes it includes using the `select_time_level()` function. The attribute called "hour" must always be included. If we do not include the "minute" attribute then "second" will not be included either. Below are some examples of definition and the result obtained.
```{r}
w_time_6 <- w_time |>
select_time_level(exclude_all = TRUE)
w_time_6 |>
get_level_attribute_names(selected = TRUE)
w_time_7 <- w_time |>
select_time_level(minute = FALSE)
w_time_7 |>
get_level_attribute_names(selected = TRUE)
```
## Generation of the result
Once the characteristics of the dimension are defined, we can generate its table.
Although we have not yet generated the dimension table, the following function returns the names of the attributes it will contain. They can be obtained as a string or as a vector of strings.
```{r}
w_date |>
get_table_attribute_names(as_string = FALSE)
```
If it is generated in the form of a string (`as_string = TRUE`, default option), the result can be used to easily change the name of the attributes, when the table is generated, using the `set_table_attribute_names()` function, if deemed necessary, as shown in section [Renaming the attributes](#rename_att).
When we consider that the dimension is already properly defined, we can generate its corresponding table using the `generate_table()` function. We update the definition of the object. Any errors found will be notified to us.
```{r}
w_date <- w_date |>
generate_table()
w_time <- w_time |>
generate_table()
```
Once the table is generated, we can export it. If we want to access the table and it is not yet generated, it will send us an error message.
If we make changes to the configuration, it is our responsibility to generate the table again.
## Getting the dimension table {#getting_dim_table}
We can access the dimension table to work with it in R (for example using the [`rolap`](https://CRAN.R-project.org/package=rolap) package), or export it in the format that is most convenient for us to work with other tools.
Next we access the generated tables and display them. First of all for the date.
```{r}
t_date <- w_date |>
get_table()
```
The first and last rows of the obtained result are shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)),
split.table = Inf)
```
Let's generate the dimension at the week level including all its attributes (if we change the definition, we have to generate the table again).
```{r}
t_date <- w_date |>
select_date_levels(day_level = FALSE) |>
select_week_level(include_all = TRUE) |>
generate_table() |>
get_table()
```
The first and last rows of the obtained result are shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)),
split.table = Inf)
```
Next for the time.
```{r}
t_time <- w_time |>
get_table()
```
The start and end of the result table is shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)),
split.table = Inf)
```
`day_part` is predefined in English. Literals or associated hours can be accessed and changed using the `get_day_part()` and `set_day_part()` functions, as shown in section [Names of the parts of the day](#name_parts_day).
Now we are going to generate it at the minute level.
```{r}
t_time <- w_time |>
select_time_level(second = FALSE) |>
generate_table() |>
get_table()
```
The start and end from the new result table is shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)),
split.table = Inf)
```
We can take the generated tables to any Relational DBMS, as shown below, we just have to pass a connection to the database as a parameter.
```{r database}
my_db <- DBI::dbConnect(RSQLite::SQLite())
w_date |>
get_table_rdb(my_db)
w_time |>
get_table_rdb(my_db)
DBI::dbListTables(my_db)
DBI::dbDisconnect(my_db)
```
In the previous example, since we have not explicitly defined any names for the tables in the configuration, they have been assigned as a name the element on which they are based (date or time). We can define these names, as shown in section [Table name](#table_name).
Functions to get the tables in *xlsx* and *csv* format are also available: `get_table_xlsx()` and `get_table_csv()`.
# Other configuration options {#other_config_options}
In section [Definition of general aspects](#general_aspects) we have presented the necessary configuration options for the examples we have used. In particular, we have only needed to define the type on which the dimension is based (time or date) and the language for naming days and months (because it is different from the language of our operating system). The default options have been adequate for the examples shown.
In this section we are going to show other configuration options that may be necessary to define our dimensions.
## Dimension table features
The result of a definition operation performed with the `where` package is a table of a dimension at a given level of granularity, with the attributes selected for that level or associated coarser levels. So far we have focused on level attributes and table instances. In this section we are going to configure other characteristics of it.
### Table name {#table_name}
By default, the table is assigned the name of the base component used to generate the dimension: date or time. We have seen it in the tables that we have stored in the database, in the example of the section [Getting the dimension table](#getting_dim_table).
We can define the name of the dimension at the time of object creation or through the `define_characteristics()` function, as shown below.
```{r}
wd_1 <- when(name = 'dim_where')
wd_2 <- when() |>
define_characteristics(name = 'dim_where')
```
This name is the one that will be assigned to the database table where we store it or to the files we obtain. Below is the case for a relational database.
```{r}
my_db <- DBI::dbConnect(RSQLite::SQLite())
wd_1 |>
generate_table() |>
get_table_rdb(my_db)
DBI::dbListTables(my_db)
DBI::dbDisconnect(my_db)
```
Also the case of obtaining it in *csv* format (if we do not indicate a location, it stores it in one by default).
```{r}
wd_2 |>
generate_table() |>
get_table_csv()
```
### Surrogate key
By default the dimension table will include a surrogate key. In some situations we may be interested in the table having the date itself as its primary key (for example if we want to partition the fact table by the date).
Through the `surrogate_key` parameter, included in the `define_characteristics()` function and also the `when()` function, we can indicate that a surrogate key is not created, as it's shown in the following.
```{r}
when() |>
get_table_attribute_names(as_string = FALSE)
when(surrogate_key = FALSE) |>
get_table_attribute_names(as_string = FALSE)
```
The name of the surrogate key is `id`, is the first attribute of the table. If necessary, we can rename the attributes, including the surrogate key as shown in the next section.
### Renaming the attributes {#rename_att}
If we get a table in `tibble` format, we can rename its attributes using the functions of this class. But if we want to obtain it in a file or in a database, we may be interested in renaming its attributes before obtaining it.
In order to rename the attributes, we first have to generate the table.
```{r}
wd_3 <- when() |>
generate_table()
```
We can rename them using the `set_table_attribute_names()` function, passing a name for each of the available attributes. To facilitate the operation, using the `get_table_attribute_names()` function, we can obtain the attributes in string format. With this we only have to replace the names that we want to change, as shown below.
```{r}
wd_3 |>
get_table_attribute_names()
wd_3 <- wd_3 |>
set_table_attribute_names(
c(
'id_when',
'date',
'month_day',
'week_day',
'day_name',
'day_num_name',
'year_week',
'week',
'year_month',
'month',
'month_name',
'month_num_name',
'year'
)
)
wd_3 |>
get_table_attribute_names(as_string = FALSE)
```
## Names of the parts of the day {#name_parts_day}
We have associated each hour with the usual part of the day in English. We can consult them and change them if necessary (for example, to define them in another language) using the `get_day_part()` and `set_day_part()` functions.
Below is an example where the definition is modified to consider only two parts in the day.
```{r}
when() |>
get_day_part()
when() |>
set_day_part(hour = c(20:23, 0:5), name = "Night") |>
set_day_part(hour = c(6:19), name = "Day") |>
get_day_part()
```
## Configuration of day and week levels
This section shows the configuration options available for the day and week levels.
### The first day of the week
For days, in addition to the language of the name of the day of the week, we can define the start day of the week. There are two possibilities: Monday or Sunday. This determines the numbering of the days.
We can define it using the parameter `week_starts_monday` (which by default has the value TRUE) in the functions `define_characteristics()` or `when()`, as shown below.
```{r}
wd_1 <- when(week_starts_monday = FALSE)
wd_2 <- when() |>
define_characteristics(week_starts_monday = FALSE)
```
### Numbering of weeks in the year
Using the value of the `type` parameter (`type = "date"` or `type = "time"`) in the `define_characteristics()` or `when()` functions we can indicate whether the dimension is based on date or time.
This parameter can additionally take two more values: "iso" and "epi". Using these values (together with the "date" value) we can indicate the type of numbering of the weeks in the year. That is, the week number associated with each date depends on the type of date dimension selected:
- Default (`type = "date"`): It numbers blocks of 7 days beginning on January 1 (regardless of what day of the week it is). The last week of the year will last less than 7 days.
- ISO 8601 (`type = "iso"`): It numbers blocks of 7 days from Monday to Sunday. The first and last week of the year can contain days from the previous or next year.
- Epidemiological (`type = "epi"`): This week is like ISO 8601 only that it considers that the week begins on Sunday.
For the first and last days of the year, the year associated with the week may be different from the year of the date, depending on the type selected ("iso" or "epi").
The "year" attribute (alone or in combination with other attributes) displays the year associated with the date, regardless of the type selected. Only the "year_week" attribute displays the year of the week according to the selected type.
The next section shows how to obtain the attribute obtaining function and how to change it if we consider it necessary (for example, to get the different year types together).
### Get date range of weeks
If we want to work with the dimension at the week level, we may be interested in having the start or end date for each week. To obtain them and be able to use them to generate the dimension, we have the `get_week_date_range()` function.
```{r}
t <- get_week_date_range(start = "2024-01-01", end = "2029-12-31")
```
A period and the type of week numbering ("date", "iso" or "epi") are indicated (by default it is "date"). Returns the result in `tibble` form.
The start and end from the new result is shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(t, 5), tail(t, 5)),
split.table = Inf)
```
We can use the column of the dates that interest us to generate the dimension. Additionally, we leave only the date and the levels that interest us (week and year).
```{r}
tw <- when(values = t[["last"]], month_level = FALSE) |>
select_day_level(exclude_all = TRUE, date = TRUE) |>
generate_table() |>
get_table()
```
The start and end from the new result table is shown below.
```{r, results = "asis"}
pander::pandoc.table(rbind(head(tw, 5), tail(tw, 5)),
split.table = Inf)
```
## Change the definition function of an attribute {#change_def_function_att}
Each attribute is defined by a specific function. All definition functions have the same structure: they receive a *table*, a column with the date or time *values*, possibly additional configuration parameters, and return the *table* to which a new column with the attribute name and values has been added, calculated from the values.
We can get the definition function of any attribute using the `get_attribute_definition_function()` function. Below is the function for the "year" attribute and also the function of the year and week combination, "year_week" attribute (it uses one of the additional parameters available, the dimension type).
```{r}
wd <- when()
wd |>
get_attribute_definition_function(name = "year")
wd |>
get_attribute_definition_function(name = "year_week")
```
In view of the functions, we can define one with the same structure and establish it as a new attribute definition function using the `set_attribute_definition_function()` function. For example, below shows how to change the year attribute definition function to get an additional attribute if the year associated with the week can be different from the year of the date.
```{r}
f <- function(table, values, ...) {
dots <- list(...)
type <- dots[['type']]
table[['year']] <- as.character(lubridate::year(values))
if (type == 'iso') {
table[['week_year']] <- as.character(lubridate::isoyear(values))
} else if (type == 'epi') {
table[['week_year']] <- as.character(lubridate::epiyear(values))
}
table
}
wd <- wd |>
set_attribute_definition_function(name = "year", f)
```
In this case, what we do is add two columns for the year: one with the year of the date and another with the year of the week if this can be different from the first. We can see the result by changing the type of the dimension and generating the table, as shown below (the two year attributes have been added).
```{r}
t <- wd |>
define_characteristics(type = 'iso') |>
generate_table() |>
get_table()
names(t)
```
# Conclusions
The `when` package offers a set of operations that allow us to generate the *When Dimension* based on date or time at the level of detail we need.
We create an object of the `when` class and, from it, we configure the instances and characteristics of the levels and attributes of the dimension. To define the instances we can indicate a range of values or provide the specific values from which to generate the rest of the attributes.
As a result, we can obtain tables for this dimension to work from R, for example with the [`rolap`](https://CRAN.R-project.org/package=rolap) package, or work in other environments, including Relational DBMSs.
# Bibliography