-
Notifications
You must be signed in to change notification settings - Fork 0
/
v40-refresh.Rmd
502 lines (394 loc) · 18.4 KB
/
v40-refresh.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
---
title: "Incremental refresh of star databases"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Incremental refresh of star databases}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, echo = FALSE, message=FALSE}
library(rolap)
```
# Introduction
Once we have star databases built with the data available at the moment, periodically we may obtain additional data, with the same structure as the initial data but from a later time or from another place. Sometimes the new data also contains instances which are already included in the star database to operate on them.
Suppose we need to jointly analyse all the data available at the moment: We must include them in the star database. One possibility is to integrate all available data into a flat table and build the star database again. Another possibility is to use the *incremental refresh* mechanism described in this document.
This document shows by means of an example the possibilities offered by the package in this context. First, the starting data sets are presented. The next section shows how to generate the refresh structures. The following section shows how to perform the incremental refresh. Finally, we present how to make changes to the transformation functions and add these changes for future refresh operations. Finish with the conclusions.
# Starting data sets
The starting data set is the content in the variable `mrs_db`, obtained in the vignette titled *Obtaining and transforming flat tables*, `vignette("v05-flat-table-op")`. It contains the constellation, formed by two star databases. Next we get their names.
```{r}
library(rolap)
mrs_db |>
get_fact_names()
```
The code to generate the constellation from the initial data is available in the vignette. Below is a graphic representation of the tables that make it up.
```{r}
db_dm <- mrs_db |>
as_dm_class(pk_facts = FALSE)
db_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
```
From the original data source (the [Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System](https://catalog.data.gov/dataset/deaths-in-122-u-s-cities-1962-2016-122-cities-mortality-reporting-system) dataset), suppose **we obtain a set of data that we want to integrate with the previous data to analyse them together**.
We have stored it in the package, in a file with the same format as the original file which only contains a small portion of the original data. We have made sure that there is data that was already included in the data set considered to obtain the content of the `mrs_db` constellation and also new data. It is accessed below.
```{r, echo=FALSE}
# mrs_ft_new: Declared as a variable instead of reading from the file due to problem building macos-latest (release)
```
```{r, eval=FALSE}
file <-
system.file(
"extdata/mrs",
"mrs_122_us_cities_1962_2016_new.csv",
package = "rolap"
)
mrs_ft_new <-
read_flat_table_file(name = 'mrs new', file)
```
Using the `read_flat_table_file()` function we read a table stored in a text file and create a `flat_table` object with a name. Below are the first records of the table. We access the table using the `get_table()` function for the object of the `flat_table` class.
```{r, results = "asis"}
ft <- mrs_ft_new |>
get_table()
pander::pandoc.table(head(ft), split.table = Inf)
```
# Generation of star database refresh structures
From the table that we have read, with the information contained in the constellation variable, we can automatically generate refresh structures for the star databases that make up the constellation.
## Refresh structures
From the data read in the form of a `flat_table` object, using the `update_according_to()` function, we generate a refresh structure of the star database of the `mrs_db` constellation whose name is indicated.
```{r}
mrs_db_age_refresh <- mrs_ft_new |>
update_according_to(mrs_db, star = "mrs_age")
```
The modification structure contains a star database generated from the read data. We can see its graphical representation below.
```{r}
db_dm <- mrs_db_age_refresh |>
get_star_database() |>
as_dm_class(pk_facts = FALSE)
db_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
```
In exactly the same way, we generate the refresh structure for the other star database.
```{r}
mrs_db_cause_refresh <- mrs_ft_new |>
update_according_to(mrs_db, star = "mrs_cause")
```
Below is its graphical representation.
```{r}
db_dm <- mrs_db_cause_refresh |>
get_star_database() |>
as_dm_class(pk_facts = FALSE)
db_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
```
## Transformation code
The transformation code to generate the variable `mrs_db` can be obtained in the vignette titled *Obtaining and transforming flat tables*, `vignette("v05-flat-table-op")`.
It is also included in the refresh structure for each of the star databases obtained. It can be obtained using the `get_transformation_code()` and `get_transformation_file()` functions, in the form of a vector of strings and a file respectively.
```{r}
mrs_db_age_refresh |>
get_transformation_code()
```
If additional transformations are needed, it can be modified to our convenience. In the same way we consult it for the other star database, although we don't show the result here.
```{r}
code <- mrs_db_cause_refresh |>
get_transformation_code()
```
We can see that the selection of the star database is done using the `magrittr::extract2()` function, which was not used in the original vignette to select the elements of a list. It has been included here to preserve the pipe syntax throughout the transformation.
This code only needs to be used if we want to perform additional transformations. If the previously defined transformations are sufficient, they have already been automatically applied to the data in order to integrate them.
## Instances included in the structure
Seeing the new instances that are going to be added to the dimensions can help us determine whether or not we need to modify the transformation code. We can do this using the `get_new_dimension_instances()` function, as shown below.
```{r}
mrs_db_age_refresh |>
get_new_dimension_instances()
```
If necessary, starting from the code that we have obtained, we can perform transformations on the starting `flat_table` structure or on the `star_database` object obtained.
We can see that there are two new cities that were not included in the *where* dimension of the initial star databases: *Baltimore* and *Boston*.
# Incremental refresh
The most common thing is that refresh operations only include new instances in fact tables, but it may be the case that repeated instances appear: They may have different values in the measures, but the same values in the dimension foreign keys.
## Existing fact instances
To perform the incremental refresh operation, we must determine what happens to the fact table instances that were already included in the original star database. We can query existing fact instances using the `get_existing_fact_instances()` function, as can be seen below for each of the star databases.
```{r}
mrs_db_age_refresh |>
get_existing_fact_instances()
mrs_db_cause_refresh |>
get_existing_fact_instances()
```
In each case it may be interesting to perform a different operation on these previously existing instances. By default, what is done is to ignore them but we can indicate that they are grouped, replaced or deleted, using the `existing_instances` parameter.
## Incremental refresh operation
To better appreciate the update made, we are going to perform the same query before and after the incremental refresh operation: We obtain the design tables, along with the number of instances of each one. For the *where* dimension table, we get the names of the first cities to check that *Baltimore* and *Boston* were not included.
```{r}
l_db <- mrs_db |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
head(sort(l_db[['where']]$city), 15)
```
Next, we perform the incremental refresh for each of the star databases in the original constellation. For one of the star databases, through the `existing_instances` parameter, it has been indicated that the instances of previously existing facts are grouped with the new ones. For the other design, the default option is considered: Ignore fact table previously existing instances.
We will make a copy of the constellation to perform an operation later.
```{r}
mrs_db_seg <- mrs_db
mrs_db <- mrs_db |>
incremental_refresh(mrs_db_age_refresh) |>
incremental_refresh(mrs_db_cause_refresh, existing_instances = "group")
```
Finally, we consult the same data as before again.
```{r}
l_db <- mrs_db |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
head(sort(l_db[['where']]$city), 15)
```
We observe how the number of instances has increased in the facts and in the dimensions and the *where* dimension already contains the names of the cities that appeared as new in the update operation.
# Changes in transformation operations
In this case no problem has been detected in the new dimension data. But let's assume that we wanted to make some change to the modify operations before doing the update to show how it would be done.
## New transformation function
First of all, we obtain the transformation function. In this case we have stored it in a temporary file using the `get_transformation_file()` function and we have copied it here.
Starting from the `tibble` of the initial `flat_table`, the `star_schema` to define the `star_database`, the lookup tables and field definition functions that we have used, we can reproduce all the transformations and adapt them to the new situations that hypothetically have arisen.
```{r}
transform_instance_table <-
function(instance_df,
lookup_ft,
definition_fun,
star_sch) {
ft <-
flat_table(name = 'mrs',
instances = instance_df,
unknown_value = 'Not available') |>
transform_to_measure(
attributes = c(
'Pneumonia and Influenza Deaths',
'All Deaths',
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
),
k_sep = NULL,
decimal_sep = NULL
) |>
transform_attribute_format(
attributes = 'WEEK',
width = 2,
decimal_places = 0,
k_sep = ',',
decimal_sep = '.'
) |>
replace_empty_values(
attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
empty_values = NULL
) |>
add_custom_column(name = 'city_state',
definition = definition_fun) |>
replace_attribute_values(
attributes = c('City', 'city_state'),
old = c('Wilimington', 'Wilimington DE'),
new = c('Wilmington', 'Wilmington DE')
) |>
join_lookup_table(fk_attributes = 'city_state',
lookup = lookup_ft) |>
select_attributes(
attributes = c(
'Year',
'WEEK',
'Week Ending Date',
'REGION',
'State',
'City',
'city_state',
'status',
'pop',
'lat',
'long'
)
) |>
separate_measures(
measures = list(
c('Pneumonia and Influenza Deaths', 'All Deaths'),
c(
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
),
names = c('mrs_cause', 'mrs_age'),
na_rm = TRUE
) |>
magrittr::extract2('mrs_cause') |>
snake_case() |>
as_star_database(schema = star_sch)
ft
}
```
We can get these parameters from the initial definition (in the vignette titled *Obtaining and transforming flat tables*, `vignette("v05-flat-table-op")`) or by consulting the refresh structure with the functions available for this purpose.
```{r}
instance_df <- mrs_ft_new |>
get_table()
lookup_list <- mrs_db_cause_refresh |>
get_lookup_tables()
star_sch <- mrs_db_cause_refresh |>
get_star_schema()
# function to define a derived column
city_state <- function(table) {
paste0(table$City, ' ', table$State)
}
mrs_db_cause_transf <-
transform_instance_table(
instance_df = instance_df,
lookup_ft = lookup_list[['us_cities']],
definition_fun = city_state,
star_sch = star_sch
)
```
We perform the same operation for the other star database.
```{r}
transform_instance_table_2 <-
function(instance_df,
lookup_ft,
definition_fun,
star_sch) {
ft <-
flat_table(name = 'mrs',
instances = instance_df,
unknown_value = 'Not available') |>
transform_to_measure(
attributes = c(
'Pneumonia and Influenza Deaths',
'All Deaths',
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
),
k_sep = NULL,
decimal_sep = NULL
) |>
transform_attribute_format(
attributes = 'WEEK',
width = 2,
decimal_places = 0,
k_sep = ',',
decimal_sep = '.'
) |>
replace_empty_values(
attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
empty_values = NULL
) |>
add_custom_column(name = 'city_state',
definition = definition_fun) |>
replace_attribute_values(
attributes = c('City', 'city_state'),
old = c('Wilimington', 'Wilimington DE'),
new = c('Wilmington', 'Wilmington DE')
) |>
join_lookup_table(fk_attributes = 'city_state',
lookup = lookup_ft) |>
select_attributes(
attributes = c(
'Year',
'WEEK',
'Week Ending Date',
'REGION',
'State',
'City',
'city_state',
'status',
'pop',
'lat',
'long'
)
) |>
separate_measures(
measures = list(
c('Pneumonia and Influenza Deaths', 'All Deaths'),
c(
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
),
names = c('mrs_cause', 'mrs_age'),
na_rm = TRUE
) |>
magrittr::extract2('mrs_age') |>
transform_to_values(
attribute = 'age',
measure = 'all_deaths',
id_reverse = NULL,
na_rm = TRUE
) |>
snake_case() |>
replace_string(attributes = 'age',
string = ' (all cause deaths)',
replacement = NULL) |>
as_star_database(schema = star_sch)
ft
}
star_sch <- mrs_db_age_refresh |>
get_star_schema()
mrs_db_age_transf <-
transform_instance_table_2(
instance_df = instance_df,
lookup_ft = lookup_list[['us_cities']],
definition_fun = city_state,
star_sch = star_sch
)
```
The result of these functions, in each case, is a `star_database` containing the new transformations.
## Refresh structures
Now we have to create the refresh structure for each star database, considering the original constellation (we have made a copy before updating it in `mrs_db_seg`). To create the refresh structures, we now indicate that the operations are considered from the star database of the `sdb_operations` parameter.
```{r}
mrs_db_cause_transf_refresh <- mrs_ft_new |>
update_according_to(mrs_db_seg, star = "mrs_cause", sdb_operations = mrs_db_cause_transf)
mrs_db_age_transf_refresh <- mrs_ft_new |>
update_according_to(mrs_db_seg, star = "mrs_age", sdb_operations = mrs_db_age_transf)
```
That is, with the `sdb_operations` parameter, it takes the data to be refreshed from the constellation, but the modification operations are taken from this star database (where we have defined the new operations that have been considered necessary).
## Incremental refresh operation
To show the result of the incremental refresh operations, we are going to repeat the process carried out previously: We show the name and number of instances of the tables before and after the transformation, as well as the names of the first cities of the *where* dimension.
```{r}
l_db <- mrs_db_seg |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
head(sort(l_db[['where']]$city), 15)
```
To perform the incremental refresh, as the refresh structure has new transformation operations, we can indicate that these become the transformation operations of the constellation. Thus, in future updates, they will be taken into account automatically. This is indicated by the boolean parameter `replace_transformations`.
```{r}
mrs_db_seg <- mrs_db_seg |>
incremental_refresh(mrs_db_age_transf_refresh, replace_transformations = TRUE) |>
incremental_refresh(
mrs_db_cause_transf_refresh,
existing_instances = "group",
replace_transformations = TRUE
)
```
Finally, we consult the same data as before again.
```{r}
l_db <- mrs_db_seg |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
head(sort(l_db[['where']]$city), 15)
```
It can be seen that we have obtained the same results as before.
# Conclusions
This document shows the functions supporting incremental refreshing of star databases offered by the `rolap` package.
Starting from a flat table, the transformation operations through which a star database has been defined are automatically applied. These operations can be obtained and enriched to adapt to new situations that arise. Once the data has been conveniently updated, it can be integrated with the original star database indicating the operation that is considered most appropriate to integrate the data previously existing in the fact table.
This incremental refresh mechanism, although not trivial, facilitates this generally difficult process. It lays the foundations to continue simplifying it and offering new functionalities.