# 1. Inicio Proyecto en R

Leemos desde market.db

In [90]:
library('DBI')
library('RSQLite')
library('magrittr')
library('dplyr')
library('lubridate')
library('tidyr')
library('ggplot2')

In [91]:
con = dbConnect(RSQLite::SQLite(), "../../01_Datos/market.db")

In [92]:
dbListTables(con)

Vamos a leer las tablas y ver el contenido.

In [93]:
calendar = dbReadTable(con, "calendar")
sales = dbReadTable(con, "sales")
sell_prices = dbReadTable(con, "sell_prices")

In [94]:
calendar %>% head()

Unnamed: 0_level_0,index,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0,2013-01-01,11249,Tuesday,4,1,2013,d_704,NewYear,National,,
2,1,2013-01-02,11249,Wednesday,5,1,2013,d_705,,,,
3,2,2013-01-03,11249,Thursday,6,1,2013,d_706,,,,
4,3,2013-01-04,11249,Friday,7,1,2013,d_707,,,,
5,4,2013-01-05,11250,Saturday,1,1,2013,d_708,,,,
6,5,2013-01-06,11250,Sunday,2,1,2013,d_709,,,,


In [95]:
sales %>% head()

Unnamed: 0_level_0,index,id,item_id,dept_id,cat_id,store_id,state_id,d_704,d_705,d_706,⋯,d_1789,d_1790,d_1791,d_1792,d_1793,d_1794,d_1795,d_1796,d_1797,d_1798
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,8412,FOODS_3_090_CA_3_validation,FOODS_3_090,FOODS_3,FOODS,CA_3,CA,0,224,241,⋯,5,2,0,0,6,0,6,0,0,0
2,8442,FOODS_3_120_CA_3_validation,FOODS_3_120,FOODS_3,FOODS,CA_3,CA,0,0,0,⋯,54,63,44,0,65,90,104,73,76,97
3,8524,FOODS_3_202_CA_3_validation,FOODS_3_202,FOODS_3,FOODS,CA_3,CA,20,23,23,⋯,43,40,39,0,29,33,27,13,26,47
4,8574,FOODS_3_252_CA_3_validation,FOODS_3_252,FOODS_3,FOODS,CA_3,CA,34,27,40,⋯,31,43,32,0,52,37,32,29,34,27
5,8610,FOODS_3_288_CA_3_validation,FOODS_3_288,FOODS_3,FOODS,CA_3,CA,0,0,0,⋯,29,45,28,0,46,36,40,31,46,36
6,8651,FOODS_3_329_CA_3_validation,FOODS_3_329,FOODS_3,FOODS,CA_3,CA,39,75,54,⋯,66,56,60,0,46,64,49,44,37,44


In [96]:
sell_prices %>% head()

Unnamed: 0_level_0,index,store_id,item_id,wm_yr_wk,sell_price
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<dbl>
1,1862524,CA_3,FOODS_3_090,11249,1.25
2,1862525,CA_3,FOODS_3_090,11250,1.25
3,1862526,CA_3,FOODS_3_090,11251,1.25
4,1862527,CA_3,FOODS_3_090,11252,1.25
5,1862528,CA_3,FOODS_3_090,11301,1.38
6,1862529,CA_3,FOODS_3_090,11302,1.38


En ambas tres tablas, eliminamos la columna index porque no es significativa. De sales también eliminamos id porque es información redundante.

In [97]:
calendar = calendar %>% dplyr::select(-index)
sales = sales %>% dplyr::select(-index, -id)
sell_prices = sell_prices %>% dplyr::select(-index)

Ahora, haremos pasaremos de formato transaccional a tabular para normalizar algunas tablas. En sales, la estructura viene como los días de ventas y las ventas por columnas, las pasamos a filas.

In [98]:
sales = pivot_longer(sales, 
                      cols = starts_with("d_"), 
                      names_to = "d", 
                      values_to = "ventas")

Unimos las tablas por los campos comunes, en primer lugar sales y calendar por el campo en común "d" que hemos transformado en sales en el anterior paso.

In [99]:
df = sales %>% left_join(calendar, by = "d")

Visualizamos:

In [100]:
df %>%  head()

item_id,dept_id,cat_id,store_id,state_id,d,ventas,date,wm_yr_wk,weekday,wday,month,year,event_name_1,event_type_1,event_name_2,event_type_2
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_704,0,2013-01-01,11249,Tuesday,4,1,2013,NewYear,National,,
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_705,224,2013-01-02,11249,Wednesday,5,1,2013,,,,
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_706,241,2013-01-03,11249,Thursday,6,1,2013,,,,
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_707,232,2013-01-04,11249,Friday,7,1,2013,,,,
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_708,301,2013-01-05,11250,Saturday,1,1,2013,,,,
FOODS_3_090,FOODS_3,FOODS,CA_3,CA,d_709,270,2013-01-06,11250,Sunday,2,1,2013,,,,


Actualizamos con sell_prices, que tiene en común con df los campos "store_id", "item_id" y "wm-yr_wk"

In [101]:
df = df %>% left_join(sell_prices, by = c("store_id", "item_id", "wm_yr_wk"))

Vamos a comprobar que todos los campos se han integrado correctamente.

In [102]:
df %>% select(store_id, item_id, wm_yr_wk, d, sell_price) %>%
arrange(store_id, item_id, wm_yr_wk, d) %>% 
head(10)

store_id,item_id,wm_yr_wk,d,sell_price
<chr>,<chr>,<int>,<chr>,<dbl>
CA_3,FOODS_3_090,11249,d_704,1.25
CA_3,FOODS_3_090,11249,d_705,1.25
CA_3,FOODS_3_090,11249,d_706,1.25
CA_3,FOODS_3_090,11249,d_707,1.25
CA_3,FOODS_3_090,11250,d_708,1.25
CA_3,FOODS_3_090,11250,d_709,1.25
CA_3,FOODS_3_090,11250,d_710,1.25
CA_3,FOODS_3_090,11250,d_711,1.25
CA_3,FOODS_3_090,11250,d_712,1.25
CA_3,FOODS_3_090,11250,d_713,1.25


Vamos ahora a poner las columnas en un orden más adecuado.

In [103]:
columns_order = c('date', 'state_id', 'store_id', 'dept_id', 'cat_id', 'item_id', 'wm_yr_wk', 'd', 'ventas',
                      'sell_price', 'year', 'month', 'wday', 'weekday', 'event_name_1', 'event_type_1', 'event_name_2',
                      'event_type_2')

In [104]:
df = df[, columns_order]

In [105]:
df %>% head()

date,state_id,store_id,dept_id,cat_id,item_id,wm_yr_wk,d,ventas,sell_price,year,month,wday,weekday,event_name_1,event_type_1,event_name_2,event_type_2
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<dbl>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
2013-01-01,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11249,d_704,0,1.25,2013,1,4,Tuesday,NewYear,National,,
2013-01-02,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11249,d_705,224,1.25,2013,1,5,Wednesday,,,,
2013-01-03,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11249,d_706,241,1.25,2013,1,6,Thursday,,,,
2013-01-04,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11249,d_707,232,1.25,2013,1,7,Friday,,,,
2013-01-05,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11250,d_708,301,1.25,2013,1,1,Saturday,,,,
2013-01-06,CA,CA_3,FOODS_3,FOODS,FOODS_3_090,11250,d_709,270,1.25,2013,1,2,Sunday,,,,


Vamos a establecer la columna date, como datetime.

In [106]:
df$date = ymd(df$date)

Vamos a separar este dataframe en datos con los cuales trabajaremos y los últimos registros para la validación del forecasting. Vamos a coger para datos de validación los desde noviembre de 2015.

Comprobamos las dimensiones del dataframe antes de la división.

In [107]:
dim(df)

In [108]:
fecha_division = as.Date('2015-11-01')

In [109]:
val = df[df$date >= fecha_division, ] 
df = df[df$date < fecha_division, ]

Y después:

In [112]:
dim(df)

In [113]:
dim(val)

Guardamos los dos datasets, de trabajo y validación, en formato .rds.

In [115]:
saveRDS(df, file = "../022_Variables/df.rds")
saveRDS(val, file = "../022_Variables/val.rds")