## Valores perdidos en series temporales

Existen varias variables meteorológicas en el set de datos `weather_train/test` que contienen valores perdidos. En este notebook probaremos la librería de `R`, `imputeTS`, para lidiar con estos casos.

In [135]:
rm(list = ls())

In [136]:
# Librerias
packages <- c("data.table","ggplot2","tidyverse","tidyr","dplyr"
             ,"tibble","forecast","tsfknn","anytime","varhandle","lubridate","nortest"
             ,"normtest","scales","xts","tsfknn","TSrepr","imputeTS")

In [137]:
# Función que carga las librerias que se indiquen, y las instala en caso de no estarlo

loadLibraries <- function(pakages) {
  usePackage <- function(p){
    if ( !is.element(p, installed.packages()[, 1]) ) {
      install.packages(p, dep = TRUE)}
    require(p, character.only = TRUE)}
  
  for (p in packages){ usePackage(p) }
  
}

In [138]:
# Carga de librerías
loadLibraries(packages)

In [139]:
# Importamos los datos
weather <- as.data.frame(read.csv("../data/interim/weather.csv", header = TRUE, sep = ","))
weather_by_site <- weather %>% group_by(weather$site_id)

In [140]:
head(weather)
tail(weather)

X,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.5,0,0.0
1,0,2016-01-01 01:00:00,24.4,,21.1,-1.0,1020.0,70,1.5
2,0,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.0,0,0.0
3,0,2016-01-01 03:00:00,21.1,2.0,20.6,0.0,1020.0,0,0.0
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250,2.6
5,0,2016-01-01 05:00:00,19.4,,19.4,0.0,,0,0.0


Unnamed: 0_level_0,X,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
Unnamed: 0_level_1,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
139768,139767,15,2016-12-31 18:00:00,2.8,,-7.8,,1007.5,180,8.2
139769,139768,15,2016-12-31 19:00:00,3.0,,-8.0,,,180,5.7
139770,139769,15,2016-12-31 20:00:00,2.8,2.0,-8.9,,1007.5,180,7.7
139771,139770,15,2016-12-31 21:00:00,2.8,,-7.2,,1007.5,180,5.1
139772,139771,15,2016-12-31 22:00:00,2.2,,-6.7,,1008.0,170,4.6
139773,139772,15,2016-12-31 23:00:00,1.7,,-5.6,-1.0,1008.5,180,8.8


In [141]:
sites <- unique(weather$site_id)
print(sites)

 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15


In [142]:
# Convertimos a tipo fecha la columna timestamp
weather$timestamp <- unfactor(weather$timestamp)
weather$timestamp <- as.POSIXct(weather$timestamp, format = "%Y-%m-%d %H:%M", tz="GMT")

In [164]:
# Generamos un lista de series temporales de la variable `air_temperature`, una por cada `site_id`
airTempList = list()
for (site in sites){
    airTempList[[site+1]] <- xts((filter(weather,site_id==site))$air_temperature, order.by = filter(weather,site_id==site)$timestamp)
}


In [165]:
# Gráficas
...

ERROR: Error in eval(expr, envir, enclos): '...' usado en un contexto incorrecto


In [163]:
# Estadísticas sobre los valores perdidos de cada serie
n = 0
for (serie in airTempList){
    print(paste("**** SITE_ID: ", n))
    statsNA(serie)
    cat(sep="\n")
    n = n+1
          
}


[1] "**** SITE_ID:  0"
[1] "Length of time series:"
[1] 8784
[1] "-------------------------"
[1] "Number of Missing Values:"
[1] 3
[1] "-------------------------"
[1] "Percentage of Missing Values:"
[1] "0.0342%"
[1] "-------------------------"
[1] "Stats for Bins"
[1] "  Bin 1 (2196 values from 1 to 2196) :      3 NAs (0.137%)"
[1] "  Bin 2 (2196 values from 2197 to 4392) :      0 NAs (0%)"
[1] "  Bin 3 (2196 values from 4393 to 6588) :      0 NAs (0%)"
[1] "  Bin 4 (2196 values from 6589 to 8784) :      0 NAs (0%)"
[1] "-------------------------"
[1] "Longest NA gap (series of consecutive NAs)"
[1] "3 in a row"
[1] "-------------------------"
[1] "Most frequent gap size (series of consecutive NA series)"
[1] "3 NA in a row (occuring 1 times)"
[1] "-------------------------"
[1] "Gap size accounting for most NAs"
[1] "3 NA in a row (occuring 1 times, making up for overall 3 NAs)"
[1] "-------------------------"
[1] "Overview NA series"
[1] "  3 NA in a row: 1 times"

[1] "**** SITE_ID

### Método 1: Interpolación lineal
### Método 2: Interpolación splin