## Event gap distribution among presses and product lines

## 1. Loading libraries

In [8]:
library(dplyr)
library(ggplot2)
library(tidyr)
library(stats)

## 2. Loading data

In [9]:
load('IndigoData.dat')
glimpse(data)

Observations: 1,211,693
Variables: 22
$ event_RowNumber (dbl) 15765585098, 15765585142, 15765585284, 15765585445,...
$ event_press     (int) 45000586, 45000586, 45000586, 45000586, 45000586, 4...
$ Product_Line    (fctr) HP Indigo 7600, HP Indigo 7600, HP Indigo 7600, HP...
$ series          (fctr) 7000 Family Sheet Fed Commercial Press, 7000 Famil...
$ Region          (fctr) North America, North America, North America, North...
$ SubRegion       (fctr) MidAtlantic United States, MidAtlantic United Stat...
$ District        (fctr) UNITED STATES, UNITED STATES, UNITED STATES, UNITE...
$ Ownership       (fctr) DIRECT, DIRECT, DIRECT, DIRECT, DIRECT, DIRECT, DI...
$ event_recNum    (int) 739828, 739878, 740037, 740221, 740224, 740554, 740...
$ event_date      (chr) "2016-06-08 00:00:00.000", "2016-06-08 00:00:00.000...
$ event_time      (int) 74703, 75903, 84709, 93603, 95930, 120644, 132213, ...
$ event_name      (fctr) SAMPLE_PIP_AND_IMO_PARAMETERS, SAMPLE_PIP_AND_IMO_...
$ event_state 

## 3. Preparing data for analysis

### 3.1 Calculate daily averages for PRINT_STATE events

In [45]:
df_print_state_daily <- data %>%
    filter(event_state == "PRINT_STATE") %>%       
    mutate(event_date = as.POSIXct(substr(event_date, 1, 10))) %>%
    mutate(event_press = factor(event_press)) %>%
    group_by(event_press, event_date) %>%
    summarise(
        pip_temperature = mean(PIP_Temperature),
        io_temperature = mean(IO_temperature),
        io_dirtiness = mean(IO_dirtiness),
        vessel_flow = mean(vessel_flow),
        io_conductivity = mean(IO_Conductivity),
        cs_voltage = mean(CS_Voltage),
        delta_pressure = mean(Delta_Pressure),
        product_line = first(Product_Line)
    ) %>% 
    arrange(event_press, event_date) %>%
    select(event_press, product_line, event_date, pip_temperature, io_temperature, 
           io_dirtiness, vessel_flow, io_conductivity, cs_voltage, delta_pressure) 
#glimpse(df_print_state_daily)

### 3.2 Calculate continuous gaps in time series

In [67]:
df_press_gaps <- df_print_state_daily %>%
    select(event_press, event_date, product_line) %>%
    mutate(prior_event_date = lag(event_date, na.pad = TRUE)) %>%
    mutate(delta_event_date = ifelse(is.na(prior_event_date), -1, event_date - prior_event_date - 1)) %>%
    mutate(delta_event_date = replace(delta_event_date,which(delta_event_date == -1),NA)) %>%
    summarise(
        product_line = first(product_line),
        max_delta_event_date = max(delta_event_date, na.rm = TRUE),
        min_delta_event_date = min(delta_event_date, na.rm = TRUE),
        number_of_events = n(),
        mean_gap_size = mean(delta_event_date, na.rm = TRUE),
        median_gap_size = median(delta_event_date, na.rm = TRUE),
        sd_gap_size = sd(delta_event_date, na.rm = TRUE)
    )
glimpse(df_press_gaps)

Observations: 851
Variables: 8
$ event_press          (fctr) 40000024, 40000028, 40000034, 40000038, 40000...
$ product_line         (fctr) HP Indigo 7000, HP Indigo 7000, HP Indigo 700...
$ max_delta_event_date (dbl) 13, 18, 13, 6, 4, 15, 41, 10, 21, 10, 8, 7, 4,...
$ min_delta_event_date (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ number_of_events     (int) 36, 33, 83, 59, 109, 48, 16, 23, 44, 75, 81, 8...
$ mean_gap_size        (dbl) 4.0000000, 4.0000000, 1.1707317, 2.0172414, 0....
$ median_gap_size      (dbl) 2.0, 2.0, 0.0, 2.0, 0.0, 1.0, 7.0, 0.0, 1.0, 0...
$ sd_gap_size          (dbl) 4.4787078, 4.5860940, 2.3083578, 1.9600517, 0....


## 4. Continuous gaps 
Sensor data is irregularly distributed in each press with frequent continuous gaps in several sizes.
In order to avoid to avoid a sliding window to pass through a time-series interval where there will be no data for a press, it is important to make sure only presses with smaller continous gaps are used in this process.

Also we should get some examples of presses with frequent pip temperatures crossing threshold to make it possible to understand how effective can be the trends for these cases.

### 4.1 Gap overview considering all presses

In [114]:
df_press_gaps %>%
    summarise(
        number_of_presses = n(),
        median_number_of_events = median(number_of_events, na.rm = T),
        sd_number_of_events = sd(number_of_events, na.rm = T),
        global_median_gap_size = median(median_gap_size, na.rm = T),
        global_sd_gap_size = sd(median_gap_size, na.rm = T)
        )        

Unnamed: 0,number_of_presses,median_number_of_events,sd_number_of_events,global_median_gap_size,global_sd_gap_size
1,851.0,43.0,36.97867,1.0,11.42132


### 4.2 Gap overview by product line

In [115]:
df_press_gaps %>%
    mutate(total_presses = n()) %>%
    group_by(product_line) %>%  
    mutate(number_of_presses = n()) %>%
    summarise(        
        number_of_press = first(number_of_presses),
        perc_of_presses = first(number_of_presses) / first(total_presses),
        median_number_of_events = median(number_of_events, na.rm = T),
        sd_number_of_events = sd(number_of_events, na.rm = T),
        global_median_gap_size = median(median_gap_size, na.rm = T),
        global_sd_gap_size = sd(median_gap_size, na.rm = T)
    )   

Unnamed: 0,product_line,number_of_press,perc_of_presses,median_number_of_events,sd_number_of_events,global_median_gap_size,global_sd_gap_size
1,HP Indigo 7000,228,0.267920094007051,43,34.4309901531685,1,9.94543665415612
2,HP Indigo 7500,246,0.289071680376028,43,38.281294707379,1,9.30639989499397
3,HP Indigo 7600,377,0.443008225616921,45,37.6076142741222,1,13.3584242196602


### 4.3 Conclusions from the gap analysis

The table above shows that 'HP Indigo 7600' corresponds to about 45% of all press events in PRINT_STATE. 

Due to that it seems to be a good approach to always break data in different product lines prior to train any machine learning algorithm. By doing so, we can better represent each product line.

The median number of events per press is 43 and 45 so it might be better to use presses that have a similar number of events for the training. This way different presses can be better learned.

### 4.4 Number of presses with medians similar to their corresponding product_line medians

In [127]:
event_perc_around_median <- 0.20
gap_perc_around_median <- 0.30

# Considering event_number_median +- 50%
event_number_median <- 44 # since all event medians are similar among product lines - an approximate number is being used here
event_upper_limit <- round(event_number_median*(1 + event_perc_around_median))
event_lower_limit <- round(event_number_median*(1 - event_perc_around_median))

# event gaps can´t be equal or bigger than 14
gap_upper_limit <- 14

cat('With event_numbers',event_perc_around_median*100,'% around ',event_number_median,':', event_upper_limit,'and', event_lower_limit,'\n')
cat('With median_gap_size less than', gap_upper_limit,'\n')

df_press_gaps %>%
    group_by(product_line) %>%
    mutate(total_product_line = n()) %>%
    filter(number_of_events >= round(event_lower_limit) & number_of_events <= round(event_upper_limit)) %>%
    filter(median_gap_size < gap_upper_limit) %>%
    summarise(
        original_number_of_press = first(total_product_line),        
        filtered_number_of_press = n(),
        perc_of_filtered_press_from_original = filtered_number_of_press / original_number_of_press,
        biggest_continuous_gap = max(median_gap_size)
    )
    

With event_numbers 20 % around  44 : 53 and 35 
With median_gap_size less than 14 


Unnamed: 0,product_line,original_number_of_press,filtered_number_of_press,perc_of_filtered_press_from_original,biggest_continuous_gap
1,HP Indigo 7000,228,42,0.184210526315789,3
2,HP Indigo 7500,246,41,0.166666666666667,3
3,HP Indigo 7600,377,55,0.145888594164456,3
