# Prepare data splits for calibration and validation

Lars Caspersen

## Aim

This notebook creates two versions of calibration / validation splits of the bloom observations: a “full” split using a common 75% calibration and 25% validation and a “scarcity” split with only ten observations per cultivar for calibration and the remaining data for validation.

We decided to have three cultivars per location. We only included phenology data from a single location even if there were observations from multiple locations to balance the experiment design.

Prepare the cherry data

In [None]:
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

`summarise()` has grouped output by 'cultivar'. You can override using the
`.groups` argument.

Prepare the apricot data

In [None]:
#take three cultivars per location
apricot <- read.csv('data/combined_phenological_data_adamedor_clean.csv') %>% 
  filter(species == 'Apricot') %>% 
  select(species, cultivar, location, flowering_f50, year) %>% 
  mutate(yday = lubridate::mdy(flowering_f50) %>% lubridate::yday()) %>% 
  na.omit()  

#sometimes R makes trouble with accents. So remove it from Bulida
apricot$cultivar <- ifelse(apricot$cultivar == "B\xfalida",
                           yes = 'Bulida',
                           no = apricot$cultivar)


apricot_summary <- apricot %>% 
  group_by(cultivar, location) %>% 
  summarise(n = n(),
            mean = mean(yday)) %>% 
  filter(n >= 20)

`summarise()` has grouped output by 'cultivar'. You can override using the
`.groups` argument.

Prepare the almond data. In almond data I accidentally started first with the scarcity split, but in the end it has the same structure. Calibration data that is part of the scarcity split is also present in the calibration data of the “full split”. I decided to keep this structure, so that the splits are reproducible.

In [None]:
almond_adamedor <- read.csv('data/combined_phenological_data_adamedor_clean.csv') %>%
  filter(species == 'Almond') %>% 
  select(species, cultivar, location, year, flowering_f50) %>%
  drop_na() %>%
  mutate(yday = lubridate::mdy(flowering_f50) %>% lubridate::yday())

overview <- almond_adamedor %>%
  mutate(cult_loc = paste(cultivar, location, sep ='-')) %>%
  group_by(cult_loc, cultivar, location) %>%
  summarise(n = n())

`summarise()` has grouped output by 'cult_loc', 'cultivar'. You can override
using the `.groups` argument.