## Identifying Heat Stress Periods Based on Weather Data

This approach focuses on identifying heat stress periods using weather data. Various weather variables, such as temperature, humidity, and the Temperature-Humidity Index (THI), are known to influence heat stress in cows. By setting specific thresholds for these variables, we can pinpoint days when the weather conditions are likely to cause heat stress.

In [36]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.optimize import curve_fit, OptimizeWarning
from tqdm import tqdm
import warnings

sns.set_theme()
sns.set_context("notebook")
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [37]:
dtype_dict = {
    'FarmName_Pseudo': 'str',
    'SE_Number': 'str',
    'AnimalNumber': 'Int64',          
    'StartDate': 'str',
    'StartTime': 'str',
    'DateTime': 'str',
    'LactationNumber': 'Int64',       
    'DaysInMilk': 'Int64', 
    'YearSeason': 'str',           
    'TotalYield': 'float',
    'DateTime': 'str',
    'BreedName': 'str',
    'Age': 'Int64',
    'Mother': 'str',
    'Father': 'str',
    'CullDecisionDate': 'str',
    'Temperature': 'float',
    'RelativeHumidity': 'float',      
    'THI_adj': 'float',
    'HW': 'Int64',                    
    'cum_HW': 'Int64',                
    'Temp15Threshold': 'Int64'        
}


# Load the CSV with specified dtypes
data = pd.read_csv('../Data/MergedData/CleanedYieldData.csv', dtype=dtype_dict)

# Convert date and time columns back to datetime and time objects
data['DateTime'] = pd.to_datetime(data['DateTime'], errors='coerce')
data['StartTime'] = pd.to_datetime(data['StartTime'], format='%H:%M:%S', errors='coerce').dt.time
data['StartDate'] = pd.to_datetime(data['StartDate'], errors='coerce')
data['CullDecisionDate'] = pd.to_datetime(data['CullDecisionDate'], errors='coerce')
data['DateTime'] = pd.to_datetime(data['DateTime'], errors='coerce')
data.head()

Unnamed: 0,FarmName_Pseudo,SE_Number,AnimalNumber,StartDate,StartTime,LactationNumber,DaysInMilk,TotalYield,DateTime,YearSeason,...,Mother,Father,CullDecisionDate,Temperature,RelativeHumidity,THI_adj,HW,cum_HW,Temp15Threshold,Age
0,a624fb9a,SE-064c0cec-1189,5189,2022-01-01,06:25:00,7,191,13.9,2022-01-01 06:25:00,2022-1,...,,,2022-12-20,-3.025,0.930917,28.012944,0,0,0,3095
1,a624fb9a,SE-064c0cec-1189,5189,2022-01-01,16:41:00,7,191,16.87,2022-01-01 16:41:00,2022-1,...,,,2022-12-20,-3.025,0.930917,28.012944,0,0,0,3095
2,a624fb9a,SE-064c0cec-1189,5189,2022-01-02,15:29:00,7,192,20.41,2022-01-02 15:29:00,2022-1,...,,,2022-12-20,-0.279167,0.990542,32.898193,0,0,0,3096
3,a624fb9a,SE-064c0cec-1189,5189,2022-01-02,03:31:00,7,192,16.28,2022-01-02 03:31:00,2022-1,...,,,2022-12-20,-0.279167,0.990542,32.898193,0,0,0,3096
4,a624fb9a,SE-064c0cec-1189,5189,2022-01-02,22:44:00,7,192,11.53,2022-01-02 22:44:00,2022-1,...,,,2022-12-20,-0.279167,0.990542,32.898193,0,0,0,3096


In [38]:
# Calculate the DailyYield for each cow each day
data['DailyYield'] = data.groupby(['SE_Number', 'StartDate'])['TotalYield'].transform('sum')

# Sort the data by AnimalNumber and StartDate
data.sort_values(['AnimalNumber', 'StartDate'], inplace=True)

# Calculate the previous day's total yield for each cow
data['PreviousDailyYield'] = data.groupby('AnimalNumber')['DailyYield'].shift(1)

# Calculate the daily yield change for each cow
data['DailyYieldChange'] = data['DailyYield'] - data['PreviousDailyYield']

# Group and aggregate data
data = data.groupby(['SE_Number', 'FarmName_Pseudo', 'StartDate']).agg({
    'DailyYield': 'first',
    'PreviousDailyYield': 'first',
    'DailyYieldChange': 'first',
    'HW': 'max',
    'Temperature': 'mean',
    'THI_adj': 'mean',
    'DaysInMilk': 'first',
    'YearSeason': 'first',
    'cum_HW': 'max',
    'Temp15Threshold': 'max',
    'Age': 'first',
    'BreedName': 'first',
    'LactationNumber': 'first'
}).reset_index()

# Renaming and formatting
data.rename(columns={
    'Temperature': 'MeanTemperature',
    'THI_adj': 'MeanTHI_adj',
    'StartDate': 'Date'
}, inplace=True)
data['Date'] = pd.to_datetime(data['Date'])

# Display the first few rows of the transformed data
data.head()

Unnamed: 0,SE_Number,FarmName_Pseudo,Date,DailyYield,PreviousDailyYield,DailyYieldChange,HW,MeanTemperature,MeanTHI_adj,DaysInMilk,YearSeason,cum_HW,Temp15Threshold,Age,BreedName,LactationNumber
0,SE-064c0cec-1189,a624fb9a,2022-01-01,30.77,30.77,0.0,0,-3.025,28.012944,191,2022-1,0,0,3095,02 SLB,7
1,SE-064c0cec-1189,a624fb9a,2022-01-02,48.22,30.77,17.45,0,-0.279167,32.898193,192,2022-1,0,0,3096,02 SLB,7
2,SE-064c0cec-1189,a624fb9a,2022-01-03,30.53,48.22,-17.69,0,2.033333,36.760487,193,2022-1,0,0,3097,02 SLB,7
3,SE-064c0cec-1189,a624fb9a,2022-01-04,42.26,30.53,11.73,0,0.066667,31.939524,194,2022-1,0,0,3098,02 SLB,7
4,SE-064c0cec-1189,a624fb9a,2022-01-05,38.49,42.26,-3.77,0,-3.7,26.498206,195,2022-1,0,0,3099,02 SLB,7


In [39]:
# Define the THI threshold
THI_THRESHOLD = 61

# Calculate the daily heat load based on the THI threshold
data['HeatLoad'] = data['MeanTHI_adj'].apply(lambda x: x - THI_THRESHOLD if x > THI_THRESHOLD else -(THI_THRESHOLD - x))

# Calculate the cumulative heat load, ensuring it never goes negative
data['CumulativeHeatLoad'] = data['HeatLoad'].cumsum()
data['CumulativeHeatLoad'] = data['CumulativeHeatLoad'].apply(lambda x: max(x, 0))

data.head(-5)

Unnamed: 0,SE_Number,FarmName_Pseudo,Date,DailyYield,PreviousDailyYield,DailyYieldChange,HW,MeanTemperature,MeanTHI_adj,DaysInMilk,YearSeason,cum_HW,Temp15Threshold,Age,BreedName,LactationNumber,HeatLoad,CumulativeHeatLoad
0,SE-064c0cec-1189,a624fb9a,2022-01-01,30.77,30.77,0.00,0,-3.025000,28.012944,191,2022-1,0,0,3095,02 SLB,7,-32.987056,0
1,SE-064c0cec-1189,a624fb9a,2022-01-02,48.22,30.77,17.45,0,-0.279167,32.898193,192,2022-1,0,0,3096,02 SLB,7,-28.101807,0
2,SE-064c0cec-1189,a624fb9a,2022-01-03,30.53,48.22,-17.69,0,2.033333,36.760487,193,2022-1,0,0,3097,02 SLB,7,-24.239513,0
3,SE-064c0cec-1189,a624fb9a,2022-01-04,42.26,30.53,11.73,0,0.066667,31.939524,194,2022-1,0,0,3098,02 SLB,7,-29.060476,0
4,SE-064c0cec-1189,a624fb9a,2022-01-05,38.49,42.26,-3.77,0,-3.700000,26.498206,195,2022-1,0,0,3099,02 SLB,7,-34.501794,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
487083,SE-fcdf259d-0044-0,f454e660,2023-06-03,12.67,15.75,-3.08,0,12.666667,53.132530,347,2023-3,0,1,4150,41 Fjällko,10,-7.867470,0
487084,SE-fcdf259d-0044-0,f454e660,2023-06-04,22.31,12.67,9.64,0,13.079167,56.726870,348,2023-3,0,1,4151,41 Fjällko,10,-4.273130,0
487085,SE-fcdf259d-0044-0,f454e660,2023-06-05,12.84,22.31,-9.47,0,14.237500,58.482418,349,2023-3,0,1,4152,41 Fjällko,10,-2.517582,0
487086,SE-fcdf259d-0044-0,f454e660,2023-06-06,9.47,12.84,-3.37,0,15.345833,60.546358,350,2023-3,0,1,4153,41 Fjällko,10,-0.453642,0


In [40]:
data['HeatLoad'].describe()

count    487093.000000
mean        -13.602936
std          13.001571
min         -50.929164
25%         -24.912793
50%         -13.281689
75%          -1.810921
max          13.596361
Name: HeatLoad, dtype: float64

In [41]:
data['RawCumulativeHeatLoad'] = data['HeatLoad'].cumsum()
data[['HeatLoad', 'RawCumulativeHeatLoad', 'CumulativeHeatLoad']].head(20)

Unnamed: 0,HeatLoad,RawCumulativeHeatLoad,CumulativeHeatLoad
0,-32.987056,-32.987056,0
1,-28.101807,-61.088863,0
2,-24.239513,-85.328377,0
3,-29.060476,-114.388852,0
4,-34.501794,-148.890647,0
5,-39.416952,-188.307599,0
6,-43.177558,-231.485157,0
7,-33.018292,-264.503449,0
8,-31.747625,-296.251074,0
9,-31.987207,-328.238281,0


In [None]:
# Calculate the cumulative heat load ensuring it never goes negative
data['CumulativeHeatLoad'] = data['RawCumulativeHeatLoad'].apply(lambda x: max(x, 0))

## Wilmink Lactation Curve
$$
Y(t) = a + bt + c \exp(-dt)
$$
- \(Y(t)\): Milk yield at time \(t\) post-calving, so t = DaysInMilk
- \(a\): Intercept, representing baseline milk yield
- \(b\): Linear increase rate of milk yield over time
- \(c\): Initial exponential increase in milk yield
- \(d\): Rate at which the exponential increase declines over time

The Wilmink model captures the lactation curve by considering both linear and exponential components, providing a flexible representation of milk production dynamics over the lactation period.


Normalize the dataset using Wood's lactattion curve and set thresholds for some outliers which have unreasonable values.

In [42]:
# Define the Wilmink Lactation Curve function
def wilmink_lactation_curve(dim, a, b, c, d):
    dim = np.array(dim, dtype=float)
    return a + b * dim + c * np.exp(-d * dim)

# Function to detect and remove outliers
def remove_outliers(group, threshold=3.5):
    mean = np.mean(group['DailyYield'])
    std_dev = np.std(group['DailyYield'])
    return group[(group['DailyYield'] > mean - threshold * std_dev) & (group['DailyYield'] < mean + threshold * std_dev)]

# Function to smooth the data using a rolling average
def smooth_data(group, window=5):
    group = group.copy()
    group['DailyYield'] = group['DailyYield'].rolling(window, min_periods=1).mean()
    return group

# Function to fit the Wilmink Lactation Curve to the dataset
def fit_wilmink_lactation_curve(dataset):
    # Initialize the 'ExpectedYield' column to NaN
    dataset['ExpectedYield'] = np.nan
    
    valid_indices = []

    # Group the dataset by 'SE_Number' and 'LactationNumber' and fit the curve for each segment
    for (animal_number, lactation_number), group in tqdm(dataset.groupby(['SE_Number', 'LactationNumber']), unit=" Segments"):
        # Prepare the data for fitting
        group = remove_outliers(group, threshold=3.5)  # Remove outliers with threshold 3.5
        group = smooth_data(group)  # Smooth the data
        x_data = group['DaysInMilk'].values
        y_data = group['DailyYield'].values
        
        # Ensure there are no NaN or infinite values in the data
        if not np.isfinite(x_data).all() or not np.isfinite(y_data).all():
            print(f"Non-finite values found for cow {animal_number}, lactation {lactation_number}, skipping.")
            continue
        
        # Ensure there are enough data points to fit the curve
        if len(x_data) < 10 or len(y_data) < 10:
            print(f"Insufficient data points for cow {animal_number}, lactation {lactation_number}, skipping.")
            continue

        valid_indices.extend(group.index)
        
        # Fit the model
        try:
            # Initial parameter guesses
            initial_guesses = [np.mean(y_data), 0, np.mean(y_data) / 2, 0.1]
            # Bounds on the parameters to prevent overflow
            bounds = ([-np.inf, -np.inf, -np.inf, 0], [np.inf, np.inf, np.inf, np.inf])
            
            with warnings.catch_warnings():
                warnings.filterwarnings('error', category=OptimizeWarning)
                try:
                    popt, pcov = curve_fit(
                        wilmink_lactation_curve, x_data, y_data,
                        p0=initial_guesses, bounds=bounds, maxfev=30000
                    )
                    
                    # Predict the expected yield using the fitted model
                    dataset.loc[group.index, 'ExpectedYield'] = wilmink_lactation_curve(group['DaysInMilk'], *popt)
                    
                    # Normalize the DailyYield
                    dataset.loc[group.index, 'NormalizedDailyYield'] = group['DailyYield'] / dataset.loc[group.index, 'ExpectedYield']
                    
                    # Calculate the daily yield change and normalize it
                    dataset.loc[group.index, 'PreviousDailyYield'] = group['DailyYield'].shift(1)
                    dataset.loc[group.index, 'DailyYieldChange'] = group['DailyYield'] - dataset.loc[group.index, 'PreviousDailyYield']
                    dataset.loc[group.index, 'NormalizedDailyYieldChange'] = dataset.loc[group.index, 'DailyYieldChange'] / dataset.loc[group.index, 'ExpectedYield']
                
                except OptimizeWarning:
                    print(f"OptimizeWarning for cow {animal_number}, lactation {lactation_number}, skipping.")
            
        except RuntimeError as e:
            print(f"Curve fit failed for cow {animal_number}, lactation {lactation_number}: {e}")
        except ValueError as e:
            print(f"Value error for cow {animal_number}, lactation {lactation_number}: {e}")
    
    # Keep only valid indices
    dataset = dataset.loc[valid_indices].reset_index(drop=True)
    
    # Fill any NaN values in the newly created columns with 0
    dataset['ExpectedYield'] = dataset['ExpectedYield'].fillna(0)
    dataset['NormalizedDailyYield'] = dataset['NormalizedDailyYield'].fillna(0)
    dataset['PreviousDailyYield'] = dataset['PreviousDailyYield'].fillna(0)
    dataset['DailyYieldChange'] = dataset['DailyYieldChange'].fillna(0)
    dataset['NormalizedDailyYieldChange'] = dataset['NormalizedDailyYieldChange'].fillna(0)
    
    return dataset

# Apply the curve fitting function to your dataset
data = fit_wilmink_lactation_curve(data)

  4%|▍         | 102/2315 [00:07<01:57, 18.80 Segments/s]

Insufficient data points for cow SE-5c06d92d-2621, lactation 3, skipping.


  5%|▍         | 114/2315 [00:09<03:54,  9.39 Segments/s]

Insufficient data points for cow SE-5c06d92d-2639, lactation 3, skipping.


  7%|▋         | 164/2315 [00:13<03:00, 11.94 Segments/s]


KeyboardInterrupt: 

In [None]:
# # Define heat stress conditions
# def identify_weather_based_heat_stress(row):
#     if (row['HW'] == 1 or 
#         row['cum_HW'] > 0 or 
#         row['MeanTHI_adj'] > 63.5 or
#         row['MeanTemperature'] > 17.5):
#         return 1
#     return 0

# # Apply the function to identify heat stress periods
# data['HeatStress'] = data.apply(identify_weather_based_heat_stress, axis=1)

# data.head()

In [None]:
# When CumulativeHeatLoad is greater than 5, it indicates that the cow is under heat stress
data['HeatStress'] = (data['CumulativeHeatLoad'] > 0).astype(int)
data.head()

Unnamed: 0,SE_Number,FarmName_Pseudo,Date,DailyYield,PreviousDailyYield,DailyYieldChange,HW,MeanTemperature,MeanTHI_adj,DaysInMilk,...,Temp15Threshold,Age,BreedName,LactationNumber,HeatLoad,CumulativeHeatLoad,ExpectedYield,NormalizedDailyYield,NormalizedDailyYieldChange,HeatStress
0,SE-064c0cec-1189,a624fb9a,2022-01-01,30.77,0.0,0.0,0,-3.025,28.012944,191,...,0,3095,02 SLB,7,-32.987056,0,35.914865,0.856748,0.0,0
1,SE-064c0cec-1189,a624fb9a,2022-01-02,48.22,30.77,8.725,0,-0.279167,32.898193,192,...,0,3096,02 SLB,7,-28.101807,0,35.799613,1.103224,0.243718,0
2,SE-064c0cec-1189,a624fb9a,2022-01-03,30.53,39.495,-2.988333,0,2.033333,36.760487,193,...,0,3097,02 SLB,7,-24.239513,0,35.68436,1.023044,-0.083744,0
3,SE-064c0cec-1189,a624fb9a,2022-01-04,42.26,36.506667,1.438333,0,0.066667,31.939524,194,...,0,3098,02 SLB,7,-29.060476,0,35.569108,1.066796,0.040438,0
4,SE-064c0cec-1189,a624fb9a,2022-01-05,38.49,37.945,0.109,0,-3.7,26.498206,195,...,0,3099,02 SLB,7,-34.501794,0,35.453856,1.073339,0.003074,0


In [None]:
# Reorder columns
new_order = [
    "Date", "FarmName_Pseudo", "SE_Number", "Age", "BreedName", "LactationNumber", "DaysInMilk",'YearSeason', "DailyYield", "PreviousDailyYield", 
    "DailyYieldChange", "ExpectedYield", "NormalizedDailyYield", 
    "NormalizedDailyYieldChange", "HeatStress", "Temp15Threshold", "HW", 
    "cum_HW", "MeanTemperature", "MeanTHI_adj", "HeatLoad",	"CumulativeHeatLoad"
]
data = data[new_order]
data.head()

Unnamed: 0,Date,FarmName_Pseudo,SE_Number,Age,BreedName,LactationNumber,DaysInMilk,YearSeason,DailyYield,PreviousDailyYield,...,NormalizedDailyYield,NormalizedDailyYieldChange,HeatStress,Temp15Threshold,HW,cum_HW,MeanTemperature,MeanTHI_adj,HeatLoad,CumulativeHeatLoad
0,2022-01-01,a624fb9a,SE-064c0cec-1189,3095,02 SLB,7,191,2022-1,30.77,0.0,...,0.856748,0.0,0,0,0,0,-3.025,28.012944,-32.987056,0
1,2022-01-02,a624fb9a,SE-064c0cec-1189,3096,02 SLB,7,192,2022-1,48.22,30.77,...,1.103224,0.243718,0,0,0,0,-0.279167,32.898193,-28.101807,0
2,2022-01-03,a624fb9a,SE-064c0cec-1189,3097,02 SLB,7,193,2022-1,30.53,39.495,...,1.023044,-0.083744,0,0,0,0,2.033333,36.760487,-24.239513,0
3,2022-01-04,a624fb9a,SE-064c0cec-1189,3098,02 SLB,7,194,2022-1,42.26,36.506667,...,1.066796,0.040438,0,0,0,0,0.066667,31.939524,-29.060476,0
4,2022-01-05,a624fb9a,SE-064c0cec-1189,3099,02 SLB,7,195,2022-1,38.49,37.945,...,1.073339,0.003074,0,0,0,0,-3.7,26.498206,-34.501794,0


In [None]:
# Save the reordered DataFrame to a CSV file
data.to_csv('../Data/MergedData/HeatApproachYieldData.csv', index=False)

### Variables Explanation for `HeatApproachYieldData.csv`

1. **Date**:
   - Description: The date when the milk yield was recorded.
   - Datatype: `datetime`
   - Format: `YYYY-MM-DD`
   - Example: `2022-01-01`

2. **FarmName_Pseudo**:
   - Description: A pseudo-identifier for the farm where the data was collected.
   - Datatype: `str`
   - Example: `a624fb9a`

3. **SE_Number**:
   - Description: A unique identifier for the cow, which has been formatted to include the farm and the animal number.
   - Datatype: `str`
   - Example: `SE-064c0cec-1189`

4. **Age**:
   - Description: The age of the cow in days.
   - Datatype: `Int64`
   - Example: `3095`

5. **BreedName**:
   - Description: The breed name of the cow.
   - Datatype: `str`
   - Example: `02 SLB`

6. **LactationNumber**:
   - Description: The number assigned to the cow's lactation cycle.
   - Datatype: `Int64`
   - Example: `7`

7. **DaysInMilk**:
   - Description: The number of days the cow has been in milk (lactating) at the time of recording.
   - Datatype: `Int64`
   - Example: `191`

8. **YearSeason**:
   - Description: The seasonal period based on the year and the month range.
   - Datatype: `str`
   - Example: `2022-1`
   - YearSeason parameters in yield datasets:
     - 1: Dec-Feb
     - 2: Mar-May
     - 3: Jun-Aug
     - 4: Sep-Nov

9. **DailyYield**:
   - Description: The total amount of milk produced by the cow in a single day.
   - Datatype: `float`
   - Example: `30.77`

10. **PreviousDailyYield**:
    - Description: The total amount of milk produced by the cow on the previous day.
    - Datatype: `float`
    - Example: `0.0`

11. **DailyYieldChange**:
    - Description: The change in daily milk yield from the previous day.
    - Datatype: `float`
    - Example: `0.0`

12. **ExpectedYield**:
    - Description: The expected amount of milk yield based on certain models or predictions.
    - Datatype: `float`
    - Example: `35.914865`

13. **NormalizedDailyYield**:
    - Description: The daily yield normalized to account for various factors.
    - Datatype: `float`
    - Example: `0.856748`

14. **NormalizedDailyYieldChange**:
    - Description: The change in normalized daily yield from the previous day.
    - Datatype: `float`
    - Example: `0.0`

15. **HeatStress**:
    - Description: A binary variable indicating the presence of heat stress on the cow.
    - Datatype: `Int64`
    - Example: `0`

16. **Temp15Threshold**:
    - Description: A binary variable indicating if the temperature exceeded 15 degrees Celsius on the given day.
    - Datatype: `Int64`
    - Example: `0`

17. **HW**:
    - Description: A binary variable indicating the presence of a heatwave on the day.
    - Datatype: `Int64`
    - Example: `0`

18. **cum_HW**:
    - Description: Cumulative number of heatwave days up to the current date.
    - Datatype: `Int64`
    - Example: `0`

19. **MeanTemperature**:
    - Description: The mean temperature recorded on the day.
    - Datatype: `float`
    - Example: `-3.025`

20. **MeanTHI_adj**:
    - Description: The mean adjusted Temperature-Humidity Index for the day.
    - Datatype: `float`
    - Example: `28.012944`