In [2]:
import pandas as pd

In [43]:
# --------------------------------------------------------------------
#                          DATA LOADING
# --------------------------------------------------------------------
# The CSV file is assumed to be in the ../data/dairy_cows.csv path.
# Adjust the path as needed for your environment.

df=pd.read_csv("../data/dairy_cows.csv")
df.head()

Unnamed: 0,ID,Species,Animal_Class,WQ_Principles,WQ_Criteria,Welfare_Hazards_Animal,Welfare_Hazards_Consequences,Welfare_Hazards_Affective_States,Welfare_Hazards_Impact,Ease_of_Hazard_Mitigation,Welfare_Indicator,Indicator_Type,Indicator_Dimensions,Indicator_Ease,Indicator_Resources,Hazards_Source_1,Hazards_Source_2,Global_Usage,Norway_Usage
0,1,Dairy cows,Tie stalls,Good Health,Absence of disease,Pasture access,Gastro-enteric disorders,Discomfort,High,Moderate,Abdominal discomfort,Welfare outcome,Health/physical/production,Moderate,Low,7.0,,F,Y
1,1,Dairy cows,Cubicles,Good Health,Absence of disease,Pasture access,Gastro-enteric disorders,Discomfort,High,Moderate,Abdominal discomfort,Welfare outcome,Health/physical/production,Moderate,Low,7.0,,F,Y
2,2,Dairy cows,Tie stalls,Appropiate behaviour,Expression of social behaviours,Continuous housing for long periods,General disruption of behaviour,Di(stress),Low,Difficult,Agonistic behaviour,Welfare outcome,Behavioural,Moderate,Low,7.0,,"F, R",Y
3,2,Dairy cows,Cubicles,Appropiate behaviour,Expression of social behaviours,Continuous housing for long periods,General disruption of behaviour,Di(stress),Low,Difficult,Agonistic behaviour,Welfare outcome,Behavioural,Moderate,Low,7.0,,"F, R",Y
4,3,Dairy cows,Tie stalls,Appropiate behaviour,Expression of other behaviours,Insufficient space,Restriction of movement,Discomfort,High,Moderate,Agonistic interactions,Welfare outcome,Behavioural,Moderate,Medium,,13a,R,N (research only)


## Data cleaning and preparation

1. Convert certain columns to string and strip whitespace to ensure consistency (e.g., "Pasture access" vs. " Pasture access").
2. ...

In [44]:
categorical_columns = ['Species', 'Animal_Class', 'Welfare_Hazards_Animal',
                       'Welfare_Hazards_Consequences', 'Welfare_Hazards_Impact',
                       'Ease_of_Hazard_Mitigation', 'Welfare_Indicator',
                       'Indicator_Ease', 'Indicator_Resources']
for col in categorical_columns:
    df[col] = df[col].astype(str).str.strip()

## CATEGORICAL -> NUMERICAL MAPPINGS

We create numerical mappings for "Low", "High", etc. so we can compute average impacts, sums, or other numeric operations.

- *impact_mapping*: Map impact level to numeric (Low=1, High=2)
- *ease_mapping*: Map ease-of-measurement or hazard mitigation to numeric
   (Easy=1, Moderate=2, Difficult=3)
- *resources_mapping*: Map resource usage to numeric (Low=1, Medium=2, High=3)

In [45]:
impact_mapping = {
    'Low': 1, 
    'High': 2
}

ease_mapping = {
    'Easy': 1, 
    'Moderate': 2, 
    'Difficult': 3
}

resources_mapping = {
    'Low': 1, 
    'Medium': 2, 
    'High': 3
}

# Apply mappings to DataFrame columns
df['Impact_Num'] = df['Welfare_Hazards_Impact'].map(impact_mapping)
df['Ease_Num'] = df['Indicator_Ease'].map(ease_mapping)
df['Resources_Num'] = df['Indicator_Resources'].map(resources_mapping)
df['Mitigation_Ease_Num'] = df['Ease_of_Hazard_Mitigation'].map(ease_mapping)

df.head()

Unnamed: 0,ID,Species,Animal_Class,WQ_Principles,WQ_Criteria,Welfare_Hazards_Animal,Welfare_Hazards_Consequences,Welfare_Hazards_Affective_States,Welfare_Hazards_Impact,Ease_of_Hazard_Mitigation,...,Indicator_Ease,Indicator_Resources,Hazards_Source_1,Hazards_Source_2,Global_Usage,Norway_Usage,Impact_Num,Ease_Num,Resources_Num,Mitigation_Ease_Num
0,1,Dairy cows,Tie stalls,Good Health,Absence of disease,Pasture access,Gastro-enteric disorders,Discomfort,High,Moderate,...,Moderate,Low,7.0,,F,Y,2,2,1,2
1,1,Dairy cows,Cubicles,Good Health,Absence of disease,Pasture access,Gastro-enteric disorders,Discomfort,High,Moderate,...,Moderate,Low,7.0,,F,Y,2,2,1,2
2,2,Dairy cows,Tie stalls,Appropiate behaviour,Expression of social behaviours,Continuous housing for long periods,General disruption of behaviour,Di(stress),Low,Difficult,...,Moderate,Low,7.0,,"F, R",Y,1,2,1,3
3,2,Dairy cows,Cubicles,Appropiate behaviour,Expression of social behaviours,Continuous housing for long periods,General disruption of behaviour,Di(stress),Low,Difficult,...,Moderate,Low,7.0,,"F, R",Y,1,2,1,3
4,3,Dairy cows,Tie stalls,Appropiate behaviour,Expression of other behaviours,Insufficient space,Restriction of movement,Discomfort,High,Moderate,...,Moderate,Medium,,13a,R,N (research only),2,2,2,2


### Calculate coverage score per indicator

Calculate Coverage_Score as the count of unique hazards and consequences per indicator

### Data aggregations
Aggregate data by Welfare_Indicator
We'll compute:
- Coverage_Score: Number of unique hazards and consequences an indicator covers
- Impact_Score: Average impact level (High=2, Low=1)
- Cost and Time: Estimated based on Indicator_Ease and Indicator_Resources

In [51]:

# --------------------------------------------------------------------
#          AGGREGATION: COVERAGE & IMPACT PER INDICATOR
# --------------------------------------------------------------------
# 1. Compute Coverage_Score: 
#    Number of unique hazards and consequences each indicator covers.
# 2. Compute Impact_Score:
#    Average 'Impact_Num' across rows for each indicator.

alpha = 1.0  # weight for hazards
beta  = 1.0  # weight for consequences


coverage = (
    df.groupby('Welfare_Indicator', as_index=False)
    .agg(
       n_risks=('Welfare_Hazards_Animal', 'nunique'),
       n_cons=('Welfare_Hazards_Consequences', 'nunique')
    )
)

coverage['Coverage_Score'] = alpha * coverage['n_risks'] + beta * coverage['n_cons']
coverage = coverage[['Welfare_Indicator', 'Coverage_Score']]

impact = (
    df.groupby('Welfare_Indicator')
    .agg({'Impact_Num': 'mean'})
    .reset_index()
    .rename(columns={'Impact_Num': 'Impact_Score'})
)

# Merge coverage and impact DataFrames
df_agg = pd.merge(coverage, impact, on='Welfare_Indicator')
df_agg.drop_duplicates(subset='Welfare_Indicator', inplace=True)
df_agg.head()


Unnamed: 0,Welfare_Indicator,Coverage_Score,Impact_Score
0,Abdominal discomfort,2.0,2.0
1,Agonistic behaviour,2.0,1.0
2,Agonistic interactions,2.0,2.0
3,Allo-grooming,2.0,1.0
4,Altered resting posture,2.0,2.0


### Define Coefficients for Cost and Time

Define coefficients for the objective function
These coefficients can be adjusted based on priorities
For example, higher coefficients for more difficult indicators

#### Coefficients for Cost
- coeff_cost_ease = 50        (Coefficient for Indicator_Ease)
- coeff_cost_resources = 25   (Coefficient for Indicator_Resources)

#### Coefficients for Time
coeff_time_ease = 2         (Coefficient for Indicator_Ease)
coeff_time_resources = 1    (Coefficient for Indicator_Resources)

In [52]:
# --------------------------------------------------------------------
#          COST & TIME COEFFICIENTS (Customizable)
# --------------------------------------------------------------------
# We define some coefficients that will be used to estimate
# 'Total_Cost' and 'Total_Time' for each indicator. 
# The user can adjust them based on the project's needs.

coeff_cost_ease = 50          # Weight cost by how easy/difficult an indicator is
coeff_cost_resources = 25     # Weight cost by how resource-intensive an indicator is
coeff_mitigation = 0.0        # Example: you could incorporate hazard mitigation ease if desired

coeff_time_ease = 2           # Weight time by how easy/difficult an indicator is
coeff_time_resources = 1      # Weight time by how resource-intensive an indicator is


In [53]:
# --------------------------------------------------------------------
#    MERGE AGGREGATION RESULTS WITH EASE/RESOURCE INFO
# --------------------------------------------------------------------
# We retrieve each indicator's (Ease_Num, Resources_Num) from original df
# so we can compute total costs and times.

indicator_info = (
    df.groupby('Welfare_Indicator', as_index=False)
    .agg({
        'Ease_Num': 'mean',
        'Resources_Num': 'mean',
        'Mitigation_Ease_Num': 'mean'
    })
)
df_agg = pd.merge(df_agg, indicator_info, on='Welfare_Indicator', how='left')

# Compute estimated total cost & time
df_agg['Total_Cost'] = (
    coeff_cost_ease * df_agg['Ease_Num'] 
    + coeff_cost_resources * df_agg['Resources_Num']
)
df_agg['Total_Time'] = (
    coeff_time_ease * df_agg['Ease_Num']
    + coeff_time_resources * df_agg['Resources_Num']
)

df_agg


Unnamed: 0,Welfare_Indicator,Coverage_Score,Impact_Score,Ease_Num,Resources_Num,Mitigation_Ease_Num,Total_Cost,Total_Time
0,Abdominal discomfort,2.0,2.0,2.0,1.0,2.0,125.0,5.0
1,Agonistic behaviour,2.0,1.0,2.0,1.0,3.0,125.0,5.0
2,Agonistic interactions,2.0,2.0,2.0,2.0,2.0,150.0,6.0
3,Allo-grooming,2.0,1.0,2.0,2.0,3.0,150.0,6.0
4,Altered resting posture,2.0,2.0,2.0,1.0,3.0,125.0,5.0
5,Amount of eye white,3.0,2.0,2.0,2.0,2.0,150.0,6.0
6,Body condition scoring,5.0,2.0,2.0,1.0,1.5,125.0,5.0
7,Brush use,2.0,1.0,2.0,2.0,3.0,150.0,6.0
8,Calving behaviour (difficult/long calving),2.0,2.0,2.0,1.0,2.0,125.0,5.0
9,Calving records (death of cow),4.0,2.0,1.0,1.0,1.333333,75.0,3.0


--------------------------------------------------------------------
##     OBJECTIVE FUNCTION & EFFICIENCY SCORING
--------------------------------------------------------------------
We define an objective score that focuses on:
   - Coverage_Score 
   - (optionally) Impact_Score
   - Possibly hazard mitigation or other factors

Then we define an 'Efficiency' as ratio of Objective_Score to the combined cost/time (or any other formula you prefer).

The user can tune weights for coverage, impact, cost, time, etc.


In [54]:
# Define weights for coverage and impact in the objective function
WEIGHT_COVERAGE = 1.0
WEIGHT_IMPACT = 0.0

# Define how much we penalize cost/time in the efficiency metric
WEIGHT_COST = 1.0
WEIGHT_TIME = 1.0

# Calculate the objective score
df_agg['Objective_Score'] = (
    WEIGHT_COVERAGE * df_agg['Coverage_Score'] 
    + WEIGHT_IMPACT * df_agg['Impact_Score']
    + coeff_mitigation * df_agg['Mitigation_Ease_Num']
)

# Calculate an 'Efficiency' metric
# E.g., Efficiency = Objective_Score / (Cost + Time)
df_agg['Efficiency'] = (
    df_agg['Objective_Score'] 
    / (WEIGHT_COST * df_agg['Total_Cost'] + WEIGHT_TIME * df_agg['Total_Time'])
)

# Sort by Efficiency (descending)
df_sorted = df_agg.sort_values(by='Efficiency', ascending=False).reset_index(drop=True)

print("\nIndicators Sorted by Efficiency (top 10):")
print(
    df_sorted[
        [
            'Welfare_Indicator', 
            'Coverage_Score', 
            'Impact_Score', 
            'Total_Cost', 
            'Total_Time', 
            'Objective_Score', 
            'Efficiency'
        ]
    ].head(10)
)

df_sorted.head(10)



Indicators Sorted by Efficiency (top 10):
                                   Welfare_Indicator  Coverage_Score  \
0                    Physiological stress indicators            36.0   
1                                           Injuries            10.0   
2  Cow Pain Scale (attention towards surroundings...            20.0   
3             Time budgets (disruption of behaviour)            16.0   
4                            Hot, red, painful udder             6.0   
5                                    Gait assessment             9.0   
6                                   Hock alterations             8.0   
7                                   Knee alterations             8.0   
8             Time budgets (prevention of behaviour)             9.0   
9                        Time budgets (lack of rest)             8.0   

   Impact_Score  Total_Cost  Total_Time  Objective_Score  Efficiency  
0      1.050000       225.0         9.0             36.0    0.153846  
1      2.000000       

Unnamed: 0,Welfare_Indicator,Coverage_Score,Impact_Score,Ease_Num,Resources_Num,Mitigation_Ease_Num,Total_Cost,Total_Time,Objective_Score,Efficiency
0,Physiological stress indicators,36.0,1.05,3.0,3.0,2.15,225.0,9.0,36.0,0.153846
1,Injuries,10.0,2.0,1.0,1.0,2.5,75.0,3.0,10.0,0.128205
2,Cow Pain Scale (attention towards surroundings...,20.0,1.285714,2.0,2.0,2.0,150.0,6.0,20.0,0.128205
3,Time budgets (disruption of behaviour),16.0,1.0,2.0,2.0,2.2,150.0,6.0,16.0,0.102564
4,"Hot, red, painful udder",6.0,2.0,1.0,1.0,2.6,75.0,3.0,6.0,0.076923
5,Gait assessment,9.0,2.0,2.0,1.0,2.5,125.0,5.0,9.0,0.069231
6,Hock alterations,8.0,2.0,2.0,1.0,2.428571,125.0,5.0,8.0,0.061538
7,Knee alterations,8.0,2.0,2.0,1.0,2.428571,125.0,5.0,8.0,0.061538
8,Time budgets (prevention of behaviour),9.0,1.0,2.0,2.0,2.5,150.0,6.0,9.0,0.057692
9,Time budgets (lack of rest),8.0,2.0,2.0,2.0,2.142857,150.0,6.0,8.0,0.051282


--------------------------------------------------------------------
###          SIMPLE GREEDY SELECTION UNDER CONSTRAINTS
--------------------------------------------------------------------
We demonstrate a greedy approach that picks the highest-efficiency indicators first, subject to budget/time constraints & a max number.

**NOTE:** A simple greedy approach is not guaranteed to find a global optimum.
If we need the absolute best combination, consider integer programming or other optimization techniques (e.g., knapsack solutions).


In [55]:
# Constraints
BUDGET = 1200      # e.g., total budget
DEADLINE = 100      # e.g., total time available
MAX_INDICATORS = 7 # user-defined limit on the number of indicators

selected_indicators = []
total_cost = 0
total_time = 0

# Greedy selection
for index, row in df_sorted.iterrows():
    # Check constraints:
    if (
        len(selected_indicators) < MAX_INDICATORS
        and (total_cost + row['Total_Cost']) <= BUDGET
        and (total_time + row['Total_Time']) <= DEADLINE
    ):
        selected_indicators.append(row)
        total_cost += row['Total_Cost']
        total_time += row['Total_Time']
        
        # If we've reached the max number of indicators, stop.
        if len(selected_indicators) == MAX_INDICATORS:
            break

if selected_indicators:
    df_selected = pd.DataFrame(selected_indicators)
    print("\nOptimal Selection of Welfare Indicators (Greedy Approach):")
    print(
        df_selected[
            [
                'Welfare_Indicator', 
                'Coverage_Score', 
                'Impact_Score', 
                'Total_Cost', 
                'Total_Time', 
                'Objective_Score', 
                'Efficiency'
            ]
        ]
    )
    
    # Summary
    print(f"\nTotal Cost: ${total_cost} (Budget: ${BUDGET})")
    print(f"Total Time: {total_time} days (Deadline: {DEADLINE} days)")
    print(f"Total Coverage Score: {df_selected['Coverage_Score'].sum()}")
    print(f"Total Impact Score: {df_selected['Impact_Score'].sum()}")
    print(f"Overall Objective Score: {df_selected['Objective_Score'].sum()}")
else:
    print("No combination of indicators satisfies the constraints.")



Optimal Selection of Welfare Indicators (Greedy Approach):
                                   Welfare_Indicator  Coverage_Score  \
0                    Physiological stress indicators            36.0   
1                                           Injuries            10.0   
2  Cow Pain Scale (attention towards surroundings...            20.0   
3             Time budgets (disruption of behaviour)            16.0   
4                            Hot, red, painful udder             6.0   
5                                    Gait assessment             9.0   
6                                   Hock alterations             8.0   

   Impact_Score  Total_Cost  Total_Time  Objective_Score  Efficiency  
0      1.050000       225.0         9.0             36.0    0.153846  
1      2.000000        75.0         3.0             10.0    0.128205  
2      1.285714       150.0         6.0             20.0    0.128205  
3      1.000000       150.0         6.0             16.0    0.102564  
4      2



## Constraints

In addition to maximizing the above objective, we want to **maximize** that sum, subject to constraints like:

1. **Budget Constraint**  
   $$
   \sum_{i \in S} {Total\_Cost}_i \;\;\le\;\; {BUDGET}.
   $$

3. **Time Constraint**  
   $$
   \sum_{i \in S} \mathrm{Total\_Time}_i \;\;\le\;\; {DEADLINE}.
$$

4. **Maximum Number of Indicators**  
   $$
   |S| \;\;\le\;\; {MAX\_INDICATORS}.
   $$

Each ${Total\_Cost}_i$ and ${Total\_Time}_i$ is associated with selecting indicator $i$.

---

## Backtracking (Depth-First Search) Approach

A **backtracking** algorithm can **enumerate** possible subsets \(S\) of indicators, respecting the constraints. For each indicator $i$, we try:

1. **Exclude** $i$ (do not pick it).  
2. **Include** $i$ (if it does not violate the budget, time, or count constraints).

During this process, we keep track of the **best** subset found so far, i.e., the one with the **largest sum** of ${objective}_i$.

Even though backtracking might seem $O(2^N)$ in the worst case, **pruning** (e.g., by checking constraints early) can reduce the search space significantly. For a moderate number of indicators, this ensures we find the **globally optimal** subset that satisfies all constraints.

---

### Putting It All Together

1. Compute for each indicator $i$:
   - ${riskCoverage}_i$, ${consequenceCoverage}_i$  
   - ${Ease\_Num}_i$, ${Impact\_Score}_i$, $\mathrm{Mitigation\_Ease\_Num}_i$  
   - A combined ${objective}_i$ using the formula:
     $$
     {objective}_i 
     \;=\; 
       w_{\mathrm{cover}} 
         \bigl(\alpha \, {riskCoverage}_i 
               + \beta \, {consequenceCoverage}_i \bigr)
       \;+\;
       w_{{ease}} 
         \bigl(4 - {Ease\_Num}_i \bigr)
       \;+\;
       w_{{impact}} 
         \bigl({Impact\_Score}_i \bigr)
       \;+\;
       w_{{miti}} 
         \bigl(4 - {Mitigation\_Ease\_Num}_i \bigr).
     $$
2. Use **backtracking** to try each indicator’s inclusion/exclusion, track constraints, and maximize  
   $$
   \sum_{i \in S} \mathrm{objective}_i.
   $$
3. Return the subset $S$ (or multiple subsets if you store them) that attains the highest feasible objective.

By adjusting $\alpha, \beta, w_\mathrm{cover}, w_\mathrm{ease}, w_\mathrm{impact}, w_\mathrm{miti}$, we can emphasize or de-emphasize certain aspects of the problem, such as giving more weight to **coverage** vs. **impact**, or preferring indicators that are **easy** to measure.