The goal of this project is to develop a machine learning model that predicts if the age at which American Woodcocks die is based on different variables including banding year, banding month, banding flyway, banding region, recovery year, recovery month, recovery flyway, recovery region, sex, method of recovery, and status. If yes, what is the most important feature involved in predicting the age depending on the recovery month? A random forest regressor model is created and trained. Feature importance is assessed. This project uses AMWO recovery data from the paper Saunders et al. (2019). 


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns   

In [None]:

# -----------------------------
# STEP 1: Load the CSV dataset and select first 100 rows
# -----------------------------
file_path = 'AMWO recoveries.csv'
data = pd.read_csv(file_path)

#Selecting relevant column variables
cols = ['B Year','B Month','B Flyway','B Region','R Year','R Month','R Flyway','R Region','Sex','How Obt','Status']
#Select first 100 rows
data = data[cols].iloc[:100]

# -----------------------------
# STEP 2: Clean and calculate age in months
# -----------------------------
# Convert year and month columns to numeric
for col in ['B Year','B Month','R Year','R Month']:
    data[col] = pd.to_numeric(data[col], errors='coerce')

# Drop rows with missing critical values
data = data.dropna(subset=['B Year','B Month','R Year','R Month'])

# Calculate age in months
data['age_months'] = ((data['R Year'] - data['B Year']) * 12) + (data['R Month'] - data['B Month'])

# Remove negative or unrealistic ages
data = data[data['age_months'] >= 0]

# -----------------------------
# STEP 3: Encode categorical features
# -----------------------------
categorical_cols = ['B Flyway','B Region','R Flyway','R Region','Sex','How Obt','Status']
for col in categorical_cols:
    data[col] = data[col].astype(str)
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])

# -----------------------------
# STEP 4: Prepare features and target (REMOVE LEAKAGE)
# -----------------------------
X = data.drop(columns=['age_months'])
y = data['age_months']

# Split data into training and testing sets. Let test size be 20%, random state 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------------
# STEP 5: Train Random Forest Regressor model 
# -----------------------------
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# -----------------------------
# STEP 6: Global Feature Importance
# -----------------------------
importances = model.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}).sort_values(by='Importance', ascending=False)

fig_global = plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.title('Global Feature Importance (First 100 Rows from CSV)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('global_feature_importance_csv.png')
plt.close()

# -----------------------------
# STEP 7: Feature Importance by Recovery Month
# -----------------------------
month_importance = {}
for month in sorted(data['R Month'].unique()):
    subset = data[data['R Month'] == month]
    if len(subset) < 5:
        continue
    X_m = subset.drop(columns=['age_months'])
    y_m = subset['age_months']
    model_m = RandomForestRegressor(n_estimators=50, random_state=42)
    model_m.fit(X_m, y_m)
    month_importance[month] = model_m.feature_importances_

heatmap_data = pd.DataFrame(month_importance, index=feature_names)
sns.heatmap(heatmap_data, cmap='viridis')
plt.title('Feature Importance by Recovery Month (First 100 Rows from CSV)')
plt.xlabel('Month')
plt.ylabel('Feature')
plt.tight_layout()
plt.savefig('feature_importance_by_month_csv.png')
plt.close()


# -----------------------------
# STEP 8: Most Important Feature per Month
# -----------------------------
most_important_per_month = {month: feature_names[np.argmax(vals)] for month, vals in month_importance.items()}

print("Global Feature Importance (First 100 Rows from CSV):")
print(importance_df)
print("\nMost Important Feature per Recovery Month (First 100 Rows from CSV):")
for month, feature in most_important_per_month.items():
    print(f"Month {month}: {feature}")


Global Feature Importance (First 100 Rows from CSV):
     Feature  Importance
4     R Year    0.368209
9    How Obt    0.333718
5    R Month    0.182301
1    B Month    0.056053
3   B Region    0.017728
0     B Year    0.015735
7   R Region    0.013081
2   B Flyway    0.005709
8        Sex    0.003876
6   R Flyway    0.003589
10    Status    0.000000

Most Important Feature per Recovery Month (First 100 Rows from CSV):
Month 10: B Region
Month 11: Sex
Month 12: B Month


The two visuals produced show:
1. The importance of the features by proportion 
2. The feature importance by month 