# Predicting diet proportions of all known extinct bird species using Random Forest

## PART 1. DATASET PRE-PROCESSING

This script contains the code used to pre-process the datasets to train a RF model.

### Loading packages

First, the necessary Python libraries are imported. These include pandas for data manipulation, scikit-learn for machine learning functionalities, matplotlib and seaborn for data visualization, and numpy for numerical operations.

In [301]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import numpy as np

# 1. Pre-processing of AVONET dataset

## 1.1 Load AVONET dataset

We used a comprehensive dataset that combines information on the diets of extant bird species obtained from Tobias, J. et al., 2021. 
The code reads an excel file containing bird data using pd.read_xlsx() and stores it in a pandas DataFrame named data.

In [372]:
url = "https://github.com/martinezrubio/ECOTREX/raw/refs/heads/main/data/raw/AVONET_2022.xlsx"
data = pd.read_excel(url)
data.head()

Unnamed: 0,Species3,Family3,Order3,Total.individuals,Female,Male,Unknown,Complete.measures,Beak.Length_Culmen,Beak.Length_Nares,...,Migration,Trophic.Level,Trophic.Niche,Primary.Lifestyle,Min.Latitude,Max.Latitude,Centroid.Latitude,Centroid.Longitude,Range.Size,Species.Status
0,Accipiter albogularis,Accipitridae,Accipitriformes,5,2,0,3,4,27.7,17.8,...,2.0,Carnivore,Vertivore,Insessorial,-11.73,-4.02,-8.15,158.493765,37461.21,Extant
1,Accipiter badius,Accipitridae,Accipitriformes,10,4,6,0,8,20.6,12.1,...,3.0,Carnivore,Vertivore,Insessorial,-29.47,46.39,8.23,44.982464,22374973.0,Extant
2,Accipiter bicolor,Accipitridae,Accipitriformes,6,2,2,2,4,26.5,14.8,...,2.0,Carnivore,Vertivore,Generalist,,,,,,Extant
3,Accipiter brachyurus,Accipitridae,Accipitriformes,4,4,0,0,3,22.5,14.0,...,2.0,Carnivore,Vertivore,Insessorial,-6.31,-4.08,-5.45,150.681314,35580.71,Extant
4,Accipiter brevipes,Accipitridae,Accipitriformes,8,4,4,0,4,21.1,12.1,...,3.0,Carnivore,Vertivore,Generalist,31.19,55.86,45.24,45.32734,2936751.8,Extant


In [303]:
data.shape # Should be 9993 rows and 36 columns

(9993, 36)

In [373]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'Total.individuals', 'Female', 'Male',
       'Unknown', 'Complete.measures', 'Beak.Length_Culmen',
       'Beak.Length_Nares', 'Beak.Width', 'Beak.Depth', 'Tarsus.Length',
       'Wing.Length', 'Kipps.Distance', 'Secondary1', 'Hand-Wing.Index',
       'Tail.Length', 'Mass', 'Mass.Source', 'Mass.Refs.Other', 'Inference',
       'Traits.inferred', 'Reference.species', 'Habitat', 'Habitat.Density',
       'Migration', 'Trophic.Level', 'Trophic.Niche', 'Primary.Lifestyle',
       'Min.Latitude', 'Max.Latitude', 'Centroid.Latitude',
       'Centroid.Longitude', 'Range.Size', 'Species.Status'],
      dtype='object')

### 1.1.1 Exclusion of extinct species

Species that are currently extinct or already represented in the AVOTREX dataset were excluded from the analysis to avoid duplication and ensure consistency between extant and extinct trait datasets.

In [374]:
species_to_remove = [
    "Acrocephalus luscinius",
    "Corvus hawaiiensis",
    "Cyanopsitta spixii",
    "Gallinula nesiotis",
    "Hemignathus lucidus",
    "Melamprosops phaeosoma",
    "Mitu mitu",
    "Philydor novaesi",
    "Prosobonia cancellata",
    "Todiramphus cinnamominus",
    "Zenaida graysoni",
    "Zosterops conspicillatus", 
    "Phyllastrephus leucolepis" # Removed as this species isn't considered a species anymore
]

In [375]:
data = data[~data["Species3"].isin(species_to_remove)]

In [376]:
data.shape # Should be 9980 rows and 36 columns

(9980, 36)

### 1.1.2 Remove non-informative variables

Non-informative metadata variables not used as predictors were removed to simplify the dataset prior to analysis.

In [377]:
cols_to_drop = [
    "Total.individuals",
    "Female",
    "Male",
    "Unknown",
    "Complete.measures",
    "Beak.Length_Nares",  # Removed as it was highly correlated with Beak.Length_Culmen
    "Secondary1",      # Removed as it was highly correlated with Kipps Distance
    "Hand-Wing.Index", # Removed as it was highly correlated with Kipps Distance
    "Mass.Source",
    "Mass.Refs.Other",
    "Inference",
    "Traits.inferred",
    "Reference.species",
    "Species.Status",
    "Habitat",         # Removed as we don't have enough information for extinct species to compare
    "Habitat.Density", # Removed as we don't have enough information for extinct species to compare
    "Migration",       # Removed as we don't have enough information for extinct species to compare
    "Trophic.Level",   # Removed as we don't have enough information for extinct species to compare
    "Trophic.Niche",   # Removed as we don't have enough information for extinct species to compare
    "Primary.Lifestyle", # Removed as we don't have enough information for extinct species to compare
    "Min.Latitude",    # Removed as we don't have enough information for extinct species to compare
    "Max.Latitude",    # Removed as we don't have enough information for extinct species to compare
    "Centroid.Latitude", # Removed as we don't have enough information for extinct species to compare
    "Centroid.Longitude", # Removed as we don't have enough information for extinct species to compare
    "Range.Size"       # Removed as we don't have enough information for extinct species to compare
]

In [378]:
data = data.drop(columns=cols_to_drop)

In [379]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'Beak.Length_Culmen', 'Beak.Width',
       'Beak.Depth', 'Tarsus.Length', 'Wing.Length', 'Kipps.Distance',
       'Tail.Length', 'Mass'],
      dtype='object')

In [380]:
data.shape # Should be 9980 rows and 11 columns

(9980, 11)

### 1.1.3 Log-transformation of morphological traits

Morphological traits were log-transformed to account for allometric scaling and reduce skewness prior to analysis.

In [381]:
vars_to_log = [
    "Beak.Length_Culmen",
    "Beak.Width",
    "Beak.Depth",
    "Tarsus.Length",
    "Wing.Length",
    "Kipps.Distance",
    "Tail.Length",
    "Mass"
]

for var in vars_to_log:
    data[f"log_{var}"] = np.log(data[var])

In [382]:
data[[f"log_{v}" for v in vars_to_log]].describe()

Unnamed: 0,log_Beak.Length_Culmen,log_Beak.Width,log_Beak.Depth,log_Tarsus.Length,log_Wing.Length,log_Kipps.Distance,log_Tail.Length,log_Mass
count,9980.0,9980.0,9980.0,9980.0,9980.0,9980.0,9980.0,9980.0
mean,3.076279,1.693549,1.83135,3.172099,4.643163,3.121708,4.295238,3.904417
std,0.562332,0.578921,0.672284,0.590073,0.595676,0.984829,0.569388,1.575007
min,1.504077,-0.356675,0.0,0.916291,-2.302585,-2.302585,-2.302585,0.641854
25%,2.681022,1.280934,1.335001,2.862201,4.198705,2.388763,3.916015,2.70538
50%,2.985682,1.609438,1.757858,3.091042,4.522875,2.98315,4.225373,3.569533
75%,3.349904,2.028148,2.24071,3.456317,4.990433,3.793239,4.613138,4.820282
max,6.026349,4.487512,4.708629,6.163315,6.671906,6.037871,6.700485,11.617285


In [384]:
cols_to_drop = [
    "Beak.Length_Culmen",
    "Beak.Width",
    "Beak.Depth",
    "Tarsus.Length",
    "Wing.Length",
    "Kipps.Distance",  
    "Tail.Length",      
    "Mass"    
]

In [385]:
data = data.drop(columns=cols_to_drop)

In [386]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'log_Beak.Length_Culmen',
       'log_Beak.Width', 'log_Beak.Depth', 'log_Tarsus.Length',
       'log_Wing.Length', 'log_Kipps.Distance', 'log_Tail.Length', 'log_Mass'],
      dtype='object')

## 1.2 Incorporation of flight ability and island endemicity data

Flight ability and island endemicity information were imported from an external AVONET-derived dataset and integrated with the main trait database for downstream analyses.

In [387]:
url = "https://raw.githubusercontent.com/martinezrubio/ECOTREX/main/data/raw/AVONET_flight_endemicity.csv"
endemicity = pd.read_csv(url)
endemicity.head()

Unnamed: 0,species,Island_Endemicity,Flight_Ability
0,Accipiter_albogularis,yes,1.0
1,Accipiter_badius,no,1.0
2,Accipiter_bicolor,no,1.0
3,Accipiter_brachyurus,yes,1.0
4,Accipiter_brevipes,no,1.0


In [388]:
endemicity.shape # Should be 9993 rows and 3 columns 

(9993, 3)

**Formatting species names for consistent merging**

In [389]:
data["Species3"] = (
    data["Species3"]
    .str.strip()
    .str.replace(" ", "_", regex=False)
)

In [390]:
data = data.merge(
    endemicity,
    left_on="Species3",
    right_on="species",
    how="left"
)

In [391]:
data = data.drop(columns=["species"])

In [392]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'log_Beak.Length_Culmen',
       'log_Beak.Width', 'log_Beak.Depth', 'log_Tarsus.Length',
       'log_Wing.Length', 'log_Kipps.Distance', 'log_Tail.Length', 'log_Mass',
       'Island_Endemicity', 'Flight_Ability'],
      dtype='object')

In [393]:
data.shape # Should be 9980 rows and 13 columns 

(9980, 13)

In [394]:
data[["Flight_Ability", "Island_Endemicity"]].isna().any() # Should be False

Flight_Ability       False
Island_Endemicity    False
dtype: bool

## 1.3 Incorporation of foraging strategy and sea bird category

Foraging strategy and sea bird information were imported from an external AVONET-derived dataset and integrated with the main trait database for downstream analyses.

In [395]:
url = "https://raw.githubusercontent.com/martinezrubio/ECOTREX/main/data/raw/AVONET_fly_swim_walk_seabird.xlsx"
foraging = pd.read_excel(url)
foraging.head()

Unnamed: 0,Species,fly,swim,walk,seabird
0,Abeillia_abeillei,100.0,0.0,0.0,no
1,Abroscopus_albogularis,55.0,0.0,45.0,no
2,Abroscopus_schisticeps,60.0,0.0,40.0,no
3,Abroscopus_superciliaris,60.0,0.0,40.0,no
4,Aburria_aburri,50.0,0.0,50.0,no


In [396]:
foraging.shape # Should be 9981 rows and 5 columns 

(9981, 5)

**Formatting species names for consistent merging**

In [397]:
data = data.merge(
    foraging,
    left_on="Species3",
    right_on="Species",
    how="left"
)

In [398]:
data = data.drop(columns="Species")

In [399]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'log_Beak.Length_Culmen',
       'log_Beak.Width', 'log_Beak.Depth', 'log_Tarsus.Length',
       'log_Wing.Length', 'log_Kipps.Distance', 'log_Tail.Length', 'log_Mass',
       'Island_Endemicity', 'Flight_Ability', 'fly', 'swim', 'walk',
       'seabird'],
      dtype='object')

In [400]:
data.shape # Should be 9980 rows and 17 columns 

(9980, 17)

In [401]:
data[["fly", "swim", "walk", "seabird"]].isna().any() # Should be False

fly        False
swim       False
walk       False
seabird    False
dtype: bool

## 1.4 Incorporation of diet category

Diet categories information were imported from AVONICHE and integrated with the main trait database for downstream analyses.

In [402]:
url = "https://raw.githubusercontent.com/martinezrubio/ECOTREX/main/data/raw/AVONICHE_2026.csv"
diet = pd.read_csv(url)
diet.head()

Unnamed: 0,Species3,Family3,Order3,Avibase_id,Inferred,MainDiet,MainNiche,Pa,Pt,Ne,...,APG,APS,APD,VAE,VAS,VPE,VGE,VGG,CAQ,CGR
0,Accipiter albogularis,Accipitridae,Accipitriformes,AVIBASE-BBB59880,0,Vt,VPE,0,0,0,...,0,0,0,30,0,70,0,0,0,0
1,Accipiter badius,Accipitridae,Accipitriformes,AVIBASE-1A0ECB6E,0,Vt,VPE,0,0,0,...,0,0,0,9,0,72,9,0,0,0
2,Accipiter bicolor,Accipitridae,Accipitriformes,AVIBASE-ADBE44E1,0,Vt,VPE,0,0,0,...,0,0,0,40,0,60,0,0,0,0
3,Accipiter brachyurus,Accipitridae,Accipitriformes,AVIBASE-68BF920B,0,Vt,VPE,0,0,0,...,0,0,0,30,0,70,0,0,0,0
4,Accipiter brevipes,Accipitridae,Accipitriformes,AVIBASE-8492E4B7,0,Vt,Generalist,0,0,0,...,0,0,0,6,36,18,0,0,0,0


In [403]:
diet.shape # Should be 9993 rows and 48 columns

(9993, 48)

In [404]:
diet.columns

Index(['Species3', 'Family3', 'Order3', 'Avibase_id', 'Inferred', 'MainDiet',
       'MainNiche', 'Pa', 'Pt', 'Ne', 'Se', 'Fr', 'In', 'Ap', 'Vt', 'Ca',
       'PAG', 'PAS', 'PAD', 'PEL', 'PGR', 'NAE', 'NGL', 'SEL', 'SGR', 'FAE',
       'FGL', 'FGR', 'IAE', 'ISA', 'ISS', 'ISG', 'IVS', 'IGE', 'IGG', 'APA',
       'APL', 'APP', 'APG', 'APS', 'APD', 'VAE', 'VAS', 'VPE', 'VGE', 'VGG',
       'CAQ', 'CGR'],
      dtype='object')

**Selecting diet variables**

In [405]:
diet_vars = [
    "MainDiet","MainNiche",
    "Pa", "Pt", "Ne", "Se", "Fr", "In", "Ap", "Vt", "Ca",
    "PAG", "PAS", "PAD", "PEL", "PGR", "NAE", "NGL", "SEL", "SGR", "FAE",
    "FGL", "FGR", "IAE", "ISA", "ISS", "ISG", "IVS", "IGE", "IGG", "APA",
    "APL", "APP", "APG", "APS", "APD", "VAE", "VAS", "VPE", "VGE", "VGG",
    "CAQ", "CGR"
]

**Formatting species names for consistent merging**

In [406]:
diet["Species3"] = (
    diet["Species3"]
    .str.strip()
    .str.replace(" ", "_", regex=False)
)

In [407]:
diet_sub = diet[["Species3"] + diet_vars]

In [408]:
data = data.merge(
    diet_sub,
    left_on="Species3",
    right_on="Species3",
    how="left"
)

In [409]:
data.columns

Index(['Species3', 'Family3', 'Order3', 'log_Beak.Length_Culmen',
       'log_Beak.Width', 'log_Beak.Depth', 'log_Tarsus.Length',
       'log_Wing.Length', 'log_Kipps.Distance', 'log_Tail.Length', 'log_Mass',
       'Island_Endemicity', 'Flight_Ability', 'fly', 'swim', 'walk', 'seabird',
       'MainDiet', 'MainNiche', 'Pa', 'Pt', 'Ne', 'Se', 'Fr', 'In', 'Ap', 'Vt',
       'Ca', 'PAG', 'PAS', 'PAD', 'PEL', 'PGR', 'NAE', 'NGL', 'SEL', 'SGR',
       'FAE', 'FGL', 'FGR', 'IAE', 'ISA', 'ISS', 'ISG', 'IVS', 'IGE', 'IGG',
       'APA', 'APL', 'APP', 'APG', 'APS', 'APD', 'VAE', 'VAS', 'VPE', 'VGE',
       'VGG', 'CAQ', 'CGR'],
      dtype='object')

In [410]:
data.shape # Should be 9980 rows and 60 columns

(9980, 60)

**Renaming columns**

In [411]:
data = data.rename(columns={
    "Species3": "Species",
    "Family3": "Family",
    "Order3": "Order",
    "MainDiet": "dn.cat",
    "MainNiche": "fd.cat"
})

In [412]:
diet_map = {
    "In": "Invertivore",
    "Ap": "Aquatic_predator",
    "Vt": "Vertivore",
    "Ne": "Nectarivore",
    "Fr": "Frugivore",
    "Se": "Granivore",
    "Pa": "Aquatic_Herbivore",
    "Pt": "Terrestrial_Herbivore",
    "Ca" : "Scavenger",
    "Omnivore" : "Omnivore"
}

In [413]:
data["diet_simple"] = data["dn.cat"].map(diet_map)

In [414]:
data[["dn.cat", "diet_simple"]].value_counts(dropna=False)

dn.cat    diet_simple          
In        Invertivore              4782
Omnivore  Omnivore                 1732
Fr        Frugivore                1030
Ap        Aquatic_predator          757
Se        Granivore                 664
Ne        Nectarivore               507
Vt        Vertivore                 311
Pt        Terrestrial_Herbivore      93
Pa        Aquatic_Herbivore          82
Ca        Scavenger                  22
Name: count, dtype: int64

### 1.4.1 Reclassification of scavenger diets into vertebrate-based feeding categories

Carrion-based diet category (Ca) was merged into vertebrate-based feeding category (Vt) by redistributing proportional values and recoding dominant diet labels, ensuring consistency across diet representations.

In [415]:
data["Vt"] = data["Vt"] + data["Ca"]

In [416]:
data = data.drop(columns=["Ca"])

In [417]:
"Ca" in data.columns # Should be False

False

**Validation of diet proportion consistency**

Diet proportions were verified to ensure that the redistribution preserved the total dietary composition, with proportions summing to 100% for each species.

In [418]:
diet_cols = ["Pa", "Pt", "Ne", "Se", "Fr", "In", "Ap", "Vt"]

In [419]:
(data[diet_cols].sum(axis=1).round(5) == 100).all() # Should be True

True

**Inspection of scavenger species diet profiles**

Species previously classified as scavengers were inspected to confirm the redistribution of diet proportions, and to see if they were all classified to Vt.

In [420]:
diet_cols = ["Pa", "Pt", "Ne", "Se", "Fr", "In", "Ap", "Vt"]

In [421]:
cols_to_show = ["Species"] + diet_cols

In [422]:
scavengers = data.loc[data["dn.cat"] == "Ca", cols_to_show]

In [423]:
scavengers.head()

Unnamed: 0,Species,Pa,Pt,Ne,Se,Fr,In,Ap,Vt
50,Aegypius_monachus,0,0,0,0,0,0,0,100
142,Gypaetus_barbatus,0,0,0,0,0,0,0,100
144,Gyps_africanus,0,0,0,0,0,0,0,100
145,Gyps_bengalensis,0,0,0,0,0,0,0,100
146,Gyps_coprotheres,0,0,0,0,0,0,0,100


**Reclassification of dominant diet category**

The dominant diet category was updated by recoding carrion-based diets (Ca) as vertebrate-based diets (Vt).

In [424]:
data["dn.cat"] = data["dn.cat"].replace({"Ca": "Vt"})

In [425]:
data["dn.cat"].value_counts()

dn.cat
In          4782
Omnivore    1732
Fr          1030
Ap           757
Se           664
Ne           507
Vt           333
Pt            93
Pa            82
Name: count, dtype: int64

**Harmonization of simplified diet categories**

Simplified diet categories were updated to reflect the removal of scavenger diets, reclassifying them as vertebrate-based diets.

In [426]:
data["diet_simple"] = data["diet_simple"].replace({"Scavenger": "Vertivore"})

In [427]:
data["diet_simple"].value_counts()

diet_simple
Invertivore              4782
Omnivore                 1732
Frugivore                1030
Aquatic_predator          757
Granivore                 664
Nectarivore               507
Vertivore                 333
Terrestrial_Herbivore      93
Aquatic_Herbivore          82
Name: count, dtype: int64

## 1.5 Phylogenetic PCA and integration of evolutionary information

Phylogenetic structure was accounted for by computing a Brownian-motion variance–covariance matrix from an avian phylogeny and extracting the first 12 phylogenetic principal components. These components summarize shared evolutionary history and were merged with the main dataset at the species level.

**Install R and rpy2**

In [72]:
!pip install biopython openpyxl
from Bio import Phylo
import openpyxl
from io import StringIO
import requests



In [73]:
url = "https://raw.githubusercontent.com/martinezrubio/ECOTREX/main/data/raw/ecotrex_tree.nwk"
tree_str = requests.get(url).text
tree = Phylo.read(StringIO(tree_str), "newick")

In [80]:
conda install -c conda-forge rpy2 r-base r-essentials

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
Solving environment: unsuccessful attempt using repodata from current_repodata.json, retrying with next repodata source.
done
Solving environment: done


  current version: 23.7.4
  latest version: 25.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=25.11.1



## Package Plan ##

  environment location: /Users/nataliamartinezrubio/anaconda3

  added / updated specs:
    - r-base
    - r-essentials
    - rpy2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _r-mutex-1.0.1             |      anacondar_1           3 KB  conda-forge
    anaconda-cloud-auth-0.5.0  |    

r-backports-1.5.0    | 126 KB    |                                       |   0% 
libclang-cpp14-14.0. | 13.1 MB   |                                       |   0% [A

r-rcpp-1.0.10        | 1.9 MB    |                                       |   0% [A[A


tzlocal-5.3.1        | 23 KB     |                                       |   0% [A[A[A



r-cachem-1.1.0       | 73 KB     |                                       |   0% [A[A[A[A




r-rematch-2.0.0      | 24 KB     |                                       |   0% [A[A[A[A[A





r-survival-3.7_0     | 6.0 MB    |                                       |   0% [A[A[A[A[A[A






r-openssl-2.1.1      | 670 KB    |                                       |   0% [A[A[A[A[A[A[A







r-clipr-0.8.0        | 67 KB     |                                       |   0% [A[A[A[A[A[A[A[A








r-numderiv-2016.8_1. | 125 KB    |                                       |   0% [A[A[A[A[A[A[A[A[A









r-broom-1.

r-prodlim-2023.03.31 | 421 KB    | #4                                    |   4% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A


















r-prodlim-2023.03.31 | 421 KB    | ##################################### | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

r-rcpp-1.0.10        | 1.9 MB    | ##################################### | 100% [A[A

r-rcpp-1.0.10        | 1.9 MB    | ##################################### | 100% [A[A




















r-nlme-3.1_162       | 2.2 MB    | 2                                     |   1% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A





















r-sourcetools-0.1.7_ | 50 KB     | ###########7                          |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A





















r-sourcetools-0.1.7_ | 50 KB     | ##################################### | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A






















 ... (mo

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.


**Enable R execution inside the notebook (rpy2)**

In [35]:
%load_ext rpy2.ipython

**Compute phylogenetic PCA**

In [36]:
%%R -o pcs
library(ape)

tree <- read.tree("https://raw.githubusercontent.com/martinezrubio/ECOTREX/main/data/raw/ecotrex_tree.nwk")
V <- vcv(tree)
pca <- prcomp(V, scale.=TRUE)

pcs <- as.data.frame(pca$x[, 1:12])
colnames(pcs) <- paste0("PC", 1:12)
pcs$species <- rownames(V)

In [37]:
pcs.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,species
Megalapteryx_didinus,-206.274554,-108.923832,-1.743762,-0.317601,3.892841,-0.483482,0.660844,-3.653352,13.55458,-76.396884,85.771716,-10.984549,Megalapteryx_didinus
Euryapteryx_curtus,-206.291422,-108.951216,-1.745061,-0.317972,3.898758,-0.484266,0.662108,-3.662704,13.602923,-76.694778,86.145266,-11.048494,Euryapteryx_curtus
Emeus_crassus,-206.291422,-108.951216,-1.745061,-0.317972,3.898758,-0.484266,0.662108,-3.662704,13.602923,-76.694778,86.145266,-11.048494,Emeus_crassus
Anomalopteryx_didiformis,-206.290785,-108.950181,-1.745012,-0.317958,3.898534,-0.484236,0.66206,-3.66235,13.601096,-76.683523,86.131152,-11.046078,Anomalopteryx_didiformis
Pachyornis_geranoides,-206.289058,-108.947378,-1.744879,-0.31792,3.897929,-0.484156,0.661931,-3.661393,13.596148,-76.653032,86.092918,-11.039533,Pachyornis_geranoides


**Merge phylogenetic PCs with the main dataset**

In [428]:
data = data.merge(pcs, left_on="Species", right_on="species", how="left").drop(columns=["species"])

In [429]:
data.columns

Index(['Species', 'Family', 'Order', 'log_Beak.Length_Culmen',
       'log_Beak.Width', 'log_Beak.Depth', 'log_Tarsus.Length',
       'log_Wing.Length', 'log_Kipps.Distance', 'log_Tail.Length', 'log_Mass',
       'Island_Endemicity', 'Flight_Ability', 'fly', 'swim', 'walk', 'seabird',
       'dn.cat', 'fd.cat', 'Pa', 'Pt', 'Ne', 'Se', 'Fr', 'In', 'Ap', 'Vt',
       'PAG', 'PAS', 'PAD', 'PEL', 'PGR', 'NAE', 'NGL', 'SEL', 'SGR', 'FAE',
       'FGL', 'FGR', 'IAE', 'ISA', 'ISS', 'ISG', 'IVS', 'IGE', 'IGG', 'APA',
       'APL', 'APP', 'APG', 'APS', 'APD', 'VAE', 'VAS', 'VPE', 'VGE', 'VGG',
       'CAQ', 'CGR', 'diet_simple', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6',
       'PC7', 'PC8', 'PC9', 'PC10', 'PC11', 'PC12'],
      dtype='object')

## 1.6 Final dataset assembly

Relevant taxonomic, morphological, ecological, dietary, spatial, and phylogenetic variables were selected and explicitly ordered to define the final analysis-ready dataset used in downstream modelling.

In [430]:
final_columns = [
    "Species", "Family", "Order",
    "Island_Endemicity", "Flight_Ability",
    "log_Beak.Length_Culmen", "log_Beak.Width", "log_Beak.Depth",
    "log_Tarsus.Length", "log_Wing.Length", "log_Kipps.Distance", "log_Tail.Length", "log_Mass",

    "IAE", "ISA", "ISS", "ISG", "IVS", "IGE", "IGG",
    "APG", "APP", "APA", "APL", "APS", "APD",
    "FAE", "FGL", "FGR",
    "NAE", "NGL", "SEL", "SGR",
    "PEL", "PGR", "PAG", "PAS", "PAD",
    "VAE", "VAS", "VPE", "VGE", "VGG",
    "CAQ", "CGR",

    "In", "Ap", "Vt", "Ne", "Fr", "Se", "Pa", "Pt",

    "dn.cat", "fd.cat",
    "fly", "swim", "walk", "seabird",

    "PC1", "PC2", "PC3", "PC4", "PC5", "PC6",
    "PC7", "PC8", "PC9", "PC10", "PC11", "PC12"
]

In [431]:
data = data[final_columns]

In [432]:
data.columns.tolist()

['Species',
 'Family',
 'Order',
 'Island_Endemicity',
 'Flight_Ability',
 'log_Beak.Length_Culmen',
 'log_Beak.Width',
 'log_Beak.Depth',
 'log_Tarsus.Length',
 'log_Wing.Length',
 'log_Kipps.Distance',
 'log_Tail.Length',
 'log_Mass',
 'IAE',
 'ISA',
 'ISS',
 'ISG',
 'IVS',
 'IGE',
 'IGG',
 'APG',
 'APP',
 'APA',
 'APL',
 'APS',
 'APD',
 'FAE',
 'FGL',
 'FGR',
 'NAE',
 'NGL',
 'SEL',
 'SGR',
 'PEL',
 'PGR',
 'PAG',
 'PAS',
 'PAD',
 'VAE',
 'VAS',
 'VPE',
 'VGE',
 'VGG',
 'CAQ',
 'CGR',
 'In',
 'Ap',
 'Vt',
 'Ne',
 'Fr',
 'Se',
 'Pa',
 'Pt',
 'dn.cat',
 'fd.cat',
 'fly',
 'swim',
 'walk',
 'seabird',
 'PC1',
 'PC2',
 'PC3',
 'PC4',
 'PC5',
 'PC6',
 'PC7',
 'PC8',
 'PC9',
 'PC10',
 'PC11',
 'PC12']

## 1.7 Save AVONET pre-processed dataset

In [434]:
data.to_csv("/Users/nataliamartinezrubio/Library/CloudStorage/GoogleDrive-natalia.maru3101@gmail.com/Mi unidad/Work/CREAF/A_Extinctions/F_EcoTrEx/C_RandomForest/F_GitHub/avonet_processed.csv", index=False)