This repository contains the data, code, and results for “Reflectance spectroscopy and machine learning as a tool for the categorization of twin species based on the example of the Diachrysia genus” article.
There are two datasets in the data
directory - the first containing legislative information on moths (Legislative
), the second containing the measured spectra (Spectra
).
The files are available in two formats - .csv
and .xlsx
.
Legislative
file contains the following columns:
- ID - Individual identifier (species-number)
- Species - Diachrysia chrysitis or Diachrysia stenochrysis
- Sex - Male (♂) or Female (♀)
- Year_catch - year of the moth catch
- Day_catch - day of the year when the moth was caught
- Locality - place where the moth was caught
- UTM_code - zone in UTM coordinate system
- Longitude - east–west position in degrees
- Latitude - north–south position in degrees
- Feature_level - level of marking the morphological feature, where 1 means weak, 2 means strong, 3 means very strong
Spectra
file contains the following columns:
- ID - Individual identifier (species-number)
- Species - Diachrysia chrysitis or Diachrysia stenochrysis
- Scale - part of the scale on which the spectrometer measurement was made (Glass or Brown)
- 400-2100 - spectral band number
- Open the
diachrysia-classification.Rproj
project file in RStudio. - Run
01_randomforest.R
to build classification models, assess the performance of classification at the general and individual level and determine the importance of spectral features. - Run
02_KS_test.R
to determine importance of the spectral bands for species discrimination using Kolmogorov–Smirnov test. - Run
03_LDA_best_features.R
to determine the most useful spectral bands for species classification using Linear Discriminant Analysis and D-statistic. - Run
04_LDA_combinations.R
to determine the minimum set of spectral features to distinguish species with 100% accuracy.
The code results were saved in results
directory:
ks-test.csv
- importance of the spectral bands for species discrimination determined by D-statisticrf-importance.csv
- average importance of the spectral features for classification in the random forest models