Skip to content

ketingchen/Acyl_ACP_TE_MachineLearning

Repository files navigation

Acyl_ACP_TE_MachineLearning

R scripts in this repo are implemented in R version 3.6. The initial input files include "TE fatty acid profiles.csv", "TE substrate specificities.csv", and "pairwise comparison of TE sequences encoded with labels.csv". Output files generated by each step are used as the input for the next steps.

The R scripts are used in the following order:

Step 1: Clustering analysis for fatty acid profiles of each TE --> use script fatty_acid_profile_clustering.R

This script depends on two additional scritps: PlotDimRdc.R and PLS.R.
Input: TE fatty acid profiles.csv
Output: FA_clustering.rds

Step 2: Evaluation of feature importance by random forest (RF) classifier --> use script RF_feature importance.R

Input: pairwise comparison of TE sequences encoded with labels.csv; FA_clustering.rds
Output: pvalues_10Runs_RF.txt; importance_score_10Runs_RF.txt; importance_rank.txt
Detailed steps:

  1. Define the instance for RF classifier:
    Response --> comparison of fatty acid profile cluster membership between two TEs (1, same cluster; 0, different clusters).
    Feature --> sequence variation of two TEs at each amino acid position (0, same amino acids; 1, different amino acids).
  2. Construction of RF classifier. The classifier was implemented 10 times using the same dataset to account for the randomness involved in classifier construction.
  3. Calculate the feature importance score and the associted p-values according to the 10 RF classifiers.
  4. Calculate the rank of feature importance from high to low.

Step 3: Further trimming of the results in step 2 to obtain the short list of important features --> use script RF_IFS.R

Incremental feature selection approach was used to identify the RF classifier with the minimum number of features but having an optimal predictive performance.
Input: pairwise comparison of TE sequences encoded with labels.csv; TE substrate specificities.csv; importance_rank.txt
Output: IFS_evaluation.rds

Step 4: Summarize and plot the IFS results --> use script IFS_summary.R

To determine the minimum number of features needed to reach MCC plateau, comparison of MCC was made between every pair of adjacent models in IFS (e.g. a model with n features vs a model with n+1 features).
IFS_summary.R requires asterisk.R and errbar.R.
Input: IFS_evaluation.rds; importance_score_10Runs_RF.txt; pvalues_10Runs_RF.txt
Output: IFS_model_comparison_mcc.csv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages