Acyl_ACP_TE_MachineLearning

R scripts in this repo are implemented in R version 3.6. The initial input files include "TE fatty acid profiles.csv", "TE substrate specificities.csv", and "pairwise comparison of TE sequences encoded with labels.csv". Output files generated by each step are used as the input for the next steps.

The R scripts are used in the following order:

Step 1: Clustering analysis for fatty acid profiles of each TE --> use script fatty_acid_profile_clustering.R

This script depends on two additional scritps: PlotDimRdc.R and PLS.R.
Input: TE fatty acid profiles.csv
Output: FA_clustering.rds

Step 2: Evaluation of feature importance by random forest (RF) classifier --> use script RF_feature importance.R

Input: pairwise comparison of TE sequences encoded with labels.csv; FA_clustering.rds
Output: pvalues_10Runs_RF.txt; importance_score_10Runs_RF.txt; importance_rank.txt
Detailed steps:

Define the instance for RF classifier:
Response --> comparison of fatty acid profile cluster membership between two TEs (1, same cluster; 0, different clusters).
Feature --> sequence variation of two TEs at each amino acid position (0, same amino acids; 1, different amino acids).
Construction of RF classifier. The classifier was implemented 10 times using the same dataset to account for the randomness involved in classifier construction.
Calculate the feature importance score and the associted p-values according to the 10 RF classifiers.
Calculate the rank of feature importance from high to low.

Step 3: Further trimming of the results in step 2 to obtain the short list of important features --> use script RF_IFS.R

Incremental feature selection approach was used to identify the RF classifier with the minimum number of features but having an optimal predictive performance.
Input: pairwise comparison of TE sequences encoded with labels.csv; TE substrate specificities.csv; importance_rank.txt
Output: IFS_evaluation.rds

Step 4: Summarize and plot the IFS results --> use script IFS_summary.R

To determine the minimum number of features needed to reach MCC plateau, comparison of MCC was made between every pair of adjacent models in IFS (e.g. a model with n features vs a model with n+1 features).
IFS_summary.R requires asterisk.R and errbar.R.
Input: IFS_evaluation.rds; importance_score_10Runs_RF.txt; pvalues_10Runs_RF.txt
Output: IFS_model_comparison_mcc.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Acyl_ACP_TE_MachineLearning

The R scripts are used in the following order:

Step 1: Clustering analysis for fatty acid profiles of each TE --> use script fatty_acid_profile_clustering.R

Step 2: Evaluation of feature importance by random forest (RF) classifier --> use script RF_feature importance.R

Step 3: Further trimming of the results in step 2 to obtain the short list of important features --> use script RF_IFS.R

Step 4: Summarize and plot the IFS results --> use script IFS_summary.R

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
IFS_summary.R		IFS_summary.R
LICENSE		LICENSE
PLS.R		PLS.R
PlotDimRdc.R		PlotDimRdc.R
README.md		README.md
RF_IFS.R		RF_IFS.R
RF_feature_importance_rank.R		RF_feature_importance_rank.R
TE fatty acid profiles.csv		TE fatty acid profiles.csv
TE substrate specificities.csv		TE substrate specificities.csv
TE_randomForest.R		TE_randomForest.R
asterisk.R		asterisk.R
errbar.R		errbar.R
fatty_acid_profile_clustering.R		fatty_acid_profile_clustering.R
pairwise comparison of TE sequences encoded with labels.csv		pairwise comparison of TE sequences encoded with labels.csv

License

ketingchen/Acyl_ACP_TE_MachineLearning

Folders and files

Latest commit

History

Repository files navigation

Acyl_ACP_TE_MachineLearning

The R scripts are used in the following order:

Step 1: Clustering analysis for fatty acid profiles of each TE --> use script fatty_acid_profile_clustering.R

Step 2: Evaluation of feature importance by random forest (RF) classifier --> use script RF_feature importance.R

Step 3: Further trimming of the results in step 2 to obtain the short list of important features --> use script RF_IFS.R

Step 4: Summarize and plot the IFS results --> use script IFS_summary.R

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages