# Comparing the features
Here we are going to import the top 25 features as determined by the shap package for each of the models. As a reminder, the average accuracies were:

| Model | Accuracy |
| :---   | ---      |
|Logistic Regression|93.5%|
|Random Forests|93.8%|
|Gradient Boosting Machine|91.8%|
|Neural Network|97.1%|

The purpose here is to just take a look at the features that these models agree on so that we can look more closely at the chemical data. Before we do that, we need to recover the retention time and m/z ratios saved during the cleaning and processing steps.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#import the top 25 features from each model
rf_25 = pd.read_csv('models/RF_top25.csv', index_col=[0])
lg_25 = pd.read_csv('models/lg_top25.csv', index_col=[0])
gbm_25 = pd.read_csv('models/GB_top25.csv', index_col=[0])
nn_25 = pd.read_csv('models/NN_top25.csv',index_col=[0])

In [3]:
#rename logistic regression column
col_update = {'0':'col_name'}
lg_25 = lg_25.rename(columns=col_update)

In [4]:
#join dataframes
all_25 = pd.concat([lg_25, rf_25, gbm_25, nn_25])

In [5]:
#look at the top 10 features by frequency
top10_feats = all_25['col_name'].value_counts()
top10_feats = top10_feats[:10].reset_index()
top10_feats = list(top10_feats['index'])
top10_feats

['787', '957', '1200', '1196', '187', '208', '952', '81', '996', '1288']

In [6]:
peak_ids = pd.read_csv('../processing/peaks_updated.csv',index_col=[0])

In [7]:
peak_ids

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435
name,12.0,14.0,16.0,19.0,22.0,23.0,25.0,26.0,28.0,31.0,...,18911.0,19007.0,19127.0,19207.0,19454.0,19466.0,19548.0,19869.0,19994.0,20163.0
m/z,415.210919,437.192936,457.276727,327.17755,553.255367,119.085962,629.371696,480.30775,522.354513,763.515364,...,124.967214,453.16701,443.171244,309.581772,616.175797,224.091635,725.615763,522.35436,417.025757,309.098409
retention time,9.200066,9.200345,10.471969,8.663874,12.873119,9.200257,16.715573,10.962874,10.996964,16.248348,...,0.314417,9.316367,9.338894,9.07034,9.046608,2.96543,12.324333,11.067387,4.857672,9.662856
componentindex,2.0,2.0,36.0,,2.0,,1.0,7.0,32.0,7.0,...,,,122.0,112.0,10.0,101.0,,32.0,,


In [8]:
top10 = pd.DataFrame()

#recall that we were using the order of metabolites up until this point. We can therefore just take the columns that we want and make a new dataframe
for feat in top10_feats:
    top10 = pd.concat([top10,peak_ids[feat]],axis=1)           

In [9]:
top10 = top10.T
top10.to_csv('top10_ML_features.csv')
top10

Unnamed: 0,name,m/z,retention time,componentindex
787,4724.0,524.297733,11.319835,7.0
957,6813.0,639.406783,11.271281,51.0
1200,12240.0,462.289309,9.386694,
1196,12224.0,105.070272,1.378392,
187,539.0,605.414337,17.242694,
208,625.0,524.297665,11.419732,7.0
952,6800.0,595.380713,11.286608,
81,184.0,607.390045,16.717968,11.0
996,7376.0,678.476749,11.24638,4.0
1288,14090.0,474.289816,9.448015,


# Summary
It looks like metabolite number 787 was the most agreed upon followed by 208. They belong to the same molecular network (componentindex) that we calculated from [GNPS](https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp), have an almost identical retention time and mass to charge ratio (m/z). It's probably safe to say that these two metabolites are isomers, but we can check this more closely with the fragmentation data. 

Overall this analysis shows that it may be possible to predict whether an insect has been exposed to *Trypanosoma cruzi* or not based upon the metabolites present in their gut. This is intuitive since the parasites themselves are expected to produce unique compounds, distinct from both the insect and the other microbes present.

Our analysis would benefit significantly from more observations to minimize over-fitting, which we observed in some of the models. We could also investigate different species of insect with a similar experiment that would add support to our hypothesis. Ultimately, this could be a cheap alternative to monitor the spread of parasite in the environment.