# Comparing the features
Here we are going to import the top 25 features as determined by the shap package for each of the models. As a reminder, the average accuracies were:

| Model | Accuracy |
| :---   | ---      |
|Logistic Regression|93.5%|
|Random Forests|93.8%|
|Gradient Boosting Machine|91.8%|


The purpose here is to just take a look at the features that these models agree on so that we can look more closely at the chemical data. Before we do that, we need to recover the retention time and m/z ratios saved during the cleaning and processing steps.

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [16]:
#import the top 25 features from each model
rf_25 = pd.read_csv('models/RF_top25.csv', index_col=[0])
lg_25 = pd.read_csv('models/lg_top25.csv', index_col=[0])
gbm_25 = pd.read_csv('models/GB_top25.csv', index_col=[0])


In [17]:
#rename logistic regression column
col_update = {'0':'col_name'}
lg_25 = lg_25.rename(columns=col_update)

In [24]:
#join dataframes
all_25 = pd.concat([lg_25, rf_25, gbm_25])

In [44]:
all_25.to_csv("all25.csv")
print(all_25.iloc[45:55])

    col_name  feature_importance_vals
665      996                 0.959584
668     1002                 0.955820
798     1359                 0.923948
660      988                 0.885916
747     1267                 0.863689
714     1196               259.927967
557      787               187.831556
135      187                85.797950
668     1002                70.792206
189      263                52.315632


In [53]:
#look at the top 10 features by frequency
top10_feats = all_25['col_name'].value_counts()
top10_feats = top10_feats[:10].reset_index()

#select only features that all models agree on
top_feats = top10_feats[top10_feats['col_name'] == 3]
top_feats = list(top_feats['index'])
top_feats

['81', '787', '1196', '1200', '952', '959', '717', '1002', '187']

In [29]:
peak_ids = pd.read_csv('processing/peaks_updated.csv',index_col=[0])

In [30]:
peak_ids

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435
name,12.0,14.0,16.0,19.0,22.0,23.0,25.0,26.0,28.0,31.0,...,18911.0,19007.0,19127.0,19207.0,19454.0,19466.0,19548.0,19869.0,19994.0,20163.0
m/z,415.210919,437.192936,457.276727,327.17755,553.255367,119.085962,629.371696,480.30775,522.354513,763.515364,...,124.967214,453.16701,443.171244,309.581772,616.175797,224.091635,725.615763,522.35436,417.025757,309.098409
retention time,9.200066,9.200345,10.471969,8.663874,12.873119,9.200257,16.715573,10.962874,10.996964,16.248348,...,0.314417,9.316367,9.338894,9.07034,9.046608,2.96543,12.324333,11.067387,4.857672,9.662856
componentindex,2.0,2.0,36.0,,2.0,,1.0,7.0,32.0,7.0,...,,,122.0,112.0,10.0,101.0,,32.0,,


In [54]:
top9 = pd.DataFrame()

#recall that we were using the order of metabolites up until this point. We can therefore just take the columns that we want and make a new dataframe
for feat in top_feats:
    top9 = pd.concat([top9,peak_ids[feat]],axis=1)           

In [55]:
top9 = top9.T
top9.to_csv('top9_ML_features.csv')
top9

Unnamed: 0,name,m/z,retention time,componentindex
81,184.0,607.390045,16.717968,11.0
787,4724.0,524.297733,11.319835,7.0
1196,12224.0,105.070272,1.378392,
1200,12240.0,462.289309,9.386694,
952,6800.0,595.380713,11.286608,
959,6828.0,559.365236,14.77637,7.0
717,4000.0,548.299609,12.870013,2.0
1002,7453.0,459.301364,15.942867,11.0
187,539.0,605.414337,17.242694,


# Summary
It looks like metabolite number 787 was the most agreed upon followed by 208. They belong to the same molecular network (componentindex) that we calculated from [GNPS](https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp), have an almost identical retention time and mass to charge ratio (m/z). It's probably safe to say that these two metabolites are isomers, but we can check this more closely with the fragmentation data. 

Overall this analysis shows that it may be possible to predict whether an insect has been exposed to *Trypanosoma cruzi* or not based upon the metabolites present in their gut. This is intuitive since the parasites themselves are expected to produce unique compounds, distinct from both the insect and the other microbes present.

Our analysis would benefit significantly from more observations to minimize over-fitting, which we observed in some of the models. We could also investigate different species of insect with a similar experiment that would add support to our hypothesis. Ultimately, this could be a cheap alternative to monitor the spread of parasite in the environment.