<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Metadata" data-toc-modified-id="Metadata-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Metadata</a></span></li><li><span><a href="#Feature-correspondence" data-toc-modified-id="Feature-correspondence-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Feature correspondence</a></span></li><li><span><a href="#Get-features-that-have-MS2" data-toc-modified-id="Get-features-that-have-MS2-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Get features that have MS2</a></span></li><li><span><a href="#Merge" data-toc-modified-id="Merge-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Merge</a></span></li><li><span><a href="#Filter" data-toc-modified-id="Filter-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Filter</a></span><ul class="toc-item"><li><span><a href="#Exploring-chemical-classes..." data-toc-modified-id="Exploring-chemical-classes...-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Exploring chemical classes...</a></span></li></ul></li></ul></div>

Data preparation for multivariate analyses.

The matrix I am using is `featureCorrespondence.csv`. I have to do some filtering. I am retaining features that received a MS2 annotation, so need to do some data wrangling here.

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
pwd

'/Volumes/NGG_TFAILY_LAB_1/URen_oak/2021/3_multiv_analyses'

# Metadata

Just adding italic sci name

In [3]:
metadata = pd.read_csv("../1_preprocessing/metadata.csv")
metadata['itSpeciesName'] = "italic('"+metadata['SpeciesName']+"')"
metadata.to_csv('metadata.csv', index=False)
metadata.head()

Unnamed: 0,SampleCode,DatePreparedforMetabolomics,SpeciesName,OakType,State,LeafType,LeafLife,FileName,itSpeciesName
0,QAU-1L,2019-10-10,Q. austrina,White,FL,Living,Deciduous,Tfaily_QAU-1L-M_15Nov19_Gimli_Zorbax-1190_neg....,italic('Q. austrina')
1,QGE-2L,2019-10-10,Q. geminata,White,FL,Living,Brevideciduous,Tfaily_QGE-2L-M_15Nov19_Gimli_Zorbax-1190_neg....,italic('Q. geminata')
2,QHE-16L,2019-10-10,Q. hemisphaerica,Red,FL,Living,Brevideciduous,Tfaily_QHE-16L-M_15Nov19_Gimli_Zorbax-1190_neg...,italic('Q. hemisphaerica')
3,QLE-17L,2019-10-10,Q. laevis,Red,FL,Living,Deciduous,Tfaily_QLE-17L-M_15Nov19_Gimli_Zorbax-1190_neg...,italic('Q. laevis')
4,QLA-3L,2019-10-10,Q. laurifolia,Red,FL,Living,Brevideciduous,Tfaily_QLA-3L-M_15Nov19_Gimli_Zorbax-1190_neg....,italic('Q. laurifolia')


# Feature correspondence

In [4]:
fcorr = pd.read_csv('../1_preprocessing/featureCorrespondence.csv')

fcorr.rename(columns={"Unnamed: 0":'Features'}, inplace=True)

features = fcorr[['Features']]

fcorr['Features'] = features

fcorr.set_index('Features', inplace=True)

print(fcorr.shape)

fcorr.head()

(4683, 11)


Unnamed: 0_level_0,QAU-1L,QGE-2L,QHE-16L,QLE-17L,QLA-3L,QMI-5L,QNI-7L,QNI-8L,QVI-11L,QIN-27L,QIN-28L
Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FT0001,594.9366,3515096.0,169429.8673,330201.371648,24467.499469,0.0,727.65508,2603.742958,423296.909629,0.0,101516.9
FT0002,14007600.0,0.0,0.0,3436.279366,0.0,9121839.0,1289.048501,5741.175947,0.0,5361561.0,5582021.0
FT0003,43338.92,66360.2,1021.189074,110200.275672,40355.98696,0.0,0.0,0.0,17109.070227,59733.69,0.0
FT0004,7330.351,36814.96,14194.718842,14848.91819,30846.988332,33307.38,10513.109992,18135.355735,110271.279915,9003.466,12480.13
FT0005,60003.34,100214.1,350188.584769,403548.737315,5997.40253,11194.13,259328.108039,180676.105636,7601.37926,0.0,4694.614


# Get features that have MS2

In [5]:
ms2_annotation = pd.read_csv('../2_ms2_annotation/summary_ms2_annotation_energetics.csv')

ms2_annotation = ms2_annotation[['Features']]

ms2_annotation.head()

Unnamed: 0,Features
0,FT0158
1,FT0199
2,FT0216
3,FT0221
4,FT0227


# Merge

In [6]:
merged = ms2_annotation.merge(fcorr, on='Features', how='left')

print(merged.shape)

merged.head()

(921, 12)


Unnamed: 0,Features,QAU-1L,QGE-2L,QHE-16L,QLE-17L,QLA-3L,QMI-5L,QNI-7L,QNI-8L,QVI-11L,QIN-27L,QIN-28L
0,FT0158,392930.0,1918686.0,379764.5,710700.0,877269.6,492766.0,1500615.0,1394668.0,1502961.0,3232995.0,731361.2
1,FT0199,10402590.0,9950352.0,7427623.0,86896.0,16091910.0,5257846.0,12799320.0,7001698.0,0.0,7703962.0,5447417.0
2,FT0216,243768.7,229948.1,0.0,1152340.0,0.0,317418.9,2932.67,1516477.0,1682.982,4315103.0,193018.6
3,FT0221,39020.82,911067.0,137107.0,40935.93,327591.3,293441.9,135188.8,380203.1,250786.8,253555.3,278376.5
4,FT0227,73416.0,5215582.0,19706200.0,1637651.0,16561110.0,66278.89,8521567.0,6992220.0,11607980.0,34606.01,80129.61


# Filter 

I am dropping features that don't have a lot of variability (`var()>1`) and that are "rare" meaning present in 4 or less samples.

In [7]:
fcorr1 = merged.copy()

fcorr1 = fcorr1.set_index('Features')
print(fcorr1.shape)

fcorr1 = fcorr1.replace(0, np.nan)
nsamples = int(len(fcorr1.columns)*.4)
print(nsamples)

## Require that many non-NA values (== nsamples)
fcorr1 = fcorr1.dropna(axis=0, thresh = nsamples) 
print(fcorr1.shape)

fcorr1 = fcorr1.replace(np.nan, 0)

## log transform
fcorr1 = np.log10(fcorr1+1)

## filter rows that have variance == 0; will cause issues in the PCA
fcorr1 = fcorr1[fcorr1.var(axis=1).astype(int)>1]

print(fcorr1.shape)

# SAVE!!
fcorr1.to_csv('featureCorrespondence_MS2.csv')

fcorr1.head()

(921, 11)
4
(921, 11)
(527, 11)


Unnamed: 0_level_0,QAU-1L,QGE-2L,QHE-16L,QLE-17L,QLA-3L,QMI-5L,QNI-7L,QNI-8L,QVI-11L,QIN-27L,QIN-28L
Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FT0199,7.017142,6.997838,6.87085,4.939005,7.206608,6.720808,7.107187,6.845203,0.0,6.886714,6.736191
FT0216,5.38698,5.361632,0.0,6.061581,0.0,5.501634,3.467411,6.180836,3.226337,6.634991,5.285601
FT0316,6.176266,7.09876,6.773129,6.88366,4.923794,0.0,6.865103,7.146187,6.56771,5.324416,1.537001
FT0317,6.527802,6.861784,6.977613,6.509012,6.791199,3.697731,3.306695,6.914295,6.650696,2.741705,4.862135
FT0328,3.408924,5.924112,6.529386,5.523251,6.692541,4.034415,4.912437,4.87226,6.623623,0.0,3.36542
