# Theory
In this section a short description of the data fields are provided.

## Total ion current
The total ion current (TIC) chromatogram represents the summed intensity across the entire range of masses being detected at every point in the analysis. The range is typically several hundred mass-to-charge units or more. In complex samples, the TIC chromatogram often provides limited information as multiple analytes elute simultaneously, obscuring individual species.[https://en.wikipedia.org/wiki/Mass_chromatogram]

![TIC](img/1280px-Total_ion_current_chromatogram.png)
An example of TIC chromatogram from an LC-MS analysis.

## Base peak intensity 
The base peak chromatogram is similar to the TIC chromatogram, however it monitors only the most intense peak in each spectrum. This means that the base peak chromatogram represents the intensity of the most intense peak at every point in the analysis. Base peak chromatograms often have a cleaner look and thus are more informative than TIC chromatograms because the background is reduced by focusing on a single analyte at every point.[https://en.wikipedia.org/wiki/Mass_chromatogram]

![BPI](img/1280px-Base_peak_chromatogram.png)
An example of BPI chromatogram from an LC-MS analysis.

## Base peak MZ
Base peak: The most intense (tallest) peak in a mass spectrum, due to the ion with the greatest relative abundance (relative intensity; height of peak along the spectrum's y-axis). Not to be confused with molecular ion: base peaks are not always molecular ions, and molecular ions are not always base peaks.

![ m/z = 91](img/base_peak01.jpg)
The electron impact ionization mass spectrum of PhCH2Cl, in which the base peak is a fragment ion having m/z = 91.

![ m/z = 16](img/base_peak02.jpg)
The electron impact ionization mass spectrum of PhCH2Cl, in which the base peak is a fragment ion having m/z = 91.[http://www.chem.ucla.edu/~harding/IGOC/B/base_peak.html]

## Ion
The amino acid ion. 

![ Ion list ](img/List-of-amino-acids-abbreviations.png)

## n
How many of the particles we have (e.g in 13C in the data we have 5 carbon in the proline sample.).

## AUC
Area under the curve. The area under the interpolated "gaussian" curve that fit the peaks (what peak was this now again? 


# About the data

## How the data is recorded.

There are different number of scans for the time that it is recording. For each scan we are looking at all the amino acids and the isotopes. If we run the experiment for 200 scans. In the perfect world we would have 200 data points for each experiment, which we could use to define a statistic that we could then use for measurement. 


# Loading libraries

In [4]:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns#Understanding my variables


# Load data

In [5]:
# Read in data.
df = pd.read_csv("../data/MS/20200306_AminoAcid_Pro_1_fit.csv")


# Data description
This section contains some data description.


## Shape

In [17]:
# Data details
df.shape


(5931, 16)

Data has 5931 samples and 16 columns.

## Head

In [10]:
df.head()


Unnamed: 0,seqNum,peak,I,mu,sigma,R2,isoshift,theorshift,masserror,isoratio,rt,tic,bpI,bpMZ,ion,n
0,1,0,178901.448535,116.07073,0.000269,0.996828,0.0,0.0,0.0,1.0,0.742637,1726367104,683092160,116.070732,P,
1,1,13C,9755.975781,117.074117,,0.996654,1.003386,1.003355,3.2e-05,0.054533,0.742637,1726367104,683092160,116.070732,P,5.0
2,1,15N,418.000876,117.067945,,0.978855,0.997215,0.997035,0.00018,0.002336,0.742637,1726367104,683092160,116.070732,P,1.0
3,1,2H,163.643807,117.076463,,0.519899,1.005733,1.006277,-0.000544,0.000915,0.742637,1726367104,683092160,116.070732,P,10.0
4,1,18O,569.345768,118.075025,,0.99512,2.004295,2.004246,4.9e-05,0.003182,0.742637,1726367104,683092160,116.070732,P,2.0


## Columns

In [19]:
df.columns

Index(['seqNum', 'peak', 'I', 'mu', 'sigma', 'R2', 'isoshift', 'theorshift',
       'masserror', 'isoratio', 'rt', 'tic', 'bpI', 'bpMZ', 'ion', 'n'],
      dtype='object')

__'seqNum'__ - Name of the scan number (the scanId).

__'peak'__ - Which isotop we are looking at. 0 means the monoisotopic peak, etc. 

__'I'__ - The area under the "gaussian" curve that is fitted on the data points.

__'mu'__ - M/Z ratio. Mass of the peak that we are looking at. 

__'sigma'__ - The sigma parameter for the "gaussian"-fit. 

__'R2'__ - The same as sigma. It described the gaussian fit for those data points. Fit of a true gaussian fit vs the actual fit. 

__'isoshift'__ - Actual shift that we see for the detected peak.

__'theorshift'__ - Actual theoretical shift.

__'masserror'__ - Difference between the isoshift and theorshift.

__'isoratio'__ - Ratio of the I/(I for the peak, which is 0 [monoisotopic peak]). E.g. The peak with 0 will always be 1 for isoratio because it will always be divided by itself (own area under the curve). This is the actual area under the curve for each peak. So eventually when we want to get it for individual elements we need to divide it by the individual number of atoms that we have for this specific element.  

__'rt'__ - Retention time. 

__'tic'__ - Total Ion Current. Which is the abundance of all the abundance in this specific scan. The rt and tic are dependent on the seqNum because when we are looking at seqNum 1. The retention time will always be the same. It doesnt matter if we go for different tic or different ion because it is still in the first scan.

__'bpI'__ - Base peak intensity. Which is the intensity of the base peak.

__'bpMZ'__- Base peak mass over charge ratio. Base peak in the spectrum, means the highest peak that we have. 

__'ion'__ - Is the name of the amino acid that we have. 

__'n'__ - N is the number of element for the peak that we have for the specific ion. For instance; for proline. We will look for the ion P and go for the peak; and for carbon we will have 4 number of carbons. Then n will be 4. If we go for hydrogen it will be 10. For nitrogen it will be 1. It also depends if we are looking at the MS data or the MSMS data. Becuase we have different structure and this different structure has a different number of atoms and elements. 


## Unique


In [13]:
df.nunique(axis=0)



seqNum         276
peak             5
I             5931
mu            5931
sigma         2244
R2            5931
isoshift      3688
theorshift       5
masserror     3688
isoratio      3688
rt             276
tic            276
bpI            276
bpMZ            16
ion             21
n               13
dtype: int64

## Description

In [16]:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))


Unnamed: 0,seqNum,I,mu,sigma,R2,isoshift,theorshift,masserror,isoratio,rt,tic,bpI,bpMZ,n
count,5931.0,5931.0,5931.0,2244.0,5931.0,5931.0,5931.0,5931.0,5931.0,5931.0,5931.0,5931.0,5931.0,3687.0
mean,269.688248,6724.797085,127.283779,0.000298,0.914391,0.780622,0.780662,-4e-05,0.600256,1466.38748,1431175010.325409,597139691.470578,116.070722,4.522105
std,157.954613,31098.961164,24.066124,9.1e-05,0.114675,0.699252,0.699272,0.000308,0.516794,860.279989,238985034.45137,99086345.884718,2.1e-05,3.583132
min,1.0,0.50061,76.030155,9.9e-05,0.500942,0.0,0.0,-0.001859,0.000352,0.742637,649587328.0,263874688.0,116.070633,1.0
25%,135.0,5.624252,117.068017,0.000242,0.884396,0.0,0.0,-0.000107,0.146326,733.00779,1291939456.0,544504960.0,116.070709,2.0
50%,263.0,10.21492,130.059728,0.000296,0.968738,0.997324,0.997035,0.0,0.584788,1429.995268,1442396544.0,594617024.0,116.070717,3.0
75%,405.0,46.421612,148.056956,0.000354,0.991954,1.005873,1.006277,6.5e-05,1.0,2203.306418,1584722432.0,654001728.0,116.070732,6.0
max,551.0,231537.236707,183.09098,0.000719,0.999578,2.005458,2.004246,0.001333,11.310396,2998.64796,2056997120.0,886031488.0,116.070786,15.0
