# Data Science Project 1 -  Improve the mass accuracy of spectra measured by Orbitrap mass spectrometers (orbitrap)

**Client:** Atmospheric Physical Chemistry group, INAR, University of Helsinki

**Description:** Motivation: When using a mass spectrometer, the measured mass usually shifts from its true mass. Hence, a mass calibration is an important procedure before allocating chemical formulae to the measured masses. A good mass calibration may greatly reduce the efforts of further analysis and increase the reliability of the results. Goals: Improve the mass calibration procedure for Orbitrap raw data, and perhaps for data measured by other mass spectrometers, e.g, TOF-MS Main tasks: 1) Test several fitting function for mass correction, and recommend one or a few that works best. 2) Test the performance of different parameters for mass correction, e.g., number of mass of the species for calibration.

**Data and tools:** Data: raw spectrum data measured by Orbitrap mass spectrometer. Tools: a) Orbitool, provided by the client. Orbitool will be used for reading the raw data and remove the noise, i.e., prepare the data for this analysis; b) Any programming language, which will be used to investigate this mass calibration problem.

## Notebook practises

* Clear outputs before commiting to git!

## Environment

The environment should have the dependencies to run Orbitool and this notebook.

```bash
# Import environment
conda env create -f environment.yml

# Import kernel to jupyter
ipython kernel install --user --name=orbitool

# Export the environment to file
conda env export --no-builds > environment.yml
```

## Dependencies

In [None]:
import pandas as pd
from os import listdir
from os.path import isfile, join
import seaborn as sns
import matplotlib.pyplot as plt
import random

## Data exploration

### Read the spectrum data

In [None]:
spectrum_data_directory = "test"
peak_list_file = "peak list\peaklist_1e5_background.csv"
random_sample_size = 10 # Take sample spectrums to speed up things. If 'None' then uses all the spectrums

spectrum_data_files = []
time_data = []
spectrum_data = []

# Get data files
for file in [f for f in listdir(spectrum_data_directory) if isfile(join(spectrum_data_directory, f))]:
    path_to_file = join(spectrum_data_directory, file)
    #print(path_to_file)
    if file.endswith(".csv"):
        spectrum_data_files.append(path_to_file)

# Take sample spectrums to speed up things
if random_sample_size:
    spectrum_data_files = random.sample(spectrum_data_files, 20)

# Read the spectrum files
for file in spectrum_data_files:
    time_data.append(pd.read_csv(file)[:2])
    spectrum_data.append(pd.read_csv(file, skiprows = 3))

peak_list = pd.read_csv(peak_list_file)

# Lookup
spectrum_data[0].head(5)

### Plot spectrums

In [None]:
%matplotlib notebook
sns.set_theme(style="whitegrid")
df = pd.concat(spectrum_data, keys=range(len(spectrum_data_files)), names=["spectrum"])
fig, ax = plt.subplots(figsize=(8,6))
sns.lineplot(data=df, ax=ax, x="mz", y="intensity", hue="spectrum", palette="tab10")
eps = 0.0001
for index, row in peak_list.iterrows():
    mz = row["mz"]
    formula = row["formula"]
    max_intensity = max([x for x in df.loc[(df["mz"] > mz - eps) & (df["mz"] < mz + eps)].intensity.values] + [0.0])
    plt.plot((mz,mz), (0.0,max_intensity), linestyle="--")