# Exercise 1

Use your newly acquired Python knowledge to make your live a bit easier. Develop a small script which helps you calculate the mass (m)for a batch of 1 litre 50 mM Ammonium hydrogen carbonate. Print the mass at the end.

Molar mass (M): 79,056 g/mol   
Concentration (c): 0.05 mol/l   
Volume (V): 1 l   

For this you need the formula
```
c = n / V
n = m / M
n = c * V
m = n * M
```

In [None]:
## Your code goes here

# It is tempting to use the same variable names as in the exercise description, but
# 1. they are not very descriptive, V could also be voltage or velocity
# 2. Solely uppercase variable names are usually reserved for constants (e.g. PI or G)

volume: float = 1.0
concentration: float = 0.05
g_mol: float = 79.056

mol: float = concentration * volume # mold == n from the exercise description (mol/l * l == mol)
mass: float = g_mol * mol

print(mass, "g")


# Exercise 2

Make your calculation reusable and more generic by wrapping it up in a function which returns the calculated mass.   
The argument list should be volume, concentration, molar mass.   
Validate if your function works correctly with the data from exercise 1.


In [None]:
# You code goes here


def calc_m(volume: float, concentration: float, g_mol: float) -> float:
    mol = concentration * volume
    return g_mol * mol

print(calc_m(1.0, 0.05, 79.056), "g")


# Exercise 3

In a proteomic search engine we need to find the nearest mass to charge (m/z) value to a given target thousands of times. Write a function which accepts a mass spectrum (a tuple with two float lists, first the m/z values second the intensities) and a targeted value to search for. The function should return a tuple with the nearest m/z value, the corresponding intensity, the index within the list difference to the searched value.

The mass is located in the file `mass_spec_json`.


Hint: You need the built-in module `json` and the function `open()` to parse the file to a dictionary. Also checkout the functions `enumerate()` & `abs()`


In [None]:
# std imports
import json
from typing import Tuple, List

# Define the mass spec tuple
mass_spec: Tuple[List[float], List[float]] = ([], [])

# `with` statement to open the file
# it is automatically closed at the end of the `with`-block
with open("mass_spec.json") as json_file:
    mass_json = json.load(json_file)
    mass_spec = (mass_json["mz"], mass_json["intensities"])


def find_nearest_peak(mass_spec: Tuple[List[float], List[float]], target_mz: float) -> float:
    # Set the min index to -1 for now
    min_idx = -1
    # Set the min difference to infinity
    min_diff = float("inf")
    # Iterate over the mz values in the mass spec
    for mz_index, mz_value in enumerate(mass_spec[0]):
        # Calculate the difference between the target mz and the current mz
        diff = abs(target_mz - mz_value)
        # If the difference is less than the min difference
        if diff < min_diff:
            # Set the min index to the current index
            min_idx = mz_index
            # Set the min difference to the current difference
            min_diff = diff
    # Return the requested tuple
    return (mass_spec[0][min_idx], mass_spec[1][min_idx], min_idx, min_diff)


print(find_nearest_peak(mass_spec, 638.5))


## Exercise 4 : Data Analysis

Let's delve into data analysis in python with pandas. You recieved a data set from your collaboration partners that work with hepatocellular carcinoma (HCC) and measured the proteomics of 19 healthy patients (C) and 19 patients that have HCC.

Load the dataset into pandas. What format is the file? 


In [None]:
import pandas as pd
df = pd.read_csv("./HCC_19_vs_19_miss.csv", delimiter=";")

 ### a) Take a look at the data
 
Get familiar with the dataset. What are the columns, what are the rows? What type are the values? Are there missing values?

In [None]:
# look at first 5 rows
print(df.head())
# print all columns
print(df.columns)
# get the type of values and non-nulls
print(df.info())
# see how many missing values there are per entry
print(df.isnull().sum())
# see the shape of the dataframe
print(df.shape)

 ### b) Drop rows with more than 20% missing 
 
We want to keep Proteins that 80% of the patients have to erase a bias by too many missing values, since we don't know why they are missing. Notice here that it is important to know if your values are missing at completely at random due to machine errors or non-detectability, or if the patient did not produce that protein. For simplictiy, we just apply that threshold of 20%. 

In [None]:
df_dropped = df.dropna(axis=1, thresh=int(0.2*df.shape[0]))
print(df_dropped.shape)
print(df_dropped.head())

 ### c) Impute the rest of the missing values
 
We want to impute the missing values for further analysis, like machine learning. Those algorithms cannot deal with missing data. There are some advanced imputation methods, especially if the cause of the missing is known, like multiple imputation by chained equations (MICE), k-nearest-neighbors (KNN), random forest imputation methods. Sometimes, using the mean or median can be absolutely sufficient. Impute the last missing values with the median!

In [None]:
df_dropped = df_dropped.drop("Patients", axis=1)
df_imputed= df_dropped.fillna(df_dropped.median())
print(df_imputed)


### d) Correlation

Sometimes, correlation can mess with further analysis (machine learning) or can simply help us understand interactions with our data or gives us hints for possible biomarkers. Look at the correlation within our dataset! Look at the differences between the different correlation methods. 

In [None]:
Corr = df_imputed.corr(method="kendall")
Corr.style.background_gradient(cmap="coolwarm")


#C1 = Corr.abs().unstack()
#c1_sorted = C1.sort_values(ascending=True)


#print(c1_sorted)
#columns_above_80 = [(col1, col2)for col1, col2 in c1_sorted.index if c1_sorted[col1, col2]> 0.8 and col1!=col2] 
#print(columns_above_80)

### e) Pandas Plots

Check out the other different built in visualization methods pandas has to offer! 
Choose one protein and create a plot sorted by group (Control and Cancer):
1) histogram 
2) boxplot
3) pie plot

Hint: use df.hist() for the histogram and df.plot.box() and df.plot.pie() for the other two. There are slight differences between the plot and not plot methods!

Advanced: do the same for multiple or all proteins!


1. Histogram

Hint: you will need a special argument for the function call to group

In [None]:
df.hist("Q14657",by="Patients", alpha=0.5)

2. Boxplot

Hint: you will need a special argument for the function call to group

In [None]:
df.boxplot(column=["Q14657"], by="Patients")

3. Pie Plot

Hint: You will need to group manually first, sum up all intensities of the chosen protein per group and then plot it into the pandas pie plot


In [None]:
df_pie = df.drop("Unnamed: 0", axis=1)
#df_pie = df_pie.set_index("Patients")

df_pie = df_pie.groupby(["Patients"]).sum()
df_pie.plot.pie(y= "Q14657", autopct='%1.0f%%', title="Protein Q14657 in Control and Cancer", colors=["green", "purple"])

### f) Plotly Plots

Create he same plots that you've created with pandas in plotly!

1. Correlation
2. Histogram
3. Boxplot
4. Pie Plot

In [None]:
# import the library
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe" # Needed for JupyterLab on Galaxy. if you are on Visual Studio Code remove that 

1. Correlation

Hint: You will need the correlation matrix from step d)

In [None]:
fig = px.imshow(Corr)
fig.show()

2. Histogram

Hint: you will need to pass an argument for the function call for grouping

In [None]:
fig = px.histogram(df, x="Q14657", color = "Patients")
fig.show()

3. Boxplot

Hint: you will need to pass an argument for the function call for grouping

In [None]:
fig = px.box(df, x="Patients", y="Q14657")
fig.show()

4. Pie Plot

Hint: you will need to pass an argument for the function call for grouping

In [None]:
fig = px.pie(df, values="Q14657", names="Patients", title="Control vs Cancer")
fig.show()

### g) Sweetviz

Sweetviz is one of a few libraries allowing for a broad inspection of the data without having to visualize anything manually. Try to use it on the original dataframe, whilst comparing the two groups!


In [None]:
import sweetviz

my_report = sweetviz.compare_intra(df, df["Patients"]=="C", ["Control", "Cancer", ])

my_report.show_notebook(layout="widescreen")

my_report.show_html(filepath="./viz.html", open_browser=False, layout="widescreen")