<table style="width: 100%">
    <tr style="background: #ffffff">
        <td style="padding-top:25px;width: 180px"><img src="https://mci.edu/templates/mci/images/logo.svg" alt="Logo"></td>
        <td style="width: 100%">
            <div style="text-align:right; width: 100%; text-align:right"><font style="font-size:38px"><b>Visualization and Databases</b></font></div>
            <div style="padding-top:0px; width: 100%; text-align:right"><font size="4"><b>Biodata Science</b></font></div>
        </td>
    </tr>
</table>


---

# 8 Task Description: Final Challenge

![](https://mycocosm.jgi.doe.gov/public/Tubma1/TuberMagnatum.JPG;jsessionid=CA54F53CF0C5C9B5BB4CD16A473A38E4)

In this experiment, we will focus on classifying truffle species. Using data from [Liquid Chromatography-Mass Spectrometry (LC-MS)](https://en.wikipedia.org/wiki/Liquid_chromatography–mass_spectrometry) and supervised learning, we aim to classify different truffle species. The data was collected and provided by researchers, including [Klemens Losso](https://www.sciencedirect.com/science/article/pii/S0308814622023809).

The LC-MS data is provided in an Excel file, where the ID (`TM`, `TI`, `TA`, `TU`) represents the truffle species:
- **TM** = [Tuber magnatum](https://en.wikipedia.org/wiki/Tuber_magnatum)  
- **TI** = [Tuber indicum](https://en.wikipedia.org/wiki/Tuber_indicum)  
- **TA** = [Tuber aestivum](https://en.wikipedia.org/wiki/Tuber_aestivum)  
- **TU** = [Tuber uncinatum](https://en.wikipedia.org/wiki/Tuber_uncinatum)  

Particularly, **Tuber indicum** and **Tuber uncinatum** are very similar and hard to distinguish. However, their quality and price differ significantly. Since **Tuber indicum** is much cheaper, it is often sold as **Tuber uncinatum**, as it can mimic the latter's aroma, even deceiving experts. LC-MS provides a fast and cost-effective way to differentiate truffle species. Your task is to develop a model to classify truffle species based on LC-MS data.

For each species, 30 individual truffles were analyzed, with triplicates for each sample (sample code: `<species><truffle number>_<sample number>`). The remaining columns represent chromatogram time points as variables.

## Your Tasks:

### A1: (20 Points) Create a Suitable Plots to Visualize the Data
- The plots should make the spectra of the four truffle species comparable.
- First, create a plot containing exactly one truffle from each species.
- Then, create a plot grouping truffles by species, displaying all species in a single plot.
- Research LC-MS and describe what is shown in the plot.
- Ensure the plot meets **Academic Walkthrough** standards but may include colors for clarity.

---

### A2: (20 Points) Unsupervised Learning - Clustering
- Perform clustering on the dataset, you have to get familiar with a package for clustering [e.g., sklearn.cluster.KMeans](https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html)
- Scaling and Feature Selection might be necessary.
- Decide on the number of clusters based on the data.
- Visualize the clusters in a plot.
- Provide an interpretation of the distances between truffle species.
- You might want to apply scaling or other preprocessing steps.

---

### A3: (40 Points) Supervised Learning - Classification
- Choose a model to classify truffle species based on LC-MS data.
- Develop at least three models and compare them regarding their performance on training and test data:
  - Examples for variation: 
    - Different Models: Decision Tree, Random Forest, Support Vector Machine, Nearest Neighbors, Neural Network.
    - Use different hyperparameters.
    - Experiment with different feature selection, transformations, or clustering results.
- Validate the model using a test set (20% of the data).
- Calculate appropriate error metrics, visualize results, and assess model quality.
- Address the application context and discuss the results:
  - How could the model be further improved?
  - Which misclassifications are critical in the use case, how often do they occur, and how could they be mitigated?
- Ensure no data outside the LC-MS dataset is used as input to the model [(see data leakage)](https://en.wikipedia.org/wiki/Leakage_(machine_learning)).

---

### A4: (20 Points) Notebook Presentation and Code Readability
- Add suitable text, images, and comments to make the notebook understandable.
- Include a summary of the results.
- Format the notebook to ensure clarity and organization.


# **Preparation**

---



In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv(r'https://raw.githubusercontent.com/jhumci/BLT_BDS/main/data/truffles/Modellbau_ALL.csv')

In [10]:
df

Unnamed: 0,Probencode,ID,0.004,0.005,0.007,0.009,0.01,0.012,0.014,0.015,...,11.984,11.985,11.987,11.989,11.99,11.992,11.994,11.995,11.997,11.999
0,TM01_01,TM,1049536,1050560,1011648,814016,823232,868288,643008,597952,...,699328,663488,705472,1393600,1213376,1011648,936896,613312,1006528,666560
1,TM01_02,TM,859136,825344,975872,644096,532480,834560,620544,830464,...,997376,761856,1077248,896000,816128,749568,892928,693248,624640,784384
2,TM01_03,TM,846912,737344,754752,573504,806976,723008,883776,692288,...,852032,530496,789568,610368,1196096,1047616,1169472,1360960,811072,763968
3,TM02_01,TM,901120,1007616,736256,936960,807936,877568,762880,1033216,...,856064,608256,953344,977920,695296,934912,1400832,587776,662528,756736
4,TM02_02,TM,807936,571392,709632,700416,1188864,950272,827392,893952,...,605184,1547264,863232,771072,919552,937984,693248,923648,1113088,878592
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,TU29_02,TU,705472,571328,285616,609216,474032,451504,414640,532416,...,350128,504752,376752,376752,339888,354224,394160,544704,451504,402352
356,TU29_03,TU,385968,373680,638912,358320,561088,440240,569280,578496,...,539584,453552,367536,389040,510896,524208,400304,398256,647104,451504
357,TU30_01,TU,486336,509888,390080,499648,673728,308160,427968,474048,...,523200,554944,392128,437184,531392,451520,419776,430016,560064,450496
358,TU30_02,TU,418832,732160,356368,447504,382992,575488,419856,460816,...,371728,298000,578048,872448,486416,899072,657408,341008,429072,763904


## **Task A 1**

---


In [11]:
#erstellen einer neuen Tabelle, die jeden ersten Trüffel einer Sorte enthält (der analysiert wurde)
new_df = df.copy()
new_df = df[df['Probencode'].isin(["TM01_01", "TA01_01", "TI01_01", "TU01_01"])]
new_df = new_df.set_index('Probencode')
del new_df['ID']

transposed_new_df = new_df.transpose()
transposed_new_df
#ID entferen um später einfacher einen Plot erstellen zu können


Probencode,TM01_01,TA01_01,TI01_01,TU01_01
0.004,1049536,896000,471968,810880
0.005,1050560,458752,372640,498560
0.007,1011648,912384,549760,682880
0.009,814016,631808,499616,441216
0.01,823232,417792,336800,489344
...,...,...,...,...
11.992,1011648,898048,471968,794496
11.994,936896,624640,803712,791424
11.995,613312,815104,357280,407424
11.997,1006528,792576,488352,622464


 ## **Task A 2**

---




## **Task A 3**

---