## Exercise on Machine Learning 101 - Part 1
---
Instructions are given in <span style="color:blue">blue</span> color.

In this exercise, we will be looking at clinical data from patients that suffered from some type of cardiovascular disease (e.g., heart attack), which may or may not have caused the patient's death (heart failure).

Our objective is to train a classifier that predicts under which circumstances a cardiovascular disease is most likely to be fatal.

* <div style="color:blue">The folder <code>/data</code>, next to this exercise, contains the file <code>Heart_Failure.csv</code>. Your first task will be to read the data into a <code>DataFrame</code>. Make sure to import any necessary libraries, too.</div>

In [None]:
# Libraries:


The following is needed for **reproducibility** (see [here](https://www.mikulskibartosz.name/how-to-set-the-global-random_state-in-scikit-learn/), but also [here](https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution)):

In [None]:
np.random.seed(42)

In [3]:
# Your solution goes here:
import pandas as pd

# Replace 'path_to_file' with the actual path to the Heart_Failure.csv file.
df = pd.read_csv('data/Heart_Failure.csv')


* <div style="color:blue">Familiarize yourself with the data. Print out the first 5 rows of the <code>DataFrame</code>.</div>

In [4]:
# Your solution goes here:
print(df.head())



    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  DEATH_EVENT  
0        0     4            1  
1        0     6            1  
2       

* <div style="color:blue">Find out how many samples the <code>DataFrame</code> contains.</div>

In [5]:
# Your solution goes here:
print(f"The DataFrame contains {df.shape[0]} samples.")


The DataFrame contains 299 samples.


As for the different features and labels, and what their respective meanings are:

* **`age`** - patient's age at the time of heart failure
* **`anaemia`** - decrease of red blood cells or hemoglobin (boolean)
* **`creatinine_phosphokinase`** - level of the CPK enzyme in the blood (mcg/L)
* **`diabetes`** - if the patient has diabetes (boolean)
* **`ejection_fraction`** - percentage of blood leaving the heart at each contraction
* **`high_blood_pressure`** - if the patient has hypertension (boolean)
* **`platelets`** - platelets in the blood (kiloplatelets/mL)
* **`serum_creatinine`** - level of serum creatinine in the blood (mg/dL)
* **`serum_sodium`** - level of serum sodium in the blood (mEq/L)
* **`sex`** - patients's sex (binary: '0' - female or '1' - male)
* **`smoking`** - if the patient smokes (boolean)
* **`time`** - follow-up period (days)
* **`DEATH_EVENT`** - if the patient deceased during the follow-up period (boolean)

In regards to cleaning the data, there is not much left to do for you, except for removing unneeded columns.

* <div style="color:blue">Remove the <code>time</code> column from the <code>DataFrame</code>.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Split your data into a training set (containing 75% of the original data) and a testing set (including 25% of the original data). Furthermore, make sure that the label values' distribution is the same for all your sub-sets (<code>DEATH_EVENT</code>).</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Follow the suggestions of <b><a html="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html">this</a></b> cheat sheet to select an estimator that fits the problem and the data.</div>
* <div style="color:blue">Remember that we are trying to find out what combination of feature observations will result in a <code>DEATH_EVENT</code>.</div>
* <div style="color:blue">Create an instance of the respective model with the help of <code>scikit-learn</code>.</div>

**Note**: If your model selection process proposes an estimator you are not yet familiar with, try to remember the general **fit-predict-paradigm** utilized by `scikit-learn`. Also, keep in mind that the documentation provides plentiful information about each classifier, including a list of parameters and default values.

In [None]:
# Your solution goes here:


* <div style="color:blue">Train the model, using your training data.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Use your trained model to make predictions on the <b>test</b> set.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Derive the accuracy metric from your predictions.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Plot a confusion matrix that shows the number of predicted labels in comparison to the actual labels.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Use the confusion matrix to fill out the following table:</div>

||**True Positive**|**True Negative**|**False Positive**|**False Negative**|
|-|:-:|:-:|:-:|:-:|
|Number of Predictions|?|?|?|?|

* <div style="color:blue">Generate a classification report for your model.</div>

**Note**: Depending on your classifier, your chosen parameters, or due to random effects, a warning message might pop up that says: `Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.` Please ignore this message for now and possibly make use of it when evaluating your model.

In [None]:
# Your solution goes here:


* <div style="color:blue">Use the derived metrics (accuracy, confusion matrix, precision, recall, f1-score) to critically evaluate your model's quality. Is it well suited to fulfill its initial purpose?</div>

**Note**: Before you start working on your answer, have a look at [this](https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html#) website and search for the term **Null Accuracy**. Use the definition to support your discussion.

*Your solution goes here:*

**Final Remark:** We are trying to give you a realistic indication of what you might expect when tackling data science tasks ... So, you might now either be put off in this particular case OR ready to plough on, maybe after having worked on the second one of this week's exercises and try another classifier which seems more promising (you might also want to follow the cheat sheet for algorithm selection a little further) ...

In [None]:
# Ok, I'm ploughing on and I am giving this another try ... (maybe)