# Practical Task "Fundamentals of Machine Learning"

In this task, you are asked to analyze and classify a data set with the methods from the course. It is a larger task then the ones from the excercise sheets and comprises topics from many different lectures (which, you need to find out yourself). Working on the task, you will learn a lot about how the different topics play together and how to apply the introduced methods in an actual machine learning task. 

Please regard the following points:

* At the beginning of the notebook, please explicitly configure the random seed for all relevant libraries to a constant number. Otherwise, different runs of your script may yield different results, so that your analysis does not match the program output after a re-run

* You may use the methods we developed in the course in the miniML.py. However, please consider that these are not made for productive use, i.e., we recommend to use some or all of the following libraries:
    * numPy
    * sciPy
    * matplotlib
    * scikit-learn (this is basically a more mature version of the miniML with similar interfaces)
    * pandas

# The Data

* The data is provided in two NPY files, which you can load directly with numpy.
    * The file "samples_ss2022.npy" contains the data matrix (each row corresponds to one sample, each column corresponds to one feature)
    * The file "labels_ss2022.npy" contains one label per line, corresponding to the respective row of the data matrix.
* Assume that you do not have any prior knowledge about the data, for example regarding the meaning of the different features.

# Task 1: Data Analysis (35%)

First, you are asked to analyze the data. You should at least look at the following aspects:

* 1.1 Labels: Number of classes and frequency of classes (use a histogram). Do you observe a pattern in the label distribution? What do these findings imply for the evaluation of a classifier?

* 1.2 Features: Not all features in the data set have similar characteristics. Characterize your features (e.g.: What is the range of possible values? Hint: plt.boxplot)
* 1.3 Scatter plot: Use scatter plots to illustrate the relation between selected feature pairs as well as their relation to the ground truth. 
* 1.4 Relation: Visualize and quantify the relation (correlation) between all pairs of features (Hint:  np.corrcoef and  plt.imshow) 

# Task 2: Classification (55%)

Second, you should systematically develop a classifier on the provided data and ground truth labels. This involves training and evaluating classifiers in different configurations to compare their prediction performance. Consider at least the following aspects (Hint: You can use the code from former tutorials):


* 2.1 Classifiers: Pick and use a classifier. For the classifier, think about the the following criteria: linear classifier? discriminative or generative classifier? parametric or non-parametric? Which ones should you not use?
* 2.2 Preprocessing: Explore different methods of data proprocessing (e.g., normalization, feature selection, feature transformation, etc.)
* 2.3 Evaluation: Employ a sensible evaluation scheme, such as cross-validation.
* 2.4 Metric: Use an appropriate evaluation metric (or multiple ones) and show confusion matrices.
* 2.5 Optimization: Perform a systematic optimization of (some) hyperparameters. If your optimize parameters automatically, make sure that parameters are only chosen on the training data. Examples for hyperparameters: maximum depth for a decision tree, $k$ and metric for kNN, shrinkage factor for LDA, number of trees in a random forest, number of selected features, etc.
* 2.6 Comparison: Summarize your results. How do your approaches compare to the classification chance level (a-priori probabilities)? 


The achieved classification performance is not the main goal of this analysis, but the well-reasoned and well-structured use and interpretation of methods from the course (however, these two aspects are usually correlated, so do not simply ignore classification performance). We encourage you to exchange results with your fellow students to put them into context.

# Task 3: Model Inspection (10%)

Finally, you should explicitly inspect and interpret what a classifier has learnt on the training data.

* 3.1 Evaluation: For this task, implement a single split in training and testing data (in contrast to e.g., a cross-validation from task 2). 
* 3.2 Classifier Selection: Pick a classifier which can be easily interpreted and train and evaluate it. Justify your choice. You can either pick the classifier from task 2 or pick a new one.
* 3.3 Visual Inspection: Choose a visualization to demonstrate what the classifier has learnt after training and on what basis it will classify new samples (e.g., you can show the decision boundaries of the classifier together with the training samples or illustrate a tree structure). For some types of visualizations, you might need to pick a small number of features to arrive at a feature space which you can display in 2D or 3D.
* 3.4 Interpretation: You can make connections to your findings from tasks 1 and 2 and make a statement about feature importance, classification performance, and ability to generalize.