# Open Questions

**Not all questions are required for the exam!**

# Preamble

In [1]:
# Common imports
import numpy as np # numpy is THE toolbox for scientific computing with python
import pandas as pd # pandas provides THE data structure and data analysis tools for data scientists 

# maximum number of columns
pd.set_option("display.max_rows", 101)
pd.set_option("display.max_columns", 101)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

from warnings import filterwarnings
filterwarnings('ignore')

# Introduction to Machine Learning

Question 1: Describe the difference between supervised, semi-supervised, unsupervised learning. 

Answer 1: Supervised learning algorithms learn from a training set of labeled examples. The aim is to generalize the result to the set of possible future inputs. Examples are: logistic regression, support vector machines, decision trees, random forests.

Unsupervised learning algorithms learn from a training set of unlabeled examples. It is often used to explore data according to some similarity estimation. (answer from Igual & Segui)

In Semi-supervised learning...

Question 2: How does reinforcement learning work?

Question 3: Mention five typical performance metrics for classification tasks.

Answer 3: Accuracy score, F1 score, AUC, ...

Question 4: Explain the confusion matrix.

Question 5: Why is it in many cases better to use the MSE instead of the MAE?

Answer: The MSE error \begin{equation}
MSE = \frac{1}{N} \sum_{i=1}^{N} \left( y_i - \hat{y}_i \right)^2
\end{equation}
punishes large derivations more than the MAE
\begin{equation}
MAE = \frac{1}{N} \sum_{i=1}^{N} | y_i - \hat{y}_i |
\end{equation}
therefore it is better suited for all applications where larger deviations from the true outcome should be avoided.

Question 6: Look at the following picture of a linear regression error with regularization parameter C.
        
<div>
<img src="train_test_error.png" width="600"/>
</div> 

Which line most likely refers to the train error, the blue or red one? Explain your choice. 

Question: Name five different machine learning models suitable to tackle regression tasks. 

Question: Give an overview on traditional clustering algorithms.

Question: Describe the typical phases of a machine learning project.

Question: How would you design a machine learning project that should forecast the short-term revenue of an online business?

Question: Discuss main challenges when creating a machine learning model. 

# Clustering Algorithms

Question: How would you define clustering? Can you name a few clustering algorithms?

Question: What are some of the main applications of clustering algorithms?

Question: Explain the hyperparameters used in DBSCAN. 

Question: Explain how to optimize the k-Means Clustering objective function?

Question: Which is the basic idea of a Gaussian mixture model? Explain how the standard deviation and covariance matrix are connected with each other. What tasks can you use a Gaussian Mixture Model for?

Answer: We compute $p(z_i = k|x_i, θ)$, which represents the posterior probability that point
i belongs to cluster k. This is known as the responsibility of cluster k for point i...

Question: Explain the basic idea of the Expectation-Maximization algorithm.

*Answer: The estimation–maximization (EM) algorithm facilitates parameter estimation in probabilistic models
with incomplete data. EM is an iterative scheme that estimates the MLE or MAP of parameters in statistical
models, in the presence of hidden or latent variables. The EM algorithm iteratively alternates between the
steps of performing an expectation (E), which creates a function that estimates the probability distribution
over possible completions of the missing (unobserved) data, using the current estimate for the parameters,
and performing a maximization (M), which re-estimates the parameters, using the current completions
performed during the E step. These parameter estimates are iteratively employed to estimate the distribution
of the hidden variables in the subsequent E step. In general, EM involves running an iterative algorithm
with the following attributes: (a) observed data, $X$; (b) latent (or missing) data, $Z$; (c) unknown parameter, $q$;
and (d) a likelihood function, $L(q; X, Z) = P(X, Z|q)$.*

# Regression

Question: List five different real-world applications of regression tasks.

Question: Is it possible to reformulate a regression taks as a classification task. How would you do it practically?

Question : Look at the following picture generated with a Lasso regularized model. Can you explain the following plot?
<div>
<img src="regularization_plot.png" width="600"/>
</div>

Question: How is linear regression connected to Gaussian distributions? 

Question: Explain the likelihood function and maximum likelihood estimation.

Question: Explain the objective function of logistic regression. Explain how the minimum of the logistic regression objectiv function can be found.

# Support Vector Machines

Question: The optimization problem of SVCs is to minimize the objective function $J(\omega) = \frac{1}{2} ||\omega||^2$ subject to the constraint $y_i \left(\omega_i^T x + b \right) \geq 1$. How is the SVM optimization problem written in its dual form?

Question: Describe the difference between hard-margin and soft-margin classification. How does the slack variable $\xi$ enter the objective function? How does the regularization parameter $C$ enter the objective function?

Question: Look at the following plot:
    
<div>
<img src="efficient_learning_machines_SVM_C.png" width="300"/>
</div>

Describe how the regularization parameter $C$ changes the separating hyperplane in this dataset.

Question: Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease $\gamma$? What about $C$?

Question: Let us consider a polynomial kernel $K(\vec{x}, \vec{x}') = \left( \vec{x} \cdot \vec{x}' + 1 \right)^d$. The kernel trick assumes that $K(\vec{x}, \vec{x}') = \phi^T(\vec{x}) \cdot \phi(\vec{x}')$. Calculate $\phi$ for a polynomial kernel with $d=2$ and two features $x_1$ and $x_2$.

Question: Why should we use the kernel trick for adding polynomials to a SVM instead of just adding the polynomials to the feature space?

Question: Describe the main strengths and weaknesses of using SVMs for classification tasks.

Question: How does a Support Vector Regression change with increasing $\epsilon$-parameter?

Question: Let us use a degree two polynomial kernel with regularization C and $\epsilon = 0.1$ Which of the two fits uses higher C-value? 

<div>
<img src="support_vector_regression_C.png" width="800"/>
</div>

# Decision Trees

Question: List the main benefits and main disadvantages of using decision trees?

Question: Look at the following decision tree of splitting the Iris dataset with features petal length (0) and petal width (1). Calculate the entropy for all depth-1 nodes.

**under construction**

<div>
<img src="decision_tree.png" width="400"/>
</div>

Question: Look at the following data set:

<div>
<img src="data_set_decision_tree.png" width="400"/>
</div>

*    Compute the Gini index for the **Customer ID** and the **Gender** column.
*    Which attribute - **Gender**, **Car Type**, **Shirt Size** - is better suited for splitting?

Question: Look at the following data set:

<div>
<img src="data_set_2_decision_tree.png" width="400"/>
</div>

Which is the best split according to information gain?

Question: Describe the ID3 algorithm.

Question: How is the bias-variance tradeoff related to the maximum depth of a decision tree.

Question: Describe how pruning works.

Question: Describe the cost function and optimization of a decision tree regression algorithm.

# Ensemble Methods & Random Forests

Question: Compare the bias variance trade-off for decision trees and random forests.

Question: What is meant by Bagging?

Question: Which is the difference between a Bagging and a Pasting Classifier?

Question: Which are the most important parameters of random forests?

The important parameters to adjust are n_estimators, max_features, and possibly
pre-pruning options like max_depth. For n_estimators, larger is always better. Averaging
more trees will yield a more robust ensemble by reducing overfitting. However,
there are diminishing returns, and more trees need more memory and more time to
train. A common rule of thumb is to build “as many as you have time/memory for.”
As described earlier, max_features determines how random each tree is, and a
smaller max_features reduces overfitting. In general, it’s a good rule of thumb to use
the default values: max_features=sqrt(n_features) for classification and max_fea
tures=n_features for regression. Adding max_features or max_leaf_nodes might
sometimes improve performance. It can also drastically reduce space and time
requirements for training and prediction.

Question: Looking at the following plot
    
<div>
<img src="decision_tree_vs_bagging.png" width="800"/>
</div>

which is most probable the decision tree model and which the random forest model.

The second plot is most probable the random forest model.

Question: Describe the difference between hard and soft voting in ensemble methods?

Question: Why should a model combined different other models be more accurate than the individual models.

Soft voting refers to voting with respect to probabilities...

# Genetic Algorithm

**Question:** Suppose you were using a genetic algorithm and you have the following two individual,
represented as strings of integers:

1324421 and 2751421

Which crossover techniques do you know and how would these techniques apply to those two strings?

**Question:** Name and describe the main features of Genetic Algorithms (GA).

Answer: Genetic Algorithms (GA) use principles of natural evolution. There are five important features of GA: Encoding possible solutions of a problem are considered as individuals in a population. If the solutions can be divided into a series of small steps (building blocks), then these steps are represented by genes and a series of genes (a chromosome) will encode the whole solution. This way different solutions of a problem are represented in GA as chromosomes of individuals. Fitness Function represents the main requirements of the desired solution of a problem (i.e. cheapest price, shortest route, most compact arrangement, etc). This function calculates and returns the fitness of an individual solution. Selection operator defines the way individuals in the current population are selected for reproduction. There are many strategies for that (e.g. roulette–wheel, ranked, tournament selection, etc), but usually the individuals which are more fit are selected.

Crossover operator defines how chromosomes of parents are mixed in order to obtain genetic codes of their offspring (e.g. one–point, two–point, uniform crossover, etc). This operator implements the inheritance property (offspring inherit genes of their parents). Mutation operator creates random changes in genetic codes of the offspring. This operator is needed to bring some random diversity into the genetic code. In some cases GA cannot find the optimal solution without mutation operator (local maximum problem).

**Question:** Consider the problem of finding the shortest route through several cities,
such that each city is visited only once and in the end return to the starting
city (the Travelling Salesman problem). Suppose that in order to solve this
problem we use a genetic algorithm, in which genes represent links between
pairs of cities. For example, a link between London and Paris is represented
by a single gene ‘LP’. Let also assume that the direction in which we travel
is not important, so that LP = P L

How many genes will be used in a chromosome of each individual if
the number of cities is 10?

b) How many genes will be in used in the algorithm?

Answer a) Answer: Each chromosome will consist of 10 genes. Each gene representing the path between a pair of cities in the tour.

b) Answer: The alphabet will consist of 45 genes. Indeed, each of the 10 cities can be connected with 9 remaining. Thus, 10 × 9 = 90 is the number of ways in which 10 cities can be grouped in pairs. However, because the direction is not important (i.e. London–Paris is the same as Paris–London) the number must be divided by 2. So, we shall need 90/2 = 45 genes in order to encode all pairs. In general the formula for n cities is:

**Question:** Consider a genetic algorithm in which individuals are represented using a 5-bit string of the form 

$$b1b2b3b4b5 \; . $$


An example of an individual is $$00101$$ for which b1=0, b2=0, b3=1, b4=0, b5=1.
The fitness function is defined over these individuals as follows:


$$ f(b1b2b3b4b5) = b1 + b2 + b3 + b4 + b5 + AND(b1,b2,b3,b4,b5) $$
where $AND(b1,b2,b3,b4,b5)=1$ if $b1=b2=b3=b4=b5=1$ and $AND(b1,b2,b3,b4,b5)=0$ otherwise.

a) Calculate the fitness function of the following individuals:

$$ 00101, 11101, 00000, 10010 11111 \; . $$

b) Suppose that a single crossover point will be used for crossover. This point has been chosen as the point between the 2nd and the 3rd bits (i.e. between b2 and b3). Show the two offspring that will result from crossing over the following two individuals:

     First parent:  00101       

     Second parent: 10111       
   
c) Explain how the standard mutation method is applied after selection and crossover to form the "next" generation of a population.

For each individual, a bit of the individual will be randomly selected and it will be flipped with a given "mutation rate" probability.

 

Answer a)   00101  $0+0+1+0+1+0 = 2$  

  11101 $1+1+1+0+1+0 = 4$

  00000 $0+0+0+0+0+0 = 0$

  10010 $1+0+0+1+0+0 = 2$

  11111 $1+1+1+1+1+1 = 6$
  
b) First offspring:  00111, Second offspring: 10101


**Question**:
Assume a population of strings. 

$$ 10111, 00111, 01001, 01010 $$

Assuming that the string represents a binary encoding of a number n, and that the fitness function is given by $F_{i}=\frac{100}{n}$ , calculate the fitness of the above indiviuals. 


Having done this, we randomly selecting mates and single crossover sites to generate a new population

$$ 00111, 01001, 00101, 01011$$

Calculate Fi for each member of the new population. Is this an improvement? (E.g has the average population fitness improved? How do the best-performing members compare?)

String, Fi, Mating pool
* 10111: 4.35, 00111
* 00111: 14.29, 00111
* 01001: 11.11, 01001
* 01010: 10.0, 01001

b) Previous average population fitness = 9.93, new average population fitness = 12.82. Previous best performer had fitness 14.29, new best performer has fitness 20.0. Yes, this is an improvement



**Question:** In the context of genetic algorithms, what is a fitness function? What role does it play in helping the system to learn?