# Open Questions

# Preamble

In [1]:
# Common imports
import numpy as np # numpy is THE toolbox for scientific computing with python
import pandas as pd # pandas provides THE data structure and data analysis tools for data scientists 

# maximum number of columns
pd.set_option("display.max_rows", 101)
pd.set_option("display.max_columns", 101)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

from warnings import filterwarnings
filterwarnings('ignore')

# Clustering Algorithms

# Support Vector Machines

Question: The optimization problem of SVCs is to minimize the objective function $J(\omega) = \frac{1}{2} ||\omega||^2$ subject to the constraint $y_i \left(\omega_i^T x + b \right) \geq 1$. How is the SVM optimization problem written in its dual form?

Question: Describe the difference between hard-margin and soft-margin classification. How does the slack variable $\xi$ enter the objective function? How does the regularization parameter $C$ enter the objective function?

Question: Look at the following plot:
    
<div>
<img src="efficient_learning_machines_SVM_C.png" width="300"/>
</div>

Describe how the regularization parameter $C$ changes the separating hyperplane in this dataset.

Question: Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease $\gamma$? What about $C$?

Question: Let us consider a polynomial kernel $K(\vec{x}, \vec{x}') = \left( \vec{x} \cdot \vec{x}' + 1 \right)^d$. The kernel trick assumes that $K(\vec{x}, \vec{x}') = \phi^T(\vec{x}) \cdot \phi(\vec{x}')$. Calculate $\phi$ for a polynomial kernel with $d=2$ and two features $x_1$ and $x_2$.

Question: Why should we use the kernel trick for adding polynomials to a SVM instead of just adding the polynomials to the feature space?

Question: Describe the main strengths and weaknesses of using SVMs for classification tasks.

Question: How does a Support Vector Regression change with increasing $\epsilon$-parameter?

Question: Let us use a degree two polynomial kernel with regularization C and $\epsilon = 0.1$ Which of the two fits uses higher C-value? 

<div>
<img src="support_vector_regression_C.png" width="800"/>
</div>

# Decision Trees

Question: List the main benefits and main disadvantages of using decision trees?

Question: Look at the following decision tree of splitting the Iris dataset with features petal length (0) and petal width (1). Calculate the entropy for all depth-1 nodes.

<div>
<img src="decision_tree.png" width="400"/>
</div>

Question: Look at the following data set:

<div>
<img src="data_set_decision_tree.png" width="400"/>
</div>

*    Compute the Gini index for the **Customer ID** and the **Gender** column.
*    Which attribute - **Gender**, **Car Type**, **Shirt Size** - is better suited for splitting?

Question: Look at the following data set:

<div>
<img src="data_set_2_decision_tree.png" width="400"/>
</div>

Which is the best split according to information gain?

Question: Describe the ID3 algorithm.

Question: How is the bias-variance tradeoff related to the maximum depth of a decision tree.

Question: Describe how pruning works.

Question: Describe the cost function and optimization of a decision tree regression algorithm.

# Ensemble Methods & Random Forests

Question: Compare the bias variance trade-off for decision trees and random forests.

Question: What is meant by Bagging?

Question: Which is the difference between a Bagging and a Pasting Classifier?

Question: Which are the most important parameters of random forests?

The important parameters to adjust are n_estimators, max_features, and possibly
pre-pruning options like max_depth. For n_estimators, larger is always better. Averaging
more trees will yield a more robust ensemble by reducing overfitting. However,
there are diminishing returns, and more trees need more memory and more time to
train. A common rule of thumb is to build “as many as you have time/memory for.”
As described earlier, max_features determines how random each tree is, and a
smaller max_features reduces overfitting. In general, it’s a good rule of thumb to use
the default values: max_features=sqrt(n_features) for classification and max_fea
tures=n_features for regression. Adding max_features or max_leaf_nodes might
sometimes improve performance. It can also drastically reduce space and time
requirements for training and prediction.

Question: Looking at the following plot
    
<div>
<img src="decision_tree_vs_bagging.png" width="800"/>
</div>

which is most probable the decision tree model and which the random forest model.

The second plot is most probable the random forest model.

Question: Describe the difference between hard and soft voting in ensemble methods?

Question: Why should a model combined different other models be more accurate than the individual models.

In [None]:
Soft voting refers to voting with respect to probabilities...

# Genetic Algorithm