# Study Guide

1. **Why do we use Stratified K-Fold cross-validation, especially with a stratification of 10 folds, in this classification task?**
   
   Stratified K-Fold cross-validation is used to ensure that each fold of the dataset contains roughly the same proportions of the different types of class labels. In a stratified 10-fold cross-validation, the dataset is split into 10 equal parts, with each part maintaining the overall proportion of class labels. This is particularly useful when dealing with datasets that have an unequal number of instances in each class (i.e., imbalanced classes) as it helps in achieving a more reliable estimate of the model's performance.

2. **Can you explain the entropy function defined in the code? How does adding a small constant (1e-10) affect the computation?**
   
   The entropy function calculates the entropy of the labels, which is a measure of the impurity in the dataset. It's defined as the sum of the negative probabilities of each class log2 of the probability of that class. The small constant `1e-10` is added to prevent a log of zero, which is undefined. This is a technique called Laplace smoothing and helps in avoiding the mathematical error when the probability is zero.

3. **What is the information gain, and how is it calculated in this context?**

   Information gain is a measure of the effectiveness of an attribute in classifying the training data. It is calculated by subtracting the weighted entropies of the child nodes from the entropy of the parent node. The code defines a function `information_gain` that takes the entire set's entropy and subtracts the sum of entropies of the two child sets, weighted by their relative sizes.

4. **Discuss the difference between the ID3, C4.5, and CART decision trees as used in the code. Why might they perform differently on the same dataset?**
   
   ID3 uses information gain as its criterion for making splits, C4.5 is an extension of ID3 that uses the gain ratio to counteract the bias of information gain towards attributes with many outcomes, and CART uses the Gini index as its measure of impurity. They might perform differently due to these intrinsic differences in how they evaluate the best splits. Additionally, the presence of features with many outcomes, the handling of continuous attributes, and different handling of missing values can affect their performance.

5. **What is the significance of setting `max_features='sqrt'` in the C4.5-like classifier, and how does it differ from the approach in ID3 and CART?**
   
   Setting `max_features='sqrt'` means that only a subset of features are considered when looking for the best split at each node, specifically the square root of the total number of features. This is inspired by the random forest algorithm's feature bagging approach and can lead to more diversity in the splits made by the C4.5-like model. In contrast, ID3 and CART consider all features at each split.

6. **How are sensitivity and specificity calculated from the confusion matrix, and what do they signify about the classifier's performance?**
   
   Sensitivity (also known as recall or true positive rate) measures the proportion of actual positives correctly identified by the classifier, calculated as the number of true positives divided by the sum of true positives and false negatives. Specificity measures the proportion of actual negatives correctly identified, calculated as the number of true negatives divided by the sum of true negatives and false positives. Together, they provide a more nuanced view of the classifier's performance across the different classes.

7. **Why is the Matthews correlation coefficient (MCC) considered a balanced measure even on imbalanced datasets, and how is it interpreted in this context?**
   
   MCC takes into account true and false positives and negatives and is generally regarded as a balanced measure that can be used even when classes are of very different sizes. It returns a value between -1 and 1, where 1 represents a perfect prediction, 0 no better than random prediction, and -1 indicates total disagreement between prediction and observation.

8. **Why might the accuracy not be an adequate performance metric for this classification task, especially when compared to the F1 score or MCC?**
   
   Accuracy might not be suitable for imbalanced datasets because it can be skewed by the majority class; it doesn’t take into account the balance of measures such as precision and recall. The F1 score, which is the harmonic mean of precision and recall, and the MCC, which takes all four confusion matrix categories into account, can provide a more balanced performance evaluation.

9. **Considering that none of the classifiers achieved the target metrics of ACC over 90% or the others over 0.9, what steps could be taken to try and improve model performance?**
   
   Several strategies can be considered: tuning the hyperparameters of the classifiers, trying different algorithms, using ensemble methods, collecting or generating more data, using feature engineering to create more relevant features, balancing the dataset, or using more sophisticated methods to handle imbalanced data like SMOTE for oversampling the minority class.

10. **In the context of decision tree algorithms, what impact does the choice of 'entropy' vs. 'gini' have on the splitting criteria, and why might one be chosen over the other in practice?**

    Entropy and Gini impurity are two different metrics for quantifying the impurity of a dataset. Entropy is a measure from information theory that provides a strict probabilistic interpretation, while Gini impurity is derived from the economic measure of inequality. While both tend to yield similar results, Gini is typically faster to compute because it doesn't involve logarithms, which are computationally intensive. In practice, one might prefer entropy if the probabilistic interpretation of information gain is specifically required for the problem at hand, or if the dataset is known to contain rare but important classes.

11. **How does the modification of the cost function in a decision tree affect its bias-variance trade-off, and how could this manifest in the resulting model’s performance?**

    Modifying the cost function in a decision tree affects its complexity. For example, a cost function that heavily penalizes complex trees might lead to simpler models with higher bias and lower variance (underfitting), whereas a cost function that allows more complex trees can lead to lower bias but higher variance (overfitting). Finding the right balance is critical for optimal model performance.

12. **When tuning a model using grid search with cross-validation, how does the choice of the parameter grid affect the likelihood of overfitting or underfitting?**

    The parameter grid defines the hyperparameters and their respective ranges to be evaluated. If the grid is too narrow or centered around the wrong values, it could lead to underfitting because the optimal hyperparameters are not considered. Conversely, if the grid is too extensive and includes overly complex model configurations, it may lead to overfitting as the model becomes too tailored to the training data.

13. **Discuss the role of the ‘min_samples_split’ parameter in decision trees. How does altering this value potentially change the structure and performance of the tree?**

    The 'min_samples_split' parameter determines the minimum number of samples required to split an internal node. A lower value increases the tree's complexity and can lead to overfitting as the tree grows deeper and starts to capture noise in the data. A higher value leads to simpler trees and may cause underfitting as the tree might not capture enough information about the data.

14. **Could you elaborate on how the 'bootstrap' parameter in the RandomForestClassifier influences the diversity of the trees in the forest, and what effect this might have on the bias-variance trade-off?**

    The 'bootstrap' parameter, when set to True, allows the construction of decision trees in the forest on different random subsets of the data. This bootstrapping process introduces more diversity into the trees, which generally helps in reducing variance without increasing bias too much. If set to False, each decision tree in the forest is built on the full dataset, which can make individual trees less diverse and could potentially lead to overfitting.

15. **Explain the mathematical intuition behind the feature importance metric in tree-based models. What does it tell us about the predictive power of individual features, and how is it affected by correlated features?**

    Feature importance in tree-based models is usually calculated based on the decrease in node impurity weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node). It provides a score that ranks the predictive power of each feature. However, in the presence of highly correlated features, this metric can become skewed as the importance is distributed among the correlated features, which might not truly reflect their individual predictive power.

16. **In the context of the project, why might it be beneficial to prune the decision tree, and what are the theoretical underpinnings that justify this action?**

    Pruning a decision tree involves cutting back parts of the tree that do not provide power in predicting target values. The theoretical justification lies in reducing overfitting by simplifying the model. It’s based on the concept that smaller trees tend to generalize better, and pruning can remove the branches that have little to no statistical support.

17. **Can you derive the Gini impurity formula and explain how it is used to compute node purity? What does a Gini impurity of 0 signify in the context of a decision tree?**

    Gini impurity is calculated as 1 - sum(p_i^2), where p_i is the proportion of samples that belong to class i at a given node. A Gini impurity of 0 signifies that the node is pure, meaning all the samples at the node belong to the same class.

18. **How do you calculate information gain when splitting a node, and why is it important to maximize this quantity when building decision trees?**

    Information gain is calculated as the difference in entropy before and after the split: IG = H(S) - sum(|S_v| / |S| * H(S_v)) for all values v in Values(A), where H(S) is the entropy of set S, Values(A) are the different values of attribute A by which to split, and H(S_v) is the entropy of the subset of S for each value v. Maximizing information gain results in splits that produce the most homogenous nodes, thereby increasing the predictive power of the tree.

19. **Can you describe the formula for entropy in the context of information theory and relate it to its use in decision trees?**

    Entropy in information theory is defined as H(X) = -sum(p(x) * log2(p(x))), where p(x) is the probability of occurrence of the state x of random variable X. In decision trees, entropy is used to measure the impurity of a node, with higher entropy indicating more disorder and lower entropy indicating more homogeneity.

20. **Explain how the cost complexity pruning formula R_α(T) = R(T) + α * |T~| balances tree size and fit to the data. What role does the parameter α play?**

    In this formula, R(T) is the misclassification rate of the tree T, |T~| is the number of terminal nodes in the tree, and α is a complexity parameter that controls the trade-off between tree size and its fit to the data. Larger values of α penalize complex trees more heavily, leading to simpler trees.

21. **What is the mathematical rationale for using the mean squared error (MSE) in regression problems, and how does minimizing MSE lead to the best unbiased estimator under the assumption of normally distributed errors?**

    The MSE is given by MSE = (1/n) * sum((Y_i - Ŷ_i)^2) where Y_i are the observed values, Ŷ_i are the predicted values, and n is the number of observations. The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better. Under the assumption of normally distributed errors, minimizing MSE corresponds to maximizing the likelihood of the data, which yields the best unbiased estimator according to the Gauss-Markov theorem.

22. **How does the variance of the bootstrap samples in a RandomForest influence the overall variance of the ensemble model, according to the central limit theorem?**

    According to the central limit theorem, the distribution of the average of the bootstrapped samples' predictions will tend toward a normal distribution as the number of trees increases, regardless of the underlying distribution. The variance of these averages will be equal to the variance of a single tree's prediction divided by the number of trees in the forest, reducing the overall variance of the ensemble model.

23. **In the context of logistic regression, could you detail how the logit function is derived from the odds ratio and its significance in transforming probabilities into log-odds?**

    The logit function is the natural log of the odds ratio: logit(p) = log(p / (1 - p)), where p is the probability of the positive class. This transformation is significant because it converts probabilities, which are bounded between 0 and 1, into log-odds, which can range from negative to positive infinity, thus allowing for linear modeling.