# Statistical Learning Exercise

## Flexible Approach vs Inflexible Approach

1 . For each of parts (a) through (d), indicate whether we would generally
expect the performance of a flexible statistical learning method to be
better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.

(b) The number of predictors p is extremely large, and the number
of observations n is small.

(c) The relationship between the predictors and response is highly
non-linear.

(d) The variance of the error terms, i.e. σ
2 = Var(ϵ), is extremely
high.

> **inflexible method** : inflexible method compromised algorithms such as *simple Decision Tree, linear regression, logistic regression, Lasso regression, linear discriminant analysis, naive bayes* because the model can be interpreted by manual mathematics calculation

> **flexible method** : flexible method is very hard to compute it manually by mathematics because the model is already complex. Hence could cause over fitting in a model. Example algorithm are *Support vector model, K-nearest neighbour, complex decision trees, deep random forest, neural network*


### Answer

(a) . If a there is too much sample and less predictor, the likelyhood of underfitting of a model could happen when we use **inflexible algorithm** can happen easily because model could potentially not generalize the relationships for each records therefore *creating a model learns patterns poorly for each data.* Therefore *to learn better patterns for each data relationship*, performing **flexible method** is better because it wont cause underfitting and the model can generalize the pattern better for model evaluation.

(b) . When the number of predictors is large and the number of observations is small, using a flexible method can easily lead to overfitting.
Because there are too few data points relative to the number of predictors, a flexible method can fit the training data almost perfectly — essentially memorizing the noise instead of learning the true relationship.
As a result, it performs poorly on unseen data.
Therefore, in this situation, an inflexible method is preferable, since it imposes more structure on the model and is less likely to overfit, allowing for better generalization despite limited data.

(c) . relationship between predictors and dependent variable is highly non linear. Therefore, using **flexible method** is better because it helps the model to learn the pattern for all the existing variables between dependent and independent to generalize a better pattern for a model.

(d) . if variance is very high, it means that the data distribution is not close with the mean or average. Therefore the spread of data is quite large and we need to detect what is causing this high variance. It could be a stupid **high amount of outlier** or some **complex relationships** between independent variables and dependennt variables. if outlier, performing inflexible method is better because we can throw out the outlier and generalize the model structure better with the dependent variable. If the variance is caused by the complex relationship its self, then using flexible method is better because for each different sample we use, we can perform why this sample group behaves this way and this also applies with the other sample groups.

## Classification vs Regression (predict or inference)

2 . Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. We are interested in understanding which factors
affect CEO salary.

(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on 20
similar products that were previously launched. For each product we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables.

(c) We are interested in predicting the % change in the USD/Euro
exchange rate in relation to the weekly changes in the world
stock markets. Hence we collect weekly data for all of 2012. For
each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market,
and the % change in the German market.

### Answer

(a) . If we are interested which factors affect the CEO salary. Then, this is an **inference problem**. we want to estimate *f* and figure out which **predictor** has the most significant impact with the **dependent variable**. To estimate *f* for this problem, **regression** is used because the dependent variable (salary) is a *continuous value* and that value is affected by the predictors.

(b). If we are interested in predicting whether similar products that were launched success or not to consider launching new product. Then this should be a classification problem.  

### Inference problem cheatcheet


| Method                               | Suitable For                                          | Why It’s Good for Inference                                                                                  |
| ------------------------------------ | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
| **Linear Regression (OLS)**          | Continuous dependent variable                         | Gives direct coefficient estimates → you can interpret how much ( y ) changes with a unit change in ( x_i ). |
| **Multiple Linear Regression**       | Continuous ( y ), multiple predictors                 | Helps you see the *partial effect* of each predictor while holding others constant.                          |
| **Logistic Regression**              | Binary classification (e.g., success/failure)         | Estimates *odds ratios* — you can interpret how each predictor affects the log-odds of success.              |
| **Poisson Regression**               | Count data (e.g., number of sales, clicks)            | Models the relationship between predictors and count outcomes.                                               |
| **ANOVA (Analysis of Variance)**     | Categorical predictors, continuous dependent variable | Tests whether group means differ significantly.                                                              |
| **ANCOVA (Analysis of Covariance)**  | Continuous + categorical predictors                   | Tests effects of categorical predictors while controlling continuous ones.                                   |
| **Generalized Linear Models (GLMs)** | Various data types                                    | Extend linear regression to non-normal distributions (logistic, Poisson, etc.).                              |
| **Linear Mixed Models**              | Hierarchical or grouped data                          | Useful when data have random effects (e.g., CEOs nested within companies).                                   |
| **Stepwise Regression / LASSO**      | Variable selection                                    | Identifies which predictors matter most (good for inference with many predictors).                           |


### Prediction problem cheatsheet




| Method                                    | Type            | Flexibility | Notes                                       |
| ----------------------------------------- | --------------- | ----------- | ------------------------------------------- |
| **Linear Regression**                     | Inflexible      | Low         | Good for simple, linear trends              |
| **Polynomial Regression**                 | Flexible        | Medium      | Captures non-linear patterns                |
| **Decision Tree Regressor**               | Flexible        | Medium      | Splits data into regions; interpretable     |
| **Random Forest Regressor**               | Very Flexible   | High        | Combines many trees; reduces overfitting    |
| **Gradient Boosting (XGBoost, LightGBM)** | Very Flexible   | High        | Excellent predictive accuracy               |
| **k-Nearest Neighbors (KNN)**             | Flexible        | Medium      | Based on local neighborhood patterns        |
| **Support Vector Regression (SVR)**       | Flexible        | High        | Works well for complex boundaries           |
| **Neural Networks / Deep Learning**       | Highly Flexible | Very High   | For complex, high-dimensional relationships |


| Method                                              | Type            | Flexibility | Notes                                             |
| --------------------------------------------------- | --------------- | ----------- | ------------------------------------------------- |
| **Logistic Regression**                             | Inflexible      | Low         | Good for simple binary classification             |
| **Naive Bayes**                                     | Inflexible      | Medium      | Based on probability & Bayes theorem              |
| **Decision Tree Classifier**                        | Flexible        | Medium      | Interpretable and visual                          |
| **Random Forest Classifier**                        | Flexible        | High        | Reduces overfitting, robust                       |
| **Gradient Boosting (XGBoost, CatBoost, LightGBM)** | Very Flexible   | High        | Among the best for structured data                |
| **k-Nearest Neighbors (KNN)**                       | Flexible        | Medium      | Simple but sensitive to noise                     |
| **Support Vector Machine (SVM)**                    | Flexible        | High        | Works well for complex classification boundaries  |
| **Neural Networks / Deep Learning**                 | Highly Flexible | Very High   | For large and complex datasets (e.g. image, text) |
