### Here is a jupyter notebook containing my solutions of exercises present from a book "Introduction To Statistical Learning" written by Daniela Witten, Robert Tibshirani, Trevor Hastie, Gareth James and Jonathan Taylor.

# Table of Contents

* [Exercise 2.1](#2.1)
* [Exercise 2.2](#2.2)
* [Exercise 2.3](#2.3)
* [Exercise 2.4](#2.4)
* [Exercise 2.5](#2.5)
* [Exercise 2.6](#2.6)
* [Exercise 2.7](#2.7)

# Conceptual tasks

# 2.1 

### For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.


(a) – The sample size $n$ is extremely large, and the number of predictors $p$ is small. 

**Answer**: In this scenario, the flexible method is more suitable than the inflexible method. This is because an extremely large $n$ can lead to a better estimate of the true, unknown relationship $f$. By capturing more patterns within the training data, the flexible method can produce more accurate estimates. 

---

(b) — The number of predictors $p$ is extremely large, and the number of observations $n$ is small.

**Answer**: In this context, the inflexible method might be superior due to the small sample size, but an assumption about the highly non-linear form of the relationship $f$ is necessary. In that instance, the inflexible method would be inadequate.

---

(c) — The relationship between the predictors and response is highly non-linear.

**Answer**: Highly non-linear relationships can be better captured by flexible methods. Therefore, it can be suggested that this method would perform better than inflexible methods.

---

(d) — The variance of the error terms, i.e. $\sigma^{2} = Var({\epsilon})$, is extremely high:

**Answer**: Due to the high variance of error terms in this context, both methods may not yield high efficiency. However, flexible methods tend to capture more random patterns, which may lead to overfitting. On the other hand, less attention to these variances by inflexible methods may result in more efficiency with higher accuracy, so inflexible method could be more approperiate.


# 2.2

### Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.

(a) — We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

**Answer**: 

* $n$ — number of samples; in this case that would be number of firms, that would be $n = 500$,
* $p$ — number of features; in this case that would be *profit*, *no. employees*, *industry*; $p = 3$,
* *CEO salary* would be our *target* or *dependent variable* which we want to investigate.

This type of problem is an *inference* problem, because we want to understand a relationship and an impact of those features $X$ such that $X \in \mathbb{R}^{n \times p}$ on *CEO salary*.

Salary is a **quantative** or **numerical** type of variable, so in this scenario, that would be **regression** problem.

---

(b) — We are considering launching a new product and wish to know whether it will be a *success* or a *failure*. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price and ten other variables.

**Answer**:

* $n$ — number of samples; in this case, that would be a number of products, so $n = 20$,
* $p$ — number of features; in this case that would be *price*, *marketing budget* and *ten other variables*, so $p = 12$.
* Our target would be a *success* or *failure* for each product.

This type of problem is **classification**, because our *response* (in other words: *dependent variable* or *target*) is a binary type (i.e. *success* or *false*). We also don't care about a form of true underlying relationship $f$, so it can be treated as *black box*, given that an estimation of $\hat{f}$ gives us an accurate **predictions**.

---

(c) — We are interested in predicting the $\%$ change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the $\%$ change in the USD/Euro, the $\%$ change in the US market, the $\%$ change in the British market, and the $\%$ change in the German market.

**Answer**:

* $n$ — the number of samples, in this scenario number of weeks in 2012, [here is the number](https://www.epochconverter.com/weeks/2012#:~:text=There%20are%2052%20weeks%20in%202012.); $n = 52$,
* $p$ — the number of features, in this scenario: *$\%$ change in the US market*, *$\%$ the change in British market*, *$\%$ the change of German market*; $p = 3$.
* our *dependent variable* or *target* would be *$\%$ change in the USB/Euro exchange rate*; it's a **numerical** or **quantative** value, so we get a **regression** problem.

In the description of this task, we already know in what we are interested (in **prediction**). 

# 2.3

### We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

**Answer**:

![image.png](attachment:e1deba9b-a289-4d77-945a-5190064353f4.png)

![image.png](attachment:181e28ae-64c3-459b-990a-3188e18250d6.png)

---

(b) Explain why each of the five curves has the shape displayed in
part (a).

**Answer**:

* with a more flexibility (*$x$-axis*), the general rule applies that value of the *variance* will increase (the <span style="color:#45f542"> green line</span> at the plot),
* with a more flexibility (*$x$-axis*), value of the *squared bias* should become a much smaller (the <span style="color:red">red line</span> at the plot),
* with a more flexibility (*$x$-axis*), value of *train error* should decrease steadily (the <span style="color:#2237d6">blue line</span> at the plot),
* with a more flexibiltiy (*$x$-axis*), value of *test error* should decrease when the *bias* has more impact than the *variance*. We can see that *test error* decreases (the <span style="color:purple">purple line</span> at the plot), when *bias* (the <span style="color:red">red line</span> at the plot) decreases; when *variance* (the <span style="color:#45f542"> green line</span> at the plot) steadily increases with a flexibility, and *bias* flattens (losses an impact, i.e. not decreasing any more or decreases very slowly), the *test error* increases along with a variance. That's why *test error* has the $\text{U—shape}$,
* *irreducible error* is a flat line (the <span style="color:#224f15">dark green line</span> at the plot) which is an upper bound of accuracy of a *test error* (the <span style="color:purple">purple line</span> at the plot) — in simple terms it means, that *test error* can not be less than the value of an *irreducible error*. 

# 2.4

### You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the
goal of each application inference or prediction? Explain your answer.

**Answer**:

- prediction of a cancer mole. The predictors in such problem would be, for example: *diameter of the mole*, *area of the mole*, *whether boundaries are soft or sharp*, *the age of the patient*, *the place of the mole*. The dependent variable of such problem would be a binary output (*True/False*) whether patient has cancerous mole or doesn't,
- inference problem would be understanding an impact of poisoning lakes (True/False) based on the chemical concetration of chemicals: phosphate, O2, wind speed, iron, etc...
- prediction of a fungus species. The predictors would be a pixels of a photo of a fungus with a corresponding dependent variable – its specimen. 

---

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

**Answer**:

- prediction of a temperature; the predictors would be *atmospheric pressue*, *wind speed*, *the direction of a wind*. An output (dependent variable would be a temperature),
- prediction of a price of an apartment based on the *number of floors*, *is it located in a center or peripheria*, *area of an apartment* and the output would be a price,
- predicting a customers in a given day in a store. The predictors would be *day of the week*, *promotion in store*, *public holiday*, *season*, *weather* and the output would be a number of a customers in a given day.

--- 

(c) Describe three real-life applications in which cluster analysis might be useful.

**Answer**:


- market segmentation: cluster analysis can be used to group the clients based on their similar preferences to recommend them a products that they might like. The features could be consumer demographics (**age**, **gender**, **income**), consumer behaviors (**purchase history**, **product preferences**, **brand loyalty**), and psychographic information (**attitudes**, **values**, **lifestyles**),
- bioinformatics and genomics: cluster analysis is often used to classify genes with similar functions or expression patterns. This can aid in the discovery of new gene functions and the understanding of disease processes. The features might include gene expression levels across different conditions or time points, **sequence similarity**, and **functional annotation data**,
- fraud detection: cluster analysis can help identify unusual patterns that could indicate fraudulent activity. For instance, transactions could be clustered based on features like **transaction amount**, **frequency**, **location**, and **time**. Clusters of transactions that deviate significantly from the norm may be flagged for further investigation as potential fraud. 

# 2.5

### What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

**Answer**:

Very flexible methods vs. a less flexible:
- Advantages:
    - very flexible methods are able to capture much more sophisticated relationships than less flexible ones,
    - with less flexible methods, we have a better interpretability of these models,
    - the bias tends to be lower with very flexible model – the error of an estimation of underyling relationship is smaller,
    - with less flexible methods, we get less variance — so these models are not prone to change hugely $\hat{f}$, 
- Disadvantages:
    - with very flexible methods, we need a lot of data to estimate well underyling relationship $f$,
    - with very flexible methods, we lack of an interpretability (these models tend to be much more complicated),
    - with very flexible methods, we tend to overfit these models, because they capture random patterns of variance from training set which is not present in test set, that's why test error is much higher,
    - the variance of such very flexible model would be higher (small change of a training dataset results in a huge change in estimation of $\hat{f}$,

More flexible approach would be more preferred when we get a huge number of samples.
Less flexible method can be preferred, if we can inspect that the underlying relationship between $X$ and $y$ is not sophisticated; i.e. it can be linear or something like that or we don't have a huge number of samples.

---

# 2.6

### Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

**Answer**:

In parametric models we need an assumption of relationship between $X$ and $y$. Then, we only need to estimate coefficients (for example: if such relationship is linear, we can just use *linear regression* to estimate its $p + 1$ coefficients: $\beta_0, \beta_1, \dots, 
\beta_p$ for $X$ predictor such that $X \in \mathbb{R}^{n \times p}$.

The disadvantage of parametric method is a much more complicated form (it happens very often) of true underyling and unknown relationship between $X$ and $y$ than our assumption. So, in a result these methods may give us a poorly accuracy of predictions.

The advantage is just the job to estimate those coefficients instead of coming up with totally new equation which will satisfy the relationship.

In non-parametric method, we don't care about the form of $f$. We only need a big number of samples and the algorithm will try to estimate those relationship as close as possible. It can delibetary estimate any form of $f$.

--- 

# 2.7

### Suppose we wish to use this data set to make a prediction for $Y$ when $X1 = X2 = X3 = 0$ using K-nearest neighbors.

| Obs. | $X_1$ | $X_2$ | $X_3$ | $Y$     |
|------|-----|-----|-----|-------|
| 1    | 0   | 3   | 0   | Red   |
| 2    | 2   | 0   | 0   | Red   |
| 3    | 0   | 1   | 3   | Red   |
| 4    | 0   | 1   | 2   | Green |
| 5    | -1  | 0   | 1   | Green |
| 6    | 1   | 1   | 1   | Red   |


Suppose we wish to use this data set to make prediciton $Y$ when $X_1 = X_2 = X_3 = 0$ using $K$-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point, $X_1 = X_2 = X_3 = 0$

**Answer**:


The formula for the Euclidean distance is: 

$d(p, q) = \sqrt{\sum_{i=1}^{n}(q_i - p_i)^{2}}$

So, in our case that would be (# – the number of observation to which we calculate euclidean distance – look up the table above):

#1 $d(0, 0, 0) = \sqrt{(0 - 0)^{2} + (3 - 0)^{2} + (0 - 0)^{2}} = \sqrt{9} = 3$

#2 $d(0, 0, 0) = \sqrt{(2 - 0)^{2} + (0 - 0)^{2} + (0 - 0)^{2}} = \sqrt{4} = 2$

#3 $d(0, 0, 0) = \sqrt{(0 - 0)^{2} + (1 - 0)^{2} + (3 - 0)^{2}} = \sqrt{10} \approx 3.16$

#4 $d(0, 0, 0) = \sqrt{(0 - 0)^{2} + (1 - 0)^{2} + (2 - 0)^{2}} = \sqrt{5} \approx 2.24$

#5 $d(0, 0, 0) = \sqrt{(-1 - 0)^{2} + (0 - 0)^{2} + (1 - 0)^{2}} = \sqrt{2} \approx 1.41$

#6 $d(0, 0, 0) = \sqrt{(1 - 0)^{2} + (1 - 0)^{2} + (1 - 0)^{2}} = \sqrt{3} \approx 1.73$

---

(b) What is our prediction with $K = 1$? Why?

**Answer**:

With $K = 1$ our prediction would be that the value of dependent variable $y$ for sample $X_1 = X_2 = X_3 = 0$ would be $\text{Green}$. This is because, we consider only one neighbor (which in this case has the minimum possible distance; the minimum possible distance to our test sample has observation #5 $(-1, 0, 1)$, which euclidean distance equals to $1.41$.

---

(c) What is our prediction for $K = 3$? Why?

**Answer**:

In this case we are considering the three closest neighbors (i.e. three minimal euclidean distances). 

That would be observations (after the hyphen I wrote a euclidean distance to our sample): #5 — 1.41, #6 — 1.73, #2 — 2.

Then, we counting the classes to which those three neighbors belong; in our case we got:

#5 sample's class is $\text{Green}$,
#6 sample's class is $\text{Red}$,
#2 sample's class is $\text{Red}$.

Our conditional probability for assigning our test sample to class $\text{Red}$ equals $\frac{2}{3}$.

Our conditional probability for assigning our test sample to class $\text{Green}$ equals $\frac{1}{3}$.

Finally, we can assign our test sample to class $\text{Red}$, because $\frac{2}{3} \gt\frac{1}{3}$

---

(d) If the Bayes decision boundary in this problem is highly non-linear, then we expect the best value for K to be large of small? Why?

**Answer**:

If the Bayes decision boundary is highly non-linear, then we need more flexibility to capture this relationship. In the context of $K$-nearest neighbor, the smaller value of $K$ means much more flexible decision boundary. So, in this case, we should expect the best value for $K$ to be smaller. The reason of smaller $K$ is that the algorithm compares its $K$ neighbors to made a decision to classify the sample. The smaller region of neighbors (i.e. the smaller $K$) – the less uncertanity to which class predict the sample.