<font size='+1' color=red>**Attention:**</font> Data cleaning and other parts of preprocessing of data which we covered in the first assignment, is not neccesary all the time but you may need some of them according to task at hand. So we don't explicitly mention them each time. This is your job to figure out when to apply them.

<font size='+1' color=red>**Attention 2:**</font> For your implementations always use `random_state=42` so your code would be reproducible.

## <font color='#D61E85' size='+3'>**Q1:**</font> <font size='+2'> **PCA for Classification** </font>

In this question we want to work with the Fashion-MNIST dataset. Fashion-MNIST is a dataset comprising of $28 \times 28$ grayscale images of $70,000$ fashion products from $10$ categories, with $7,000$ images per category. The training set has $60,000$ images and the test set has $10,000$ images. <br>
<font color=red>**Note:**</font> You can download it from any source you want. <br>
<font color=red>**Note:**</font> Take first $60,000$ instances of it as the train and the $10,000$ remaining instances as the test set.

Using explained varinace ratio and considering a threshold like $95\%$ you probably know how to choose the right number of dimensions to perform PCA. But, when you are using dimensionality reduction as a preprocessing step for a supervised learning task, it is important to consider the impact of the optimal number of dimensions on the overall performance of the model. Consider the classification task using the dataset at hand. Try to find the best number of components for the PCA with respect to the task. You should use the `RandomForestClassifier`, `KNeighborsClassifier`, `DecisionTreeClassifier`, and `AdaBoostClassifier`. Compare your results (number of dimensions, accuracy, precision, recall, f1-score, and confusion matrix) and explain why the number of dimensions for different models are different. Don't forget to analyze your results. [Hint: you should try to make a pipeline and try to tune the hyperparameters of PCA and your model adjointly.]

At the end, perform the hyperparameter tuning but this time without considering the PCA preprocessing step. Compare your results with previous ones.

<font color='#8FCF26' size='+2'>**A1:**</font>

### 1. Import Necessary Libraries


In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# from scipy.stats import randint

### 2. Load the Dataset

In [3]:
# Load the dataset
fashion_mnist = fetch_openml('Fashion-MNIST', version=1)
X = fashion_mnist.data
y = fashion_mnist.target

# Split the dataset
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

  warn(


### 3. Define the Pipelines and Parameters for Grid Search

In [4]:
pipelines = {
    'rf': Pipeline([('pca', PCA()), ('classifier', RandomForestClassifier(random_state=42))]),
    'knn': Pipeline([('pca', PCA()), ('classifier', KNeighborsClassifier())]),
    'dt': Pipeline([('pca', PCA()), ('classifier', DecisionTreeClassifier(random_state=42))]),
    'ada': Pipeline([('pca', PCA()), ('classifier', AdaBoostClassifier(random_state=42))])
}

param_grid = {
    'rf': {'pca__n_components': [0.85, 0.90, 0.95], 'classifier__n_estimators': [100, 200]},
    'knn': {'pca__n_components': [0.85, 0.90, 0.95], 'classifier__n_neighbors': [3, 5, 7]},
    'dt': {'pca__n_components': [0.85, 0.90, 0.95], 'classifier__max_depth': [10, 20, 30]},
    'ada': {'pca__n_components': [0.85, 0.90, 0.95], 'classifier__n_estimators': [50, 100]}
}


### 4. Train and Evaluate Each Model

In [None]:
results = {}
for name, pipeline in pipelines.items():
    # grid_search = GridSearchCV(pipeline, param_grid[name], cv=5, scoring='accuracy')
    grid_search = RandomizedSearchCV(pipeline, param_grid[name], n_iter=5, cv=5, verbose=1, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_
    predictions = best_model.predict(X_test)

    results[name] = {
        'Best Parameters': grid_search.best_params_,
        'Accuracy': accuracy_score(y_test, predictions),
        'Precision': precision_score(y_test, predictions, average='macro'),
        'Recall': recall_score(y_test, predictions, average='macro'),
        'F1 Score': f1_score(y_test, predictions, average='macro'),
        'Confusion Matrix': confusion_matrix(y_test, predictions)
    }

# Display results
for model, metrics in results.items():
    print(f"Model: {model}")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")
    print("\n")


### 5. Analysis

#### RandomForestClassifier (rf)
- **PCA Components**: 0.9
- **Accuracy**: 0.8624
- **Precision**: 0.8609
- **Recall**: 0.8624
- **F1 Score**: 0.8608

**Analysis**: RandomForest performed well with a PCA component setting of 0.9. This indicates that capturing 90% of the variance in the data is sufficient for a complex ensemble method like RandomForest, which can handle a higher dimensional space effectively. The high accuracy and balanced precision-recall suggest that RandomForest is able to classify most classes accurately.

#### KNeighborsClassifier (knn)
- **PCA Components**: 0.95
- **Accuracy**: 0.8623
- **Precision**: 0.8631
- **Recall**: 0.8623
- **F1 Score**: 0.8616

**Analysis**: KNN needed slightly more features (95% variance) to achieve similar performance to RandomForest. This is understandable as KNN relies heavily on the feature space for making decisions based on the nearest neighbors. The slight increase in the number of dimensions may provide more distinctiveness between different classes for KNN.

#### DecisionTreeClassifier (dt)
- **PCA Components**: 0.85
- **Accuracy**: 0.7753
- **Precision**: 0.7770
- **Recall**: 0.7753
- **F1 Score**: 0.7760

**Analysis**: The DecisionTreeClassifier performed best with 85% variance, which is lower than the others. This might be because decision trees can overfit with too many features. A reduced feature set can sometimes help in preventing the model from fitting to noise in the data.

#### AdaBoostClassifier (ada)
- **PCA Components**: 0.95
- **Accuracy**: 0.5746
- **Precision**: 0.5769
- **Recall**: 0.5746
- **F1 Score**: 0.5626

**Analysis**: AdaBoost with 95% variance components had the lowest performance among the models. AdaBoost is sensitive to noisy data and outliers, and the higher dimensional space might be introducing more complexity than the model can handle effectively.


#### **Overall Observation**
Different models require a different number of PCA components because each model has a unique way of handling and interpreting features. Models like RandomForest and KNN can benefit from a higher dimensional space as they can capture more complex patterns, while simpler models like DecisionTrees might perform better with fewer dimensions to prevent overfitting.

The varying performance across models also highlights the importance of considering both the model's nature and the feature space's dimensionality when performing tasks like classification. The results also demonstrate the utility of PCA in reducing dimensionality while preserving enough information for effective model training and prediction.

### 6. tuning without considering the PCA preprocessing step

In [5]:
classifiers = {
    'rf': RandomForestClassifier(random_state=42),
    'knn': KNeighborsClassifier(),
    'dt': DecisionTreeClassifier(random_state=42),
    'ada': AdaBoostClassifier(random_state=42)
}

param_grid_no_pca = {
    'rf': {'n_estimators': [100, 200]},
    'knn': {'n_neighbors': [3, 5, 7]},
    'dt': {'max_depth': [10, 20, 30]},
    'ada': {'n_estimators': [50, 100]}
}

In [8]:
results_no_pca = {}
for name, classifier in classifiers.items():
    grid_search_no_pca = RandomizedSearchCV(classifier, param_grid_no_pca[name],
                                            scoring='accuracy', n_iter=5, cv=5, verbose=1, n_jobs=-1)
    grid_search_no_pca.fit(X_train, y_train)

    best_model_no_pca = grid_search_no_pca.best_estimator_
    predictions_no_pca = best_model_no_pca.predict(X_test)

    results_no_pca[name] = {
        'Best Parameters': grid_search_no_pca.best_params_,
        'Accuracy': accuracy_score(y_test, predictions_no_pca),
        'Precision': precision_score(y_test, predictions_no_pca, average='macro'),
        'Recall': recall_score(y_test, predictions_no_pca, average='macro'),
        'F1 Score': f1_score(y_test, predictions_no_pca, average='macro'),
        'Confusion Matrix': confusion_matrix(y_test, predictions_no_pca)
    }

# Display results
for model, metrics in results_no_pca.items():
    print(f"Model: {model}")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")
    print("\n")


Fitting 5 folds for each of 5 candidates, totalling 25 fits


KeyboardInterrupt: ignored

## <font color='#D61E85' size='+3'>**Q2:**</font> <font size='+2'> **Randomized PCA** </font>

In this question we want to check the time complexity of finding an approximation of the first $d$ principal components. Also, we want to see how is the performance of it in compare to the original PCA. In order to make this happen there is a stochastic algorithm called *randomized PCA* which has a faster procedure to find the first $d$ principal components.

By default, the `svd_solver` parameter of PCA in Scikit-learn is set to `"auto"`. It means that it automatically determine to use `"full"` or `"randomized"` to find the principal components. Base on our text book:

"Scikit-learn uses the randomized PCA algorithm if $max(m, n) > 500$ and `n_components` is an integer smaller than $80\%$ of $min(m,n)$, or else it uses the full SVD approach"

<font size='+1'>**(a)**</font> For previous question you found $d$ components for each one of the classifiers.This time try to perform PCA without considering the classifier and just by determining the number of components. For the `svd_solver`, this time use both `"full"` and `"randomized"` arguments separately and compare the results of them. Also compare the running time of `"full"` and `"randomized"` for each one of the classifiers. Explain your observations. You should perform following steps one-by-one for each classifier:

*   Perform PCA for both `"full"` and `"randomized"`. Repoert the time of performing them.

*   For both cases fit the new training data (after dimensionality reduction) to the classifier. Then report the results (Accuracy, F1-score, ...) on test set.

<font size='+1'>**(b)**</font> This time consider the $d=10$ and compare the running times for both `"full"` and `"randomized"` arguments. Explain your observations.

<font size='+1'>**(c)**</font> There is something called Incremental PCA (IPCA), explain what is it and in what situations it is useful?

<font size='+1'>**(d)**</font> Consider the number of batches equal to $200$ and perform the IPCA.

<font color='#8FCF26' size='+2'>**A2:**</font> Your explanations

In [None]:
# Your code

## <font color='#D61E85' size='+3'>**Q3:**</font> <font size='+2'> **Locally Linear Embedding** </font>

Locally linear embedding (LLE), is a nonlinear dimensionality reduction algorithm which is categorized as a manifold learning technique.

<font size='+1'>**(a)**</font> At first, try to explain how it works by mentioning its optimization objectives.

<font size='+1'>**(b)**</font> Now, it's time to implement it and trying to perform your implementation on a swiss roll to see what happens after unrolling. (try to plot your results)
The code below make you a swiss roll with $1000$ samples.

<font size='+1'>**(c)**</font> Finally use the LLE implementation provided by Scikit-learn to check the results of your implementation. (plot your results)

<font color='#8FCF26' size='+2'>**A3:**</font> Your explanations

In [None]:
from sklearn.datasets import make_swiss_roll

X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

In [None]:
# Your code

## <font color='#D61E85' size='+3'>**Q4:**</font> <font size='+2'> **t-SNE vs. UMAP ‎️‍🔥** </font>

In this question we need the first $5000$ images of Fashion-MNIST dataset. We want to reduce the dimension of these samples down to 2 so we can plot them. Here, we use t-SNE and UMAP to perform these reductions. You can use scatterplot with 10 different colors to demonstrate the class of each instance. After visulaization try to analyze your results and compare them with each other. Is there any pattern in these visualizations?

<font color='#8FCF26' size='+2'>**A4:**</font> Your explanations

In [None]:
# Your code

## <font color='#D61E85' size='+3'>**Q5:**</font> <font size='+2'> **Iris** </font>

You will take a shortcut and load the Iris dataset from Scikit-learn’s datasets module. Furthermore, you will only select two features, sepal width and petal length, to make the classification task more challenging for illustration purposes

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np

In [None]:
# Your Code

split the Iris examples into 50 percent training and 50 percent test data:

In [None]:
# Your Code

Using the training dataset, you now will train three different classifiers:

- Logistic regression classifier

- Decision tree classifier

- k-nearest neighbors classifier

you will then evaluate the model performance of each classifier via 10-fold cross-validation on the training dataset before combining them into an ensemble classifier:

In [None]:
# Your Code

## <font color='#D61E85' size='+3'>**Q6:**</font> <font size='+2'> **Carseats** </font>

#### Ensemble Learning

Ensemble learning is a machine learning technique that involves combining the predictions of multiple models to improve the overall performance and accuracy of a system. Instead of relying on a single model to make predictions, ensemble methods use a group of models and aggregate their predictions to achieve better results than any individual model could achieve on its own.

The basic idea behind ensemble learning is that by combining the strengths of different models, it is possible to mitigate the weaknesses of each individual model. Ensemble methods are often used to enhance predictive accuracy, reduce overfitting, and improve the robustness of the model.

There are several popular ensemble learning techniques, including:

**Bagging (Bootstrap Aggregating):** This method involves training multiple instances of the same learning algorithm on different subsets of the training data, typically created by random sampling with replacement. The predictions of these models are then averaged or voted upon to make the final prediction.

**Boosting:** Boosting focuses on training a sequence of weak learners, where each subsequent model corrects the errors of its predecessor. Popular boosting algorithms include AdaBoost (Adaptive Boosting) and Gradient Boosting.

**Random Forest:** Random Forest is an ensemble method based on bagging. It constructs multiple decision trees during training and combines their predictions through averaging or voting. Each tree in the forest is trained on a random subset of the features.

Stacking: Stacking involves training multiple diverse models and using another model (meta-model or blender) to combine their predictions. The predictions of individual models serve as input features for the meta-model.

Ensemble learning is a powerful technique that is widely used in various machine learning applications. It is particularly effective when dealing with complex and diverse datasets, as well as when individual models may have different strengths and weaknesses.

We are going to work with **Carseats** dataset. We want to predict the sales using regression trees and related approaches, treating the response as a quantitative variable.

- Load Dataset

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, export_graphviz

- Do preprocess
- Split the data set into a training set and a test set.

In [None]:
# Your Code:

- Fit a regression tree to the training set. Plot the tree, and interpret
the results. What test MSE do you obtain?

In [None]:
# Your code

- Use the bagging approach in order to analyze this data. What test MSE do you obtain? which variables are most important. visualize them

In [None]:
# Your code

- Use random forests to analyze this data. What test MSE do you obtain?  which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

In [None]:
# Your code