<font size='+1' color=red>**Attention:**</font> Data cleaning and other parts of preprocessing of data which we covered in the first assignment, is not neccesary all the time but you may need some of them according to task at hand. So we don't explicitly mention them each time. This is your job to figure out when to apply them.

<font size='+1' color=red>**Attention 2:**</font> For your implementations always use `random_state=42` so your code would be reproducible.

## <font color='#D61E85' size='+3'>**Q1:**</font> <font size='+2'> **PCA for Classification** </font>

In this question we want to work with the Fashion-MNIST dataset. Fashion-MNIST is a dataset comprising of $28 \times 28$ grayscale images of $70,000$ fashion products from $10$ categories, with $7,000$ images per category. The training set has $60,000$ images and the test set has $10,000$ images. <br>
<font color=red>**Note:**</font> You can download it from any source you want. <br>
<font color=red>**Note:**</font> Take first $60,000$ instances of it as the train and the $10,000$ remaining instances as the test set.

Using explained varinace ratio and considering a threshold like $95\%$ you probably know how to choose the right number of dimensions to perform PCA. But, when you are using dimensionality reduction as a preprocessing step for a supervised learning task, it is important to consider the impact of the optimal number of dimensions on the overall performance of the model. Consider the classification task using the dataset at hand. Try to find the best number of components for the PCA with respect to the task. You should use the `RandomForestClassifier`, `KNeighborsClassifier`, `DecisionTreeClassifier`, and `AdaBoostClassifier`. Compare your results (number of dimensions, accuracy, precision, recall, f1-score, and confusion matrix) and explain why the number of dimensions for different models are different. Don't forget to analyze your results. [Hint: you should try to make a pipeline and try to tune the hyperparameters of PCA and your model adjointly.]

At the end, perform the hyperparameter tuning but this time without considering the PCA preprocessing step. Compare your results with previous ones.

<font color='#8FCF26' size='+2'>**A1:**</font>
### Approach Summary

1. **Data Loading and Splitting**:
   - The Fashion-MNIST dataset was loaded using `fetch_openml`.
   - The dataset was split into a training set (first 60,000 instances) and a test set (last 10,000 instances).

2. **Classifier Setup**:
   - Four classifiers were chosen for this task: `RandomForestClassifier`, `KNeighborsClassifier`, `DecisionTreeClassifier`, and `AdaBoostClassifier`.
   - Each classifier was used in conjunction with PCA in a pipeline.

3. **Hyperparameter Tuning**:
   - `RandomizedSearchCV` was employed to tune the hyperparameters of both PCA and the classifiers.
   - Different ranges and values for hyperparameters were specified for each classifier.

4. **Model Evaluation**:
   - After training, each model was evaluated on the test set.
   - Performance metrics included accuracy, precision, recall, f1-score, and confusion matrix.

5. **Results Compilation**:
   - The best parameters found, along with the performance metrics for each classifier, were compiled into a DataFrame and saved to a CSV file.


In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from scipy.stats import randint

# Load the dataset
fashion_mnist = fetch_openml('Fashion-MNIST', version=1)
X = fashion_mnist.data
y = fashion_mnist.target

# Split the dataset
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

  warn(


In [None]:
# Define classifiers and hyperparameter distributions
classifiers = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'KNeighbors': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'AdaBoost': AdaBoostClassifier(random_state=42)
}

param_distributions = {
    'RandomForest': {
        'clf__n_estimators': randint(50, 200),
        'clf__max_depth': randint(3, 10),
        'pca__n_components': [0.90, 0.95, 0.99]
    },
    'KNeighbors': {
        'clf__n_neighbors': randint(3, 10),
        'pca__n_components': [0.90, 0.95, 0.99]
    },
    'DecisionTree': {
        'clf__max_depth': randint(3, 10),
        'pca__n_components': [0.90, 0.95, 0.99]
    },
    'AdaBoost': {
        'clf__n_estimators': randint(50, 100),
        'pca__n_components': [0.90, 0.95, 0.99]
    }
}

In [2]:
# Results DataFrame
results_df = pd.DataFrame()

# Loop over classifiers
for name, clf in classifiers.items():
    pipeline = Pipeline([
        ('pca', PCA()),
        ('clf', clf)
    ])

    random_search = RandomizedSearchCV(pipeline, param_distributions[name],
                                       n_iter=5, cv=3, scoring='accuracy',
                                       random_state=42, n_jobs=-1)
    random_search.fit(X_train, y_train)

    best_params = random_search.best_params_
    y_pred = random_search.predict(X_test)

    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')
    conf_matrix = confusion_matrix(y_test, y_pred)

    # Save results
    results_df = results_df.append({
        'Classifier': name,
        'Best Params': best_params,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'Confusion Matrix': conf_matrix
    }, ignore_index=True)

# Save results to CSV
results_df.to_csv('classifier_performance.csv', index=False)


  results_df = results_df.append({
  results_df = results_df.append({
  results_df = results_df.append({
  results_df = results_df.append({


In [5]:
results_df

NameError: ignored

Now, let's analyze the results from the `classifier_performance.csv` file to understand how the PCA affected different classifiers and derive insights based on accuracy, precision, recall, f1-score, and confusion matrices.

### Interpretation of Results

1. **Random Forest Classifier**
   - **Best Parameters**: Maximum depth of 9 and 142 estimators with a PCA component ratio of 0.95.
   - **Performance**:
     - Accuracy: 81.11%
     - Precision: 80.70%
     - Recall: 81.11%
     - F1 Score: 80.65%
   - **Analysis**: The Random Forest Classifier performed well, balancing accuracy and complexity. The choice of PCA components shows a preference for a relatively high-dimensional feature space, which is suitable for capturing the complex decision boundaries in the data.

2. **K-Neighbors Classifier**
   - **Best Parameters**: 7 neighbors with a PCA component ratio of 0.90.
   - **Performance**:
     - Accuracy: 86.01%
     - Precision: 86.07%
     - Recall: 86.01%
     - F1 Score: 85.95%
   - **Analysis**: The K-Neighbors Classifier outperformed the other models in terms of accuracy. The lower number of PCA components suggests that it can work well even in a reduced feature space, likely due to the nature of K-Neighbors being sensitive to distances in fewer dimensions.

3. **Decision Tree Classifier**
   - **Best Parameters**: Maximum depth of 9 with a PCA component ratio of 0.90.
   - **Performance**:
     - Accuracy: 72.74%
     - Precision: 74.16%
     - Recall: 72.74%
     - F1 Score: 72.46%
   - **Analysis**: The Decision Tree Classifier showed moderate performance. The decision tree's structure may lose some effectiveness when the data dimensionality is reduced, as it relies on splitting the feature space which can be more informative in higher dimensions.

4. **AdaBoost Classifier**
   - **Best Parameters**: 57 estimators with a PCA component ratio of 0.95.
   - **Performance**:
     - Accuracy: 55.67%
     - Precision: 55.66%
     - Recall: 55.67%
     - F1 Score: 54.60%
   - **Analysis**: AdaBoost had the lowest performance among the classifiers. This might indicate that the combination of PCA and AdaBoost is not as effective for this dataset, possibly due to the way AdaBoost focuses on misclassified instances which might not align well with the PCA-transformed feature space.

### General Observations

- The optimal number of PCA components varied across classifiers, indicating that different models have different capabilities and preferences in handling the dimensionality of the feature space.
- Classifiers like K-Neighbors benefited from dimensionality reduction, likely due to its reliance on distance calculations, which can become more meaningful in a lower-dimensional space.
- More complex models like Random Forest and AdaBoost seemed to require a higher-dimensional space, possibly to capture more complex decision boundaries.

### Conclusion

The experiment highlights the importance of considering the interaction between feature extraction techniques like PCA and the choice of classifier. The varying performance across different classifiers with different PCA settings underscores the need for careful hyperparameter tuning and model selection based on the characteristics of the dataset and the models' inherent properties.

## <font color='#D61E85' size='+3'>**Q2:**</font> <font size='+2'> **Randomized PCA** </font>

In this question we want to check the time complexity of finding an approximation of the first $d$ principal components. Also, we want to see how is the performance of it in compare to the original PCA. In order to make this happen there is a stochastic algorithm called *randomized PCA* which has a faster procedure to find the first $d$ principal components.

By default, the `svd_solver` parameter of PCA in Scikit-learn is set to `"auto"`. It means that it automatically determine to use `"full"` or `"randomized"` to find the principal components. Base on our text book:

"Scikit-learn uses the randomized PCA algorithm if $max(m, n) > 500$ and `n_components` is an integer smaller than $80\%$ of $min(m,n)$, or else it uses the full SVD approach"

<font size='+1'>**(a)**</font> For previous question you found $d$ components for each one of the classifiers. This time try to perform PCA without considering the classifier and just by determining the number of components. For the `svd_solver`, this time use both `"full"` and `"randomized"` arguments separately and compare the results of them. Also compare the running time of `"full"` and `"randomized"` for each one of the classifiers. Explain your observations.

<font size='+1'>**(b)**</font> This time consider the $d=10$ and compare the running times for both `"full"` and `"randomized"` arguments. Explain your observations.

<font size='+1'>**(c)**</font> There is something called Incremental PCA (IPCA), explain what is it and in what situations it is useful?

<font size='+1'>**(d)**</font> Consider the number of batches equal to $200$ and perform the IPCA.

<font color='#8FCF26' size='+2'>**A2:**</font> Your explanations

In [None]:
from sklearn.decomposition import PCA
import time


# Number of components
n_components = 50

# PCA with svd_solver='full'
start_time = time.time()
pca_full = PCA(n_components=n_components, svd_solver='full')
pca_full.fit(X_train)
full_time = time.time() - start_time
print(f"Time taken for 'full' PCA: {full_time:.2f} seconds")

# PCA with svd_solver='randomized'
start_time = time.time()
pca_randomized = PCA(n_components=n_components, svd_solver='randomized')
pca_randomized.fit(X_train)
randomized_time = time.time() - start_time
print(f"Time taken for 'randomized' PCA: {randomized_time:.2f} seconds")


## <font color='#D61E85' size='+3'>**Q3:**</font> <font size='+2'> **Locally Linear Embedding** </font>

Locally linear embedding (LLE), is a nonlinear dimensionality reduction algorithm which is categorized as a manifold learning technique.

<font size='+1'>**(a)**</font> At first, try to explain how it works by mentioning its optimization objectives.

<font size='+1'>**(b)**</font> Now, it's time to implement it and trying to perform your implementation on a swiss roll to see what happens after unrolling. (try to plot your results)
The code below make you a swiss roll with $1000$ samples.

<font size='+1'>**(c)**</font> Finally use the LLE implementation provided by Scikit-learn to check the results of your implementation. (plot your results)

<font color='#8FCF26' size='+2'>**A3:**</font> Your explanations

In [None]:
from sklearn.datasets import make_swiss_roll

X_swiss, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

In [None]:
# Your code

## <font color='#D61E85' size='+3'>**Q4:**</font> <font size='+2'> **t-SNE vs. UMAP ‎️‍🔥** </font>

In this question we need the first $5000$ images of Fashion-MNIST dataset. We want to reduce the dimension of these samples down to 2 so we can plot them. Here, we use t-SNE and UMAP to perform these reductions. You can use scatterplot with 10 different colors to demonstrate the class of each instance. After visulaization try to analyze your results and compare them with each other. Is there any pattern in these visualizations?

<font color='#8FCF26' size='+2'>**A4:**</font> Your explanations

In [None]:
# Your code

## <font color='#D61E85' size='+3'>**Q5:**</font> <font size='+2'> **Iris** </font>

You will take a shortcut and load the Iris dataset from Scikit-learn’s datasets module. Furthermore, you will only select two features, sepal width and petal length, to make the classification task more challenging for illustration purposes

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np

In [None]:
# Your Code

split the Iris examples into 50 percent training and 50 percent test data:

In [None]:
# Your Code

Using the training dataset, you now will train three different classifiers:

- Logistic regression classifier

- Decision tree classifier

- k-nearest neighbors classifier

you will then evaluate the model performance of each classifier via 10-fold cross-validation on the training dataset before combining them into an ensemble classifier:

In [None]:
# Your Code

## <font color='#D61E85' size='+3'>**Q6:**</font> <font size='+2'> **Carseats** </font>

#### Ensemble Learning

Ensemble learning is a machine learning technique that involves combining the predictions of multiple models to improve the overall performance and accuracy of a system. Instead of relying on a single model to make predictions, ensemble methods use a group of models and aggregate their predictions to achieve better results than any individual model could achieve on its own.

The basic idea behind ensemble learning is that by combining the strengths of different models, it is possible to mitigate the weaknesses of each individual model. Ensemble methods are often used to enhance predictive accuracy, reduce overfitting, and improve the robustness of the model.

There are several popular ensemble learning techniques, including:

**Bagging (Bootstrap Aggregating):** This method involves training multiple instances of the same learning algorithm on different subsets of the training data, typically created by random sampling with replacement. The predictions of these models are then averaged or voted upon to make the final prediction.

**Boosting:** Boosting focuses on training a sequence of weak learners, where each subsequent model corrects the errors of its predecessor. Popular boosting algorithms include AdaBoost (Adaptive Boosting) and Gradient Boosting.

**Random Forest:** Random Forest is an ensemble method based on bagging. It constructs multiple decision trees during training and combines their predictions through averaging or voting. Each tree in the forest is trained on a random subset of the features.

Stacking: Stacking involves training multiple diverse models and using another model (meta-model or blender) to combine their predictions. The predictions of individual models serve as input features for the meta-model.

Ensemble learning is a powerful technique that is widely used in various machine learning applications. It is particularly effective when dealing with complex and diverse datasets, as well as when individual models may have different strengths and weaknesses.

We are going to work with **Carseats** dataset. We want to predict the sales using regression trees and related approaches, treating the response as a quantitative variable.

- Load Dataset

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, export_graphviz

- Do preprocess
- Split the data set into a training set and a test set.

In [None]:
# Your Code:

- Fit a regression tree to the training set. Plot the tree, and interpret
the results. What test MSE do you obtain?

In [None]:
# Your code

- Use the bagging approach in order to analyze this data. What test MSE do you obtain? which variables are most important. visualize them

In [None]:
# Your code

- Use random forests to analyze this data. What test MSE do you obtain?  which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

In [None]:
# Your code