## 2. Ensemble-based models to predict tissue
We are interested in predicting the tissue of origin using the latent spaces generated before.

Using samples projected into the different latent spaces, use at least two ensemble approaches to predict their tissue of origin:
One approach must use a learner/estimator that tends to overfit data.
Another approach must use a weak learner that slightly outperforms a random estimator.
Use a model evaluation strategy that allows you to play with different hyperparameters to select the best model, assess whether the models are good predictors across all tissues (take a look at the classification_reportLinks to an external site. function in sklearn) and whether they generalize well on unseen data.

### Load data and packages

In [1]:
# import packages
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.spatial.distance import pdist
from sklearn import cluster
from sklearn import metrics
from sklearn.metrics import adjusted_rand_score as ari
from sklearn.metrics import adjusted_mutual_info_score as ami
from sklearn.metrics import (
    calinski_harabasz_score,
    davies_bouldin_score,
    silhouette_score,
)
import umap 

In [None]:
# import data from 01_preprocessing jupyter notebook
GTEx_data = pd.read_pickle("../GTEx_data_input.pkl")
GTEx_labels = pd.read_pickle("../GTEx_labels.pkl")

In [None]:
# set up PCA of GTEx data based on best parameters from 01_dimension_reduction
# reduce GTEx data to a PCA of 40 dimensions, considered best
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
pca_fit = pca.fit(GTEx_data)
pca_data = pca_fit.transform(GTEx_data)

In [None]:
# set up UMAP of GTEx data based on best parameters from 01_dimension_reduction
# reduce data with UMAP, set seed because stochastic
umap = umap.UMAP(random_state=42, n_components=200, min_dist=0.1)
umap_fit = umap.fit(GTEx_data) 
umap_data = umap_fit.transform(GTEx_data) 

In [None]:
# set up smaller datasets with colon and brain tissue info removed

### Ensemble Modeling

We'll use the full top ten tissue dataset to assess different parameters of the model. Then, we'll remove colon and brain tissue information from the dataset, train models using optimal hyperparameters, and then determine how well they work on unseen data. 

#### Weak Learners: Boosting with Gradient Boosting

#### Strong Learners: Bagging with Random Forest meta-estimators

The idea of bagging is to reduce model variance by averaging many models trained on random subsets of the data.

```
for i in range(n_models):
    # collect data samples and fit models
    X_i, y_i = sample_with_replacement(X, y, n_samples)
    model = Model().fit(X_i, y_i)
    ensemble.append(model)

# output average prediction at test time:
y_test = ensemble.average_prediction(x_test)

```

- Bagging classifier: https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting oob_score=True