# Tutorial Overview
This set of five tutorials (installation, package setup, data setup, running, analyzing) will explain the UncertaintyForest class. After following the steps below, you should have the ability to run the code on your own machine and interpret the results.

If you haven't seen it already, take a look at the first and second parts of this set of tutorials called `UncertaintyForest_Tutorials_1-Installation` and `UncertaintyForest_Tutorial_2-Package-Setup`

# 3: Data Setup
## *Goal: Understand the data and the parameters that will be passed to the UncertaintyForest instance*

### First, we have to import some modules to have everything we need. 

The top two sections are standard packages, the third block is just specifying where to look for the packages listed below, the fourth block is another standard package, and the final block is for importing the actual UncertaintyForest class.

In [13]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# adding the proglearn directory to the path since this tutorial is in a different place
import sys
sys.path.insert(1, '../../')

from tqdm.notebook import tqdm
from joblib import Parallel, delayed

from proglearn.forest import UncertaintyForest

### Now, we create the function that will make data that we'll train on.
Here, we use randomized data because if the learner can learn that, then it can learn most anything.

In [14]:
def generate_data(n, d, var): 
    '''
    Parameters
    ---
    n : int
        The number of data to be generated
    d : int
        The number of features to generate for each data point
    var : double
        The variance in the data
    '''
    # create the mean matrix for the data (here it's just a mean of 1)
    means = [np.ones(d) * -1, np.ones(d)] 
    
    # create the data with the given parameters (variance)
    X = np.concatenate([np.random.multivariate_normal(mean, var * np.eye(len(mean)), 
                                                 size=int(n / 2)) for mean in means]) 
    
    # create the labels for the data
    y = np.concatenate([np.ones(int(n / 2)) * mean_idx for mean_idx in range(len(means))])
    
    return X, y

### Lastly, the parameters of the uncertainty forest are defined.

In [15]:
# Real Params.
n_train = 50
n_test = 10000
d = 100
var = 0.25
num_trials = 10
n_estimators = 100

It will be important to understand each of these parameters, so we'll go into more depth on what they mean:
* `n_train` is the number of training data that will be used to train the learner
* `n_test` is the number of test data that will be used to assess how well the learner is at classifying
* `d` is the dimensionality of the input space (i.e. how many features the data has)
* `var` is the variance of the data
* `num_trials` is the number of times we'll generate data, train, and test to make sure our results are not outliers
* `num_estimators` is the number of trees in the forest

### You're done with part 3 of the tutorial!

### Move on to part 4 (called "UncertaintyForest_Tutorial_4-Running")