# TPOT

In [1]:
# call the 01-data-cleaning.ipynb notebook to bring the pecarn_tbi dataframe and the cleaned dataframe into the environment
%cd -q ../notebooks
%run ./02-data-imputation.ipynb
%cd -q -

START: 00-load-raw-data.ipynb
  PECARN TBI data read from c:\Jan\Capstone\data/TBI PUD 10-08-2013.csv into "pecarn_tbi" dataframe
START: 01-data-cleaning.ipynb
  Dropping records where GCS < 14
  Dropping GCS columns as they are now redundant
  Dropping AgeInMonth
  Renaming AgeinYears to Age
  Dropping EmplType
  Dropping AgeInMonth
  Dropping High_impact_InjSev
  Renaming InjuryMech to Injury_Mechanism
  Renaming ActNorm to Acting_Normal
  Setting Acting_Normal missing data to 1 (Yes)
  Dropping Findings## columns
  The cleaned dataset is now available in a dataframe named "data"
START: 02-data-imputation.ipynb
  Temporary Fix - dropping Age
  Removing 54459 NaN values from 41 columns
  The imputed dataset is now available in a dataframe named "data_imputed"


## One Hot Encoding
We know the data is categorical, so we can One Hot Encode ahead of time.
Note - this means we should configure a custom TPOT pipeline and remove some of the preprocessing options

In [2]:
from sklearn import preprocessing

# don't one-hot-encode the class variable
data_inputs = data_imputed.drop(columns='PosIntFinal')

# encode
encoder = preprocessing.OneHotEncoder(sparse=False, dtype=np.int)
encoder.fit(data_inputs)
data_encoded = encoder.transform(data_inputs)

## Arrange Data
Need to convert the dataframes to numpy arrays so that TPOT can work with them.

In [3]:
data_Y = data_imputed['PosIntFinal'].astype('int64').to_numpy()
data_X = data_encoded.astype('int64')

# Configure TPOT
Initial attempts to use the TPOT Light template were taking a long time too complete.

We create a tpot_config that is based on the TPOT Light template, and removing in particular preprocessors that we aren't interested in.

See https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_light.py

In [4]:
# https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_light.py
tpot_config = {

    # Classifiers
    'sklearn.naive_bayes.GaussianNB': {
    },

    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },

    'sklearn.naive_bayes.MultinomialNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },

    'sklearn.tree.DecisionTreeClassifier': {
        'criterion': ["gini", "entropy"],
        'max_depth': range(1, 11),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21)
    },


    'sklearn.neighbors.KNeighborsClassifier': {
        'n_neighbors': range(1, 101),
        'weights': ["uniform", "distance"],
        'p': [1, 2]
    },


    'sklearn.linear_model.LogisticRegression': {
        'penalty': ["l1", "l2"],
        'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
        'dual': [True, False]
    },

    # Preprocesssors
    # 'sklearn.preprocessing.Binarizer': {
    #     'threshold': np.arange(0.0, 1.01, 0.05)
    # },

    'sklearn.cluster.FeatureAgglomeration': {
        'linkage': ['ward', 'complete', 'average'],
        'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
    },

    # 'sklearn.preprocessing.MaxAbsScaler': {
    # },

    # 'sklearn.preprocessing.MinMaxScaler': {
    # },

    # 'sklearn.preprocessing.Normalizer': {
    #     'norm': ['l1', 'l2', 'max']
    # },

    'sklearn.decomposition.PCA': {
        'svd_solver': ['randomized'],
        'iterated_power': range(1, 11)
    },

    # 'sklearn.kernel_approximation.RBFSampler': {
    #     'gamma': np.arange(0.0, 1.01, 0.05)
    # },

    # 'sklearn.preprocessing.RobustScaler': {
    # },

    # 'sklearn.preprocessing.StandardScaler': {
    # },

    'tpot.builtins.ZeroCount': {
    },

    # Selectors
    'sklearn.feature_selection.SelectFwe': {
        'alpha': np.arange(0, 0.05, 0.001),
        'score_func': {
            'sklearn.feature_selection.f_classif': None
        }
    },

    'sklearn.feature_selection.SelectPercentile': {
        'percentile': range(1, 100),
        'score_func': {
            'sklearn.feature_selection.f_classif': None
        }
    },

    'sklearn.feature_selection.VarianceThreshold': {
        'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
    }

}

In [5]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_X, data_Y, train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(config_dict=tpot_config,
                      generations=3,
                      population_size=10, 
                      memory='auto',
                      n_jobs=-1,
                      #early_stop=5,
                      verbosity=3)

## Call TPOT

In [6]:
tpot.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/40 [00:00<?, ?pipeline/s]12 operators have been imported by TPOT.
Optimization Progress:  25%|██▌       | 10/40 [05:42<2:51:17, 342.60s/pipeline]Skipped pipeline #3 due to time out. Continuing to the next pipeline.
Skipped pipeline #7 due to time out. Continuing to the next pipeline.
Skipped pipeline #10 due to time out. Continuing to the next pipeline.
Optimization Progress:  68%|██████▊   | 27/40 [11:31<53:42, 247.87s/pipeline]Skipped pipeline #14 due to time out. Continuing to the next pipeline.
Skipped pipeline #21 due to time out. Continuing to the next pipeline.
Skipped pipeline #24 due to time out. Continuing to the next pipeline.
Skipped pipeline #26 due to time out. Continuing to the next pipeline.
Generation 1 - Current Pareto front scores:
-1	0.9999057295480321	DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=8, DecisionTreeClassifier__min_samples_leaf=20, DecisionTreeC

TPOTClassifier(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                  'l1',
                                                                                  'l2',
                                                                                  'manhattan',
                                                                                  'cosine'],
                                                                     'linkage': ['ward',
                                                                                 'complete',
                                                                                 'average']},
                            'sklearn.decomposition.PCA': {'iterated_power': range(1, 11),
                                                          'svd_solver': ['randomized']},
                            'sklearn.feature_selection.SelectFwe': {'alpha': array([0.

## Score 

In [7]:
print(tpot.score(X_test, y_test))

0.9999057315233786


## Export Pipeline

In [8]:
tpot.export('tpot_tbi_pipeline.py')