# Example of Tent Functions for ML

This is a demonstration of how to use tent functions for machine learning on a set of persistence diagrams

In [1]:
import numpy as np
from scipy import stats

from teaspoon.ML.Base import ParameterBucket, build_G, TentParameters, train_test_split, ML_via_featurization, getPercentScore
from teaspoon.MakeData import PointCloud as gpc
from scipy import stats
import matplotlib.pyplot as plt

import pandas as pd

## Generate Diagrams from Manifold Test

Generate persistence diagrams drawn from random point clouds of a torus, annulus, cube, three clusters, three clusters of three clusters, and spheres. See details of *testSetManifolds* function [here](http://elizabethmunch.com/code/teaspoon/namespaceteaspoon_1_1_make_data_1_1_point_cloud.html#a5d9c892f9f0a63f64437cbbde9048aeb). Select the dimension of persistence diagram to use, here we use dimension 1.

In [2]:
df = gpc.testSetManifolds(numDgms = 100, numPts = 100)

Generating torus clouds...
Generating annuli clouds...
Generating cube clouds...
Generating three cluster clouds...
Generating three clusters of three clusters clouds...
Generating sphere clouds...
Finished generating clouds and computing persistence.



## Long Version: Train Test Split & Set up Parameter Bucket
 
### Train Test Split:
 - Choose which column (or columns) you want to use for diagrams
 - Specify which column has the training labels
 
### Parameter Bucket
 - Need a TentParameters parameter bucket. 
 - Set parameter d for mesh size in each direction.
 - Get adaptive partitions. 
 - Set delta and epsilon for each partition.
 
 Note, these steps are all handled by the function *getPercentScore*. For the shortened version using this function scroll down.

In [3]:
dgm_col = 'Dgm1'
if type(dgm_col) == str:
    dgm_col = [dgm_col]
labels_col = 'trainingLabel'

# Set up parameters and adaptively partition training set
params = TentParameters()

# Run train/test split using sklearn
D_train, D_test, L_train, L_test = train_test_split(df, df[labels_col], test_size=params.test_size, random_state = params.seed)

In [None]:
params.useAdaptivePart = True
params.d = [3,3]

# Concatenate training set into a pandas series:
allDgms = pd.concat((D_train[label] for label in dgm_col))

if params.useAdaptivePart == True:
    # Hand the series to the makeAdaptivePartition function
    params.makeAdaptivePartition(allDgms, meshingScheme = 'DV', numParts = 2)
else:
    # Just use the bounding box as the partition
    params.makeAdaptivePartition(allDgms, meshingScheme = 'None')
    
# Assign delta and epsilon for each partition
# If you didn't use adaptive partitioning this just assigns it to the one partition for the whole bounding box
params.chooseDeltaEpsForPartitions()

## Plotting Training Set and Partitions

In [None]:
dgm_array = np.concatenate(list(allDgms))

# Plot partitions and overlay the data
plt.rcParams['figure.figsize'] = [10, 10]
params.partitions.plot()
plt.plot(dgm_array[:,0], dgm_array[:,1] - dgm_array[:,0], 'r*')

plt.show()

## Training

Use function from teaspoon to run ML with featurization on persistence diagrams. Takes data frame of persistence diagrams and specified column labels, computes the G matrix using *build_G*. Does classification using labels from labels_col in the data frame. Returns trained model.

In [None]:
print('Using ' + str(len(L_train)) + '/' + str(len(df)) + ' to train...')
clf = ML_via_featurization(D_train, labels_col = labels_col, dgm_col = dgm_col, params = params, verbose = True)

## Testing

Build G matrix for the testing set, use the model generated on the training data to predict the label. Then score the predicted labels.

In [None]:
print('Using ' + str(len(L_test)) + '/' + str(len(df)) + ' to test...')
listOfG = []
for dgmColLabel in dgm_col:
    G = build_G(D_test[dgmColLabel],params)
    listOfG.append(G)

G = np.concatenate(listOfG,axis = 1)

# Compute predictions and add to DgmsDF data frame
L_predict = pd.Series(clf.predict(G),index = L_test.index)
df['Prediction'] = L_predict

# Compute score
score = clf.score(G,list(L_test))

print('Score on testing set: ' + str(score) +"...\n")

print('Finished with train/test experiment.')

output = {}
output['score'] = score
output['DgmsDF'] = df
output['clf'] = clf

## Short Version: Train Test Split & Set up Parameter Bucket

Use function *getPercentScore* to set up parameter bucket, do train/test split and calculate accuracy.

In [6]:
params = TentParameters()
params.useAdaptivePart = True
params.d = [3,3]

num_runs = 1
yy = np.zeros((num_runs))
for i in np.arange(num_runs):
    xx, Dgms_train, Dgms_test = getPercentScore(df,
                    labels_col = 'trainingLabel',
                    dgm_col = 'Dgm1',
                    params = params,
                    verbose = True)
    yy[i] = xx['score']

print('\navg success rate = {}\nStdev = {}'.format(np.mean(yy), np.std(yy)))

---
Beginning experiment.
Variables in parameter bucket
---
feature_function : <function tent at 0x1a1bf63378>
useAdaptivePart : True
d : [3, 3]
delta : 1
epsilon : 0
clf_model : <class 'sklearn.linear_model.ridge.RidgeClassifierCV'>
seed : None
test_size : 0.33
maxPower : 1
---

Converting the data to ordinal...

Parameters d, delta and epsilon have all been assigned to each partition...

Using 402/600 to train...
Training estimator.
Making G...
Number of features used is 4848 ...
Checking score on training set...
Score on training set: 0.9900497512437811.

Using 198/600 to test...
Score on testing set: 0.9090909090909091...

Finished with train/test experiment.

avg success rate = 0.9090909090909091
Stdev = 0.0
