## <span style="color:#0B3B2E;float:right;font-family:Calibri">Jordan Graesser</span>

# MpGlue
### Preparing data and testing model parameters

---
### Ranking and optimizing
---

In [1]:
import mpglue as gl

CL = gl.classification()

In [2]:
# Print help documentation for the Chi-square 
#   test to rank feature importance.
print(help(CL.rank_feas))

Help on method rank_feas in module mpglue.classification.classification:

rank_feas(rank_text=None, rank_method='chi2', top_feas=1.0, be_quiet=False) method of mpglue.classification.classification.classification instance
    Ranks image features by importance.
    
    Args:
        rank_text (Optional[str]): A text file to write ranked features to. Default is None.
        rank_method (Optional[str]): The method to use for feature ranking. Default is 'chi2' (Chi^2). Choices are 
            ['chi2', 'rf'].
        top_feas (Optional[float or int]): The percentage or total number of features to reduce to. 
            Default is 1., or no reduction.
        be_quiet (Optional[bool]): Whether to be quiet and do not print to screen. Default is False.
    
    Returns:
        None, writes to ``rank_text`` if given and prints results to screen.
    
    Examples:
        >>> # rank image features
        >>> cl.split_samples('/samples.txt', scale_data=True)
        >>> cl.rank_feas(rank_t

In [2]:
samples = '../testing/data/08N_points_merged.txt'

# Sample the data
CL.split_samples(samples)

# Rank the explanatory variables with a chi-square test
CL.rank_feas(rank_method='chi2', top_feas=0.2)

21:11:05:INFO:6873:classification.rank_feas:**********************
*                    *
* Chi^2 Feature Rank *
*                    *
**********************

Rank      Variable      Value
----      --------      -----
21:11:05:INFO:6900:classification.rank_feas:1         37            13990426.650485039
21:11:05:INFO:6900:classification.rank_feas:2         38            7941858.708653258
21:11:05:INFO:6900:classification.rank_feas:3         32            305127.71373103437
21:11:05:INFO:6900:classification.rank_feas:4         11            218691.9687341912
21:11:05:INFO:6900:classification.rank_feas:5         18            183150.47969586635
21:11:05:INFO:6900:classification.rank_feas:6         4             176574.7245637728
21:11:05:INFO:6900:classification.rank_feas:7         25            138741.52341807316
21:11:05:INFO:6900:classification.rank_feas:8         39            30383.010275834953
21:11:05:INFO:6917:classification.rank_feas:  Mean score:  2873119.35
21:11:05:INFO:692

In [6]:
# Rank the explanatory variables with a Random Forest 
#   model and compare results.

# First, construct a Random Forest model.
CL.construct_model(classifier_info={'classifier': 'rf'})

# Use the RF model to rank feature importance.
CL.rank_feas(rank_method='rf', top_feas=0.2)

11:37:01:INFO:4340:classification._train_model:  Training a RF model with 4,783 samples and 41 variables ...


---
### Optimizing parameters
---

In [6]:
print help(CL.optimize_parameters)

Help on method optimize_parameters in module mpglue.classification.classification:

optimize_parameters(self, file_name, classifier_info={'classifier': 'RF'}, n_trees_list=[500, 1000, 1500, 2000], trials_list=[2, 5, 10], max_depth_list=[25, 30, 35, 40, 45, 50], min_samps_list=[2, 5, 10], criterion_list=['gini'], rand_vars_list=['sqrt'], cf_list=[0.25, 0.5, 0.75], committees_list=[1, 2, 5, 10], rules_list=[25, 50, 100, 500], extrapolation_list=[0, 1, 5, 10], class_weight_list=[None, 'balanced', 'balanced_subsample'], learn_rate_list=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0], bool_list=[True, False], c_list=[1.0, 10.0, 20.0, 100.0], gamma_list=[0.001, 0.001, 0.01, 0.1, 1.0, 5.0], k_folds=3, perc_samp=0.5, ignore_feas=[], use_xy=False, classes2remove=[], method='overall', f1_class=0, stratified=False, spacing=1000.0, calibrate_proba=False, output_file=None) method of mpglue.classification.classification.classification instance
    Finds the optimal parameters for a classifier by training and testing

In [None]:
# Find the optimum parameters for a Random Forest,
#   using the default parameter list.
CL.optimize_parameters(samples, 
                       classifier_info={'classifier': 'rf'}, 
                       use_xy=True)


Finding the best paramaters for a RF model ...



---
### Testing
---

In [5]:
emat = gl.error_matrix()

print(help(emat))

Help on error_matrix in module mpglue.classification.error_matrix object:

class error_matrix(builtins.object)
 |  Computes accuracy statistics
 |  
 |  Args:
 |      po_text (str): Predicted and observed labels as a text file,
 |          where (predicted, observed) are the last two columns.
 |      po_array (ndarray): Predicted and observed labels as an array,
 |          where (predicted, observed) are the last two columns.
 |      header (Optional[bool]): Whether ``file`` or ``predicted_observed`` contains a header. Default is False.
 |      class_list (Optional[list])
 |      discrete (Optional[bool])
 |      e_matrix (Optional[ndarray])
 |  
 |  Attributes:
 |      n_classes (int): Number of unique classes.
 |      class_list (list): List of unique classes.
 |      e_matrix (ndarray): Error matrix.
 |      accuracy (float): Overall accuracy.
 |      report
 |      f_scores (float)
 |      f_beta (float)
 |      hamming (float)
 |      kappa_score (float)
 |      mae (float)
 |   

In [9]:
import numpy as np

# Create some random data
test_array = np.random.randn(100, 2).astype('uint8')

emat.get_stats(po_array=test_array)

In [7]:
print(dir(emat))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'error_matrix2xy', 'get_stats', 'kappa', 'merge_lists', 'producers_accuracy', 'sample_bias', 'time_stamp', 'users_accuracy', 'write_stats']


In [11]:
print(emat.accuracy)

49.0


In [12]:
print(emat.e_matrix)

[[48  7  1  0 19]
 [ 9  0  0  1  1]
 [ 0  1  0  0  1]
 [ 2  0  0  0  0]
 [ 7  1  1  0  1]]


In [13]:
print(emat.kappa_score)

-0.0793650793651


In [14]:
print(emat.report)

             precision    recall  f1-score   support

          0       0.64      0.73      0.68        66
          1       0.00      0.00      0.00         9
          2       0.00      0.00      0.00         2
        254       0.00      0.00      0.00         1
        255       0.10      0.05      0.06        22

avg / total       0.44      0.49      0.46       100



In [1]:
print(emat.write_stats('datasets/my_report.txt'))

---
### Models in MpGlue
---

In [3]:
# Check the available models.
print(CL.model_options())


        Supported models

        Parameter name -- Long name
                          *Module

        ab-dt       -- AdaBoost with CART (classification problems)
                        *Scikit-learn
        ab-ex-dt    -- AdaBoost with extremely random trees (classification problems)
                        *Scikit-learn
        ab-rf       -- AdaBoost with Random Forest (classification problems)
                        *Scikit-learn
        ab-ex-rf    -- AdaBoost with Extremely Random Forest (classification problems)
                        *Scikit-learn
        ab-dtr      -- AdaBoost with CART (regression problems)
                        *Scikit-learn
        ab-ex-dtr   -- AdaBoost with extremely random trees (regression problems)
                        *Scikit-learn
        bag-dt      -- Bagged Decision Trees (classification problems)
                        *Scikit-learn              
        bag-dtr     -- Bagged Decision Trees (regression problems)
                    

In [4]:
# Load the samples
CL.split_samples(samples)

In [5]:
# Construct a Random Forest model with 100 trees.
CL.construct_model(classifier_info={'classifier': 'rf',
                                    'n_estimators': 100})

# The model is stored in `model`.
print(CL.model)

21:11:39:INFO:5043:classification._train_model:  Training a rf model with 2,657 samples and 41 variables ...


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=25, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


In [6]:
# Construct an extremely randomized Random Forest 
#   with 100 trees.
CL.construct_model(classifier_info={'classifier': 'ex-rf',
                                    'n_estimators': 100})

print(CL.model)

21:11:42:INFO:5043:classification._train_model:  Training an ex-rf model with 2,657 samples and 41 variables ...


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=25, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)


In [8]:
# Construct a boosted extremely randomized Random Forest with 
#   100 trees and 10 trials (boosts).
# CL.construct_model(classifier_info={'classifier': 'ab-ex-rf',
#                                     'n_estimators': 100,
#                                     'trials': 10})

# print(CL.model)

* We trained one model in the examples above.
* MpGlue also supports training ensemble models through Scikit-learn's `VotingClassifier` module.
* To train a voting classifier, simply provide a list of classifiers instead of a string.

In [14]:
# Construct an ensemble voting model and 
#   save the model to file.
CL.construct_model(classifier_info={'classifier': ['rf', 'ex-rf', 'bayes', 'qda', 'nn'],
                                    'n_estimators': 100,
                                    'trials': 5},
                   output_model='voting_model.model')

21:15:28:INFO:4466:classification._set_model:  Fitting a rf model ...
21:15:28:INFO:4466:classification._set_model:  Fitting a ex-rf model ...
21:15:28:INFO:4466:classification._set_model:  Fitting a bayes model ...
21:15:28:INFO:4466:classification._set_model:  Fitting a qda model ...
21:15:28:INFO:4466:classification._set_model:  Fitting a nn model ...
21:15:29:INFO:4466:classification._set_model:  Fitting a lightgbm model ...
21:15:29:INFO:5020:classification._train_model:  The model has already been trained as a voting model.
21:15:29:INFO:5184:classification._train_model:  Saving model to file ...
21:15:30:INFO:6659:classification.test_accuracy:  Getting test accuracy ...


---
### Making predictions on an image
---

In [11]:
# Load the voting model.
CL.construct_model(input_model='voting_model.model')

21:14:44:INFO:4084:classification._load_model:  Loading voting_model.model ...


In [13]:
# Apply the model to an image.
CL.predict('input_image.tif',
           'output_image.tif')

# *Note that the model could also be loaded with `predict`.
# This syntax would not require the `construct_model` method.
CL.predict('input_image.tif',
           'output_image.tif',
           input_model='voting_model.model')

In [None]:
# Apply the model to an image, adjusting the block size.
CL.predict('input_image.tif',
           'output_image.tif',
           row_block_size=2048,
           col_block_size=2048)

In [None]:
# Apply the model to an image, adjusting the number of parallel jobs.
CL.predict('input_image.tif',
           'output_image.tif',
           n_jobs=4,       # model parallel jobs
           n_jobs_vars=4)  # image band reading parallel jobs

In [None]:
# Apply the model to an image, and then apply
#   posterior probability label relaxation.
CL.predict('input_image.tif',
           'output_image.tif',
           relax_probabilities=True)

* The default for `predict` is to apply predictions block by block, reading from one image and writing to one image.
* However, sometimes the input image might be very large, making block writes to the output slow.
* `predict` can write to individual blocks instead of to one image.
* In the example below, individual blocks will be written as `/output_image_00001.tif`, `/output_image_00002.tif`, etc.

In [None]:
# Apply the model to an image, writing to individual blocks.
CL.predict('input_image.tif',
           'output_image.tif',
           write2blocks=True)