## THEx Models
Welcome to THEx models. Models can be run through Python interpreters- like this Jupyter Notebook! This Notebook will walk you through how to run the models. First, lets import all the current models implemented.

In [1]:
from models.ktrees_model.ktrees_model import KTreesModel
from models.clus_hmc_ens_model.clus_hmc_ens_model import CLUSHMCENS
# So plots show up 
%matplotlib inline 

Next, let's run one of the models. We just need 3 things to run the models:
1. Have our repo properly setup (please view the README for extensive directions)
2. Have an idea of what features we'd like to filter on.

If you are unfamiliar with THEx there is just 1 thing to keep in mind: our dataset is massively disparate. This means there is no single row of data that has values across every single feature. Actually, our dataset looks like this picture:

![title](figures/thexdataset.png)

No 1 row goes all the way across. So, we need to select some columns to filter on. Our models allow 2 ways of filtering on columns:

- **cols** : Specific column names, provided as a list of strings. For example: ["NED_SDSS_u", "NED_SDSS_g", "NED_SDSS_r"]

- **col_matches** : String to match column names on, provided as a list of strings. For example: ["AllWISE", "GALEX"] will filter on all columns containing those strings, which turns out to be: AllWISE_W1mag, AllWISE_W2mag, AllWISE_W3mag, AllWISE_W4mag, AllWISE_Jmag, AllWISE_Hmag, AllWISE_Kmag, NED_GALEX_FUV, NED_GALEX_NUV


In [2]:
# Intialize model
mag_cols = ['GALEXAIS_FUV', 'GALEXAIS_NUV', # GALEX
            'AllWISE_W1mag', 'AllWISE_W2mag', 'AllWISE_W3mag',  'AllWISE_W4mag', # AllWISE
            'PS1_gmag', 'PS1_rmag', 'PS1_imag' , 'PS1_zmag', 'PS1_ymag' #Pan-STARRS
           ]
ktree = KTreesModel(cols = mag_cols)

In [None]:
# Run model
ktree.run_model()

And you're done! That's all you need to run our models. There are many optional parameters to be aware of though:
- **folds** (default = None) : Number of folds to use in k-fold Cross Validation. If no number is passed in, the model will only run once.  Recommended value: 3 - 6.
- **test_on_train**: (default = False) : Boolean flag that if True, will test on training data. This helps to evaluate how well the model captures patterns in the training data. 
- **incl_redshift** (default = False) : Boolean flag that if True, will use redshift as a feature. 
- **top_classes** (default = None) : Maximum number of classes to include; selected by popularity. For example: 10 most popular classes.
- **subsample** (default = None) : Number to randomly subsample to for over-represented classes.
- **transform_features** (default = False) : Enhance features by utilizing differences between adjacent columns as features (For example: g - r).
- **one_all** (default = None) : Place all other classes (not passed in to this list) into 'Other' category (for example, pass in ["Ia", "II"] to have Ia, II, and Other as only classes).
- **min_class_size** (default = 4) : Minimum number of samples in each class. If class has less than this number of samples over entire dataset, it is altogether removed. 
- **data_split** (default = 0.3) : Percent of data to use as testing split. Remaining data will be used as training.

Let's change some of these flags around in the K-Trees Model to get an idea of their impact:

In [None]:
from models.ktrees_model.ktrees_model import KTreesModel
from datetime import datetime
import time

# Features to be used
mag_cols = ['GALEXAIS_FUV', 'GALEXAIS_NUV', # GALEX
            'AllWISE_W1mag', 'AllWISE_W2mag', 'AllWISE_W3mag',  'AllWISE_W4mag', # AllWISE
            'PS1_gmag', 'PS1_rmag', 'PS1_imag' , 'PS1_zmag', 'PS1_ymag' #Pan-STARRS
           ]
start_time = datetime.now() # for recording run-time

#Instantiate model
ktree = KTreesModel(
         cols = mag_cols,
         transform_features = True,
         incl_redshift = True,
         num_runs = 1,
         min_class_size = 6,
         folds = 3)
ktree.run_model()


end_time = datetime.now()
mins = int(time.mktime(end_time.timetuple()) -
           time.mktime(start_time.timetuple())) / 60
print(ktree.name + " took " + str(mins) + " minutes to run ")
