<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">


<h1><center>Darwin Unsupervised Model Building </center></h1>


# Prior to getting started:

First, if you have just received a new api key from support, you will need to register your key and create a new user (see Register user)

Second, in the Environment Variables cell: 
1. Set your username and password to ensure that you're able to log in successfully
2. Set the path to the location of your datasets if you are using your own data.  The path is set for the examples.
  <br><b>NOTE:</b> We provide two ways to analyze feature importance. One is to use the entire dataset; the other one is to analyze a few samples to understand individual samples. In the latter case, we advise users to use a small dataset (<=500) because it takes long time to process individual samples. 

Here are a few things to be mindful of:
1. For every run, check the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can benefit from extra training, use the resume function.

## Set Darwin SDK

In [1]:
from amb_sdk.sdk import DarwinSdk
ds = DarwinSdk()
ds.set_url('https://amb-demo-api.sparkcognition.com/v1/')

(True, 'https://amb-demo-api.sparkcognition.com/v1/')

## Register user (if needed, read above)

In [2]:
# Use only if you have a new api-key and 
# no registered users - fill in the appropriate fields then execute

#Enter your support provided api key and api key password below to register/create new users
api_key = ''
api_key_pw = ''
status, msg = ds.auth_login(api_key_pw, api_key)
if not status:
    print(msg)

#Create a new user
status, msg = ds.auth_register_user('username', 'password','email@emailaddress.com')
if not status:
    print(msg)

401: UNAUTHORIZED - {"message": "Incorrect username or password"}

401: UNAUTHORIZED - {"msg":"Missing Authorization Header"}



## Environment Variables

In [3]:
#Set your user id and password accordingly
USER='idunlap@rocketmail.com'
PW='5uVGHsTHrQ'

# Set path to datasets - The default below assumes Jupyter was started from amb-sdk/examples/Enterprise/
# Modify accordingly if you wish to use your own data
PATH_TO_DATASET = '../../sets/'
TRAIN_DATASET = 'pulsars.csv'
PREDICT_DATASET = 'pulsars_predict.csv'

# A timestamp is used to create a unique name in the event you execute the workflow multiple times or with 
# different datasets.  File names must be unique in Darwin.
import datetime
ts = '{:%Y%m%d%H%M%S}'.format(datetime.datetime.now())

## Import necessary libraries

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from time import sleep
import os

# User Login

In [5]:
status, msg = ds.auth_login_user(USER,PW)
if not status:
    print(msg)
else:
    print('You are logged in.')

You are logged in.


# Data Upload

**Read dataset and view a file snippet**

In [6]:
# Preview dataset
df = pd.read_csv(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
df.head()

Unnamed: 0,mean_profile,std_profile,kurt_profile,skew_profile,mean_dmsnr,std_dmsnr,kurt_dmsnr,skew_dmsnr,class
0,111.09375,47.341089,0.435469,0.471339,2.386288,15.867173,9.327098,103.545876,0
1,105.0,49.203341,0.563215,0.38215,1.601171,14.657767,11.381829,148.33435,0
2,115.304688,43.653207,0.448319,0.614359,3.158027,21.378754,8.34743,76.310271,0
3,108.554688,52.559016,0.138068,-0.44234,1.787625,12.108555,11.262459,180.074252,0
4,136.429688,49.552164,-0.180418,0.370338,9.066054,37.284742,4.270014,17.700441,0


**Upload dataset to Darwin**

In [7]:
# Upload dataset
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, TRAIN_DATASET))
print(status)
print(dataset)

if not status:
    print(dataset)

True
{'dataset_name': 'pulsars.csv'}


# Analyze Data
Before creating a model, users need to analyze data and clean data first. 

In [8]:
status, analyze_id = ds.analyze_data(TRAIN_DATASET, 
                                     job_name = 'Darwin_analyze_data_job' + "-" + ts, 
                                     artifact_name = 'Darwin_analyze_data_artifact' + "-" + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_data_job' + "-" + ts)
else:
    print(analyze_id)

{'status': 'Requested', 'starttime': '2019-04-17T21:59:36.469952', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars.csv'], 'artifact_names': ['Darwin_analyze_data_artifact-20190417215937'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2019-04-17T21:59:36.469952', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars.csv'], 'artifact_names': ['Darwin_analyze_data_artifact-20190417215937'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2019-04-17T21:59:36.469952', 'endtime': None, 'percent_complete': 10, 'job_type': 'AnalyzeData', 'loss': None, 'generations': None, 'dataset_names': ['pulsars.csv'], 'artifact_names': ['Darwin_analyze_data_artifact-20190417215937'], 'model_name': None, 'job_error': None}
{'status': 'Running', 'starttime': '2019-04-17T21:59:36.469952', 'endtime': 

In [9]:
ds.lookup_job_status_name(analyze_id['job_name'])

(True,
 {'status': 'Complete',
  'starttime': '2019-04-17T21:59:36.469952',
  'endtime': '2019-04-17T22:00:24.514',
  'percent_complete': 100,
  'job_type': 'AnalyzeData',
  'loss': None,
  'generations': None,
  'dataset_names': ['pulsars.csv'],
  'artifact_names': ['Darwin_analyze_data_artifact-20190417215937'],
  'model_name': None,
  'job_error': ''})

# Clean Data

Starting Version 1.6, Darwin SDK offers a way to clean your data outside of model training. Every dataset needs to be cleaned before creating a model. There is no need to save the cleaned data and upload it. 

In [10]:
# Clean dataset
status, job_id = ds.clean_data(dataset_name=TRAIN_DATASET)
if not status:
    print(job_id)
else:
    print('Data has been successfully cleaned!')

Data has been successfully cleaned!


# Create and Train Model 

To build unsupervised models, which cluster data and perform anomaly detection, Darwin goes through the following steps:
1. Determines an approximate number of clusters to start with using a single pass with a hierarchical method
2. Iterates on subsets of the data using a Spectral-Net algorithm to determine the ideal number of clusters
3. Proceeds to cluster the data using a Spectral-Net approach

In the cell below, specify the parameters used to create the model:
- model: the name of your model
- max_epochs: the number of epochs to train the model, one epoch indicates one scan of the entire dataset
- n_clusters: the number of clusters, either an integer or 'auto', if left with 'auto', the unsupervised algorithm will compute a number for you

In [11]:
# Build model
model = "model" + "-" + ts
max_epochs = 20
n_clusters = 2
status, job_id = ds.create_model(dataset_names=TRAIN_DATASET,
                                 model_name=model,
                                 max_epochs=max_epochs,
                                 n_clusters=n_clusters)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Running', 'starttime': '2019-04-17T22:00:40.617364', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': None, 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-04-17T22:00:40.617364', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': None, 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-04-17T22:00:40.617364', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': None, 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Running', 'starttime': '2019-04-17T22:00:40.617364', 'endtime': None, 'percent_complete': 0, 'job_type': 'TrainModel', 'loss': None, 'generations': 0, 'dataset_

In [12]:
# look up job status
ds.lookup_job_status_name(job_id['job_name'])

(True,
 {'status': 'Complete',
  'starttime': '2019-04-17T22:00:40.617364',
  'endtime': '2019-04-17T22:01:39.942014',
  'percent_complete': 100,
  'job_type': 'TrainModel',
  'loss': None,
  'generations': 0,
  'dataset_names': ['pulsars.csv'],
  'artifact_names': None,
  'model_name': 'model-20190417215937',
  'job_error': ''})

In [13]:
# look up the model
ds.lookup_model_name(job_id['model_name'])

(True,
 {'type': 'Unsupervised',
  'updated_at': '2019-04-17T22:01:39.935142',
  'trained_on': ['pulsars.csv'],
  'loss': None,
  'generations': 0,
  'parameters': {'n_clusters': 2,
   'max_generation': 20,
   'train_time': '00:10',
   'recurrent': None,
   'max_unique_values': 50,
   'max_int_uniques': 15,
   'impute': 'mean',
   'big_data': False},
  'description': {'model': "UnsupervisedPipeline(anomaly=False, anomaly_prior=0.0015, auto_save_per=10,\n           clustering=True, clustermethod='GaussianMixture',\n           job_id='26374df0-6186-11e9-9ed5-6fd12eab83cc',\n           max_generation=20, max_time=600,\n           model_file='models/8f7f1eea-4fc6-11e9-ba76-eb31f920f59e_model-20190417215937',\n           n_clusters=2, preproc_anomaly=None, recurrent=None, verbose=2)",
   'genome_type': 'Unsupervised'},
  'train_time_seconds': 59,
  'algorithm': None,
  'running_job_id': None})

## Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [14]:
# Train some more
extra_epochs = 10
status, job_id = ds.resume_training_model(dataset_names=TRAIN_DATASET,
                                          model_name=model,
                                          max_epochs=extra_epochs,
                                          n_clusters=n_clusters)
sleep(1)
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Running', 'starttime': '2019-04-17T22:01:43.108881', 'endtime': None, 'percent_complete': 0, 'job_type': 'UpdateModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': None, 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-17T22:01:43.108881', 'endtime': '2019-04-17T22:01:56.967556', 'percent_complete': 100, 'job_type': 'UpdateModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': None, 'model_name': 'model-20190417215937', 'job_error': ''}


## Predict
Run the following cell for prediction

In [15]:
# Test model
status, artifact = ds.run_model(TRAIN_DATASET, 
                                model, 
                                supervised=False)
sleep(1)
ds.wait_for_job(artifact['job_name'])

{'status': 'Running', 'starttime': '2019-04-17T22:02:00.347664', 'endtime': None, 'percent_complete': 0, 'job_type': 'RunModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': ['1fa52d74159b42718572a39918a2978d'], 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-17T22:02:00.347664', 'endtime': '2019-04-17T22:02:03.003864', 'percent_complete': 100, 'job_type': 'RunModel', 'loss': None, 'generations': 0, 'dataset_names': ['pulsars.csv'], 'artifact_names': ['1fa52d74159b42718572a39918a2978d'], 'model_name': 'model-20190417215937', 'job_error': ''}


(True, 'Job completed')

In [16]:
# Get predictions
status, pred_file = ds.download_artifact(artifact['artifact_name'])

In [17]:
# View prediction
df = pd.read_csv(pred_file['filename'])
df.head()

Unnamed: 0,anomaly_score,predict_proba,prediction
0,-12.691042,"[0.9666828207784542, 0.033317179221545666]",0
1,-12.691042,"[0.9619552834389405, 0.03804471656105941]",0
2,-12.691042,"[0.9583237944289412, 0.041676205571058604]",0
3,-12.691042,"[0.9563806361669092, 0.04361936383309067]",0
4,-12.691042,"[0.9259344943050699, 0.07406550569493016]",0


## Analyze Model
Analyze model provides feature importance ranked by the model. It indicates a general view of which features pose a bigger impact on the model

In [18]:
status, analyze_id = ds.analyze_model(job_id['model_name'], 
                                      job_name='Darwin_analyze_model_job-' + ts, 
                                      artifact_name='Darwin_analyze_model_artifact-' + ts)
sleep(1)
if status:
    ds.wait_for_job('Darwin_analyze_model_job-' + ts)
else:
    print(analyze_id)

{'status': 'Running', 'starttime': '2019-04-17T22:02:18.027824', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeModel', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Darwin_analyze_model_artifact-20190417215937'], 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-17T22:02:18.027824', 'endtime': '2019-04-17T22:02:21.042838', 'percent_complete': 100, 'job_type': 'AnalyzeModel', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Darwin_analyze_model_artifact-20190417215937'], 'model_name': 'model-20190417215937', 'job_error': ''}


In [19]:
ds.lookup_job_status_name('Darwin_analyze_model_job-' + ts)

(True,
 {'status': 'Complete',
  'starttime': '2019-04-17T22:02:18.027824',
  'endtime': '2019-04-17T22:02:21.042838',
  'percent_complete': 100,
  'job_type': 'AnalyzeModel',
  'loss': None,
  'generations': 0,
  'dataset_names': None,
  'artifact_names': ['Darwin_analyze_model_artifact-20190417215937'],
  'model_name': 'model-20190417215937',
  'job_error': ''})

Downloade and print the top 10 features

In [20]:
status, feature_importance = ds.download_artifact('Darwin_analyze_model_artifact-' + ts)
feature_importance

skew_profile    0.196476
kurt_profile    0.141127
skew_dmsnr      0.133543
std_profile     0.120845
kurt_dmsnr      0.096207
mean_profile    0.085166
mean_dmsnr      0.084242
std_dmsnr       0.071762
class = 1       0.070633
dtype: float64

## Analyze Prediction
Different from Analyze Model, the Analyze Prediction provides a way to analyze feature importance for each data point. The output estimates how each feature added or subtracted from a known base-value to result in the overall prediction that was made.  <br>
**You need to set the path to the dataset which contains all the samples you want to analyze (max rows = 500)**

In [21]:
# Upload the data that you are interested in feature importance (max: 500 rows)
status, dataset = ds.upload_dataset(os.path.join(PATH_TO_DATASET, PREDICT_DATASET))
if not status:
    print(dataset)
    
if status:
    dataset_by_row=dataset['dataset_name']
else:
    print("Upload data failed!")

In [22]:
status, analyze_id = ds.analyze_predictions(job_id['model_name'], 
                                            PREDICT_DATASET, 
                                            job_name='Analyze_prediction_job-' + ts, 
                                            artifact_name='Analyze_prediction_artifact-' + ts)
sleep(1)
if status:
    ds.wait_for_job('Analyze_prediction_job-' + ts)
else:
    print(analyze_id)

{'status': 'Running', 'starttime': '2019-04-17T22:02:35.761435', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzePredictions', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Analyze_prediction_artifact-20190417215937'], 'model_name': 'model-20190417215937', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-17T22:02:35.761435', 'endtime': '2019-04-17T22:02:40.70888', 'percent_complete': 100, 'job_type': 'AnalyzePredictions', 'loss': None, 'generations': 0, 'dataset_names': None, 'artifact_names': ['Analyze_prediction_artifact-20190417215937'], 'model_name': 'model-20190417215937', 'job_error': ''}


In [23]:
ds.lookup_job_status_name('Analyze_prediction_job-' + ts)

(True,
 {'status': 'Complete',
  'starttime': '2019-04-17T22:02:35.761435',
  'endtime': '2019-04-17T22:02:40.70888',
  'percent_complete': 100,
  'job_type': 'AnalyzePredictions',
  'loss': None,
  'generations': 0,
  'dataset_names': None,
  'artifact_names': ['Analyze_prediction_artifact-20190417215937'],
  'model_name': 'model-20190417215937',
  'job_error': ''})

Download and print the top 10 features

In [24]:
status, feature_importance = ds.download_artifact('Analyze_prediction_artifact-' + ts)
feature_importance.head()

Unnamed: 0,mean_profile_shap,std_profile_shap,kurt_profile_shap,skew_profile_shap,mean_dmsnr_shap,std_dmsnr_shap,kurt_dmsnr_shap,skew_dmsnr_shap,class = 1_shap,base_value,predicted_proba,predicted_class
0,0.057928,0.081586,0.06133,0.06039,0.025644,0.0441,0.084346,0.082562,0.022604,0.443911,0.902422,0
1,0.068686,0.086406,0.071951,0.070056,0.027363,0.039938,0.063032,0.062591,0.030716,0.443911,0.950183,0
2,0.060826,0.093792,0.063636,0.06276,0.022066,0.04027,0.077618,0.078794,0.020414,0.443911,0.589138,0
3,0.067299,0.09111,0.068313,0.068326,0.034789,0.042636,0.063223,0.047388,0.037772,0.443911,0.963259,0
4,-0.022155,-0.078355,-0.04835,-0.071799,-0.067388,-0.046311,-0.064381,-0.057761,-0.061093,0.556089,0.993726,1
