The goal of this file is to build IoT detection model, that is to find out whether a device is an IoT device.


The process of building IoT detection model is devided into 5 main parts, including installing needed resources, preparing training data, creating machine learning models, creating endpoint and testing models.

# 1. Install Package Needed
* boto3 and sagemaker are packages that allow notebook to interact with various aws services

In [1]:
import boto3
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

In [2]:
#set up region and bucket
role = get_execution_role()
region = boto3.Session().region_name
bucket='sagemaker-model-ml'

# 2. Prepare Training Data
## 2.1 Read Training Data from S3

* Note:the AUGandOCT.csv file stored in the S3 bucket contains the training data that is already 'cleaned'. But there is still some missing values in it. Therefore, filling in missing value is required before building the prediction model.


In [3]:
#download dataset from S3
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file('AUGandOCT.csv','AUGandOCT.csv')

#read training data and fill in the missing value
IP = pd.read_csv('AUGandOCT.csv')
IP[['uri_num','query_num','CN','OU','O']] = \
IP[['uri_num','query_num','CN','OU','O']].fillna(0)

#define the hamming distance to the baseline SAX value
def hamdist(str1, str2):        
        diffs = 0
        for ch1, ch2 in zip(str1, str2):
                if ch1 != ch2:
                        diffs += 1
        return diffs
IP['SAX_resp_ham']=IP.SAX_resp_h.apply(lambda x: hamdist(x,'bbbbbbbbbbbbbbbbbbbbbbbb'))

## 2.2 Traning Data Preparation
* This process involves two steps:
    * select feature based on feature selection results
    * encode text value into numerical representation

In [4]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [5]:
def encode(x):
    le=LabelEncoder()
    le.fit(x)
    x=le.transform(x)
    return x
#for training, we need to drop mac address and timestamp, then select the feature we want
train = IP.drop(['mac','ts'],1)
Feature = ['id_resp_h_num','proto_num','conn_state_num','history_num',
           'orig_bt_sum','resp_bt_sum','orig_bt_std','resp_bt_std',
           'uri_num','query_num','up_down','SAX_resp_ham','O','OU','IOT','Device']
train = train[Feature]

In [6]:
#we dont have to exclude the device, so ex = '' null. this chunck of code is remaining from the past when we first build
#our model. for production purpose, just leave it ex=''
ex=''
X = train.drop(['IOT','Device'],1)[train.Device != ex]
yt = encode(train['IOT'][train.Device != ex])
X_train , X_test, yt_train, yt_test = train_test_split(X, yt, test_size = 0.2)

# 3. Create and Save Models
## 3.1 Build and Find Best Models


In [8]:
# This line of code below allows notebook to install xgboost package
# run this line of code everytime a new notebook instance is open

# !conda install -y -c conda-forge xgboost

In [9]:
import xgboost as xgb
from sklearn.grid_search import GridSearchCV



In [10]:
#this part is called grid search, it fine tune the parameter of model, and find the best AUC value. 
#make sure you use the best parameter in your production model
xgb_model = xgb.XGBClassifier()
#this is the parameter you want to search, objective is binary classfication, so we use binary:logistic
parameters = {'objective':['binary:logistic'],
              'learning_rate': [0.05,0.1,0.2], #so called `eta` value
              'max_depth': [6,7,8],#depth of the tree
              'silent': [1],
              'subsample': [0.5,0.8],#each tree contains x% of the training data
              'colsample_bytree': [0.7,0.8],#how many column,feature you want to use in each tree
              'n_estimators': [1000]} #number of trees, change it to 1000 for better results}
#change n_jobs to avoid overfitting, also! slow your model !!!make sure you find the balance
clf = GridSearchCV(xgb_model, parameters, n_jobs=1, 
                   scoring='roc_auc',
                   verbose=2, refit=True)
clf.fit(X_train, yt_train)
#trust your CV!
best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
print('Raw AUC score:', score)
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
test_probs = clf.predict_proba(X_test)[:,1]

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV] colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV]  colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.7, learning_rate=0.05, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.05, 

[CV]  colsample_bytree=0.7, learning_rate=0.1, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.1, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.7, learning_rate=0.1, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.1, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.7, learning_rate=0.1, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.7, learning_rate=0.2, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.7, learning_rate=0.2, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.1s
[CV] colsample_bytree=0.7, learning_rate=0.2, max_dept

[CV] colsample_bytree=0.8, learning_rate=0.05, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.8, learning_rate=0.05, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.8, learning_rate=0.05, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 
[CV]  colsample_bytree=0.8, learning_rate=0.05, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.8, learning_rate=0.05, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.8, learning_rate=0.05, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.2s
[CV] colsample_bytree=0.8, learning_rate=0.05, max_depth=8, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.8, learning_rate=0.05, max_dept

[CV]  colsample_bytree=0.8, learning_rate=0.2, max_depth=6, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.8 -   0.2s
[CV] colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.1s
[CV] colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.1s
[CV] colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 
[CV]  colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=1000, objective=binary:logistic, silent=1, subsample=0.5 -   0.2s
[CV] colsample_bytree=0.8, learning_rate=0.2, max_dept

[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed:   18.4s finished


Raw AUC score: 0.9997470374736552
colsample_bytree: 0.7
learning_rate: 0.1
max_depth: 6
n_estimators: 1000
objective: 'binary:logistic'
silent: 1
subsample: 0.8


In [None]:
#best parameter model!
bt = xgb.XGBClassifier(max_depth=6,learning_rate=0.1,n_estimators=1000,colsample_bytree=0.7,silent=1,subsample=0.8)
bt.fit(X_train, yt_train, eval_set=[(X_test, yt_test)], verbose=False)

## 3.2 Save Best Model
you want this model file to create the IOTdetection endpoint!
this is the core component in your endpoint

In [12]:
model_file_name = "IOTDetection"
bt._Booster.save_model(model_file_name)
!tar czvf Detection.tar.gz $model_file_name

IOTDetection


In [14]:
import os
fObj = open("Detection.tar.gz", 'rb')
key= os.path.join(model_file_name, 'Detection.tar.gz')
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fObj)

# 4. Create Endpoint
## 4.1 Create Endpoint Configuration
the same parameter you need if you create from sagemaker dashboard


In [15]:
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}
container = containers[boto3.Session().region_name]

In [16]:
%%time
from time import gmtime, strftime

model_name = model_file_name
model_url = 'https://s3-{}.amazonaws.com/{}/{}'.format(region,bucket,key)
sm_client = boto3.client('sagemaker')

print (model_url)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_url,
}

create_model_response2 = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response2['ModelArn'])

https://s3-us-east-1.amazonaws.com/sagemaker-model-ml/IOTDetection/Detection.tar.gz
arn:aws:sagemaker:us-east-1:197066110901:model/iotdetection
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 218 ms


In [17]:
from time import gmtime, strftime

endpoint_config_name = 'IOTDetectionEndpointConfig1'
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

IOTDetectionEndpointConfig1
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:197066110901:endpoint-config/iotdetectionendpointconfig1


## 4.2 Create Endpoint 

In [18]:
%%time
import time

endpoint_name = 'IOTDetectionEndpoint1'
print(endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

IOTDetectionEndpoint1
arn:aws:sagemaker:us-east-1:197066110901:endpoint/iotdetectionendpoint1
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:197066110901:endpoint/iotdetectionendpoint1
Status: InService
CPU times: user 60 ms, sys: 8 ms, total: 68 ms
Wall time: 7min 1s


In [19]:
runtime_client = boto3.client('runtime.sagemaker')

# 5. Test Model

In [21]:
#test data, you have to run it through the same preprocessing code
test = pd.read_csv('testData_cleaned.csv')
test[['uri_num','query_num','CN','OU','O']] = \
test[['uri_num','query_num','CN','OU','O']].fillna(0)
test = test.drop(['mac','ts'],1)
test = test[Feature]
#point_x is your test_x, point_y is your test_y. point_x has the same column as your training data!!!!
point_X = test.drop(['IOT','Device'],1)
point_y = encode(test['IOT'])
np.savetxt("test_point.csv", point_X, delimiter=",")

In [22]:
%%time
import json


file_name = 'test_point.csv' #customize to your test file, this is the same file you want to use
# but in production, if you call from the database/S3, you will get same test_point.csv, use lambda feed csv to endpoint, giving you the result

with open(file_name, 'r') as f:
    payload = f.read().strip()

response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
result = response['Body'].read().decode('ascii')
#print('Predicted Class Probabilities: {}.'.format(result))

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 311 ms


In [23]:
predictedLabel = [1 if num >0.5 else 0 for num in np.array(result.split(',')).astype('float')]
confusion_matrix(point_y,predictedLabel)
#print('Predicted Class Label: {}.'.format(floatArr))
#print('Actual Class Label: {}.'.format(point_y))

array([[231,  67],
       [ 42, 366]])

# WARNING: this line delete the endpoint

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)