# Notebook for Splunk Machine Learning Toolkit Container for TensorFlow

This notebook contains an example workflow how to work on custom containerized code that seamlessly interfaces with the Splunk Machine Learning Toolkit (MLTK) Container for TensorFlow. This script contains an example of how to run an entity extraction algorithm over text using the spacy library.

## Stage 0 - import libraries
At stage 0 we define all imports necessary to run our subsequent code depending on various libraries.

In [1]:
# this definition exposes all python module imports that should be available in all subsequent commands
import json
import datetime
import numpy as np
import pandas as pd
import spacy
import sys

# global constants
MODEL_DIRECTORY = "/srv/app/model/data/"

In [2]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes
print("numpy version: " + np.__version__)
print("pandas version: " + pd.__version__)
print("spacy version: " + spacy.__version__)

numpy version: 1.16.4
pandas version: 0.25.1
spacy version: 2.2.4


In [3]:
#import sys
#!{sys.executable} -m spacy download en_core_web_sm

## Stage 1 - get a data sample from Splunk
In Splunk run a search to pipe a prepared dataset into this environment.

| inputlookup fy21_basic | rename "本日の感想やご意見をお聞かせください" as comment, "本日のワークショップにご満足いただけましたか？" as satisfy, "ワークショップ実施日" as date| where isnotnull('comment') | eval workshop = if(date = "7/17/2020","AWS Security","Basic") | append [| inputlookup fy21_premium | rename "本日の感想や率直なご意見をお聞かせください" as comment, "本日のワークショップにご満足いただけましたか？" as satisfy, "ワークショップ実施日" as date , "本日のワークショップ種類" as workshop| where isnotnull('comment') ] | table workshop date satisfy comment 
| fit MLTKContainer mode=stage algo=spacy_ginza_token comment from date workshop satisfy into ws_spacy_token_stage

After you run this search your data set sample is available as a csv inside the container to develop your model. The name is taken from the into keyword ("spacy_entity_extraction_model in the example above) or set to "default" if no into keyword is present. This step is intended to work with a subset of your data to create your custom model.

In [4]:
# this cell is not executed from MLTK and should only be used for staging data into the notebook environment
def stage(name):
    with open("data/"+name+".csv", 'r') as f:
        df = pd.read_csv(f)
    with open("data/"+name+".json", 'r') as f:
        param = json.load(f)
    return df, param

In [5]:
# THIS CELL IS NOT EXPORTED - free notebook cell for testing purposes
df, param = stage("ws_spacy_token_stage")
print(df)
print(df.shape)
print(str(param))

          date  satisfy                                            comment  \
0    2/14/2020        4                             本格的に使うとなるとやはり難しそうではある。   
1    2/14/2020        5                  Splunkの基本動作の勉強になりました。ありがとうございました。   
2    2/14/2020        5  知らないコマンドを知れただけで有益でした。\nsplunkはもっとたくさん活用できる可能性が...   
3    2/14/2020        5                                分かりやすかったです。まず使ってみます   
4    2/14/2020        5  事前にsplunkを少しでも触っていると飲み込みやすい内容だった。splunkの特長を強調し...   
..         ...      ...                                                ...   
343  8/28/2020        4              セキュリティは現在の主業務ではありませんがログ分析の方法は参考になりました   
344  8/28/2020        4    途中参加でしたが、途中からでもログソースや各ログ内のフィールド、値がわかって参考になりました。   
345  8/28/2020        5             実践的な事案でのハンズオンで、大変勉強になりました。ありがとうございました。   
346  8/28/2020        4  自分にはPowerPoint内の課題の難易度が高かったですが、Office365分析における...   
347  8/28/2020        5  他社のSIEM製品を使用した運用を実施しているが、Splunkについては未経験の中スケジュー...   

           workshop  
0             Basic  
1             Basic

## Stage 2 - create and initialize a model

In [6]:
# initialize the model
# params: data and parameters
# returns the model object which will be used as a reference to call fit, apply and summary subsequently
def init(df,param):
    # Load English tokenizer, tagger, parser, NER and word vectors
    #import en_core_web_sm
    #model = en_core_web_sm.load()
    #model = spacy.load("en_core_web_sm")
    model = spacy.load("ja_ginza")
    return model

In [7]:
model = init(df,param)

## Stage 3 - fit the model

Note that for this algorithm the model is pre-trained (the en_core_web_sm library comes pre-packaged by spacy) and therefore this stage is a placeholder only

In [8]:
# returns a fit info json object
def fit(model,df,param):
    returns = {}
    
    return returns

## Stage 4 - apply the model

In [9]:
def apply(model,df,param):
    X = df['comment'].values.tolist()
    
    returns = list()
    
    for i in range(len(X)):
        doc = model(str(X[i]))
        
        
        entities = ''
        stop_words = ['[',']','、','。','.',',','\'','です']
        
        # Find named entities, phrases and concepts
        for entity in doc:
            if str(entities) in stop_words or len(entity) == 1:
                      continue
            elif entities == '':
                entities = entities + entity.text + ':' + entity.pos_
            else:
                entities = entities + '|' + entity.text + ':' + entity.pos_
        
        returns.append(entities)
    return returns

In [10]:
df['comment'].values.tolist()[:4]

['本格的に使うとなるとやはり難しそうではある。',
 'Splunkの基本動作の勉強になりました。ありがとうございました。',
 '知らないコマンドを知れただけで有益でした。\nsplunkはもっとたくさん活用できる可能性があると思いますが、セミナー等が少なく、販売代理店からはスキルを学べないため、より多くの情報を収集したいです。',
 '分かりやすかったです。まず使ってみます']

In [11]:
returns = apply(model,df,param)
returns[:5]

['本格的:NOUN|使う:VERB|なる:VERB|やはり:ADV|難し:ADJ|そう:ADJ|ある:AUX',
 'Splunk:NOUN|基本:NOUN|動作:NOUN|勉強:NOUN|なり:VERB|まし:AUX|ありがとう:INTJ|ござい:VERB|まし:AUX',
 '知ら:VERB|ない:AUX|コマンド:NOUN|知れ:VERB|だけ:ADP|有益:ADJ|でし:AUX|splunk:NOUN|もっと:ADV|たくさん:ADV|活用:VERB|できる:VERB|可能性:NOUN|ある:VERB|思い:VERB|ます:AUX|セミナー:NOUN|少なく:ADJ|販売:NOUN|代理店:NOUN|から:ADP|スキル:NOUN|学べ:VERB|ない:AUX|ため:NOUN|より:ADV|多く:NOUN|情報:NOUN|収集:VERB|たい:AUX|です:AUX',
 '分かり:VERB|やすかっ:NOUN|です:AUX|まず:ADV|使っ:VERB|ます:AUX',
 '事前:NOUN|splunk:NOUN|少し:ADV|触っ:VERB|いる:AUX|飲み込み:VERB|やすい:NOUN|内容:NOUN|だっ:AUX|splunk:NOUN|特長:NOUN|強調:VERB|喋っ:VERB|いただい:AUX|特長:NOUN|掴み:VERB|やすかっ:NOUN']

## Stage 5 - save the model

In [12]:
# save model to name in expected convention "<algo_name>_<model_name>.h5"
def save(model,name):
    # model will not be saved or reloaded as it is pre-built
    return model

## Stage 6 - load the model

In [13]:
# load model from name in expected convention "<algo_name>_<model_name>.h5"
def load(name):
    # model will not be saved or reloaded as it is pre-built
    return model

## Stage 7 - provide a summary of the model

In [14]:
# return model summary
def summary(model=None):
    returns = {"version": {"spacy": spacy.__version__} }
    if model is not None:
        # Save keras model summary to string:
        s = []
        returns["summary"] = ''.join(s)
    return returns

## End of Stages
All subsequent cells are not tagged and can be used for further freeform code