# Getting started with SplunkUtils

SplunkUtils is an API for Splunk database, which can be used for querying SIRI and GTFS data into Pandas DataFrame. 

## Step 1: Install  splunk-sdk

SplunkUtils requires installing splunk-sdk package.

You can use pip for installation.

In [1]:
!pip install splunk-sdk



## Step 2:  Set Splunk credentials

One must have Splunk credentials (host, port, username, password) to set a connection to Splunk using SplunkUtils.

Note that the port is Splunk management port and not web port.

You can save your credentials in json file as in the following example or simply hard-coded them.

In [2]:
import json

with open('credentials.json', 'r') as f:
    credentials = json.load(f)
    
HOST, PORT, USERNAME, PASSWORD = credentials.values()

## Step 3: Download & import SplunkUtils

Download SplunkUtils.py (needs to be located at the same folder as this notebook).

Then import **splunk_query_builder**, **read_splunk** and **loop_over_splunk** from SplunkUtils.

In [3]:
from SplunkUtils import splunk_query_builder, read_splunk, loop_over_splunk

## Step 4: Read Splunk function 

Args:
   * query (str): SPL query.
   * username (str): Splunk username.
   * password (str): Splunk password.
   * host (str): Splunk host.
   * port (int): Splunk management port.
   * time_limit (int): time limit (minutes) for the query (default: 5).
   
Returns:
    DataFrame: query results.
    
**For example lets query the GTFS (route_stats) using Splunk's Search Processing Language (SPL).**

In [4]:
%%time

query = '''search index=route_stats earliest=-12d latest=-10d route_id=5189|
        fields agency_id, route_short_name, route_id, date, all_start_time | '''

GTFS_data = read_splunk(query,
          host =  HOST, port = PORT, username = USERNAME, password = PASSWORD)

start..

your query:
 search index=route_stats earliest=-12d latest=-10d route_id=5189|

        fields agency_id, route_short_name, route_id, date, all_start_time |
 

connection succeed

query status: 100.0%   13 scanned   2 matched   2 results

Done!

query succeed

read results succeed

job finished and canceled

finished! number of rows: 2

Wall time: 3.74 s


In [5]:
GTFS_data.head()

Unnamed: 0,agency_id,route_short_name,route_id,date,all_start_time
0,15,501,5189,2020-02-23,05:00:00;05:15:00;05:45:00;06:05:00;06:23:00;0...
1,15,501,5189,2020-02-22,20:00:00;20:00:00;20:15:00;20:30:00;20:45:00;2...


**Note that read_splunk function runs a Splunk job. One might send a very heavy query/ies which can fail the connection.
You can stop and delete jobs using Splunk app Job Manager.**

## Step 5:  Splunk Query Builder function

To save time of learning the SPL syntax, you can use this function, which creates SPL query from Dictionary of query Kwargs.

**For example lets query SIRI using splunk_query_builder function.**


**First, we will need to set our query_kwargs for filtering by columns values.**

In [6]:
query_kwargs = {
                'index': 'siri',
                'earliest': '-10d', #=last 10 days
                #'latest': '-8d',
                #'agency_id': 3,
                'route_short_name': 501,
                'route_id': 5189,
                'planned_start_time': '22:00:00',
               #'max_columns': 100000,
               'columns': ['timestamp','agency_id','route_id','route_short_name','service_id',
                           'planned_start_date', 'planned_start_time',
                            'bus_id','predicted_end_time','time_recorded','lat','lon']
                }

In [7]:
%%time

SIRI_data = read_splunk(splunk_query_builder(**query_kwargs),
          host =  HOST, port = PORT, username = USERNAME, password = PASSWORD)

start..

your query:
 search index=siri earliest=-10d route_short_name=501 route_id=5189 planned_start_time=22:00:00 |
 fields timestamp, agency_id, route_id, route_short_name, service_id, planned_start_date, planned_start_time, bus_id, predicted_end_time, time_recorded, lat, lon |


connection succeed

query status: 100.0%   2390 scanned   45 matched   45 results

Done!

query succeed

read results succeed

job finished and canceled

finished! number of rows: 45

Wall time: 4.4 s


In [8]:
SIRI_data.head()

Unnamed: 0,timestamp,agency_id,route_id,route_short_name,service_id,planned_start_date,planned_start_time,bus_id,predicted_end_time,time_recorded,lat,lon
0,2020-02-29T22:56:09,15,5189,501,43319863,2020-02-29,22:00:00,86344401,23:00:00,22:46:23,32.073746,34.790249
1,2020-02-29T22:55:09,15,5189,501,43319863,2020-02-29,22:00:00,86344401,22:59:00,22:46:23,32.073746,34.790249
2,2020-02-29T22:54:09,15,5189,501,43319863,2020-02-29,22:00:00,86344401,22:58:00,22:46:23,32.073746,34.790249
3,2020-02-29T22:53:09,15,5189,501,43319863,2020-02-29,22:00:00,86344401,22:57:00,22:46:23,32.073746,34.790249
4,2020-02-29T22:52:09,15,5189,501,43319863,2020-02-29,22:00:00,86344401,22:56:00,22:46:23,32.073746,34.790249


**Some tips for splunk_query_builder function:**

* Splunk search works better when using indexes for filtering. As for now, the function gets only one value per filter column (index), so you can't pass lists in query_kwargs.

* Use 'earliest' & 'latest' time modifiers to customize the time range of your search. You can specify an exact time such as earliest="10/5/2019:20:00:00", or a relative time such as earliest=-1h or latest=@w6. To learn more about SPL TimeMoidifers see: https://docs.splunk.com/Documentation/Splunk/7.2.6/SearchReference/SearchTimeModifiers.

* It's recommended to declare selected columns in query_kwargs ('columns').

* You can limit the results number in query_kwargs ('max_columns').

* Note that no results might be a result of a syntax error in column name/filter value.   


## Step 6: loop_over_splunk function (Using loop for getting more the 50,000 results) 

Splunk API limits the number of results per query to 50,000.

As for now, we created a function that using loop to overcome this limitation.

**First, we will need to set a base query_kwargs dict and a loop_kwargs list.**

query_kwargs dict sets the default query kwargs for all sub queries, and loop_kwargs is a list of query kwargs dicts in which each element sets different sub query.   

For this example, lets query SIRI for two high frequency bus routes in Jerusaelm (Line Routes 15 & 19 of Egged).

In [9]:
query_kwargs = {'index': 'siri',
                'agency_id': 3,
               'columns': ['timestamp','agency_id','route_id','route_short_name','service_id','planned_start_time',
                        'planned_start_date', 'bus_id','predicted_end_time','time_recorded','lat','lon']
                }

In [10]:
loop_kwargs = [{"route_id": 12405},
                {"route_id": 23823}
              ]

**Second, lets define time gaps (e.g days, hours, minutes) for the loop**.

The time gap need to be small enough for collecting no more then 50,000 results per run, and big enough for better performance.

See example of quering 8 days data using time gaps of 5 days.

In [11]:
time_args =  {"start_time": "11/01/2019 04:00",
            "end_time": "11/09/2019 04:00",
            "freq": "120h"}

In [12]:
SIRI_loop = loop_over_splunk(host=HOST, port=PORT, username=USERNAME, password=PASSWORD,
                                    query_kwargs=query_kwargs,
                                    time_args=time_args, 
                                    loop_kwargs=loop_kwargs)

start..

your query:
 search index=siri earliest="11/01/2019:04:00:00" latest="11/06/2019:04:00:00" agency_id=3 route_id=12405 |
 fields timestamp, agency_id, route_id, route_short_name, service_id, planned_start_time, planned_start_date, bus_id, predicted_end_time, time_recorded, lat, lon |


connection succeed

query status: 100.0%   30343 scanned   30343 matched   30343 results

Done!

query succeed

read results succeed

job finished and canceled

finished! number of rows: 30343

start..

your query:
 search index=siri earliest="11/06/2019:04:00:00" latest="11/09/2019:04:00:00" agency_id=3 route_id=12405 |
 fields timestamp, agency_id, route_id, route_short_name, service_id, planned_start_time, planned_start_date, bus_id, predicted_end_time, time_recorded, lat, lon |


connection succeed

query status: 100.0%   21463 scanned   21463 matched   21463 results

Done!

query succeed

read results succeed

job finished and canceled

finished! number of rows: 21463

start..

your query:
 

In [13]:
len(SIRI_loop)

55012

In this example we used loop_kwargs and time_args but when needed you can also set only one of them.    