# Infino - Rally-Tracks Example

Elasticsearch is running nightly benchmarks and the data being used is available. The github repository is [here](https://github.com/elastic/rally-tracks).

In this example, we use Infino to index [this file](http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/http_logs/documents-181998.json.bz2) which has 2.7M documents.

In [1]:
import concurrent.futures
from infinopy import InfinoClient
import json
import sys
import time

## Step 1 - Download the data



In [12]:
# Use curl to download the bz2-compressed data to /tmp
!curl -o /tmp/rally-tracks.json.bz2 http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/http_logs/documents-181998.json.bz2

# Unzip the data and view a sample.
!bzip2 -d /tmp/rally-tracks.json.bz2
!ls -l /tmp/rally-tracks.json
!head -n 5 /tmp/rally-tracks.json

# Change @timestamp key to date, as that is the key Infino's default config uses.
# The command below works for Mac - for other systems, it might be slightly different.
!sed -i '' 's/@timestamp/date/g' /tmp/rally-tracks.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.2M  100 13.2M    0     0  7616k      0  0:00:01  0:00:01 --:--:-- 7629k
bzip2: Output file /tmp/rally-tracks.json already exists.
-rw-r--r--@ 1 vinaykakade  wheel  347260278 Jun 12 12:47 /tmp/rally-tracks.json
{"date": 893964617, "clientip":"40.135.0.0", "request": "GET /images/hm_bg.jpg HTTP/1.0", "status": 200, "size": 24736}
{"date": 893964653, "clientip":"232.0.0.0", "request": "GET /images/hm_bg.jpg HTTP/1.0", "status": 200, "size": 24736}
{"date": 893964672, "clientip":"26.1.0.0", "request": "GET /images/hm_bg.jpg HTTP/1.0", "status": 200, "size": 24736}
{"date": 893964679, "clientip":"247.37.0.0", "request": "GET /french/splash_inet.html HTTP/1.0", "status": 200, "size": 3781}
{"date": 893964682, "clientip":"247.37.0.0", "request": "GET /images/hm_nbg.jpg HTTP/1.0", "status": 304, "size": 0}


## Step 2: Start Infino server

We start infino using docker.

In [13]:
!docker run --rm --detach --name infino-example -p 3000:3000 infinohq/infino:latest

e82cbb7970fd442e7307dd1241612b7b32f59f6684749b959738f14176e851d6


## Step 3: Publish the data to Infino

We'll treat `date` field as the timestamp, `request` as the log message on which we want full text search, and rest of the fields as tags.

In [14]:
# Create infino client.
client = InfinoClient()

# Read the downloaded file above.
fh = open('/tmp/rally-tracks.json', 'r')
lines = fh.readlines()

# Create a group of log messages to be inserted, with each group having 1000 messages.
inner_list_length = 15000
result = []
inner_list = []

for i, item in enumerate(lines):
    inner_list.append(json.loads(item))
    
    if (i + 1) % inner_list_length == 0 or i == len(lines) - 1:
        result.append(inner_list)
        inner_list = []

# Ingest the data in Infino.
print("Starting to ingest data in Infino...")

start = time.time()
num = 0

def process_inner_list(inner_list):
    response = client.append_log(inner_list)
    assert response.status_code == 200
    return len(inner_list)

# Call infino's append_log using 10 parallel threads.
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(process_inner_list, inner_list) for inner_list in result]
    for future in concurrent.futures.as_completed(futures):
        num += future.result()
        print("Number of documents inserted so far %d" % num)

end = time.time()

print("Completed ingesting data in Infino")

duration = end-start

print("Number of documents indexed = %d" % num)
print("Time taken by infino = %.2f seconds" % duration)
throughput = num/duration
print("Infino indexing throughput = %.2f documents per second" % throughput)

Starting to ingest data in Infino...
Number of documents inserted so far 15000
Number of documents inserted so far 30000
Number of documents inserted so far 45000
Number of documents inserted so far 60000
Number of documents inserted so far 75000
Number of documents inserted so far 90000
Number of documents inserted so far 105000
Number of documents inserted so far 120000
Number of documents inserted so far 135000
Number of documents inserted so far 150000
Number of documents inserted so far 165000
Number of documents inserted so far 180000
Number of documents inserted so far 195000
Number of documents inserted so far 210000
Number of documents inserted so far 225000
Number of documents inserted so far 240000
Number of documents inserted so far 255000
Number of documents inserted so far 270000
Number of documents inserted so far 285000
Number of documents inserted so far 300000
Number of documents inserted so far 315000
Number of documents inserted so far 330000
Number of documents ins

## Step 4: Search, index size.

Example search queries and look at the index size.

In [15]:
# Search for a particular image.
response = client.search_log("/images/bord_stories.gif", 0, int(time.time()))
print(response)
result = json.loads(response.text)
print("Number of results =", len(result))
# Uncomment below to print the actual results.
# print("First 10 results =", result[0:10])

# Search for all get requests.
response = client.search_log("get", 0, int(time.time()))
print(response)
result = json.loads(response.text)
print("Number of results =", len(result))
# Uncomment below to print the actual results.
# print("First 10 results =", result[0:10])

# Look at the Infino index size.
# Infino commits the index contents to disk periodically. So, sleep for some time
# so that the contents will be flushed to disk and we can get the correct index size.
time.sleep(30)
!docker exec -t -i infino-example du -h /opt/infino/index


<Response [200]>
Number of results = 6736
<Response [200]>
Number of results = 2699430
112M	/opt/infino/index/0
28K	/opt/infino/index/1
112M	/opt/infino/index


## Step 5: Stop infino server

In [16]:
!docker rm -f infino-example

infino-example
