In [1]:
from datetime import datetime
from elasticsearch import Elasticsearch, RequestsHttpConnection
import nbformat as nbf
import json
import warnings
import requests
warnings.filterwarnings("ignore")

The problem? I have many Jupyter notebooks and I need an easy to search through them all. I am always remembering that I have some snippet of code <i>somewhere</i> in these notebooks so need an easy to way to find it. 

Enter <b>Elasticsearch</b>

Elasticsearch "is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents" That means that you can put lots of documents full of text into it, and it will index all of this, and make it easy to search

The Elasticsearch folks have also built <b>Kibana</b>, which is "proprietary data visualization dashboard software for Elasticsearch". That means that Kibana is a handy GUI tool you can use to quickly search your data, similiar to using something like a Google searchbox

In this notebook, I will demonstrate how to: 

1. Create Docker containers for Elasticsearch and Kibana on a shared Docker network
2. Use nbconvert to convert your .ipynb files to a list of strings
3. Upload that list of strings into a Elasticsearch database
4. Search through the notebooks using either the Python Elasticsearch library, or Kibana

<i>Prerequisites:</i><br/> That you know how to create a Jupyter Notebook and save it somewhere, and that you have Docker installed. 

<i>Caveat:</i><br/>This assumes you want to easily just search through your own notebooks as part of your workflow, and as such I am going to ignore some Elasticsearch security options. Which is why I am ignoring warnings in this notebook. So don't use this approach if you are planning somehing that is not dev. 

This will all take about 5 minutes.

<b>Set up and Hello World</b>

Let's start with some setup. Open a Terminal on your machine (bash on Linux, PowerShell on Windows, whatever). Then run the following commands. 
1. <code>docker network create elastic</code><br/>
This wil tell Docker to create a docker network called 'elastic'
2. <code>docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.1</code><br/>
This will pull down an image of the latest version of elasticsearch (7.13.1) at the time of writing this
3. <code>docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.1</code><br/>
This will create a docker container on the docker network 'elastic', expose some ports so you can acess it. If you go to localhost:9200 you will see a welcom message

So that's it for installing and getting Elasticsearch up and running. Now let's do the same for Kibana. Open a new shell and run the following commands:

4. <code>docker pull docker.elastic.co/kibana/kibana:7.13.1</code><br/>
This will pull down and image of Kibana
5. <code>docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.13.1</code><br/>
This will create a container for Kibana. It will see the Elasticsearch instance and be connected to it. You can go to localhost:5601 and see the Kibana homepage

So looks like it is working. Let's make sure our notebook can see it to: 

In [17]:
res = requests.get('http://host.docker.internal:9200')
print(res.content)

b'{\n  "name" : "829766e7847b",\n  "cluster_name" : "docker-cluster",\n  "cluster_uuid" : "mLWu9gbQQqqOy5xB3IONVg",\n  "version" : {\n    "number" : "7.13.1",\n    "build_flavor" : "default",\n    "build_type" : "docker",\n    "build_hash" : "9a7758028e4ea59bcab41c12004603c5a7dd84a9",\n    "build_date" : "2021-05-28T17:40:59.346932922Z",\n    "build_snapshot" : false,\n    "lucene_version" : "8.8.2",\n    "minimum_wire_compatibility_version" : "6.8.0",\n    "minimum_index_compatibility_version" : "6.0.0-beta1"\n  },\n  "tagline" : "You Know, for Search"\n}\n'


So this notebook can connect to the Elasticsearch instance. Note that I am using <code>host.docker.internal</code> in my URL in that get request. This is because I have set up my Jupyter up in Docker as well (details at: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html). If you have installed an Anaconda instance or something, this URL would be <code>localhost</code> rather than <code>host.docker.internal</code>

Now, using our Python elasticsearch library, let's create a connection to Elasticsearch

In [21]:
# Note "host.docker.internal" might be "localhost" if you are running an Anaconda version of Jupyter
es = Elasticsearch(hosts=[{"host": "host.docker.internal", "port": 9200}], 
                   connection_class=RequestsHttpConnection, max_retries=30,
                       retry_on_timeout=True, request_timeout=30)

Let's create an index (think of this as a db) and put some data into it: 

In [22]:
#index some test data
es.index(index='testing-index', doc_type='test', id=1, body={'test': 'test'})

{'_index': 'testing-index',
 '_type': 'test',
 '_id': '1',
 '_version': 1,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 2,
 '_primary_term': 1}

We can go to <code>http://localhost:9200/testing-index/_search?pretty=true&q=*:*</code> and see the data now exists in Elasticsearch. Or we could just retreive it using the Python elastic search library

In [24]:
res = es.get(index= "testing-index", id=1)
res

{'_index': 'testing-index',
 '_type': '_doc',
 '_id': '1',
 '_version': 1,
 '_seq_no': 2,
 '_primary_term': 1,
 'found': True,
 '_source': {'test': 'test'}}

So that works. Let's delete it now: 

In [25]:
es.delete(index='testing-index', doc_type='test', id=1)

{'_index': 'testing-index',
 '_type': 'test',
 '_id': '1',
 '_version': 2,
 'result': 'deleted',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 3,
 '_primary_term': 1}

<b>Extracting Text from Jupyter Notebooks</b>

Now let's do something a little more substantial. I have a folder full of Jupyter Notebooks and I always need code from one or another. So let's create a function to extract all the text from the notebooks:

In [26]:
NB_VERSION = 4

def extractTextFromNotebook(notebook_str):
    formatted = nbf.read(notebook_str, as_version=NB_VERSION)
    text = []
    for cell in formatted.get('cells', []):
        if 'source' in cell and 'cell_type' in cell:
            if cell['cell_type'] == 'code' or cell['cell_type'] == 'markdown':
                text.append(cell['source'])

    return(text)

listOfStringsInNotebook = extractTextFromNotebook("../work/HTMNotebooks/HTM_Overview_7.ipynb")

Let's check one of those list's of strings. It is definitely some text. 

In [27]:
listOfStringsInNotebook[5]

'df = pd.read_csv("./data/gymdata.csv", header=1)\ndf = df.rename(columns={"datetime": "date_time", "float": "power_consumption"})\ndf = df.iloc[1:]\ndf.head()'

Now let's iterate through all those notebooks converted to text, and push them into Elasticsearch. Elasticsearch will want something JSON like so that's is what we will give it. 

In [28]:
elasticDBName = "notebook-cell-search"

def writeTextCellsToElasticSearchDB(doc):
    for i in range(len(doc)):
        cellDict = {}
        cellDict['text'] =  doc[i],
        cellDict['timestamp'] =  datetime.now()
        listOfStringsInNotebook.append(cellDict)
        es.index(index= elasticDBName, doc_type= 'cell', body=cellDict, id = i)
    

In [29]:
writeTextCellsToElasticSearchDB(doc)

<b>Searching Notebooks</b>

So now all the data is in Elasticsearch. Now we want to search it. There are three options to do this: 

1. You can use the Python elasticsearch library to run queries<br/>
This can be quite handy I will cover some examples below
2. Use Kibana to search
I tend to use this. I will never remember a proprietry query langage, I can barely remember SQL these days. So I want to just be able to quickly search in a search box where I am given some help to do so. Kibana is perfect for that:
3. I could pass query params in a url to search, such as <code>http://localhost:9200/testing-index/_search?pretty=true&q=*:*</code><br/>
If you are into this kind of thing, and love Postman or something it could be handy I guess. For our purposes I wouldn't do this, and won't cover it

<b>Option 1: Using Python</b>

This is my preferred way of doing it. Here are some handy getting started searches you can do to look through your data that has been put into Elasticsearch:

In [31]:
# Grab a particular record
es.get(index='notebook-cell-search', 
       doc_type="_doc", id = 44)

{'_index': 'notebook-cell-search',
 '_type': '_doc',
 '_id': '44',
 '_version': 2,
 '_seq_no': 91,
 '_primary_term': 1,
 'found': True,
 '_source': {'text': ['def getColumHistorySummary(columsHistory, columnChoice, inputCount):\n    activeColumnHistory = [columsHistory[i][columnChoice][\'activeColumns\'] for i in range(0, inputCount)]\n    history = []\n    for j in range(0, len(activeColumnHistory) - 1):\n        columnHistoryConnectionsAndDisconnections = {}\n\n        currentColumnSynapses, nextColumnSynapses, = set(activeColumnHistory[j]), set(activeColumnHistory[j + 1])\n        columnHistoryConnectionsAndDisconnections["newlyDisconnectedSynapses"] = list(currentColumnSynapses - nextColumnSynapses)\n        columnHistoryConnectionsAndDisconnections["newlyConnectedSynapses"] = list(nextColumnSynapses - currentColumnSynapses)\n        unchangedSynapses = np.sort(list(nextColumnSynapses - set(columnHistoryConnectionsAndDisconnections["newlyConnectedSynapses"])))\n\n        columnHist

It supports all kind of queries to match text, partial match, etc. Here is another example. Things can get a bit messy, so I would advise you to keep you query in a seperate Python dict, and then just pass the Python dict to the search:

In [37]:
q = {
  "query": {
      "prefix": {
          "text": "enc"
      }
  }}

es.search(index="notebook-cell-search", 
          body = q)

{'took': 0,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 6, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'notebook-cell-search',
    '_type': 'cell',
    '_id': '3',
    '_score': 1.0,
    '_source': {'text': ['ep7 = \'<iframe style="background:#99ddff; color:black; padding: 10px" width="400" height="315" src="https://www.youtube.com/embed/R5UoFNtv5AU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>\'\nep8 = \'<iframe  style="background:#99ddff; color:black; padding: 10px" width="400" height="315" src="https://www.youtube.com/embed/rHvjykCIrZM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>\'\ntable = \'<table style="width:100%"><tr><td>\' + ep7 + \'</td><td>\

I tend to avoid  widgets in noteobooks simply because I like to make them super easy to deploy, however this is an instnance where I might just set up. They way I like to set it up

<b>Option 2: Using Kibana</b>

Kibana's cool, but after a while I did get annoyed at the UI - its kind of invasive. But if you are going to use it

1. Go to <code>http://localhost:5601/app/home#/</code> which should be up and running
2. From the dropdown on the left, go the "Stack Management" menu item. This will take you to <code>http://localhost:5601/app/management</code>
3. Choose the index pattern option. You will be taken to <code>http://localhost:5601/app/management/kibana/indexPatterns</code>
4. Go to <code>http://localhost:5601/app/management/kibana/indexPatterns/create</code>
5. Choose notebook-cell-search, follow the prompts to set this index
6. Then go back to <code>http://localhost:5601/app/home#/</code> and choose "Discover" from the left hand index

From there you will have a search box and some filters, and all kinds of cool things you can check. Have fun!