In [1]:
from elasticsearch import Elasticsearch, RequestsHttpConnection
import nbformat as nbf
import warnings
import requests
warnings.filterwarnings("ignore")
import glob

The problem? I have a ton of Jupyter Notebooks and I need an easy to search through them all. I am always remembering that I have some snippet of code <i>somewhere</i> in these notebooks so need an easy to way to find it. 

Enter <b>Elasticsearch</b>

Elasticsearch "is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents" That means that you can put lots of documents full of text into it, and it will index all of this, and make it easy to search

The Elasticsearch folks have also built <b>Kibana</b>, which is "proprietary data visualization dashboard software for Elasticsearch". That means that Kibana is a handy GUI tool you can use to quickly search your data, similiar to using something like a Google searchbox

In this notebook, I will demonstrate how to: 

1. Setup Docker containers for Elasticsearch and Kibana on a shared Docker network
2. Use nbconvert to convert your .ipynb files to a list of strings
3. Upload that list of strings into a Elasticsearch database
4. Search through the notebooks using either the Python Elasticsearch library, or Kibana

<i>Prerequisites:</i><br/> That you know how to create a Jupyter Notebook and save it somewhere, and that you have Docker installed. 

<i>Caveat:</i><br/>This assumes your use case is that you want to easily be able to search through your own notebooks as part of your workflow, and as such I am going to ignore some Elasticsearch security options. Which is why I have an ignore warnings filter in this notebook. But don't use this approach if you are planning somehing that is not dev. 

This will all take about 5 minutes.

<b>Set up and Hello World</b>

Let's start with some setup. Open a Terminal on your machine (bash on Linux, PowerShell on Windows, whatever). Then run the following commands. 
1. <code>docker network create elastic</code><br/>
This wil tell Docker to create a docker network called 'elastic'
2. <code>docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.1</code><br/>
This will pull down an image of the latest version of elasticsearch (7.13.1) at the time of writing this
3. <code>docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.1</code><br/>
This will create a docker container on the docker network 'elastic', expose some ports so you can acess it. If you go to localhost:9200 you will see a welcom message

So that's it for installing and getting Elasticsearch up and running. Now let's do the same for Kibana. Open a new shell and run the following commands:

4. <code>docker pull docker.elastic.co/kibana/kibana:7.13.1</code><br/>
This will pull down and image of Kibana
5. <code>docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.13.1</code><br/>
This will create a container for Kibana. It will see the Elasticsearch instance and be connected to it. You can go to localhost:5601 and see the Kibana homepage

So looks like it is working. Let's make sure our notebook can see it to: 

In [2]:
res = requests.get('http://host.docker.internal:9200')
print(res.content)

b'{\n  "name" : "829766e7847b",\n  "cluster_name" : "docker-cluster",\n  "cluster_uuid" : "mLWu9gbQQqqOy5xB3IONVg",\n  "version" : {\n    "number" : "7.13.1",\n    "build_flavor" : "default",\n    "build_type" : "docker",\n    "build_hash" : "9a7758028e4ea59bcab41c12004603c5a7dd84a9",\n    "build_date" : "2021-05-28T17:40:59.346932922Z",\n    "build_snapshot" : false,\n    "lucene_version" : "8.8.2",\n    "minimum_wire_compatibility_version" : "6.8.0",\n    "minimum_index_compatibility_version" : "6.0.0-beta1"\n  },\n  "tagline" : "You Know, for Search"\n}\n'


So this notebook can connect to the Elasticsearch instance. Note that I am using <code>host.docker.internal</code> in my URL in that get request. This is because I have set up my Jupyter up in Docker as well (details at: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html). If you have installed an Anaconda instance or something, this URL would be <code>localhost</code> rather than <code>host.docker.internal</code>

Now, using our Python elasticsearch library, let's create a connection to Elasticsearch

In [3]:
# Note "host.docker.internal" might be "localhost" if you are running an Anaconda version of Jupyter
es = Elasticsearch(hosts=[{"host": "host.docker.internal", "port": 9200}], 
                   connection_class=RequestsHttpConnection, max_retries=30,
                       retry_on_timeout=True, request_timeout=30)

Let's create an index (think of this as a db) and put some data into it: 

In [4]:
#index some test data
es.index(index='testing-index', doc_type='test', id=1, body={'test': 'test'})

{'_index': 'testing-index',
 '_type': 'test',
 '_id': '1',
 '_version': 3,
 'result': 'created',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 8,
 '_primary_term': 1}

We can go to <code>http://localhost:9200/testing-index/_search?pretty=true&q=*:*</code> and see the data now exists in Elasticsearch. Or we could just retreive it using the Python elastic search library

In [5]:
res = es.get(index= "testing-index", id=1)
res

{'_index': 'testing-index',
 '_type': '_doc',
 '_id': '1',
 '_version': 3,
 '_seq_no': 8,
 '_primary_term': 1,
 'found': True,
 '_source': {'test': 'test'}}

So that works. Let's delete it now: 

In [6]:
es.delete(index='testing-index', doc_type='test', id=1)

{'_index': 'testing-index',
 '_type': 'test',
 '_id': '1',
 '_version': 4,
 'result': 'deleted',
 '_shards': {'total': 2, 'successful': 1, 'failed': 0},
 '_seq_no': 9,
 '_primary_term': 1}

<b>Extracting Text from Jupyter Notebooks</b>

Now let's do something a little more substantial. I have a folder full of Jupyter Notebooks and I always need code from one or another. So let's create a function to extract all the text from the notebooks. First I need a list of names of the Jupyter Notebooks from the directory in which they are located: 

In [18]:
pathToLocationJupyterNotebookFiles = "../work/HTMNotebooks/"
jupyterNotebooksFileNames = glob.glob(pathToLocationJupyterNotebookFiles + './*.ipynb')
jupyterNotebooksFileNames

['../work/HTMNotebooks/./HTMTest.ipynb',
 '../work/HTMNotebooks/./HTM_Encoders_0.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_0.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_1.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_10.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_11.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_2.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_3.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_4.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_5.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_6.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_7.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_8.ipynb',
 '../work/HTMNotebooks/./HTM_Overview_9.ipynb']

Now, let's create function to extract for the notebooks, and then we will just iterate each of the notebooks to extrat the code:

In [8]:
NB_VERSION = 4

def extractTextFromNotebook(notebook_str):
    formatted = nbf.read(notebook_str, as_version=NB_VERSION)
    text = []
    for cell in formatted.get('cells', []):
        if 'source' in cell and 'cell_type' in cell:
            if cell['cell_type'] == 'code' or cell['cell_type'] == 'markdown':
                text.append(cell['source'])

    return(text)


textFromNotebooks = [extractTextFromNotebook(jupyterNotebooksFileNames[i]) for i in range(len(jupyterNotebooksFileNames))]


We get back a list of lists of all the notebooks. Let's check we are on the right trck by looking at, say, the fith item in the first notebook:

In [9]:
textFromNotebooks[1][5]

'from htm.bindings.sdr import SDR, Metrics\nfrom htm.encoders.rdse import RDSE, RDSE_Parameters\nfrom htm.encoders.date import DateEncoder\nfrom htm.bindings.algorithms import SpatialPooler\nfrom htm.bindings.algorithms import TemporalMemory\nfrom htm.algorithms.anomaly_likelihood import AnomalyLikelihood \nfrom htm.bindings.algorithms import Predictor'

Now let's iterate through all those notebooks converted to text, and push them into Elasticsearch. Elasticsearch will want something JSON like so that's is what we will give it. I notice it seems to return an empty object but still appears to work:

In [10]:
elasticDBName = "j-notebook-cell-search-index"

def writeTextCellsToElasticSearchDB(doc, notebookFilePath):
    for i in range(len(doc)):
        cellDict = {}
        cellDict['text'] =  doc[i],
        cellDict['noteBookFilePath'] =  notebookFilePath
        es.index(index= elasticDBName, doc_type= 'cell', body=cellDict)
    
[writeTextCellsToElasticSearchDB(textFromNotebooks[i], jupyterNotebooksFileNames[i]) for i in range(len(jupyterNotebooksFileNames))]

[None, None, None, None, None, None, None, None, None, None, None, None, None]

<b>Searching Notebooks</b>

So now all the data is in Elasticsearch. Now we want to search it. There are three options to do this: 

1. You can use the Python elasticsearch library to run queries<br/>
This can be quite handy I will cover some examples below
2. Use Kibana to search<br/>
This if fun to use, but I probably won't use it enough remember a Kibana proprietry query langage, (I can barely remember SQL these days). But this does allow a GUI search box and filters and all that. 
3. I could pass query params in a url to search, such as <code>http://localhost:9200/testing-index/_search?pretty=true&q=*:*</code><br/>
If you are into this kind of thing, like if you love Postman or something it could be handy I guess. For our purposes I wouldn't do this, and won't cover it

<b>Option 1: Using Python</b>

This is my preferred way of doing it. Here are some handy getting started searches you can do to look through your data that has been put into Elasticsearch:

In [13]:
# Grab a particular record - note I just got the ID from http://localhost:9200/j-notebook-cell-search-index/_search?pretty=true&q=*:*
# now the elastic search index is up and running
es.get(index=elasticDBName, 
       doc_type="_doc", id = "VFtpKHoB3T1ThL6Sx1Yg")

{'_index': 'j-notebook-cell-search-index',
 '_type': '_doc',
 '_id': 'VFtpKHoB3T1ThL6Sx1Yg',
 '_version': 1,
 '_seq_no': 0,
 '_primary_term': 1,
 'found': True,
 '_source': {'text': ['import csv\nimport datetime\nimport os\nimport numpy as np\nimport random\nimport math\n\nfrom htm.bindings.sdr import SDR, Metrics\nfrom htm.encoders.rdse import RDSE, RDSE_Parameters\nfrom htm.encoders.date import DateEncoder\nfrom htm.bindings.algorithms import SpatialPooler\nfrom htm.bindings.algorithms import TemporalMemory\nfrom htm.algorithms.anomaly_likelihood import AnomalyLikelihood #FIXME use TM.anomaly instead, but it gives worse results than the py.AnomalyLikelihood now\nfrom htm.bindings.algorithms import Predictor'],
  'noteBookFilePath': '../work/HTMNotebooks/./HTMTest.ipynb'}}

It supports all kind of queries to match text, partial match, etc. Here is another example, the use case here is that I know I have a notebook where I have done some work on Baltimore Crime Data, but can't remember where. So will put in the prefix "crim" and let Elasticsearch do its thing. 

Note that things can get a bit messy, so I would advise you to keep you query in a seperate Python dictionary, and then just pass that into the search:

In [14]:
q = {
  "query": {
      "prefix": {
          "text": "crim"
      }
  }}

es.search(index=elasticDBName, 
          body = q)

{'took': 1,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 3, 'relation': 'eq'},
  'max_score': 1.0,
  'hits': [{'_index': 'j-notebook-cell-search-index',
    '_type': 'cell',
    '_id': 'OltpKHoB3T1ThL6Szlec',
    '_score': 1.0,
    '_source': {'text': ['To explore this, let\'s use some data. There is some really interesting data that will turn up in episode\'s 7 and 8 in the context of the Spatial Pooler, that has some interesting info, but for now, let\'s use Baltimore Crime Data. This data has nice coverage across a number of data points, descriptive names, some categorical variables, the footprint isn\'t too big but it gives us a nice sample of 96k records\n\nInformation available <a href="https://data.baltimorecity.gov/datasets/baltimore::part1-crime-2015-to-2016/about">https://data.baltimorecity.gov/datasets/baltimore::part1-crime-2015-to-2016/about</a>\n'],
     'noteBookFilePath': '../work/HTMNotebooks/./

<b>Option 2: Using Kibana</b>

Kibana's cool, but after a while I did get annoyed at the UI But if you are going to use it

1. Go to <code>http://localhost:5601/app/home#/</code> which should be up and running
2. From the dropdown on the left, go the "Stack Management" menu item. This will take you to <code>http://localhost:5601/app/management</code>
3. Choose the index pattern option. You will be taken to <code>http://localhost:5601/app/management/kibana/indexPatterns</code>
4. Go to <code>http://localhost:5601/app/management/kibana/indexPatterns/create</code>
5. Choose your index/database tha tis listed, and follow the prompts to set it up
6. Then go back to <code>http://localhost:5601/app/home#/</code> and choose "Discover" from the left hand index

From there you will have a search box and some filters, and all kinds of cool things you can check. 

Enjoy!