# Database Joins: Combining JIRA With GitHub

For this notebook, our research question is as follows:

<b style="color:crimson">Are there Liferay Portal Patches (LPP) tickets that are currently stuck in review?</b>

In order to answer this question, this notebook introduces database joins by creating a script that combines data retrieved from JIRA with data retrieved from GitHub.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install -y numpy pandas pytz requests ujson
!pip install dateparser

## Notebook Imports

In [None]:
from __future__ import print_function

from collections import defaultdict, namedtuple
import dateparser
from datetime import date, datetime
import numpy as np
import os
import pandas as pd
import pytz
import re
import requests
import six
import subprocess
import ujson as json

## Saving Raw Data

Before we start, we'll make sure that we establish one rule for this script and all future scripts: we will save all raw data.

In [None]:
!mkdir -p rawdata

This is important, because one of the more common things to do as a developer is to retrieve the data, extract only the information you need, and then discard the data you do not need. However, unless you have some terms of service agreement restricting what data you are allowed to retain, do not discard the raw data! Raw data is a starting point that speeds up the creation of many different application prototypes.

In [None]:
today = date.today()
now = datetime.now(pytz.utc)

def load_raw_data(base_name):
    file_name = 'rawdata/%s_%s.json' % (base_name, today.isoformat())

    if not os.path.isfile(file_name):
        return None

    with open(file_name) as infile:
        return json.load(infile)

def save_raw_data(base_name, json_value):
    file_name = 'rawdata/%s_%s.json' % (base_name, today.isoformat())

    with open(file_name, 'w') as outfile:
        json.dump(json_value, outfile)

## Load Data from JIRA

Before we can answer the question of whether anything is *stuck in review*, we will need to answer the broader question of what is *in review*. We'll start by looking at what's in review in JIRA.

### Login to JIRA

Because [issues.liferay.com](https://issues.liferay.com/) does not have OAuth support, we will need to find a different way to connect to our JIRA installation. The simplest way is to simply login to JIRA.

There are a lot of secure ways to specify your username and password, but for the sake of this script, we'll use the most insecure way possible: a plain text file. Namely, the `.gitconfig` in your user's home folder. If you want to use a different strategy that's less global (a plain text JSON file in the same folder as this script, for example) or more secure, just change the implementation of the two functions below.

In [None]:
def get_config(key):
    try:
        return subprocess.check_output(['git', 'config', key]).strip().decode('utf8')
    except:
        return None

def set_config(key, value):
    subprocess.call(['git', 'config', '--global', key, value])
    subprocess.call(['git', 'config', '--global', key, value])

Assuming you are using the default implementation, set your username and password inside of the `.gitconfig` located in your user's home folder by running the following two commands in a command line window.

``` .sh
git config --global jira.session-username JIRA_USERNAME
git config --global jira.session-password JIRA_PASSWORD
```

The following cell will read it in and confirm that it exists.

In [None]:
jira_username = get_config('jira.session-username')
jira_password = get_config('jira.session-password')

assert(jira_username is not None)
assert(jira_password is not None)

The following cell will use this information and attempt to login to JIRA and confirm that it has a valid session, which will confirm that the credentials you saved are valid. It will also save this session information so that it can reuse it later without constantly relogging in.

In [None]:
jira_base_url = 'https://issues.liferay.com/rest'

def get_jira_cookie():
    jira_cookie = None

    jira_cookie_name = None
    jira_cookie_value = None

    try:
        jira_cookie_name = get_config('jira.session-cookie-name')
        jira_cookie_value = get_config('jira.session-cookie-value')
    except:
        pass
    
    if jira_cookie_name is not None and jira_cookie_value is not None:
        jira_cookie = {
            jira_cookie_name: jira_cookie_value
        }

        r = requests.get(jira_base_url + '/auth/1/session', cookies=jira_cookie)

        if r.status_code != 200:
            jira_cookie = None

    if jira_cookie is not None:
        return jira_cookie
        
    post_json = {
        'username': jira_username,
        'password': jira_password
    }

    r = requests.post(jira_base_url + '/auth/1/session', json=post_json)

    if r.status_code != 200:
        print('Invalid login')

        return None

    response_json = r.json()

    jira_cookie_name = response_json['session']['name']
    jira_cookie_value = response_json['session']['value']

    set_config('jira.session-cookie-name', jira_cookie_name)
    set_config('jira.session-cookie-value', jira_cookie_value)

    jira_cookie = {
        jira_cookie_name: jira_cookie_value
    }

    return jira_cookie

assert(get_jira_cookie() is not None)

### Retrieve JIRA Issues

Now that we have a valid login, our next step is to use the JIRA API to retrieve tickets. If you've interacted with JIRA before, you know that it has its own query language (JQL). It turns out there is a simple search API that allows you to submit the JQL and all the matching issues are returned as JSON. Since the API is fairly simple, we implement it here.

In [None]:
def get_jira_issues(jql):
    jira_cookie = get_jira_cookie()

    if jira_cookie is None:
        return []

    start_at = 0

    post_json = {
        'jql': jql,
        'startAt': start_at
    }

    r = requests.post(jira_base_url + '/api/2/search', cookies=jira_cookie, json=post_json)

    if r.status_code != 200:
        return []

    response_json = r.json()

    issues = response_json['issues']

    while start_at + response_json['maxResults'] < response_json['total']:
        start_at += response_json['maxResults']
        post_json['startAt'] = start_at

        r = requests.post(jira_base_url + '/api/2/search', cookies=jira_cookie, json=post_json)

        if r.status_code != 200:
            return issues

        response_json = r.json()

        issues.extend(response_json['issues'])

    return issues

Now that we have something that can retrieve JIRA issues, all we need is to actually create our JQL and then run the search. This is what that code looks like.

In [None]:
jira_issues = load_raw_data('jira_issues')

if jira_issues is None:
    print('Executing JIRA search')

    jql = 'project = LPP AND type not in ("SME Request", "SME Request SubTask") AND status = "In Review" order by key'
    jira_issues = get_jira_issues(jql)
    save_raw_data('jira_issues', jira_issues)
else:
    print('Loaded cached JIRA search')

From there, let's take a look at the data we've retrieved. Looking at JSON is a bit tedious, so we'll take a look at a subset of fields in a way that resembles the view you get when you run JQL via the web browser.

In [None]:
JIRAIssue = namedtuple('JIRAIssue', ['key', 'status', 'assignee', 'summary'])

pd.DataFrame([
    JIRAIssue(
        key=jira_issue['key'],
        status=jira_issue['fields']['status']['name'],
        assignee=jira_issue['fields']['assignee']['displayName'],
        summary=jira_issue['fields']['summary']
    )
        for jira_issue in jira_issues
])

## Load Data from GitHub

As noted previously, before we can answer the question of whether anything is *stuck in review*, we will need to answer the broader question of what is *in review*. Now that we know what's in review in JIRA, we also want to what's in review on GitHub.

### Login to GitHub

Luckily, [api.github.com](https://developer.github.com/v3/) does have OAuth support, so we'll want to request an OAuth token from GitHub and leverage it in our script.

* [Creating a personal access token for command line](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)

Assuming you are using the default implementation, set your OAuth token inside of the `.gitconfig` located in your user's home folder by running the following command in a command line window.

``` .sh
git config --global github.oauth-token GITHUB_OAUTH_TOKEN
```

The following cell will read it in and confirm that it exists.

In [None]:
github_oauth_token = get_config('github.oauth-token')

assert(github_oauth_token is not None)

However, it's not enough that it exists. We should confirm that the token works on the repositories that are related to Liferay. We'll use the `liferay/liferay-portal` and `liferay/liferay-portal-ee` repositories as a way to validate.

In [None]:
github_base_url = 'https://api.github.com'

def is_repository_accessible(reviewer_url):
    print('Validating OAuth token against %s' % reviewer_url)

    headers = {
        'user-agent': 'python checklpp.py',
        'authorization': 'token %s' % github_oauth_token
    }

    api_path = '/repos/%s' % reviewer_url
    
    r = requests.get(github_base_url + api_path, headers=headers)
    
    return r.status_code == 200

assert(is_repository_accessible('liferay/liferay-portal'))
assert(is_repository_accessible('liferay/liferay-portal-ee'))

### Retrieve GitHub Pull Requests, Part 1

Now that we have a valid login, our next step is to use the GitHub API to retrieve pull requests.

Unfortunately, given the number of unique users repositories that are involved in reviewing Liferay pull requests, it's not practical to make an API call against every user and every repository. Instead, we'll want to do this on-demand based on which pull requests we know exist.

If you visit `/repos/USERNAME/REPOSITORY/pulls` without a pull ID, GitHub will return all currently open pull requests. The approach below uses this API in an attempt to reduce the number of API requests to GitHub, because this will fetch all pull requests that are open in a single request.

Following our principle of keeping any data we retrieve, we save all of these pull requests, and then we request any additional pull requests that are closed and save those as well.

In [None]:
def retrieve_pull_requests(reviewer_url, pull_request_ids=[]):
    print('Checking pull requests waiting on %s' % reviewer_url)

    headers = {
        'user-agent': 'python checklpp.py',
        'authorization': 'token %s' % github_oauth_token
    }

    api_path = '/repos/%s/pulls' % reviewer_url

    r = requests.get(github_base_url + api_path, headers=headers)

    if r.status_code != 200:
        return {}

    new_pull_requests = r.json()

    new_seen_pull_requests = { pull_request['html_url']: pull_request for pull_request in new_pull_requests }

    for pull_request_id in pull_request_ids:
        github_url = 'https://github.com/%s/pull/%s' % (reviewer_url, pull_request_id)

        if github_url in new_seen_pull_requests:
            continue

        api_path = '/repos/%s/pulls/%s' % (reviewer_url, pull_request_id)

        r = requests.get(github_base_url + api_path, headers=headers)

        if r.status_code != 200:
            continue

        new_seen_pull_requests[github_url] = r.json()

    return new_seen_pull_requests

For now, we'll take a look at all the pull requests open against `liferay/liferay-portal-ee`.

In [None]:
open_backport_pulls = load_raw_data('open_backport_pulls')

if open_backport_pulls is None:
    open_backport_pulls = retrieve_pull_requests('liferay/liferay-portal-ee')
    
    save_raw_data('open_backport_pulls', open_backport_pulls)
else:
    print('Loaded cached open backports')

From there, let's take a look at the data we've retrieved. Again, looking at JSON is a bit tedious, so we'll take a look at a subset of fields as a table.

In [None]:
GHPullRequest = namedtuple('GHPullRequest', ['submitter', 'reviewer', 'repository', 'branch', 'github_url'])

pd.DataFrame([
    GHPullRequest(
        submitter=pull_request['user']['login'],
        reviewer=pull_request['base']['user']['login'],
        repository=pull_request['base']['repo']['name'],
        branch=pull_request['base']['ref'],
        github_url=github_url
    )
        for github_url, pull_request in open_backport_pulls.items()
])

## Detour: Join Strategies

Now we have two data sets, and our next step is to join them together in order to answer our original question.

<b style="color:crimson">Are there Liferay Portal Patches (LPP) tickets that are currently stuck in review?</b>

Now that we've identified which LPP tickets are currently in review and we have created a function that can retrieve the GitHub pull requests are in review for any given user and repository, we need to now answer the following question: given a list of LPP tickets and given a list of pull requests, how do we get the pull request metadata tied to all LPP tickets?

If you learned joins while working in the industry, you might have learned joins when you ran into something similar to this situation, where you needed to derive new information from two separate tables. These two tables would have had two columns that referred to the same abstract concept (or perhaps a third "mapping table" that describes that abstract concept).

* [Khan Academy: Joining Related Tables](https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/p/joining-related-tables)

We've identified all of the JIRA issues that are in review. Presumably, we want to then combine this with GitHub data. What algorithm will we implement to achieve this? It turns out there are three basic strategies for computing a join: a nested loop join, a hash join, and a sort-merge join.

* [Join methods and subqueries](http://www.orafaq.com/tuningguide/join%20methods.html)

I believe sort-merge will make more sense after we're finished than before we've finished, and so we'll look first at how a nested loop join and a hash join relate to our problem.

### Nested Loop Join

A simple solution to this problem would be to iterate over each of our pull requests and then check each JIRA ticket to see if the JIRA ticket references the pull request. So for every GitHub pull request, you would check every JIRA ticket. This strategy is known as a **nested loop join**.

A nested loop join is a join where the query optimizer decides the best way to accomplish the join is to designate one table as the "outer table" (no connection to the notion of an outer join) and the other table as the "inner table", structured in much the same way as a for loop.

* [Nested loop join](https://en.wikipedia.org/wiki/Nested_loop_join)

How fast is a nested loop join? Let $m$ be the size of the JIRA table and $n$ be the size of the GitHub table. Regardless of which table you choose for the outer table, a nested loop join would have an expected runtime and a worst-case runtime of $O(m \cdot n)$.

### Hash Join

Another solution to this problem would be to first load all the pull requests into a lookup data structure. From there, iterate over the JIRA tickets and match the pull requests to the hash table. So for every JIRA ticket, you would perform a number of lookups against our data structure not based on either tables size (so, effectively a constant). This strategy is known as a **hash join**.

A hash join is a join where the query optimizer decides the best way to accomplish the join is to build a lookup table on the smaller table (because it's more likely that a smaller table can fit into memory), with a popular choice being a hash table, and then iterate over the larger table.

* [Hash join](https://en.wikipedia.org/wiki/Hash_join)

How fast is a hash join? Let $m$ be the size of the JIRA table and $n$ be the size of the GitHub table. Note that hash tables have an $O(1)$ expected access and insertion time, but an $O(n)$ worst case access and insertion time (due to hash collisions, table resizes). So in practice, we have an expected runtime that is $O(m) + O(n)$ and a worst-case runtime that is $O(m \cdot n) + O(n^2)$.

## Combine JIRA Data with GitHub Data

Between a nested loop join and a hash join, the hash join has the better expected runtime, and we already have a lookup data structure containing all the pull requests. So, we'll proceed with implementing the hash join, and we'll perform additional optimizations as we proceed.

### Build a Mapping Table

Our first step is to iterate over each JIRA ticket and extract the pull requests contained in the ticket. Once we have that information, we'll be able to use our lookup data structure.

In [None]:
def extract_pull_requests_in_review(jira_issues):
    issues_by_request = defaultdict(set)
    requests_by_reviewer = defaultdict(set)

    for jira_issue in jira_issues:
        for value in jira_issue['fields'].values():
            if not isinstance(value, six.string_types):
                continue

            for github_url in re.findall('https://github.com/[^\s]*/pull/[\d]+', value):
                issues_by_request[github_url].add(jira_issue['key'])

                pos = github_url.find('/', github_url.find('/', 19) + 1)
                reviewer_url = github_url[19:pos]
                pull_request_id = github_url[github_url.rfind('/') + 1:]

                requests_by_reviewer[reviewer_url].add(pull_request_id)

    return issues_by_request, requests_by_reviewer

Now, let's pass the data the results of our previous JIRA search to this function.

In [None]:
issues_by_request = load_raw_data('issues_by_request')
requests_by_reviewer = load_raw_data('requests_by_reviewer')

if issues_by_request is None or requests_by_reviewer is None:
    issues_by_request, requests_by_reviewer = extract_pull_requests_in_review(jira_issues)

    save_raw_data('issues_by_request', issues_by_request)
    save_raw_data('requests_by_reviewer', requests_by_reviewer)
else:
    print('Loaded cached JIRA to GitHub mapping')

What if we were to represent this data as a table?

In [None]:
JIRAGitHubMapping = namedtuple('JIRAGitHubMapping', ['jiraKey', 'githubKey'])

pd.DataFrame([
    JIRAGitHubMapping(jiraKey=jiraKey, githubKey=githubKey)
        for githubKey, jiraKeys in issues_by_request.items()
            for jiraKey in jiraKeys
])

As you can see, it turns out that what we've done is equivalent to building a mapping table between the JIRA tickets (represented with their issue key) and the GitHub pull requests (represented with their URL).

* [Understanding Mapping Tables](https://stackoverflow.com/questions/6453462/mysql-understanding-mapping-tables)

### Retrieve GitHub Pull Requests, Part 2

Now, we'll want to fetch all the metadata associated with all those pull requests. We'll also want a quick way to separate the open/active pull requests from the closed/inactive pull requests so that we don't have to re-derive that metadata later on when we attempt to identify stuck pull requests.

In [None]:
def retrieve_active_pull_request_reviews(issues_by_request, requests_by_reviewer):
    active_reviews = []
    seen_pull_requests = {}

    for reviewer_url, pull_request_ids in sorted(requests_by_reviewer.items()):
        new_seen_pull_requests = retrieve_pull_requests(reviewer_url, pull_request_ids)
        new_active_reviews = [pull_request['html_url'] for pull_request in new_seen_pull_requests.values() if pull_request['html_url'] in issues_by_request and pull_request['state'] != 'closed']

        seen_pull_requests.update(new_seen_pull_requests)
        active_reviews.extend(new_active_reviews)

    return active_reviews, seen_pull_requests

Let's go ahead and use our mapping tables to populate our GitHub table.

In [None]:
active_reviews = load_raw_data('active_reviews')
seen_pull_requests = load_raw_data('seen_pull_requests')

if active_reviews is None or seen_pull_requests is None:
    active_reviews, seen_pull_requests = retrieve_active_pull_request_reviews(issues_by_request, requests_by_reviewer)

    save_raw_data('active_reviews', active_reviews)
    save_raw_data('seen_pull_requests', seen_pull_requests)
else:
    print('Loaded cached pull request metadata')

## Generate the Report

In [None]:
def get_daycount_string(time_delta):
    elapsed = float(time_delta.days) + float(time_delta.seconds) / (60 * 60 * 24)
    elapsed_string = '%0.1f days' % elapsed

    return elapsed_string

def report_active(outfile, jira_issues, issues_by_request, active_reviews, seen_pull_requests):
    jira_issues_by_key = { issue['key']: issue for issue in jira_issues }

    outfile.write('<h2>Active Pull Requests on %s</h2>' % today.isoformat())

    outfile.write('<table>')
    outfile.write('<tr>')

    for header in ['Submitter', 'Pull Request Link', 'Waiting Tickets', 'Open Time', 'Idle Time']:
        outfile.write('<th>%s</th>' % header)

    outfile.write('</tr>')

    for github_url in active_reviews:
        outfile.write('<tr>')

        if github_url not in seen_pull_requests:
            continue
        
        pull_request = seen_pull_requests[github_url]

        # Submitter

        outfile.write('<td>%s</td>' % pull_request['user']['login'])

        # Pull Request Link

        outfile.write('<td><a href="%s">%s#%d</a></td>' % (github_url, pull_request['base']['user']['login'], pull_request['number']))

        # Waiting Tickets

        affected_issue_keys = [issue_key for issue_key in issues_by_request[github_url]]
        affected_issue_urls = ['https://issues.liferay.com/browse/%s' % issue_key for issue_key in affected_issue_keys]
        affected_issue_assignees = [jira_issues_by_key[issue_key]['fields']['assignee']['displayName'] for issue_key in affected_issue_keys]

        affected_issue_links = [
            '<a href="%s">%s</a> (%s)' % (issue_url, issue_key, issue_assignee)
                for issue_key, issue_url, issue_assignee in zip(affected_issue_keys, affected_issue_urls, affected_issue_assignees)
        ]

        outfile.write('<td>%s</td>' % '<br />'.join(affected_issue_links))

        # Open Time

        created_at = dateparser.parse(pull_request['created_at'])
        open_time = now - created_at

        outfile.write('<td>%s</td>' % get_daycount_string(open_time))

        # Idle Time

        updated_at = dateparser.parse(pull_request['updated_at'])
        idle_time = now - updated_at

        outfile.write('<td>%s</td>' % get_daycount_string(idle_time))

        outfile.write('</tr>')

    outfile.write('</table>')

def report_completed(outfile, jira_issues, issues_by_request, active_reviews, seen_pull_requests):
    requests_by_issue = defaultdict(set)

    for github_url, issue_keys in issues_by_request.items():
        for issue_key in issue_keys:
            requests_by_issue[issue_key].add(github_url)

    completed_review = []
    region_field_name = 'customfield_11523'

    for issue in jira_issues:
        issue_key = issue['key']
        github_urls = requests_by_issue[issue_key]
        all_pulls_closed = len([github_url for github_url in github_urls if github_url in active_reviews]) == 0

        if not all_pulls_closed:
            continue

        assignee = issue['fields']['assignee']['displayName']

        regions = []

        if region_field_name in issue['fields']:
            regions = [region['value'] for region in issue['fields'][region_field_name]]

        region = ''

        if len(regions) > 0:
            region = regions[0]

        idle_time = None

        for github_url in github_urls:
            if github_url not in seen_pull_requests:
                continue
            
            pull_request = seen_pull_requests[github_url]

            if pull_request['closed_at'] is None:
                continue

            closed_at = dateparser.parse(pull_request['closed_at'])
            new_idle_time = now - closed_at

            if idle_time is None or new_idle_time < idle_time:
                idle_time = new_idle_time

        completed_review.append((region, issue_key, assignee, idle_time))

    completed_review.sort()

    if len(completed_review) > 0:
        outfile.write('<h2>Review Already Completed for %s</h2>' % today.isoformat())

        outfile.write('<table>')
        outfile.write('<tr>')

        for header in ['Region', 'Ticket', 'Idle Time']:
            outfile.write('<th>%s</th>' % header)

        for region, issue_key, assignee, idle_time in completed_review:
            outfile.write('<tr>')

            # Region

            outfile.write('<td>%s</td>' % region)

            # Ticket

            outfile.write('<td><a href="https://issues.liferay.com/browse/%s">%s</a> (%s)</td>' % (issue_key, issue_key, assignee))

            # Idle Time

            outfile.write('<td>%s</td>' % get_daycount_string(idle_time))

            outfile.write('</tr>')

        outfile.write('</table>')

In [None]:
report_file_name = 'report_%s.html' % today.isoformat()

with open(report_file_name, 'w') as outfile:
    report_active(outfile, jira_issues, issues_by_request, active_reviews, seen_pull_requests)
    report_completed(outfile, jira_issues, issues_by_request, active_reviews, seen_pull_requests)

### Sort-Merge Join

If you are concerned about worst case runtimes, you could instead build a tree-like data structure that has $O(\log n)$ insertion and lookup time. As a result, the hash join would have a worst-case runtime of $O(m \log n) + O(n \log n)$.

Imagine that for every pull request, we've extracted its pull request URL and created a sorted data structure that allows us to lookup pull request metadata by its URL. Imagine also that we've extracted all the pull requests from every LPP ticket and we have a sorted data structure that allows us to lookup every JIRA issue corresponding to each GitHub URL.

A naive solution that uses these two data structures is a hash join. A smarter solution is to view this as the equivalent of merging two already sorted lists via merge sort.

* [Sorting visualizations](http://cs.stanford.edu/people/jcjohns/sorting.js/)

A sort-merge join is a join where the query optimizer decides the best way to accomplish the join is to sort the two tables on the specified join keys and then walk both tables in the same way as the merge step of merge sort.

* [Sort-merge join](https://en.wikipedia.org/wiki/Sort-merge_join)

Note that it doesn't necessarily have to use merge sort for this sort step, though merge sort is well-known to be one of the best external sorting algorithms (it's used by Hadoop during its shuffle step, for example).

* [External sorting](https://en.wikipedia.org/wiki/External_sorting)

From there, it will compare the sorted tables. It will place a cursor at the smallest key value for both tables (at the top of the sorted tables). At each step, it determines whether it should advance the cursor on one table or the other based on the values of the keys for the rows located at the current cursor position.

If you need a visualization in order to understand how the cursor advances in merge sort (possibly due to lack of familiarity with the merge sort algorithm), you're encouraged to consult this visualization.

* https://www.youtube.com/watch?v=kPRA0W1kECg#t=1m7

### Additional Notes

What will happen from here is you ask the database to provide an explanation of what it's doing for your query, you work some magic to make this explanation look better (add database indices, modify the query), and you commit those changes to the codebase.

* [Query plan](https://en.wikipedia.org/wiki/Query_plan)

Our definition of "better" comes from interpreting certain aspects of the query plan as necessarily worse than what we expected. Our expectations often come from definitions that are provided by various database vendors on the ideal query plan.

* [MySQL explain plan](https://dev.mysql.com/doc/refman/5.6/en/explain-output.html)
* [Oracle explain plan](https://docs.oracle.com/database/121/TGSQL/tgsql_interp.htm)
* [PostgreSQL explain plan](http://www.postgresql.org/docs/9.4/static/using-explain.html)
* [SQL Server explain plan](https://technet.microsoft.com/en-us/library/ms178071%28v=sql.105%29.aspx)

With that in mind, let's talk about implementation details (rather than relying solely on high level intuitions) in order to enhance your knowledge about why certain joins are better than others and why the query plan may choose certain approaches whenever it is confronted with your query.

## Convert Notebook to Script

The following cell will use `jupyter nbconvert` to build an `checklpp.py` which runs the script outlined in this notebook.

In [None]:
%%javascript
var script_file = 'checklpp.py';

var notebook_name = window.document.getElementById('notebook_name').innerHTML;
var nbconvert_command = 'jupyter nbconvert --stdout --to script ' + notebook_name;

var grep_command = "grep -v '^#' | grep -v -F get_ipython | sed '/^$/N;/^\\n$/D'";
var command = '!' + nbconvert_command + ' | ' + grep_command + ' > ' + script_file;

if (Jupyter.notebook.kernel) {
    Jupyter.notebook.kernel.execute(command);
}