## Requirements Relationships Chord Diagram


In [1]:
import re
from itertools import combinations
import psycopg2
import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from IPython import display
from bs4 import BeautifulSoup as bs

from nbstyler import DATA_STYLE as DS

plotly.offline.init_notebook_mode(connected=True)

%matplotlib notebook
%matplotlib inline

### Objectives

The main objective for this recipe is to try the plotly implementation of the chord diagram and to provide a clean recipe for preparing such charts in the future. 


### Data Preparation

In [2]:
data_querystr = """SELECT * FROM data_offers.do_full_offer_history"""
conn = psycopg2.connect('dbname=jobsbg')
data_df = pd.read_sql_query(data_querystr, conn, index_col='subm_date')
conn.close()

In [3]:
data_df.head(1)

Unnamed: 0_level_0,subm_type,job_id,company_id,norm_salary,job_title,company_name,text_salary,job_location,job_contents
subm_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-07-05,resubmission,4416332,179347,,Data Analyst,Технементалс Технолоджис (България) ЕАД,,София,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 ..."


### Relationship Categories Definition

In order to build the necessary data we first need to define the scope of the relationships we aim to observe. The search strings we will compare should be of the same semantic category to provide meaningful insight.

To explore the data and prepare proper definitions we took a deep dive in the data offers' text contents. We used the `nltk` library for text processing to produce frequency distributions of most common words, bigrams and trigrams. Then those distributions were used to explore different possible relationship categories. See the details in the [Requirements Deep Dive Notebook](./Data_Offers_Requirements_Deep_Dive.ipynb).

For the first chord diagram the chosen view is a breakdown by technology. Other possible relationships to explore are skill requirements breakdown, perks & benefits breakdown, job title breakdown, etc.

### Data Jobs Technology Requirement Relationships

A preliminary list of terms that describe most often sought categories is presented below. We are going to use it to construct regex filters for selected key technologies.

In [4]:
tech_terms_filters = [
    r'power bi', r'excel', r'ss[ir]s',
    r'sql server', r'postgresql', r'(mysql)|(mariadb)', r't.?sql', r'pl.?sql',
    r'pentaho', r'hadoop', r'spark', r'informatica',
    r'qlik', r'tableau', r'microstrategy', r'oracle (bi)|(business intelligence)',
    r'python', r'vba', r'linux', r'aws']

tech_terms_labels = [
    'Power BI', 'Excel', 'SSIS/SSRS',
    'MS SQL Server', 'PostgreSQL', 'MySQL/MariaDB', 'T-SQL', 'PL/SQL',
    'Pentaho', 'Hadoop', 'Spark', 'Informatica',
    'Qlik', 'Tableau', 'Microstrategy', 'Oracle BI',
    'Python', 'VBA', 'Linux', 'AWS']

#### Preparing a square matrix with counts for matching filters

First we define a helper function that returns `True` for job offers where both of the provided patterns are found in the job contents. We also prepare a list of all possible filter patterns' combinations.

In [5]:
def match_terms(first_term, second_term, text):
    match_found = re.search(first_term, text, re.IGNORECASE) and re.search(second_term, text, re.IGNORECASE)
    return True if match_found else False

    
def count_matches(first_term, second_term, col):
    return sum([match_terms(first_term, second_term, t) for t in col.values])

req_combinations = list(combinations(tech_terms_filters, 2))

Building the counts in a dictionary with keys composed of tuples with both search terms. 

In [6]:
%%time
match_results = [count_matches(*tup, data_df.job_contents) for tup in req_combinations]
match_dict = dict(zip(req_combinations, match_results))

CPU times: user 3min 15s, sys: 104 ms, total: 3min 15s
Wall time: 3min 17s


Another helper function will unpack the combinations counts into a square matrix form. Finally, a DataFrame is created from the combinations counts matrix. This is our main data source for the chord diagram. 

In [7]:
def make_matrix(headers, counts):
    res = []
    for k1 in headers:
        row = []
        for k2 in headers:
            if k1 == k2:
                row.append(0)
            else:
                curr_key = tuple([k1, k2])
                cell_value = counts.get(curr_key) if curr_key in counts else counts.get(tuple([k2, k1]))
                row.append(cell_value)
        res.append(row)
    return np.array(res, dtype=int)  

In [8]:
tech_terms_matrix = make_matrix(tech_terms_filters, match_dict)

In [9]:
tech_terms_df = pd.DataFrame(tech_terms_matrix, columns=tech_terms_labels, index=tech_terms_labels)
tech_terms_df

Unnamed: 0,Power BI,Excel,SSIS/SSRS,MS SQL Server,PostgreSQL,MySQL/MariaDB,T-SQL,PL/SQL,Pentaho,Hadoop,Spark,Informatica,Qlik,Tableau,Microstrategy,Oracle BI,Python,VBA,Linux,AWS
Power BI,0,226,131,128,0,5,47,11,9,22,29,18,188,190,29,178,81,39,8,6
Excel,226,0,617,270,76,128,184,60,68,86,93,120,422,440,41,642,363,268,77,57
SSIS/SSRS,131,617,0,275,59,84,131,45,36,34,41,74,91,165,34,279,166,89,18,31
MS SQL Server,128,270,275,0,93,100,174,12,48,29,29,64,66,109,20,180,115,39,55,10
PostgreSQL,0,76,59,93,0,106,39,2,49,4,13,2,14,6,0,47,30,0,47,6
MySQL/MariaDB,5,128,84,100,106,0,40,0,67,6,29,11,17,52,0,74,60,5,48,2
T-SQL,47,184,131,174,39,40,0,44,41,19,20,13,76,100,17,144,50,27,28,10
PL/SQL,11,60,45,12,2,0,44,0,8,3,0,32,20,22,26,86,3,7,1,2
Pentaho,9,68,36,48,49,67,41,8,0,0,8,36,33,41,4,40,38,1,37,15
Hadoop,22,86,34,29,4,6,19,3,0,0,143,5,9,38,3,36,156,7,99,54


We can finally move to Plotly. 

A chord diagram encodes information in two graphical objects:

- Ideograms, represented by distinctly colored arcs of circles;
- Ribbons, that are planar shapes bounded by two quadratic Bezier curves and two arcs of circle,that can degenerate to a point;


### Ideograms Preparation

For each of our predefined tech terms we can produce a total hits count by summing up all the entries on the row (or column for that matter). That total count determines the size of each ideogram of the chart.

We are going to need a couple of helper functions to process the data in order to get ideogram ends.


In [10]:
PI = np.pi

def moduloAB(x, a, b):
    if a >= b:
        raise ValueError('Incorrect interval ends')
    y = (x-a) % (b-a)
    return y+b if y < 0 else y+a


def test_2PI(x):
    return 0 <= x < 2*PI

And now use them to compute the row sums and the lengths of corresponding ideograms.

In [11]:
row_sum = [np.sum(tech_terms_matrix[k, :]) for k in range(len(tech_terms_filters))]

#set the gap between two consecutive ideograms
gap = 2*PI*0.005
ideogram_length = 2*PI*np.asarray(row_sum)/sum(row_sum)-gap*np.ones(len(tech_terms_filters))

The next function returns the list of end angular coordinates for each ideogram arc:


In [12]:
def get_ideogram_ends(ideogram_len, gap):
    ideo_ends = []
    left = 0
    for k in range(len(ideogram_len)):
        right = left+ideogram_len[k]
        ideo_ends.append([left, right])
        left = right+gap
    return ideo_ends

ideo_ends = get_ideogram_ends(ideogram_length, gap)

The function make_ideogram_arc returns equally spaced points on an ideogram arc, expressed as complex numbers in polar form:

In [13]:
def make_ideogram_arc(R, phi, a=50):
    if not test_2PI(phi[0]) or not test_2PI(phi[1]):
        phi = [moduloAB(t, 0, 2*PI) for t in phi]
    length = (phi[1]-phi[0]) % 2*PI
    nr = 5 if length <= PI/4 else int(a*length/PI)
    if phi[0] < phi[1]:
        theta = np.linspace(phi[0], phi[1], nr)
    else:
        phi = [moduloAB(t, -PI, PI) for t in phi]
        theta = np.linspace(phi[0], phi[1], nr)
    return R*np.exp(1j*theta)

In [14]:
z = make_ideogram_arc(1.3, [11*PI/6, PI/17])

### Ribbons Preparation

The function map_data maps all matrix entries to the corresponding values in the intervals associated to ideograms:

In [15]:
def map_data(data_matrix, row_value, ideogram_length):
    mapped = np.zeros(data_matrix.shape)
    for j in range(len(tech_terms_filters)):
        mapped[:, j] = ideogram_length*data_matrix[:, j]/row_value
    return mapped

mapped_data = map_data(tech_terms_matrix, row_sum, ideogram_length)

The array idx_sort, defined below, has on each row the indices that sort the corresponding row in mapped_data:


In [16]:
idx_sort = np.argsort(mapped_data, axis=1)

In [17]:
def make_ribbon_ends(mapped_data, ideo_ends,  idx_sort):
    L = mapped_data.shape[0]
    ribbon_boundary = np.zeros((L, L+1))
    for k in range(L):
        start = ideo_ends[k][0]
        ribbon_boundary[k][0] = start
        for j in range(1, L+1):
            J = idx_sort[k][j-1]
            ribbon_boundary[k][j] = start+mapped_data[k][J]
            start = ribbon_boundary[k][j]
    return [[(ribbon_boundary[k][j], ribbon_boundary[k][j+1]) for j in range(L)] for k in range(L)]


ribbon_ends = make_ribbon_ends(mapped_data, ideo_ends, idx_sort)
print('ribbon ends starting from the ideogram[2]\n', ribbon_ends[2])

ribbon ends starting from the ideogram[2]
 [(1.286925804166983, 1.286925804166983), (1.286925804166983, 1.2908393275578358), (1.2908393275578358, 1.297579284508749), (1.297579284508749, 1.3049714953581377), (1.3049714953581377, 1.3123637062075264), (1.3123637062075264, 1.320190752989232), (1.320190752989232, 1.3291048896017301), (1.3291048896017301, 1.3388886980788621), (1.3388886980788621, 1.3517163580822131), (1.3517163580822131, 1.3678052875779414), (1.3678052875779414, 1.3860683967352545), (1.3860683967352545, 1.40541859572336), (1.40541859572336, 1.4252036306437827), (1.4252036306437827, 1.453685384210545), (1.453685384210545, 1.4821671377773074), (1.4821671377773074, 1.5180411021934583), (1.5180411021934583, 1.5541324845757678), (1.5541324845757678, 1.6139224252693527), (1.6139224252693527, 1.6745820378275715), (1.6745820378275715, 1.80872892294736)]


In [18]:
def control_pts(angle, radius):
    if len(angle) != 3:
        raise InvalidInputError('angle must have len =3')
    b_cplx = np.array([np.exp(1j*angle[k]) for k in range(3)])
    b_cplx[1] = radius*b_cplx[1]
    return zip(b_cplx.real, b_cplx.imag)

In [19]:
def ctrl_rib_chords(l, r, radius):
    if len(l) != 2 or len(r) != 2:
        raise ValueError('the arc ends must be elements in a list of len 2')
    return [control_pts([l[j], (l[j]+r[j])/2, r[j]], radius) for j in range(2)]

In [20]:
def make_q_bezier(b):
    if len(b) != 3:
        raise ValueError('control poligon must have 3 points')
    A, B, C = b
    return 'M ' + str(A[0]) + ',' + str(A[1]) + ' ' + 'Q ' + \
        str(B[0]) + ', ' + str(B[1]) + ' ' + \
        str(C[0]) + ', ' + str(C[1])


b = [(1,4), (-0.5, 2.35), (3.745, 1.47)]
make_q_bezier(b)

'M 1,4 Q -0.5, 2.35 3.745, 1.47'

In [21]:
def make_ribbon_arc(theta0, theta1):
    if test_2PI(theta0) and test_2PI(theta1):
        if theta0 < theta1:
            theta0 = moduloAB(theta0, -PI, PI)
            theta1 = moduloAB(theta1, -PI, PI)
            if theta0*theta1 > 0:
                raise ValueError('incorrect angle coordinates for ribbon')

        nr = int(40*(theta0-theta1)/PI)
        if nr <= 2: 
            nr = 3
        theta = np.linspace(theta0, theta1, nr)
        pts = np.exp(1j*theta)

        string_arc = ''
        for k in range(len(theta)):
            string_arc += 'L ' + str(pts.real[k]) + ', ' + str(pts.imag[k]) + ' '
        return string_arc
    else:
        raise ValueError('the angle coordinates for an arc side of a ribbon must be in [0, 2*pi]')

In [22]:
def make_ideo_shape(path, line_color, fill_color):
    return dict(
        line=dict(color=line_color, width=0.45),
        path=path,
        type='path',
        fillcolor=fill_color,
        layer='below',
    )

In [23]:
temp_colors = [*DS['colorramp']['acc1'], *DS['colorramp']['acc2']]
len_datalist = len(tech_terms_filters)
ideo_colors = temp_colors[:len_datalist]

In [24]:
def make_ribbon(l, r, line_color, fill_color, radius=0.2):
    poligon = ctrl_rib_chords(l, r, radius)
    b, c = poligon

    return dict(
        line=dict(
            color=line_color,
            width=0.5
            ),
        path=make_q_bezier(list(b))+make_ribbon_arc(r[0], r[1])+make_q_bezier(list(c)[::-1])+make_ribbon_arc(l[1], l[0]),
        type='path',
        fillcolor=fill_color,
        layer='below'
    )


def make_self_rel(l, line_color, fill_color, radius):
    b = control_pts([l[0], (l[0]+l[1])/2, l[1]], radius)
    return dict(
        line=dict(
            color=line_color,
            width=0.5
            ),
        path=make_q_bezier(b)+make_ribbon_arc(l[1], l[0]),
        type='path',
        fillcolor=fill_color,
        layer='below'
    )


def invPerm(perm):
    inv = [0] * len(perm)
    for i, s in enumerate(perm):
        inv[s] = i
    return inv

In [25]:
radii_sribb = [0.4, 0.30, 0.35, 0.39, 0.12]

In [26]:
shapes = []
ribbon_info = []

for k in range(len(tech_terms_filters)):
    sigma = idx_sort[k]
    sigma_inv = invPerm(sigma)
    for j in range(k, len(tech_terms_filters)):
        if tech_terms_matrix[k][j] == 0 and tech_terms_matrix[j][k] == 0:
            continue
        eta = idx_sort[j]
        eta_inv = invPerm(eta)
        l = ribbon_ends[k][sigma_inv[j]]

        if j == k:
            shapes.append(make_self_rel(l, DS['colors']['acc1'], ideo_colors[k], radius=radii_sribb[k]))
            z = 0.9*np.exp(1j*(l[0]+l[1])/2)
            text = tech_terms_labels[k]+' appears in ' + '{:d}'.format(tech_terms_matrix[k][k]),
            ribbon_info.append(
                go.Scatter(
                    x=[z.real],
                    y=[z.imag],
                    mode='markers',
                    marker=dict(size=0.5, color=ideo_colors[k]),
                    text=text,
                    hoverinfo='text',
                ),
            )
        else:
            r = ribbon_ends[j][eta_inv[k]]
            zi = 0.9*np.exp(1j*(l[0]+l[1])/2)
            zf = 0.9*np.exp(1j*(r[0]+r[1])/2)
            texti = tech_terms_labels[k]+' appears with '+tech_terms_labels[j]+' {:d}'.format(tech_terms_matrix[k][j])+' times'
            textf = tech_terms_labels[j]+' appears with '+tech_terms_labels[k]+' {:d}'.format(tech_terms_matrix[j][k])+' times'

            ribbon_info.append(go.Scatter(
                x=[zi.real],
                y=[zi.imag],
                mode='markers',
                marker=dict(size=0.5, color='green'),
                text=texti,
                hoverinfo='text'
            )),
            ribbon_info.append(go.Scatter(
                x=[zf.real],
                y=[zf.imag],
                mode='markers',
                marker=dict(size=0.5, color='blue'),
                text=textf,
                hoverinfo='text'
            ))
            r = (r[1], r[0])
            shapes.append(make_ribbon(l, r, 'rgb(175,175,175)', ideo_colors[k]))

In [27]:
ideograms = []
for k in range(len(ideo_ends)):
    z = make_ideogram_arc(1.1, ideo_ends[k])
    zi = make_ideogram_arc(1.0, ideo_ends[k])
    m = len(z)
    n = len(zi)
    ideograms.append(go.Scatter(
        x=z.real,
        y=z.imag,
        mode='lines',
        line=dict(color=ideo_colors[k], shape='spline', width=0.25),
        text=tech_terms_labels[k]+'<br>'+'{:d}'.format(row_sum[k]),
        hoverinfo='text',))

    path = 'M '
    for s in range(m):
        path += str(z.real[s])+', '+str(z.imag[s])+' L '

    Zi = np.array(zi.tolist()[::-1])

    for s in range(m):
        path += str(Zi.real[s])+', '+str(Zi.imag[s])+' L '
    path += str(z.real[0])+' ,'+str(z.imag[0])

    shapes.append(make_ideo_shape(path, 'rgb(150,150,150)', ideo_colors[k]))

data = go.Data(ideograms+ribbon_info)



plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.




In [28]:
layout = go.Layout(
    paper_bgcolor=DS['colors']['bg1'],
    plot_bgcolor=DS['colors']['bg1'],
    title='Data Jobs ‒ Technologies in Demand and Relationships Between Them',
    titlefont=DS['chart_fonts']['title'],
    font=DS['chart_fonts']['text'],
    autosize=False,
    width=960,
    height=525,
    margin=dict(
        l=280,
        r=280,
        t=80,
        b=10,
    ),
    showlegend=False,
    hidesources=True,
    xaxis=dict(
        showline=False,
        zeroline=False,
        showgrid=False,
        showticklabels=False,
        title='',
    ),
    yaxis=dict(
        showline=False,
        zeroline=False,
        showgrid=False,
        showticklabels=False,
        title='',
    ),
    shapes=shapes,
    hovermode='closest',
    hoverdistance=40,
)

In [29]:
fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig, filename='data_offers_tech_requirements_chord.html')

In [30]:
# Uncomment the line below to export an HTML version of the chart.
plotly.offline.plot(fig, filename='data_offers_tech_requirements_chord.html', show_link=False)

'file:///games/WORKSPACE/jpynb_Job_Market_Trends_Bulgaria/workbooks/data_offers_tech_requirements_chord.html'

### Resources:

- https://plot.ly/python/filled-chord-diagram/
- https://hci.stanford.edu/courses/cs448b/f11/lectures/CS448B-20111117-Text.pdf

In [31]:
from IPython.core.display import HTML
with open('../resources/styles/datum.css', 'r') as f:
    style = f.read()
HTML(style)