## Requirements Relationships Chord Diagram


In [1]:
import re
from itertools import combinations
import psycopg2
import numpy as np
import pandas as pd
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from IPython import display
from bs4 import BeautifulSoup as bs

from nbstyler import DATA_STYLE as s

plotly.offline.init_notebook_mode(connected=True) # run at the start of every ipython notebook to use plotly.offline

%matplotlib notebook
%matplotlib inline

### Data Preparation

In [2]:
data_querystr = """SELECT * FROM v_full_data_offers_history"""
conn = psycopg2.connect('dbname=jobsbg')
data_df = pd.read_sql_query(data_querystr, conn, index_col='subm_date')
conn.close()

In [3]:
data_df.head(1)

Unnamed: 0_level_0,subm_type,job_id,company_id,norm_salary,job_title,company_name,text_salary,job_contents
subm_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2017-09-27,submission,3994437,124912,,Data Analyst,ПрайсуотърхаусКупърс Одит ООД,,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 ..."


### Relationship Categories Definition

In order to build the necessary data we first need to define the scope of the relationships we aim to observe. The search strings we will compare should be of the same semantic category to provide meaningful insight.

To explore the data and prepare proper definitions we took a deep dive in the data offers' text contents. We used the `nltk` library for text processing to produce frequency distributions of most common words, bigrams and trigrams. Then those distributions were used to explore different possible relationship categories. See the details in the [Requirements Deep Dive Notebook](./Data_Offers_Requirements_Deep_Dive.ipynb).

For the first chord diagram the chosen view is a breakdown by technology. Other possible relationships to explore are skill requirements breakdown, perks & benefits breakdown, job title breakdown, etc.

### Data Jobs Technology Requirement Relationships

A preliminary list of terms that describe most often sought categories is presented below. We are going to use it to construct regex filters for selected key technologies.

In [4]:
tech_terms_candidates = [
    'excel', 'tableau', 'access', 'qlik', 'hadoop', 'informatica', 'vmware', 'ssis', 'vba', 'python','powerpoint', 'mysql',
    'spark', 'microstrategy', 'deluge', 'ssrs', ('sql', 'server'), ('power', 'bi'), ('ms', 'office'), ('microsoft', 'office'), ]

In [5]:
tech_terms_filters = [
    r'excel', r'tableau', r'qlik', r'hadoop', r'ss[ir]s', r'sql server', r'power bi', 
    r'spark', r'postgresql', r'informatica', r'microstrategy', r'(mysql)|(mariadb)']

tech_terms_labels = [
    'Excel', 'Tableau', 'Qlik', 'Hadoop', 'SSRS/SSIS', 'SQL Server', 'Power BI', 
    'Spark', 'Postgresql', 'Informatica', 'Microstrategy', 'MySQL/MariaDB']

#### Preparing a square matrix with counts for matching filters

First we define a helper function that returns `True` for job offers where both of the provided patterns are found in the job contents. We also prepare a list of all possible filter patterns' combinations.

In [6]:
def match_terms(first_term, second_term, text):
    if re.search(first_term, text, re.IGNORECASE) and re.search(second_term, text, re.IGNORECASE):
        return True
    else:
        return False
    
def count_matches(first_term, second_term, col):
    return sum([match_terms(first_term, second_term, t) for t in col.values])

req_combinations = list(combinations(tech_terms_filters, 2))
req_combinations[:5]

[('excel', 'tableau'),
 ('excel', 'qlik'),
 ('excel', 'hadoop'),
 ('excel', 'ss[ir]s'),
 ('excel', 'sql server')]

Building the counts in a dictionary with keys composed of tuples with both search terms. 

In [7]:
%%time
match_results = [count_matches(*tup, data_df.job_contents) for tup in req_combinations]
match_dict = dict(zip(req_combinations, match_results))

CPU times: user 25.8 s, sys: 0 ns, total: 25.8 s
Wall time: 25.8 s


Another helper function will unpack the combinations counts into a square matrix form. Finally, a DataFrame is created from the combinations counts matrix. This is our main data source for the chord diagram. 

In [45]:
def make_matrix(headers, counts):
    res = []
    for k1 in headers:
        row = []
        for k2 in headers:
            if k1 == k2:
                row.append(0)
            else:
                curr_key = tuple([k1, k2])
                cell_value = counts.get(curr_key) if curr_key in counts else counts.get(tuple([k2, k1]))
                row.append(cell_value)
        res.append(row)
                
    return np.array(res, dtype=int)    

In [46]:
tech_terms_matrix = make_matrix(tech_terms_filters, match_dict)
tech_terms_matrix

array([[  0, 250, 237,  50, 300, 159, 136,  55,  25,  79,  23,  64],
       [250,   0, 195,  29,  81,  41,  89,  29,   5,  35,  23,  26],
       [237, 195,   0,   8,  42,  24,  89,  14,   6,  64,  17,  10],
       [ 50,  29,   8,   0,  26,  20,  14,  75,   3,   4,   3,   3],
       [300,  81,  42,  26,   0, 141,  78,  35,  11,  42,  12,  28],
       [159,  41,  24,  20, 141,   0,  64,  16,  26,  31,  10,  33],
       [136,  89,  89,  14,  78,  64,   0,  19,   0,   5,   9,   5],
       [ 55,  29,  14,  75,  35,  16,  19,   0,   5,   4,   3,  18],
       [ 25,   5,   6,   3,  11,  26,   0,   5,   0,   0,   0,  35],
       [ 79,  35,  64,   4,  42,  31,   5,   4,   0,   0,   6,   1],
       [ 23,  23,  17,   3,  12,  10,   9,   3,   0,   6,   0,   0],
       [ 64,  26,  10,   3,  28,  33,   5,  18,  35,   1,   0,   0]])

In [47]:
tech_terms_df = pd.DataFrame(tech_terms_matrix, columns=tech_terms_labels, index=tech_terms_labels)
tech_terms_df

Unnamed: 0,Excel,Tableau,Qlik,Hadoop,SSRS/SSIS,SQL Server,Power BI,Spark,Postgresql,Informatica,Microstrategy,MySQL/MariaDB
Excel,0,250,237,50,300,159,136,55,25,79,23,64
Tableau,250,0,195,29,81,41,89,29,5,35,23,26
Qlik,237,195,0,8,42,24,89,14,6,64,17,10
Hadoop,50,29,8,0,26,20,14,75,3,4,3,3
SSRS/SSIS,300,81,42,26,0,141,78,35,11,42,12,28
SQL Server,159,41,24,20,141,0,64,16,26,31,10,33
Power BI,136,89,89,14,78,64,0,19,0,5,9,5
Spark,55,29,14,75,35,16,19,0,5,4,3,18
Postgresql,25,5,6,3,11,26,0,5,0,0,0,35
Informatica,79,35,64,4,42,31,5,4,0,0,6,1


We can finally move to Plotly.

### Ideograms Preparation

A chord diagram encodes information in two graphical objects:

- Ideograms, represented by distinctly colored arcs of circles;
- Ribbons, that are planar shapes bounded by two quadratic Bezier curves and two arcs of circle,that can degenerate to a point;



In [15]:
from IPython.core.display import HTML
with open('../resources/styles/datum.css', 'r') as f:
    style = f.read()
HTML(style)

### Resources:

https://plot.ly/python/filled-chord-diagram/