<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#Question-3a" data-toc-modified-id="Question-3a-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Question 3a</a></span></li><li><span><a href="#Question-3b" data-toc-modified-id="Question-3b-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Question 3b</a></span></li></ul></div>

# Libraries

Import required libraries:

In [1]:
import pandas as pd
import retinasdk
import numpy as np
from pathlib import Path
import re
from sklearn.cluster import DBSCAN, KMeans
from sklearn import manifold
from sklearn import metrics
from nltk import word_tokenize

# Question 3a

**Question**
![](./images/question3a.png)
**Solution**

Read the output of Q2:

In [2]:
q2_df = pd.read_csv('../datasets/output_from_Q2.csv')
q2_df.head(3)

Unnamed: 0,Job Title,Position Duration,Position Location,Job Description,Job Responsibilities,Required Qualifications,Remuneration,Application Deadline,About Company,Clean Job Responsibilities,Duration
0,Chief Financial Officer,,"Yerevan, Armenia",AMERIA Investment Consulting Company is seekin...,- Supervises financial management and administ...,"To perform this job successfully, an\r\r\nindi...",,26 January 2004,,- supervise financial management administrativ...,this is a custom message
1,Full-time Community Connections Intern (paid i...,3 months,"IREX Armenia Main Office; Yerevan, Armenia",IREX currently seeks to fill the position of a...,- Presenting the CC program to interested part...,- Bachelor's Degree; Master's is preferred;\r\...,Commensurate with experience.,12 January 2004,The International Research & Exchanges Board (...,- present cc program interest party ; - assist...,3 months
2,Country Coordinator,Renewable annual contract,"Yerevan, Armenia",Public outreach and strengthening of a growing...,- Working with the Country Director to provide...,"- Degree in environmentally related field, or ...",Salary commensurate with experience.,20 January 2004,The Caucasus Environmental NGO Network is a\r\...,- work country director provide environmental ...,Renewable annual contract\r\r\nPOSITION


Setup to use [Cortical.io](https://www.cortical.io/)'s SDR as the word embeddings:

In [3]:
api_key = 'e098e740-1fdd-11e7-b22d-93a4ae922ff1'
liteClient = retinasdk.LiteClient(api_key)

Build a Series containing the fields to be used as input:

In [4]:
feature_columns = ['Job Title', 'Job Description', 'Job Responsibilities']
input_text = q2_df[feature_columns] \
    .fillna('') \
    .apply(lambda columns: ' '.join(columns), axis=1)\
    .apply(lambda text: 'no text' if text.strip() == '' else text)\
    .apply(lambda text: re.sub('[^A-Za-z]', ' ', text))

Get the semantic fingerprints and save the intemediary results:

In [5]:
%%time
fingerprints_file = '../datasets/output_from_Q2_with_fingerprints.csv'

if not Path(fingerprints_file).exists():
    q2_df['finger_print'] = input_text.apply(lambda text: liteClient.getFingerprint(text))
    q2_df.to_csv(fingerprints_file, index = False)

Wall time: 1.99 ms


Convert all semantic fingerprints into its vector representation and drop duplicate feature columns:

In [6]:
%%time
def fingerprint_to_vector(fingerprint):
    fingerprint = eval(fingerprint)
    vec = np.zeros(16384)
    vec[fingerprint] = 1
    
    return vec

q2_df = pd.read_csv(fingerprints_file)

q2_df['finger_print'] = q2_df['finger_print'].apply(fingerprint_to_vector)
q2_df = q2_df.drop_duplicates(['Job Title', 'Job Description', 'Job Responsibilities'])

Wall time: 36.1 s


Drop job ads whose cleaned job responsibilities contain the same words:

In [7]:
q2_df['Extra Clean Job Responsibilities'] = q2_df['Clean Job Responsibilities']\
    .fillna('')\
    .apply(lambda text: [token for token in word_tokenize(text) if token.isalpha()])\
    .apply(lambda token_list: ' '.join(token_list))

q2_df = q2_df.drop_duplicates(['Extra Clean Job Responsibilities'])

Build the feature matrix:

In [8]:
feature_matrix = np.stack(q2_df['finger_print'].values)

Apply the K-Means clustering algorithm on the feature matrix:

In [9]:
%%time
db = KMeans(init='k-means++', n_clusters=2, n_init=10, n_jobs=-1).fit(feature_matrix)

Wall time: 4min 42s


Assess quality of the clustering algorithm's groups:

In [10]:
%%time
metrics.calinski_harabaz_score(feature_matrix, db.labels_)

Wall time: 8.7 s


475.60534390234324

Note: In my experiments, the evaluation metric is highest when the number of clusters is 2.

Assign group labels to each job ad:

In [11]:
q2_df['label'] = db.labels_

# Question 3b

**Question**
![](./images/question3b.png)

**Solution**

Count the number of job ads in each group:

In [12]:
group_summary = q2_df\
 .query('label != -1')\
 .groupby('label')\
 .size()\
 .to_frame('n')\
 .sort_values('n', ascending=False)

group_summary

Unnamed: 0_level_0,n
label,Unnamed: 1_level_1
1,9938
0,3230


Display some members from each group:

In [13]:
q2_df\
 .query('label == @group_summary.index[0]')\
 .sample(5)

Unnamed: 0,Job Title,Position Duration,Position Location,Job Description,Job Responsibilities,Required Qualifications,Remuneration,Application Deadline,About Company,Clean Job Responsibilities,Duration,finger_print,Extra Clean Job Responsibilities,label
13790,Head of Financial Reporting Unit,"Long-term, with 3 months probation period","Yerevan, Armenia",Ucom LLC is seeking a successful candidate who...,- Develop and maintain timely and accurate fin...,- University degree in Finance or Accounting;\...,Competetive,30 May 2013,"""Ucom"" telecom company provides network and ot...",- develop maintain timely accurate financial s...,"Long-term, with 3 months probation period","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",develop maintain timely accurate financial sta...,1
10447,Fashion Buyer,,"Yerevan, Armenia",SAS Group is seeking a Fashion Buyer to analyz...,"- Research current fashion trends, the industr...","- Master's degree in Retail, Buying, Marketing...",,06 November 2011,,"- research current fashion trend , industry , ...",this is a custom message,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",research current fashion trend industry people...,1
7464,PR Specialist,Long term,"Yerevan, Armenia",,- Maintain and manage the image and reputation...,- At least 2 year experience in PR; \r\r\r\n- ...,,25 November 2009,,- maintain manage image reputation woman right...,Long term,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",maintain manage image reputation woman right c...,1
18081,Program Manager for Data Initiative (PMDI),Long-term with 3 months of probation period.,"Yerevan, Armenia",CRRC-Armenia is seeking a Program Manager for ...,- Manage the CRRC Caucasus Barometer and other...,- Strong background in Social Sciences; MA in ...,,"03 August 2015, COB",The Caucasus Research Resource Center-Armenia ...,- manage crrc caucasus barometer primary data ...,Long-term with 3 months of probation period.,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",manage crrc caucasus barometer primary data co...,1
2436,Online Marketing Administrator,Long term,"Yerevan, Armenia",,- Buy online media;\r\r\r\n- Organize creative...,"- College or university degree, preferably in ...",Attractive. Based on experience. Plus free lun...,10 June 2006,APG Enterprises is a Canadian IT company.\r\r\r\n,- buy online medium ; - organize creatives pay...,Long term,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",buy online medium organize creatives payment m...,1


In [14]:
q2_df\
 .query('label ==  @group_summary.index[1]')\
 .sample(5)

Unnamed: 0,Job Title,Position Duration,Position Location,Job Description,Job Responsibilities,Required Qualifications,Remuneration,Application Deadline,About Company,Clean Job Responsibilities,Duration,finger_print,Extra Clean Job Responsibilities,label
16079,Senior C/ C++ Developer,Long term,"Yerevan, Armenia",Zangi Livecom is looking for a Senior C/ C++ D...,- Responsible for development of different sol...,- At least 3 years of work experience in Devel...,Highly competitive and number of tempting\r\r\...,06 August 2014,Zangi Livecom is a new generation telecommunic...,- responsible development different solution m...,Long term,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...",responsible development different solution mob...,0
18605,IT Assistant,Long-term,"Yerevan, Armenia","The IT Assistant will manage POS, Network and ...",- Responsible for the store support for infini...,Soft Skills:\r\r\r\n- Excellent analytical and...,Competitive depending on the previous experien...,05 November 2015,RGAM Retail Group Armenia is a member of the A...,- responsible store support infinity itx po ; ...,Long-term,"[0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, ...",responsible store support infinity itx po upda...,0
7468,Revenue Assurance Specialist,One year renewable with three month probation ...,"Yerevan, Armenia",The Revenue Assurance Specialist will be respo...,- Compare different data sources to ensure dat...,- BS or MA in computer and communication engin...,Competitive compensation including various\r\r...,25 November 2009,VivaCell-MTS is the leading mobile operator of...,- compare different data source ensure data co...,One year renewable with three month probation ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",compare different data source ensure data cons...,0
18719,Android Developer,Long-term,"Yerevan, Armenia",VOLO is looking for experienced result-oriente...,"- Design, develop, test, deploy, maintain and ...",- Excellent knowledge of Java and OOP concepts...,Competitive depending on the previous experien...,05 December 2015,VOLO LLC is an IT innovative solutions provide...,"- design , develop , test , deploy , maintain ...",Long-term,"[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",design develop test deploy maintain enhance so...,0
725,Database programmer,,"Yerevan, Armenia",,Database development:\r\r\r\n- Writing stored ...,- Candidate must have experience on working wi...,,25 November 2004,,database development : - write store procedure...,this is a custom message,"[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, ...",database development write store procedure tri...,0


**Conclusion**:

It looks like job ads that are classified as 1 tend to be jobs that require technical skills while those that are classified as 0 appear to require more soft skills.

**Next steps**:

Visualize the semantic fingerprints using some dimensionality reduction techniques (e.g. T-SNE, PCA) to determine if there are qualitatively better number of clusters.