# Data Science jobs

In this project, we shall apply **hierarchical clustering** in the [dataset](https://salaries.ai-jobs.net/download/) from the jobseeker website [ai-jobs](https://ai-jobs.net/). The purpose is to cluster job vacancies in the Data Science field by experience levels.

## Libraries

In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt, colors as mcolors
import seaborn as sns
import scipy.cluster.hierarchy as sch

%matplotlib notebook

## Reading and Cleaning data

In [4]:
# Read CSV
df = pd.read_csv("dataset.csv")

# Getting only the relevant columns
df = df[['job_title', 'experience_level']].sort_values('job_title')

# Function to counting entry-level vacancies
def count_exp_level(df, level):
    df = df.value_counts()
    
    if level in df.index: return df[level]
    else: return 0

# Aggregate over entry-level vacancies and average salary
df = df.groupby('job_title').agg(
    EN = ('experience_level', lambda x: count_exp_level(x, 'EN')),
    MI = ('experience_level', lambda x: count_exp_level(x, 'MI')),
    SE = ('experience_level', lambda x: count_exp_level(x, 'SE'))
)

# Select jobs with at least 10 vacancies
df = df.where(df.sum(axis = 1) >= 10).dropna()

df

Unnamed: 0_level_0,EN,MI,SE
job_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AI Scientist,7.0,3.0,1.0
Analytics Engineer,1.0,7.0,29.0
BI Data Analyst,5.0,6.0,0.0
Data Analyst,24.0,44.0,111.0
Data Architect,0.0,3.0,34.0
Data Engineer,18.0,81.0,199.0
Data Science Manager,0.0,4.0,21.0
Data Scientist,36.0,84.0,206.0
ETL Developer,0.0,2.0,8.0
ML Engineer,2.0,7.0,7.0


## Proximity Matrix

The **proximity matrix** is essentially a table with the distances between the observations. In this context, distance is a measure of **dissimilarity**. The main properties of the proximity matrix are

- Its diagonal elements are all zeros;
- It is symmetric by the transpose operation. 

Choosing a proper distance measure is the most fundamental task before building a proximity matrix. There are many possible distances, which one has advantages and disadvantages (check [this article](https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa) to understand some of the most relevant distances). There are some features of the data we must take into account to choose a distance measure:

- All variables are discrete (positive integers);
- We are mainly concerned with the magnitude differences between observations;
- The variables space has low dimensionality (equal to 3).

Let us consider both $ x_i $ and $ y_i $ arbitrary observations of the $i$-th variables. Assuming the features described above, we chose to analyze the following distance measures:

1. **Euclidian distance** $$ d(x_i, y_i) = \sum_{i = 0}^{3} \sqrt{(x_i - y_i)^{2}}; $$

2. **Manhattan distance**
$$ d(x_i, y_i) = \sum_{i = 0}^{3} |x_i - y_i|; $$

3. **Chebyshev distance**
$$ d(x_i, y_i) = \max_i |x_i - y_i|. $$

### Proximity matrix function

In [9]:
def proximity_matrix(df, affinity = 'euclidian'):
    # Empty matrix
    proximity_matrix = pd.DataFrame(columns = df.index, index = df.index)
    
    #----- Distance Measure -----#
    
    if affinity == 'euclidian':
        distance = lambda x, y: np.sqrt(((x - y)**2).sum())
        
    elif affinity == 'manhattan':
        distance = lambda x, y: np.abs(x - y).sum()
        
    elif affinity == 'chebyshev':
        distance = lambda x, y: np.abs(x - y).max()
        
    elif affinity == 'camberra':        
        distance = lambda x, y: (np.abs(x - y) / (np.abs(x) + np.abs(y))).sum()

    #--- Fill proximity matrix --#
    
    for i in range(len(df.index)):
        proximity_matrix.iloc[i, i] = 0

        for j in range(i):
            proximity_matrix.iloc[i, j] = distance(df.iloc[i], df.iloc[j])
            proximity_matrix.iloc[j, i] = proximity_matrix.iloc[i, j]
            
    return proximity_matrix.astype('float64')

### Plotting the proximity matrix for the three distance measures

In [10]:
#---- Proximity Matrixes ----#

distances_names = ['euclidian', 'manhattan', 'chebyshev']
proximity_matrices = np.array([proximity_matrix(df, dist) for dist in distances_names])
max_distance = np.max(proximity_matrices)

#----- Figure and Axes ------#

fig = plt.figure(figsize = (13,4), dpi = 80)
gs = fig.add_gridspec(nrows = 1, ncols = 22)

axs = [
    fig.add_subplot(gs[0, 0:7]),    # Euclidian distance
    fig.add_subplot(gs[0, 7:14]),   # Manhattan distance
    fig.add_subplot(gs[0, 14:21]),  # Chebyshev distance
    fig.add_subplot(gs[0, 21]),     # Colorbar
]

#---------- Style -----------#

plt.style.use('seaborn-ticks')
plt.rcParams.update({
    'font.size': 14
})

#-- Plot proximity matrices --#

for i in range(3):
    # Heatmap
    sns.heatmap(
        proximity_matrices[i], 
        ax = axs[i],
        xticklabels = False,
        yticklabels = False,
        cmap = sns.color_palette("coolwarm", as_cmap=True),
        cbar_ax = axs[-1],
        vmin = 0,
        vmax = max_distance
    )

    axs[i].set_title(distances_names[i].capitalize() + ' Distances', pad = 10)
    axs[i].set_xlabel('')
    axs[i].set_ylabel('')
    
plt.tight_layout()

<IPython.core.display.Javascript object>

The Euclidian, Manhattan, and Chebyshev distances are similar as shown in the heatmaps above. Indeed, these distances came from the same distance measure (Minkowski distance) and, for one-dimensional data, they are all the same. For multi-dimensional data, the Euclidian and Chebyshev distances are less sensitive to differences in more than one variable than the Manhattan distance. Also, since the data only has positive integer variables, the Manhattan distance better suits the case once it only yields positive integer distances. In conclusion, I decided to use the **Manhattan distance** from now on.

In [12]:
proximity_matrix(df, affinity = 'manhattan').astype('int64')

job_title,AI Scientist,Analytics Engineer,BI Data Analyst,Data Analyst,Data Architect,Data Engineer,Data Science Manager,Data Scientist,ETL Developer,ML Engineer,Machine Learning Engineer,Machine Learning Scientist,Research Scientist
job_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AI Scientist,0,38,6,168,40,287,28,315,15,15,71,14,14
Analytics Engineer,38,0,34,142,10,261,12,289,27,23,45,24,30
BI Data Analyst,6,34,0,168,42,287,28,315,17,11,71,12,10
Data Analyst,168,142,168,0,142,131,154,147,169,163,97,166,160
Data Architect,40,10,42,142,0,261,14,289,27,33,45,30,40
Data Engineer,287,261,287,131,261,0,273,28,288,282,216,285,279
Data Science Manager,28,12,28,154,14,273,0,301,15,19,57,16,26
Data Scientist,315,289,315,147,289,28,301,0,316,310,244,313,307
ETL Developer,15,27,17,169,27,288,15,316,0,8,72,5,15
ML Engineer,15,23,11,163,33,282,19,310,8,0,66,3,7


## Agglomarative Hierarquical Clustering

## K-Means

## Dendrogram (Scipy Cluster Library)

In [144]:
# Subplots
#plt.figure()
fig, axs = plt.subplots(2, 2, figsize = (12, 7), dpi = 80)
plt.style.use('seaborn-white')

# Remove spines
for row_ax in axs:
    for ax in row_ax:
        ax.spines[:].set_visible(False)
        ax.set_yticks([])
        ax.set_xticks([])

# Single linkage
axs[0,0].set_title('Single likage', loc = 'center', fontsize = 14)
dendrogram = sch.dendrogram(
    sch.linkage(df, method = 'single', optimal_ordering = True), 
    orientation = 'right',
    labels = df.index,
    ax = axs[0,0]
)

# Complete linkage
axs[0,1].set_title('Complete likage', loc = 'center', fontsize = 14)
dendrogram = sch.dendrogram(
    sch.linkage(df, method = 'complete', optimal_ordering = True), 
    orientation = 'left',
    labels = df.index,
    ax = axs[0,1]
)

# Complete linkage
axs[1,0].set_title('Average likage', loc = 'center', fontsize = 14)
dendrogram = sch.dendrogram(
    sch.linkage(df, method = 'average', optimal_ordering = True), 
    orientation = 'right',
    labels = df.index,
    ax = axs[1,0]
)

# Complete linkage
axs[1,1].set_title('Ward likage', loc = 'center', fontsize = 14)
dendrogram = sch.dendrogram(
    sch.linkage(df, method = 'average', optimal_ordering = True), 
    orientation = 'left',
    labels = df.index,
    ax = axs[1,1]
)

# Set layout
plt.tight_layout()
plt.savefig('dendrograms.png', dpi = 100)
plt.show()

<IPython.core.display.Javascript object>