# Data Science field analysis

In this project, we shall apply the hierarchical agglomeration clustering method in the [dataset](https://salaries.ai-jobs.net/download/) from the jobseeker website [ai-jobs](https://ai-jobs.net/). The purpose is to cluster vacancies in the Data Science field posted throughout 2022 by considering the number of entry-level jobs and the average salary.

## Libraries

In [1]:
import numpy as np
import pandas as pd

## Reading and Clearning data

In [97]:
#------------ Read and Clean Dataset ------------ #

df = pd.read_csv("dataset.csv")
df['salary'] /= 12
#df['salary_in_usd'] /= 12
df = df[['job_title', 'experience_level', 'salary_in_usd']].rename(columns = {
    'job_title': 'Job',
    'salary_in_usd': 'Salary'
})

# Function to counting entry-level vacancies
def count_EN(df):
    df = df.value_counts()
    
    if 'EN' in df.index: return df['EN']
    else: return 0

# Aggregate over entry-level vacancies and average salary
df = (
    df.groupby('Job')
    .agg({
        'experience_level': count_EN,
        'Salary': 'mean'
    })
    .sort_values('experience_level', ascending = False)
    .rename(columns = {'experience_level': 'Entry-Level'})
)

# Remove vacancies which do not require entry-level experience
df = df.where(df['Entry-Level'] > 0).dropna().astype({'Entry-Level': 'int64', 'Salary': 'float64'})

df

Unnamed: 0_level_0,Entry-Level,Salary
Job,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Scientist,36,126016.404908
Data Analyst,24,95664.071823
Data Engineer,18,128168.777778
Machine Learning Engineer,14,128747.597561
AI Scientist,7,78971.583333
Data Science Consultant,5,82993.888889
BI Data Analyst,5,55826.75
Research Scientist,4,108902.2
Machine Learning Developer,3,89990.8
Computer Vision Software Engineer,3,83704.75
