## 📖 Background
You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

# Executive Report

This notebook builds on top of what has been done for the level 1.

### General Observations

- There is a total of **57194 records**.
- There is **no null** in the whole dataset
- The **52% of the records are duplicated**. We consider them as datapoints corresponding to different employees and not errors.

### Comments and Data Cleanup

| Column               | Description | Comments | Actions |
|----------------------|------------|----------|---------|
| **work_year**        | The year the salary was paid. | 2024 for the most part, fewer data points as we go back in time. <br> First observations from 2020 (**range 2020 to 2024**) | No action needed. |
| **experience_level** | Employee experience level:<br>**EN**: Entry-level / Junior<br>**MI**: Mid-level / Intermediate<br>**SE**: Senior / Expert<br>**EX**: Executive / Director | Highly imbalanced. <br> SE the most frequent value. | Convert to an ordered category. |
| **employment_type**  | Employment type:<br>**PT**: Part-time<br>**FT**: Full-time<br>**CT**: Contract<br>**FL**: Freelance | Highly imbalanced. <br> FT the most frequent. | Convert to a category. |
| **job_title**        | The job title during the year. | Too many unique values (253). | Convert to a category. <br>Perform grouping?. Extract fields like manager, engineer, BI, ML, AI...)? |
| **salary**          | Gross salary paid (in local currency). | Useless due to different currencies. | Drop this column. |
| **salary_currency** | Salary currency (ISO 4217 code). | Highly imbalanced.<br> USD the most frequent value.  | Convert to a category. |
| **salary_in_usd**   | Salary converted to USD using average yearly FX rate. | The outliers seem legit. | No action needed. |
| **employee_residence** | Employee's primary country of residence (ISO 3166 code). | Highly imbalanced.<br> US the most frequent value.  | Convert to a category. |
| **remote_ratio**     | Percentage of remote work:<br>**0**: No remote work (<20%)<br>**50**: Hybrid (50%)<br>**100**: Fully remote (>80%) | 0 (no remote work) for the most part. <br> Highly imbalanced. | Convert to a category. |
| **company_location** | Employer's main office location (ISO 3166 code). | Highly imbalanced.<br> US the most frequent value. | Convert to a category. |
| **company_size**     | Company size:<br>**S**: Small (<50 employees)<br>**M**: Medium (50–250 employees)<br>**L**: Large (>250 employees) | Highly imbalanced. <br> M the most frequent value | Convert to an ordered category. |

### Recommendations

The dataset contains valuable information to propose competitive salaries to candidates. Data is quite recent and covers a lot of different scenarions like different company sizes, contracts, company and employee locations, job_titles...

Our recommendation would be to dive deepper into this dataset, prepare it and create a model to be able to create salary predictions.

# Loading Data

In [None]:
import pandas as pd
import warnings
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

sns.set_style("darkgrid")
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv('salaries.csv')

# Level 1

In [None]:
# Lets analyze graphically the distribution of categorical features with low nunique
fig, axes = plt.subplots(1, 5, figsize=(15, 5), sharey=True)

cols = ['work_year', 'experience_level', 'employment_type', 'company_size', 'remote_ratio']

for ax, col in zip(axes, cols):
    sns.countplot(x=col, data=df, ax=ax, width=0.6)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('') 
    ax.tick_params(axis='x', rotation=45)  # Rota etiquetas si son largas

plt.tight_layout()
plt.show()


In [None]:
# Lets analyze graphically the distribution of categorical features with many different values
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)  # Compartir eje Y

cols = ['job_title', 'salary_currency', 'employee_residence', 'company_location']

for ax, col in zip(axes, cols):
    top_categories = df[col].value_counts().nlargest(4).index  # 4 más comunes
    data = df[col].apply(lambda x: x if x in top_categories else 'Other')  # Reemplaza el resto por "Other"

    palette = {cat: "C0" for cat in top_categories} 
    palette['Other'] = "green"  
    
    sns.countplot(x=data, ax=ax, width=0.6, order=top_categories.tolist() + ['Other'], palette=palette)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)
    del(data)
    
plt.tight_layout()
plt.show()

- Job_title is the only categorical feature where the datapoints are distributed across the different categories.

In [None]:
# Convert remote_ratio into a categorical feature with labels easier to interpret
df['remote_ratio'] = np.where(df['remote_ratio']==0, 'presential', np.where(df['remote_ratio']==50, 'partial_remote','full_remote'))
df['remote_ratio'] = df['remote_ratio'].astype('category')

In [None]:
# How the salary feature is distributed?
sns.displot(df.salary_in_usd, bins=50)
plt.title('Salary in USD')
plt.xticks(rotation=45)
plt.show()

# Level 2

Lets start visualizing how other features impact salary.

In [None]:
# Salary distribution for the main job_titles
plt.figure(figsize=(12, 8))

top_20_titles = df['job_title'].value_counts().nlargest(20).index
df_top = df[df['job_title'].isin(top_20_titles)]
order = df_top.groupby('job_title')['salary_in_usd'].median().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='job_title', x='salary_in_usd', order=order)
plt.title('Salary_in_usd distribution for the main job_titles')
plt.show()

- The distribution of salaries is indeed different depending on the job title.
- The ranges for the salaries of the different job_titles are quite wide
- For pretty much every job_title there are outliers on the upper bound, indicating that some individuals are way above the rest of their peers.
- The job_title with the highest median is 'Machine Learning Engineer' 

In [None]:
# Trends in Salaries Over Time
plt.figure(figsize=(8, 6))  

sns.regplot(data=df, x='work_year', y='salary_in_usd')
plt.xticks(sorted(df['work_year'].unique()))

plt.title('Trends in Salaries Over Time: Annual Salary_in_usd Distribution (2020-2024)')
plt.show()

In [None]:
# Distribution of salaries based on categorical features
plt.figure(figsize=(12, 8))  

fig, axes = plt.subplots(1, 4, figsize=(20, 5))  # Compartir eje Y

cols = ['experience_level', 'employment_type', 'remote_ratio', 'company_size']

for ax, col in zip(axes, cols):
    order = df.groupby(col)['salary_in_usd'].median().sort_values(ascending=False).index
    sns.boxplot(data=df, x=col, y='salary_in_usd', order=order, ax=ax)
    
    ax.set_title(f'Boxplot of {col} - salary USD')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)
    
plt.tight_layout()
plt.show()

- Salary_in_usd is impacted by experience_level, employment_type, remote_ratio and company_size
- A higher level of experience is in general better paid
- Full time employes are the best paid
- Parially remote workers are the worst paid
- Employees of small companies are the worst paid
- The number of outliers is still high

In [None]:
# Distribution of salaries based on employee_residence
plt.figure(figsize=(12, 8))

top_20_res = df['employee_residence'].value_counts().nlargest(20).index
df_top = df[df['employee_residence'].isin(top_20_res)]
order = df_top.groupby('employee_residence')['salary_in_usd'].median().dropna().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='employee_residence', x='salary_in_usd', order=order)
plt.title('Salary distributions by top 20 employee residence sorted by median salary_in_usd')
plt.show()

- US employees are the best paid.
- The number of outliers in US is big.
- There is a relation between the employee residence and the salary

In [None]:
# Distribution of salaries based on company_location
plt.figure(figsize=(12, 8))

top_20_res = df['company_location'].value_counts().nlargest(20).index
df_top = df[df['company_location'].isin(top_20_res)]
order = df_top.groupby('company_location')['salary_in_usd'].median().dropna().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='company_location', x='salary_in_usd', order=order)
plt.title('Salary distributions by top 20 company location sorted by median salary_in_usd')
plt.show()

- Employees of US located companies are the best paid.
- The number of outliers in US located companies is big.
- There is a relation between the company location and salary_in_usd

## Questions
### Question 1: Create a bar chart displaying the top 5 job titles with the highest average salary (in USD).

In [None]:
biggest_salaries = df.groupby('job_title')['salary_in_usd'].mean().nlargest(5).round().sort_values()
biggest_salaries.plot(kind='barh')
plt.xlabel("Average Salary in USD")
plt.ylabel("Job Title")
plt.title("Top 5 Job Titles with Highest Average Salaries")
plt.xticks(rotation=45)
plt.show()

### Question 2: Compare the average salaries for employees working remotely 100%, 50%, and 0%. What patterns or trends do you observe?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)

# Bar plot
remote_salaries = df.groupby('remote_ratio')['salary_in_usd'].mean().round()
remote_salaries.plot(kind='bar', ax=axes[0])
axes[0].set_ylabel("Average Salary in USD")
axes[0].set_xlabel("Remote Ratio")
axes[0].set_title("Remote Working Policies - Average Salaries in usd")
axes[0].tick_params(axis='x', rotation=45)

# Box plot
sns.boxplot(data=df, x='remote_ratio', y='salary_in_usd', ax=axes[1])
axes[1].set_xlabel("Remote Ratio")
axes[1].set_ylabel("Salary in USD")
axes[1].set_title("Remote Working Policies - Salary_in_usd Distribution")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

- The mean and median of salary_in_usd of the Partial_remote category is inferior to the ones of the two others.
- The distributions of salary_in_usd for full_remote and presential look similar based on the box plots. Lets compare these two distributions to see if they similarity is statistically significant.

In [None]:
# Based on the number of outliers, it seems they are not, but are these distributions normal?
for ratio in df['remote_ratio'].unique():
    statistic, pvalue = stats.shapiro(df[df['remote_ratio'] == ratio]['salary_in_usd'])
    print(f"Remote Ratio {ratio}: stat:{round(statistic, 5)}, p-val:{pvalue}")

Since the salary distributions are not normal (p too small, not supporting null hypothesis of normality), we cannot use ANOVA. Lets use kruskal-Wallis instead. Since the distribution of the partial_remote category is clearly different, we dont include it in the test.

In [None]:
# lets check if the presential and full_remote follow similar distributions
kruskal_result = stats.kruskal(df[df['remote_ratio'] == 'presential']['salary_in_usd'], 
                         df[df['remote_ratio'] == 'full_remote']['salary_in_usd'])

print("Kruskal-Wallis result:", kruskal_result)

The p-value of the kruskal test makes us reject the null hypothesis of the distributions of presential and full_remote are the same. We can say then that presential and full remote employees follow different salary distributions.

The partial_remote policy is less paid that the other two policies. Presencial is the best payed of all three.

### Question 3 : Visualise the salary distribution (in USD) across company sizes (S, M, L). Which company size offers the highest average salary?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)

# Bar plot
remote_salaries = df.groupby('company_size')['salary_in_usd'].mean().round()
remote_salaries.plot(kind='bar', ax=axes[0])
axes[0].set_ylabel("Average Salary in USD")
axes[0].set_xlabel("company_size")
axes[0].set_title("company_size - Average Salaries")
axes[0].tick_params(axis='x', rotation=45)

# Box plot
sns.boxplot(data=df, x='company_size', y='salary_in_usd', ax=axes[1])
axes[1].set_xlabel("company_size")
axes[1].set_ylabel("Salary in USD")
axes[1].set_title("company_size - Salary Distribution")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Both median and mean salary are higher for the **medium size companies**. Small companies are the ones offering lower salaries.