## 📖 Background
You work for an international HR consultancy helping companies attract and retain top talent in the competitive tech industry. As part of your services, you provide clients with insights into industry salary trends to ensure they remain competitive in hiring and compensation practices.

Your team wants to use a data-driven approach to analyse how various factors—such as job role, experience level, remote work, and company size—impact salaries globally. By understanding these trends, you can advise clients on offering competitive packages to attract the best talent.

In this competition, you’ll explore and visualise salary data from thousands of employees worldwide. f you're tackling the advanced level, you'll go a step further—building predictive models to uncover key salary drivers and providing insights on how to enhance future data collection.

## 💾 The data

The data comes from a survey hosted by an HR consultancy, available in `'salaries.csv'`.

#### Each row represents a single employee's salary record for a given year:
- **`work_year`** - The year the salary was paid.  
- **`experience_level`** - Employee experience level:  
  - **`EN`**: Entry-level / Junior  
  - **`MI`**: Mid-level / Intermediate  
  - **`SE`**: Senior / Expert  
  - **`EX`**: Executive / Director  
- **`employment_type`** - Employment type:  
  - **`PT`**: Part-time  
  - **`FT`**: Full-time  
  - **`CT`**: Contract  
  - **`FL`**: Freelance  
- **`job_title`** - The job title during the year.  
- **`salary`** - Gross salary paid (in local currency).  
- **`salary_currency`** - Salary currency (ISO 4217 code).  
- **`salary_in_usd`** - Salary converted to USD using average yearly FX rate.  
- **`employee_residence`** - Employee's primary country of residence (ISO 3166 code).  
- **`remote_ratio`** - Percentage of remote work:  
  - **`0`**: No remote work (<20%)  
  - **`50`**: Hybrid (50%)  
  - **`100`**: Fully remote (>80%)  
- **`company_location`** - Employer's main office location (ISO 3166 code).  
- **`company_size`** - Company size:  
  - **`S`**: Small (<50 employees)  
  - **`M`**: Medium (50–250 employees)  
  - **`L`**: Large (>250 employees)  

## 💪 Competition challenge 1

In this first level, you’ll explore and summarise the dataset to understand its structure and key statistics. If you want to push yourself further, check out level two!
Create a report that answers the following:
- How many records are in the dataset, and what is the range of years covered?
- What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?
- How many full-time employees based in the US work 100% remotely?

## 💪 Competition challenge 2
In this second level, you’ll create visualisations to analyse the data and uncover trends. If you’re up for an even greater challenge, head to level three! Create a report that answers the following:

- Create a bar chart displaying the top 5 job titles with the highest average salary (in USD).
- Compare the average salaries for employees working remotely 100%, 50%, and 0%. What patterns or trends do you observe?
- Visualise the salary distribution (in USD) across company sizes (S, M, L). Which company size offers the highest average salary?

## 💪 Competition challenge 3

In this final level, you’ll develop predictive models and dive deeper into the dataset. If this feels overwhelming, consider completing the earlier levels first!
Create a report that answers the following:
- Analyse how factors such as country, experience level, and remote ratio impact salaries for Data Analysts, Data Scientists, and Machine Learning Engineers. In which conditions do professionals achieve the highest salaries?
- Develop a predictive model to estimate an employee’s salary (in USD) using experience level, company location, and remote ratio. Which features are the strongest predictors of salary?
- Expand your model by incorporating additional features, such as company size and employment type. Evaluate its performance, what improves, and what doesn’t? Finally, propose new features to make future salary predictions even more accurate future salary predictions even more accurate.

## 🧑‍⚖️ Judging criteria

This is a community-based competition. Once the competition concludes, you'll have the opportunity to view and vote for the best submissions of others as the voting begins. The top 5 most upvoted entries will win. The winners will receive DataCamp merchandise.

## ✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- **Remove redundant cells** like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights. 
- Try to include an **executive summary** of your recommendations at the beginning.
- Check that all the cells run without error

## ⌛️ Time is ticking. Good luck!

Summarizing whats been observed:

### Null Values
There is no null in the whole dataset

### Duplicates 
The 52% of the records are duplicated. We consider them as datapoints corresponding to different employees and not errors.

| Column               | Description | Comments | Actions |
|----------------------|------------|----------|---------|
| **work_year**        | The year the salary was paid. | 2024 for the most part, fewer data points as we go back in time. First observations from 2020. | No action needed. |
| **experience_level** | Employee experience level:<br>**EN**: Entry-level / Junior<br>**MI**: Mid-level / Intermediate<br>**SE**: Senior / Expert<br>**EX**: Executive / Director | Highly imbalanced. | Convert to an ordered category. |
| **employment_type**  | Employment type:<br>**PT**: Part-time<br>**FT**: Full-time<br>**CT**: Contract<br>**FL**: Freelance | Highly imbalanced. | Convert to a category. |
| **job_title**        | The job title during the year. | Too many unique values (253). | Perform grouping. Convert to a category. Extract fields like manager, engineer, BI, ML, AI... |
| **salary**          | Gross salary paid (in local currency). | Useless due to different currencies. | Drop this column. |
| **salary_currency** | Salary currency (ISO 4217 code). | Highly imbalanced. | Convert to a category. |
| **salary_in_usd**   | Salary converted to USD using average yearly FX rate. | No outliers. All values seem legit. | No action needed. |
| **employee_residence** | Employee's primary country of residence (ISO 3166 code). | Highly imbalanced. | Convert to a category. |
| **remote_ratio**     | Percentage of remote work:<br>**0**: No remote work (<20%)<br>**50**: Hybrid (50%)<br>**100**: Fully remote (>80%) | 0 (no remote work) for the most part. Highly imbalanced. | Convert to a category. |
| **company_location** | Employer's main office location (ISO 3166 code). | Highly imbalanced. | Convert to a category. |
| **company_size**     | Company size:<br>**S**: Small (<50 employees)<br>**M**: Medium (50–250 employees)<br>**L**: Large (>250 employees) | Highly imbalanced. | Convert to an ordered category. |

In [None]:
import pandas as pd
import warnings
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

sns.set_style("darkgrid")
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv('salaries.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all')

In [None]:
duplicates = df[df.duplicated(keep='first')]
print(f'There is a {round(100*len(duplicates)/len(df),2)}% of duplicated records based on all the columns')


In [None]:
fig, axes = plt.subplots(1, 5, figsize=(15, 5), sharey=True)

cols = ['work_year', 'experience_level', 'employment_type', 'company_size', 'remote_ratio']

for ax, col in zip(axes, cols):
    sns.countplot(x=col, data=df, ax=ax, width=0.6)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('') 
    ax.tick_params(axis='x', rotation=45)  # Rota etiquetas si son largas

plt.tight_layout()
plt.show()


In [None]:
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)  # Compartir eje Y

cols = ['job_title', 'salary_currency', 'employee_residence', 'company_location']

for ax, col in zip(axes, cols):
    top_categories = df[col].value_counts().nlargest(4).index  # 4 más comunes
    data = df[col].apply(lambda x: x if x in top_categories else 'Other')  # Reemplaza el resto por "Other"
    
    sns.countplot(x=data, ax=ax, width=0.6, order=top_categories.tolist() + ['Other'])
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)
    del(data)
    
plt.tight_layout()
plt.show()

In [None]:
numeric_cols = df.select_dtypes(include=['number']).columns

In [None]:
sorted_experience_level = ['EN', 'MI', 'SE', 'EX']
sorted_company_size = ['S', 'M', 'L']

df['experience_level'] = pd.Categorical(df['experience_level'], categories=sorted_experience_level, ordered=True)
df['company_size'] = pd.Categorical(df['company_size'], categories=sorted_company_size, ordered=True)


In [None]:
df['remote_ratio'] = np.where(df['remote_ratio']==0, 'presential', np.where(df['remote_ratio']==50, 'partial_remote','full_remote'))
df['remote_ratio'] = df['remote_ratio'].astype('category')

In [None]:
sns.displot(df.salary_in_usd, bins=50)
plt.title('Salary in USD')
plt.xticks(rotation=45)
plt.show()

In [None]:
df.salary_in_usd.quantile(0.99)

In [None]:
(df.salary_in_usd<400000).mean()

- How many records are in the dataset, and what is the range of years covered?

In [None]:
print(f'There are {len(df)} records in the dataset.')
print(f"The dataset contains observations from {df['work_year'].min()} to {df['work_year'].max()}")

- What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?

In [None]:
df[df['job_title'].isin(['Data Scientist', 'Data Engineer'])].groupby('job_title').agg({'salary_in_usd': 'mean'}).round(2)

The average salary for Data Scientists is higher.

- How many full-time employees based in the US work 100% remotely?

In [None]:
len(df[(df['remote_ratio']=='full_remote')&(df['employee_residence']=='US')])

In [None]:
cat_cols = ['employment_type', 'salary_currency', 'employee_residence', 'company_location']
for col in cat_cols: 
    df[col] = df[col].astype('category')

In [None]:
df=df.drop('salary', axis=1)

In [None]:
plt.figure(figsize=(12, 8))

top_20_titles = df['job_title'].value_counts().nlargest(20).index
df_top = df[df['job_title'].isin(top_20_titles)]
order = df_top.groupby('job_title')['salary_in_usd'].median().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='job_title', x='salary_in_usd', order=order)
plt.show()

In [None]:
plt.figure(figsize=(8, 6))  

sns.regplot(data=df, x='work_year', y='salary_in_usd')
plt.xticks(sorted(df['work_year'].unique()))

plt.show()

In [None]:
plt.figure(figsize=(12, 8))  

fig, axes = plt.subplots(1, 4, figsize=(20, 5))  # Compartir eje Y

cols = ['experience_level', 'employment_type', 'remote_ratio', 'company_size']

for ax, col in zip(axes, cols):
    order = df.groupby(col)['salary_in_usd'].median().sort_values(ascending=False).index
    sns.boxplot(data=df, x=col, y='salary_in_usd', order=order, ax=ax)
    
    ax.set_title(f'Boxplot of {col} - salary USD')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)
    
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(12, 8))

top_20_res = df['employee_residence'].value_counts().nlargest(20).index
df_top = df[df['employee_residence'].isin(top_20_res)]
order = df_top.groupby('employee_residence')['salary_in_usd'].median().dropna().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='employee_residence', x='salary_in_usd', order=order)

In [None]:
plt.figure(figsize=(12, 8))

top_20_res = df['company_location'].value_counts().nlargest(20).index
df_top = df[df['company_location'].isin(top_20_res)]
order = df_top.groupby('company_location')['salary_in_usd'].median().dropna().sort_values(ascending=False).index

sns.boxplot(data=df_top, y='company_location', x='salary_in_usd', order=order)

## Bivariate Analysis

- Create a bar chart displaying the top 5 job titles with the highest average salary (in USD).

In [None]:
biggest_salaries = df.groupby('job_title')['salary_in_usd'].mean().nlargest(5).round()
biggest_salaries.plot(kind='barh')
plt.xlabel("Average Salary in USD")
plt.ylabel("Job Title")
plt.title("Top 5 Job Titles with Highest Average Salaries")
plt.xticks(rotation=45)
plt.show()

- Compare the average salaries for employees working remotely 100%, 50%, and 0%. What patterns or trends do you observe?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)

# Bar plot
remote_salaries = df.groupby('remote_ratio')['salary_in_usd'].mean().round()
remote_salaries.plot(kind='bar', ax=axes[0])
axes[0].set_ylabel("Average Salary in USD")
axes[0].set_xlabel("Remote Ratio")
axes[0].set_title("Remote Working Policies - Average Salaries")
axes[0].tick_params(axis='x', rotation=45)

# Box plot
sns.boxplot(data=df, x='remote_ratio', y='salary_in_usd', ax=axes[1])
axes[1].set_xlabel("Remote Ratio")
axes[1].set_ylabel("Salary in USD")
axes[1].set_title("Remote Working Policies - Salary Distribution")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
df['remote_ratio'].unique()

In [None]:
for ratio in df['remote_ratio'].unique():
    statistic, pvalue = stats.shapiro(df[df['remote_ratio'] == ratio]['salary_in_usd'])
    print(f"Remote Ratio {ratio}: stat:{round(statistic, 5)}, p-val:{pvalue}")

Since the salary distributions are not normal (p too small, not supporting null hypothesis of normality), we cannot use ANOVA. Lets use kruskal-Wallis instead. Since the distribution of the partial_remote category is clearly different, we dont include it in the test.

In [None]:
kruskal_result = stats.kruskal(df[df['remote_ratio'] == 'presential']['salary_in_usd'], 
                         df[df['remote_ratio'] == 'full_remote']['salary_in_usd'])

print("Kruskal-Wallis result:", kruskal_result)

The p-value of the kruskal test makes us reject the null hypothesis of the distributions of presential and full_remote are the same. We can say then that presential and full remote employees follow different salary distributions.

The partial_remote policy is less paid that the other two policies. Presencial is the best payed of all three.

- Visualise the salary distribution (in USD) across company sizes (S, M, L). Which company size offers the highest average salary?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 6), sharey=True)

# Bar plot
remote_salaries = df.groupby('company_size')['salary_in_usd'].mean().round()
remote_salaries.plot(kind='bar', ax=axes[0])
axes[0].set_ylabel("Average Salary in USD")
axes[0].set_xlabel("company_size")
axes[0].set_title("company_size - Average Salaries")
axes[0].tick_params(axis='x', rotation=45)

# Box plot
sns.boxplot(data=df, x='company_size', y='salary_in_usd', ax=axes[1])
axes[1].set_xlabel("company_size")
axes[1].set_ylabel("Salary in USD")
axes[1].set_title("company_size - Salary Distribution")
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

Both median and mean salary are higher for the medium size companies.

- Analyse how factors such as country, experience level, and remote ratio impact salaries for Data Analysts, Data Scientists, and Machine Learning Engineers. In which conditions do professionals achieve the highest salaries?

In [None]:
jobs_in_order = [ 'Data Analyst', 'Data Scientist', 'Machine Learning Engineer']  
mini_df = df[df['job_title'].isin(jobs_in_order)]

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=mini_df, hue='experience_level', x='salary_in_usd', y='job_title', order=jobs_in_order)

plt.title("Salary Distribution by Experience Level and Job Title")
plt.ylabel("Experience Level")  
plt.xlabel("Salary in USD")  
plt.legend(title="Job Title") 
plt.grid(axis="y", linestyle="--", alpha=0.7) 
plt.xticks(rotation=45) 

plt.show()

- Salaries are different depending on the job title and in general Data Analysts < Data Scientists < Machine Learning Engineers.
- The range for each combination job title - experience level suggests the impact of other factors like country, remote policy or industry.
- Outliers on the high end for the most combinations
- In general the salary grows with experience. There is an exception, though: expert Data Analysts.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=mini_df, hue='remote_ratio', x='salary_in_usd', y='job_title', order=jobs_in_order)

plt.title("Salary Distribution by Remote Ratio and Job Title")
plt.xlabel("Remote Ratio")  
plt.ylabel("Salary in USD")  
plt.legend(title="Job Title") 
plt.grid(axis="y", linestyle="--", alpha=0.7) 
plt.xticks(rotation=45) 

plt.show()

- Partial_remote is the less paid kind of contract.
- Presential is the best paid kind of contract for MLE and DS. Full remote is best paid for DA
- The wide ranges and presence of outliers suggest there are other factor that impact the salary.

In [None]:
n_countries = 10
plt.figure(figsize=(10, n_countries))

top_countries = mini_df.groupby('employee_residence')['salary_in_usd'].agg('count').nlargest(n_countries).index.to_list()
salary_ordered_countries = mini_df.groupby('employee_residence')['salary_in_usd'].agg('median').sort_values(ascending=False).index.to_list()
salary_ordered_countries = [country for country in salary_ordered_countries if country in top_countries]

countries_df = mini_df[mini_df['employee_residence'].isin(top_countries)] 

sns.boxplot(data=countries_df, y='employee_residence', x='salary_in_usd', hue='job_title', order=salary_ordered_countries)
plt.title(f"Salary Distribution for {job_title} by Country")
plt.xlabel("Salary in USD")  
plt.ylabel("Employee Residence")  
plt.grid(axis="y", linestyle="--", alpha=0.7) 
plt.xticks(rotation=45) 

plt.show()

- The distribution of salaries depend on the country and the job title.
- In general, the ranges are wide for the most numerous countries and there is a high presence of outliers, indicating there are other factors that may impact the salary.
- MLE is in general the best paid job title and data analyst the worst.

- Develop a predictive model to estimate an employee’s salary (in USD) using experience level, company location, and remote ratio. Which features are the strongest predictors of salary?

In [None]:
df_dum = df[['experience_level', 'company_location', 'remote_ratio']]
df_dum = pd.get_dummies(df)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

scaler = StandardScaler()
df_dum.loc[:, 'work_year'] = scaler.fit_transform(df_dum[['work_year']])
y = df_dum['salary_in_usd'] 
X = df_dum.drop('salary_in_usd', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hyperparameter tuning for Lasso (finding best alpha)
alphas = np.logspace(-3, 1, 50)  # Search from 0.001 to 10
lasso_cv = GridSearchCV(Lasso(max_iter=5000), {'alpha': alphas}, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1)
lasso_cv.fit(X_train, y_train)

# Best alpha
best_alpha = lasso_cv.best_params_['alpha']
print(f"Best alpha for Lasso: {best_alpha:.5f}")

# Train Lasso with the best alpha
lasso = Lasso(alpha=best_alpha, max_iter=5000)
lasso.fit(X_train, y_train)

# Cross-validation RMSE
cv_scores = -cross_val_score(lasso, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')
train_rmse = np.mean(cv_scores)

# Test RMSE
y_pred_test = lasso.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

# Print results
print(f"Train RMSE (CV Mean): {train_rmse:.4f}")
print(f"Validation RMSE (CV Mean): {train_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")

# Feature importance (nonzero coefficients)
feature_importances = np.abs(lasso.coef_)
feature_names = np.array(X.columns)

# Select nonzero features
nonzero_mask = feature_importances > 0
nonzero_features = feature_names[nonzero_mask]
nonzero_importances = feature_importances[nonzero_mask]

# Select top 20 most important features
top_n = min(20, len(nonzero_importances))  # Avoid errors if <20 features survive
top_features_idx = np.argsort(nonzero_importances)[-top_n:]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_features_idx)), nonzero_importances[top_features_idx], align='center')
plt.yticks(range(len(top_features_idx)), [nonzero_features[i] for i in top_features_idx])
plt.xlabel("Absolute Coefficient Value")
plt.ylabel("Feature Name")
plt.title("Top 20 Most Important Features in Lasso Model")
plt.show()

- Expand your model by incorporating additional features, such as company size and employment type. Evaluate its performance, what improves, and what doesn’t? Finally, propose new features to make future salary predictions.

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(15, 5), sharey=True)

cols = ['work_year', 'experience_level', 'employment_type', 'company_size', 'remote_ratio']

for ax, col in zip(axes, cols):
    sns.boxplot(data=df, x=col, y='salary_in_usd', ax=ax, width=0.6)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('') 
    ax.tick_params(axis='x', rotation=45)  # Rota etiquetas si son largas

plt.tight_layout()
plt.show()
