# Executive Report

### General Observations

- There is a total of **57194 records**.
- There is **no null** in the whole dataset
- The **52% of the records are duplicated**. We consider them as datapoints corresponding to different employees and not errors.

### Comments and Data Cleanup

| Column               | Description | Comments | Actions |
|----------------------|------------|----------|---------|
| **work_year**        | The year the salary was paid. | 2024 for the most part, fewer data points as we go back in time. <br> First observations from 2020 (**range 2020 to 2024**) | No action needed. |
| **experience_level** | Employee experience level:<br>**EN**: Entry-level / Junior<br>**MI**: Mid-level / Intermediate<br>**SE**: Senior / Expert<br>**EX**: Executive / Director | Highly imbalanced. <br> SE the most frequent value. | Convert to an ordered category. |
| **employment_type**  | Employment type:<br>**PT**: Part-time<br>**FT**: Full-time<br>**CT**: Contract<br>**FL**: Freelance | Highly imbalanced. <br> FT the most frequent. | Convert to a category. |
| **job_title**        | The job title during the year. | Too many unique values (253). | Convert to a category. <br>Perform grouping?. Extract fields like manager, engineer, BI, ML, AI...)? |
| **salary**          | Gross salary paid (in local currency). | Useless due to different currencies. | Drop this column. |
| **salary_currency** | Salary currency (ISO 4217 code). | Highly imbalanced.<br> USD the most frequent value.  | Convert to a category. |
| **salary_in_usd**   | Salary converted to USD using average yearly FX rate. | The outliers seem legit. | No action needed. |
| **employee_residence** | Employee's primary country of residence (ISO 3166 code). | Highly imbalanced.<br> US the most frequent value.  | Convert to a category. |
| **remote_ratio**     | Percentage of remote work:<br>**0**: No remote work (<20%)<br>**50**: Hybrid (50%)<br>**100**: Fully remote (>80%) | 0 (no remote work) for the most part. <br> Highly imbalanced. | Convert to a category. |
| **company_location** | Employer's main office location (ISO 3166 code). | Highly imbalanced.<br> US the most frequent value. | Convert to a category. |
| **company_size**     | Company size:<br>**S**: Small (<50 employees)<br>**M**: Medium (50–250 employees)<br>**L**: Large (>250 employees) | Highly imbalanced. <br> M the most frequent value | Convert to an ordered category. |

### Questions

- How many records are in the dataset, and what is the range of years covered?
  - **There are 57194 records in the dataset.**
  - **The dataset contains observations from 2020 to 2024**
<br><br>
- What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?
  - **The average salary for Data Scientists is higher (159397.07 vs 149315.00).**
<br><br>
- How many full-time employees based in the US work 100% remotely?
  - **The amount of full-time employees based in the US who work 100% remotely is 11163**
 

### Recommendations

The dataset contains valuable information to propose competitive salaries to candidates. Data is quite recent and covers a lot of different scenarions like different company sizes, contracts, company and employee locations, job_titles...

Our recommendation would be to dive deepper into this dataset, prepare it and create a model to be able to create salary predictions.

# Loading Data

In [None]:
import pandas as pd
import warnings
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

sns.set_style("darkgrid")
warnings.simplefilter(action='ignore', category=FutureWarning)

# Lets load the data
df = pd.read_csv('salaries.csv')

# Exploration

In [None]:
# Lets check the number of null values and type of each feature
df.info()

- There is no null value. 
- There are several features containing strings that could  be converted into categories, some of them sorted, helping us reducing the memory footprint and easing the creation of plots.

In [None]:
# Lets see how the data is distributed in the different features.
df.describe(include='all')

- The provided dictionary is accurate in terms of different values for company_size, experience_level and employment_type
- Work_year ranges from 2020 to 2024 being 2024 the most common year.
- Job_title presents many unique values
- Company_size, company_location, remote_ratio, salary_currency, employment_type, experience_level amd work_year are highly imbalanced.

## Duplicates

In [None]:
# Are there any duplicates?
duplicates = df[df.duplicated(keep='first')]
print(f'There is a {round(100*len(duplicates)/len(df),2)}% of duplicated records based on all the columns')


## Data Distribution

In [None]:
# Lets analyze graphically the distribution of categorical features with low nunique
fig, axes = plt.subplots(1, 5, figsize=(15, 5), sharey=True)

cols = ['work_year', 'experience_level', 'employment_type', 'company_size', 'remote_ratio']

for ax, col in zip(axes, cols):
    sns.countplot(x=col, data=df, ax=ax, width=0.6)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('') 
    ax.tick_params(axis='x', rotation=45)  # Rota etiquetas si son largas

plt.tight_layout()
plt.show()


In [None]:
# Lets analyze graphically the distribution of categorical features with many different values
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)  # Compartir eje Y

cols = ['job_title', 'salary_currency', 'employee_residence', 'company_location']

for ax, col in zip(axes, cols):
    top_categories = df[col].value_counts().nlargest(4).index  # 4 más comunes
    data = df[col].apply(lambda x: x if x in top_categories else 'Other')  # Reemplaza el resto por "Other"

    palette = {cat: "C0" for cat in top_categories} 
    palette['Other'] = "green"  
    
    sns.countplot(x=data, ax=ax, width=0.6, order=top_categories.tolist() + ['Other'], palette=palette)
    ax.set_title(f'Distribution of {col}')
    ax.set_xlabel('')
    ax.tick_params(axis='x', rotation=45)
    del(data)
    
plt.tight_layout()
plt.show()

- Job_title is the only categorical feature where the datapoints are distributed across the different categories.

In [None]:
# Convert remote_ratio into a categorical feature with labels easier to interpret
df['remote_ratio'] = np.where(df['remote_ratio']==0, 'presential', np.where(df['remote_ratio']==50, 'partial_remote','full_remote'))
df['remote_ratio'] = df['remote_ratio'].astype('category')

In [None]:
# How the salary feature is distributed?
sns.displot(df.salary_in_usd, bins=50)
plt.title('Salary in USD')
plt.xticks(rotation=45)
plt.show()

- Salary_in_usd is right skewed with some extreamly high values compared to the rest.

In [None]:
salary=400000
print(f'The {round(100*(df.salary_in_usd<salary).mean(), 2)}% of the observations are under {salary} USD')

# Questions
## Question 1: How many records are in the dataset, and what is the range of years covered?

In [None]:
print(f'There are {len(df)} records in the dataset.')
print(f"The dataset contains observations from {df['work_year'].min()} to {df['work_year'].max()}")

## Question 2: What is the average salary (in USD) for Data Scientists and Data Engineers? Which role earns more on average?

In [None]:
df[df['job_title'].isin(['Data Scientist', 'Data Engineer'])].groupby('job_title').agg({'salary_in_usd': 'mean'}).round(2)

The average salary for Data Scientists is higher.

## Question 3: How many full-time employees based in the US work 100% remotely?

In [None]:
print(f"The amount of full-time employees based in the US who work 100% remotely is {len(df[(df['remote_ratio']=='full_remote') & (df['employee_residence']=='US')])}")