# Machine Learning 
ML is one of the most soughted after qualifications in this era of Computer Science influenced heavily by the craze and usage of AI an Neural Networks

Machine Learning belongs to vast field of Artificial Intelligence or as a common term being used multiple times these days **AI**

<div>
    <center>
    <img src="https://blog.oursky.com/wp-content/uploads/2020/05/ai-vs-ml-vs-dl-e1588741387626.png" width=300>
    </center>
</div>

As the era of neural netwroks has already begun and is in the rising, **"Machine Learning Engineers**  are often associated with lucrative paychecks, huge corporates and ambitious start-ups. In this analysis we will go through the data set of the salaries of machine learning engineers and try to find what attribute has the greatest influence over paychecks. 

# Dataset

It is an updated and new dataset and consists of info of ML engineers till 2024. The attributes in the dataset are:

* **work_year**: The year in which the salary data was collected (e.g., 2024).
* **experience_level**: The level of experience of the employee (e.g., MI for Mid-Level).
* **employment_type**: The type of employment (e.g., FT for Full-Time).
* **job_title**: The title of the job (e.g., Data Scientist).
* **salary**: The salary amount.
* **salary_currency**: The currency in which the salary is denominated (e.g., USD for US Dollars).
* **salary_in_usd**: The salary amount converted to US Dollars.
* **employee_residence**: The country of residence of the employee (e.g., AU for Australia).
* **remote_ratio**: The ratio indicating the level of remote work (0 for no remote work).
* **company_location**: The location of the company (e.g., AU for Australia).
* **company_size**: The size of the company (e.g., S for Small).

We will analyze the dataset for relevant inferences with the main objective of finding what influences the salary the most.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv')

In [None]:
df.head()

In [None]:
df.info()

It is extremely important to check the dataset for null values

In [None]:
df.isnull().sum()

We can check all the unique values for columns like `experience_level`, `emplyoment_type` and `job_title`

In [None]:
df["experience_level"].unique(), df["employment_type"].unique()

**Experience Levels**

* MI: Mid-Level Experience
* SE: Senior-Level Experience
* EN: Entry-Level Experience
* EX: Executive-Level Experience

**Emplyoment Type**

* FT: Full-Time Emplyoees
* CT: Contract Based Employees
* PT: Part-Time Emplyoee
* FL: Freelance Emplyoee

In [None]:
df['company_location'].unique(), len(df['company_location'].unique())

We have data from 77 different company locations.

We use the `reset_index()` method to convert the grouped DataFrame back to a standard DataFrame.
The `first()` methodgets the first occurence of **salary_currency** for each **company_location**.

In [None]:
df_sal = df.groupby('company_location')['salary'].mean().reset_index()
currency_info = df.groupby('company_location')['salary_currency'].first().reset_index()

result = pd.merge(df_sal, currency_info, on='company_location', how='left')

print(result)

In [None]:
df["job_title"].unique()

In [None]:
len(df["job_title"].unique())

# 155 Unique ML job types

We don't have to go through them, we will draw charts and construct tables for our reference.

The dataste has 155 different job position types unser the umbrella of Machine Learning Engineer. To identify which is the most in-demand will be a bit of a challenge due to the huge number.

We can calculate the statistical details `mean`, `median` etc of **saalaries** for corresponding **experience_level**. We also need not have two columns for salaries (`salary` & `salaries_in_usd`) and currency type (`salary_currency`) but only the column where all salaries are listed in USD since our goal is to analyse the salaries of ML engineers as a whole and not focus on Purchasing Power Parity.

In [None]:
df_dropped = df.drop(columns=["salary", "salary_currency"])

In [None]:
df_dropped.head()

In [None]:
exp_df = df.groupby('experience_level')['salary_in_usd']

In [None]:
exp_df.describe()

In [None]:
exp_df.mean()

In [None]:
exp_df.median()

We can clearly see higher the experience level higher the salaries but median is a better explanation to our study approach here since the max for some MI engineers are exceptionally high and the max of EX the highest level of experience not that much.  

Even for EL (Entry-Level) engineers the **25%** is **58000 USD** which is really good. 

We can calculate the statistical details `mean`, `median` etc of **saalary_in_usd** for corresponding **emplyoment_type**. 

In [None]:
jobtype_df = df.groupby('employment_type')['salary_in_usd']

In [None]:
jobtype_df.describe()

In [None]:
jobtype_df.median()

In [None]:
jobtype_df.mean()

It can be obserevd that **full time** employees get paid usually the most followed by **contract based employees**, as a result of our mean and median calculations.

# Top 20 Job Types 

We had already seen that our dataset had a total of 155 different job types into which the ML enginers were divided into. So what is the category with the highest average paycheck? 

We can use `groupby()` method to group **job_title** with **salary_in_usd** followed by the `sort_values()` method to sort the data values in descending order and display the top 20 using a simple `head(20)` method statement.

In [None]:
jobtitle_df = df.groupby('job_title')['salary_in_usd']

The below provides a detailed report of `median` and `mean` on all the 155 job types.

In [None]:
pd.set_option('display.max_rows', None)
jobtitle_df1 = jobtitle_df.mean().reset_index()

In [None]:
jobtitle_df2 = jobtitle_df.median().reset_index()

In [None]:
jobtitle_df1

In [None]:
jobtitle_df2

In [None]:
top_titles_mean = jobtitle_df1.sort_values(by='salary_in_usd', ascending=False).head(20)
top_titles_median = jobtitle_df2.sort_values(by='salary_in_usd', ascending=False).head(20)

Below are the top 20 ML engineering job titles by mean and the median.

In [None]:
top_titles_mean

In [None]:
top_titles_median

Another important analysis could be the impact of **remote work** v/s **non-remote work** on the amount of salary they get paid.

In [None]:
location_sal_df = df.groupby('remote_ratio')['salary_in_usd']

In [None]:
location_sal_df.mean(), location_sal_df.median()

It is quite interesting that the engineers with remote work type get paid almost the same as compared to engineers with no remote work, with the mean and median of non remote workers being slightly more.

# Data Representation and Visualization

Graphs and charts are a great way to visualise our inferences as they provide quick insights to our detailed analysis. We will be using `matplotlib` and `seaborn` for visualisation.

In [None]:
plt.figure(figsize=(7, 5))

sns.barplot(x = 'experience_level', y = 'salary_in_usd', data=df, palette = 'muted')

plt.xlabel("Experience level of employee")
plt.ylabel("Salary of Employ in USD")
plt.title("Salary in USD by Experience Level")

plt.show()

**Experience Levels**
* MI: Mid-Level Experience
* SE: Senior-Level Experience
* EN: Entry-Level Experience
* EX: Executive-Level Experience


In [None]:
mean_salary = jobtype_df.mean().reset_index()
median_salary = jobtype_df.median().reset_index()

# Creating a figure with two subplots
fig, axes = plt.subplots(1 , 2, figsize=(14, 5))

sns.barplot(ax=axes[0], x = 'employment_type', y = 'salary_in_usd', data=mean_salary, palette='flare')

axes[0].set_title("Mean Salary in USD by Employment Type")
axes[0].set_xlabel("Employment Type of Employee")
axes[0].set_ylabel("Mean Salary in USD")

sns.barplot(ax=axes[1], x='employment_type', y='salary_in_usd', data=mean_salary, palette='flare')
axes[1].set_title("Median Salary in USD by Employment Type")
axes[1].set_xlabel("Employment Type of Employee")
axes[1].set_ylabel("Median Salary in USD")

plt.tight_layout()
plt.show()

**Emplyoment Type**

* FT: Full-Time Emplyoees
* CT: Contract Based Employees
* PT: Part-Time Emplyoee
* FL: Freelance Emplyoee

In [None]:
plt.figure(figsize=(11, 6))

sns.barplot(x = 'remote_ratio', y = 'salary_in_usd', data=df, palette = 'flare')

plt.xlabel("Remote Ratio (0 for no remote work)")
plt.ylabel("Salary of Employ in USD")
plt.title("Salary in USD by Job Remote Ratio")

plt.show()

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Plot the mean-based barplot
sns.barplot(ax=axes[0], x='job_title', y='salary_in_usd', data=top_titles_mean, palette='flare')
axes[0].set_title("Salary in USD (mean) by top 20 Job Titles")
axes[0].set_xlabel("Job Title")
axes[0].set_ylabel("Mean Salary in USD")
axes[0].tick_params(axis='x', rotation=80)

# Plot the median-based barplot
sns.barplot(ax=axes[1], x='job_title', y='salary_in_usd', data=top_titles_median, palette='flare')
axes[1].set_title("Salary in USD (median) by top 20 Job Titles")
axes[1].set_xlabel("Job Title")
axes[1].set_ylabel("Median Salary in USD")
axes[1].tick_params(axis='x', rotation=80)

plt.tight_layout()

plt.show()
