#**Exploratory Data Analysis**
This project involves a thorough analysis of the selected data to identify patterns, trends, and relevant insights.

**Description:** Choose a dataset of interest and explore its characteristics. Clean and prepare the data, then use techniques of visualization, descriptive statistics, and interactive exploration to identify valuable insights. Highlight noteworthy patterns, trends, or anomalies present in the data.

#**About the Dataset**
The dataset for this project covers information on salaries for various positions within the field of Data Science. It provides a detailed view of remuneration across different roles, such as Data Scientist, Machine Learning Engineer, Data Analyst, among others.

**Dataset available at:** https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023

#**Data Cleaning and Preparation**
I like to start by analyzing the original dataset in search of interesting data.

In [13]:
import pandas as pd

In [65]:
df = pd.read_csv('ds_salaries.csv')

df.head(len(df))

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


In [64]:
columns = df[['work_year', 'experience_level', 'job_title', 'salary_in_usd', 'employee_residence', 'company_location', 'company_size']]

columns.head(len(columns))

Unnamed: 0,work_year,experience_level,job_title,salary_in_usd,employee_residence,company_location,company_size
0,2023,SE,Principal Data Scientist,85847,ES,ES,L
1,2023,MI,ML Engineer,30000,US,US,S
2,2023,MI,ML Engineer,25500,US,US,S
3,2023,SE,Data Scientist,175000,CA,CA,M
4,2023,SE,Data Scientist,120000,CA,CA,M
...,...,...,...,...,...,...,...
3750,2020,SE,Data Scientist,412000,US,US,L
3751,2021,MI,Principal Data Scientist,151000,US,US,L
3752,2020,EN,Data Scientist,105000,US,US,S
3753,2020,EN,Business Data Analyst,100000,US,US,L


I have selected a few columns that caught my attention.

The dataset includes several positions within the field of data analysis. I have decided to focus on the 'Data Analyst' position.

In [63]:
data_analyst = columns[columns['job_title'] == 'Data Analyst']

data_analyst.shape

(612, 7)

The dataset contains 612 rows related to the 'Data Analyst' position.

In [62]:
years = data_analyst.groupby('work_year')

years_dict = {years: group for years, group in years}

year_20 = years_dict.get(2020)
year_21 = years_dict.get(2021)
year_22 = years_dict.get(2022)
year_23 = years_dict.get(2023)

I have created a dictionary with the values corresponding to each year in the dataset.

After cleaning and preparing the dataset, the first objective of my analysis will be to determine the average salary for each experience level per year.

In [97]:
years = ['year_20', 'year_21', 'year_22', 'year_23']

result = {}

for year in years:
    entry = globals()[year][globals()[year]['experience_level'] == 'EN']
    entry_calc = entry['salary_in_usd'].sum() / len(entry)
    result[year] = entry_calc

result_df = pd.DataFrame(result, index=[0])

result_df.head()


Unnamed: 0,year_20,year_21,year_22,year_23
0,40376.2,55926.285714,49971.894737,69523.1875


The average annual salary for an entry-level data analyst is as follows:

1.   2020 - 40,376.20 in USD
2.   2021 - 55,926.28 in USD
3.   2022 - 49,971.89 in USD
4.   2023 - 69,523.18 in USD

In [99]:
years = ['year_20', 'year_21', 'year_22', 'year_23']

result = {}

for year in years:
    entry = globals()[year][globals()[year]['experience_level'] == 'MI']
    entry_calc = entry['salary_in_usd'].sum() / len(entry)
    result[year] = entry_calc

result_df = pd.DataFrame(result, index=[0])

result_df.head()

Unnamed: 0,year_20,year_21,year_22,year_23
0,46586.333333,75427.875,105486.3875,102252.407895


The average annual salary for a mid-level data analyst is as follows:

1.   2020 - 46,586.33 in USD
2.   2021 - 75,427.87 in USD
3.   2022 - 105,486.38 in USD
4.   2023 - 102,252.40 in USD

In [100]:
years = ['year_20', 'year_21', 'year_22', 'year_23']

result = {}

for year in years:
    entry = globals()[year][globals()[year]['experience_level'] == 'SE']
    entry_calc = entry['salary_in_usd'].sum() / len(entry)
    result[year] = entry_calc

result_df = pd.DataFrame(result, index=[0])

result_df.head()

  entry_calc = entry['salary_in_usd'].sum() / len(entry)


Unnamed: 0,year_20,year_21,year_22,year_23
0,,96769.5,114062.085714,125788.944724


The average annual salary for a senior-level data analyst is as follows:

1.   2020 - The dataset does not include senior-level data for the year 2020.
2.   2021 - 96,769.5 in USD
3.   2022 - 114,062.08 in USD
4.   2023 - 125,788.94 in USD