<a href="https://www.kaggle.com/code/mohamedabidi97/eda-plotly-data-science-jobs?scriptVersionId=109620061" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="text-align:center">  
    <img src="https://i.imgur.com/ljxGToi.png"/>
</div>

<div style="text-align:center">  
    <img src="https://i.imgur.com/IxlwwnX.png"/>
</div>



# 🪧 Context

### What is Data science?
Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.

The data used for analysis can come from many different sources and presented in various formats.


### The Data Science Lifecycle

Now that you know what is data science, next up let us focus on the data science lifecycle. Data science’s lifecycle consists of five distinct stages, each with its own tasks:

- Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This stage involves gathering raw structured and unstructured data.
- Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data Architecture. This stage covers taking the raw data and putting it in a form that can be used.
- Process: Data Mining, Clustering/Classification, Data Modeling, Data Summarization. Data scientists take the prepared data and examine its patterns, ranges, and biases to determine how useful it will be in predictive analysis.
- Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining, Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves performing the various analyses on the data.
- Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision Making. In this final step, analysts prepare the analyses in easily readable forms such as charts, graphs, and reports.


<div class="alert alert-block alert-info" style="color: black; text-align: center">
    ℹ️ Information: Don't forget to hover your mouse 🖱️ to see values and interact with charts. Happy watching!
</div>

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Import Libraries 📦</p>


In [1]:
!pip install country_converter

Collecting country_converter
  Downloading country_converter-0.7.7.tar.gz (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.2/51.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: country_converter
  Building wheel for country_converter (setup.py) ... [?25l- \ done
[?25h  Created wheel for country_converter: filename=country_converter-0.7.7-py3-none-any.whl size=53786 sha256=97aa2e5ad15fbb26820ad9bccb171ba1e9a9cbf04b91862f2579a1a7e3ec3b4d
  Stored in directory: /root/.cache/pip/wheels/e8/e6/60/61798a8a91462250002293d1c8cc8de90a130119a813277ccc
Successfully built country_converter
Installing collected packages: country_converter
Successfully installed country_converter-0.7.7
[0m

In [2]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

import country_converter as coco
from IPython.display import HTML as html_print

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Functions ⚙️ </p>


In [3]:
def cprint(title, text, color='#00ADB7'):
    text = "<strong style=color:{}>{}:</strong><br>".format(color, title) + \
            "<text>{}</text>".format(text)
    return html_print(text)


def hatching_plot(df, x, y, color, pattern, pattern_shape, title, description):
    fig = px.bar(df, x=x, y=y, 
                       color=color, pattern_shape=pattern,
                       pattern_shape_sequence=pattern_shape)
    fig.update_yaxes(showgrid=False)
    fig.update_xaxes(categoryorder='total descending')
    fig.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
    fig.update_layout(margin=dict(t=80, b=100, l=90, r=60),
                            title_text=title,
                            xaxis_tickangle=360,
                            xaxis_title=' ', yaxis_title="Count",
                            plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                            title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                            font=dict(color='black'),
                            )
    fig.add_annotation(dict(font=dict(color='black',size=15),
                                            x=0,
                                            y=-0.2,
                                            showarrow=False,
                                            text=description,
                                            textangle=0,
                                            xanchor='left',
                                            xref="paper",
                                            yref="paper"))      
    fig.show()

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Dataset 🗄️ </p>


In [4]:
df = pd.read_csv("../input/data-science-job-salaries/ds_salaries.csv")

In [5]:
cprint("Size of the dataset", df.shape)

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [7]:
cprint("All columns of the dataset", df.columns)

In [8]:
df = df.drop("Unnamed: 0", axis=1)

In [9]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [10]:
df.dtypes

work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Data Understanding </p>

- work_year:
The year the salary was paid.
- experience_level:
The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director.
- employment_type:
The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance.
- job_title:
The role worked in during the year.
- salary:
The total gross salary amount paid.
- salary_currency:
The currency of the salary paid as an ISO 4217 currency code.
- salaryinusd:
The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).
- employee_residence:
Employee's primary country of residence in during the work year as an ISO 3166 country code.
- remote_ratio:
The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote 100 Fully remote (more than 80%).
- company_location:
The country of the employer's main office or contracting branch as an ISO 3166 country code.
- company_size:
The average number of people that worked for the company during the year: S less than 50 employees (small) M 50 to 250 employees (medium) L more than 250 employees (large)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  607 non-null    object
 8   remote_ratio        607 non-null    int64 
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(4), object(7)
memory usage: 52.3+ KB


In [12]:
cprint("Unique values of work year column: \n",df.work_year.unique())

In [13]:
cprint("Unique values of company location column: \n",df.company_location.unique())

In [14]:
cprint("Unique values of experience level column: \n",df.experience_level.unique())

In [15]:
cprint("Unique values of job title column: \n",df.job_title.unique())

In [16]:
cprint("Unique values of company size column: \n",df.company_size.unique())

In [17]:
cprint("Unique values of salary currency column: \n",df.salary_currency.unique())

In [18]:
cprint("Unique values of employment type column: \n",df.employment_type.unique())

In [19]:
cprint("Unique values of employee residence column: \n",df.employee_residence.unique())

In [20]:
df.isna().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [21]:
iso3_country = coco.convert(df['employee_residence'], to="ISO3")
df['employee_residence'] = iso3_country
iso3_country = coco.convert(df['company_location'], to="ISO3")
df['company_location'] = iso3_country

In [22]:
df['experience_level'] = df['experience_level'].replace('EN','Junior')
df['experience_level'] = df['experience_level'].replace('MI','Intermediate')
df['experience_level'] = df['experience_level'].replace('SE','Expert')
df['experience_level'] = df['experience_level'].replace('EX','Director')

In [23]:
df['employment_type'] = df['employment_type'].replace('FT','Full-Time')
df['employment_type'] = df['employment_type'].replace('CT','Contract')
df['employment_type'] = df['employment_type'].replace('PT','Part-Time')
df['employment_type'] = df['employment_type'].replace('FL','Freelance')

<div class="alert alert-block alert-info" style="color: black;">  
<h1 >Insights 💡</h1>
☞ I dropped an extra column<br>
☞ No missing values in the dataset.<br>
☞ I changed some column values names to be more meangnfull.<br>
</div>

# <p style="background-color:#02ADB7;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Questions 🤔 </p>
<ul style="font-size:17px; text-align: center">
    What is the most <b>common job</b> in data science field ❓<br>
    Which countries have the <b>highest</b> salaries ❓<br>
    How much the <b>average</b> of salaries of data science jobs per <b>year</b> ❓<br>
</ul>
<h3 style="font-size:17px; text-align:center">The best answers will be provided by some lovely plots. Let's do it 🚀</h3>

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Data Visualization 📊 </p>

In [24]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Intermediate,Full-Time,Data Scientist,70000,EUR,79833,DEU,0,DEU,L
1,2020,Expert,Full-Time,Machine Learning Scientist,260000,USD,260000,JPN,0,JPN,S
2,2020,Expert,Full-Time,Big Data Engineer,85000,GBP,109024,GBR,50,GBR,M
3,2020,Intermediate,Full-Time,Product Data Analyst,20000,USD,20000,HND,0,HND,S
4,2020,Expert,Full-Time,Machine Learning Engineer,150000,USD,150000,USA,50,USA,L


In [25]:
top10_job_title = df['job_title'].value_counts()[:10]
fig = go.Figure(data=[go.Pie(labels=top10_job_title.keys(), values=top10_job_title.values, pull=[0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
fig.update_traces(textposition='inside', textinfo='percent+label', marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(title_text='Top 10 Job Titles', title_font=dict(size=25, color='black', family="Lato, sans-serif"),paper_bgcolor="#f1deba", title_font_color="black",)
# add annotation
fig.add_annotation(dict(font=dict(color='black',size=15),
                                        x=0,
                                        y=-0.08,
                                        showarrow=False,
                                        text="💡 Employees with title <b>Data Scientist</b> is highest",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))

fig.show()

In [26]:
result = (
    df.groupby(['work_year', 'company_location'], as_index=False)
      .agg(mean=('salary_in_usd', 'mean'))
)
result1 = result.sort_values(by=["work_year", "mean"], ascending=[False, False ])
n = 10
top_10_each_year = result1.groupby('work_year').apply(lambda group: group.head(n)).reset_index(drop = True)
top_10_each_year

Unnamed: 0,work_year,company_location,mean
0,2020,JPN,150844.5
1,2020,USA,143251.266667
2,2020,NZL,125000.0
3,2020,CAN,117104.0
4,2020,ARE,115000.0
5,2020,GBR,103225.25
6,2020,AUT,82683.5
7,2020,DEU,67157.0
8,2020,LUX,62726.0
9,2020,ESP,59304.5


In [27]:
fig_bar = px.bar(top_10_each_year, x="company_location", y="mean", color = top_10_each_year.index,
                 animation_frame="work_year", color_continuous_scale=px.colors.sequential.Aggrnyl,
                 range_y=[0,200000])
fig_bar.update_yaxes(showgrid=False),
fig_bar.update_xaxes(categoryorder='total descending')
fig_bar.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
fig_bar.update_layout(margin=dict(t=70, b=0, l=70, r=40),
                        coloraxis_showscale=False,
                        title_text='Average of Salary "$" Per Country 🏁',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig_bar.add_annotation(dict(font=dict(color='black',size=15),
                                        x=0,
                                        y=-0.2,
                                        showarrow=False,
                                        text="💡 🇯🇵  🇷🇺  🇺🇸 have the <b>highest</b> average of salaries.",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))

fig_bar.show()

In [28]:
avg_salary_year = df.groupby("work_year")["salary_in_usd"].mean()
fig = go.Figure(data=[go.Pie(labels=avg_salary_year.keys(), values=avg_salary_year.values, pull=[0, 0, 0.2])])
fig.update_traces(textposition='inside', textinfo='percent+label', marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(title_text='Average Of Salaries Per Year 💵', title_font=dict(size=25, color='black', family="Lato, sans-serif"),paper_bgcolor="#f1deba", title_font_color="black",)

# add annotation
fig.add_annotation(dict(font=dict(color='black',size=15),
                                        x=0,
                                        y=-0.08,
                                        showarrow=False,
                                        text="💡 2022 Highest Average of Salaries With 124k $ 🤑",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))
fig.show()

In [29]:
experience = df['experience_level'].value_counts().rename_axis('name').reset_index(name='counts')
fig_bar = px.bar(experience, x="name", y="counts", color = experience.counts,
                color_continuous_scale=px.colors.sequential.Aggrnyl)
fig_bar.update_yaxes(showgrid=False)
fig_bar.update_xaxes(categoryorder='total descending')
fig_bar.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
fig_bar.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Distribution of Experience Level',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig_bar.add_annotation(dict(font=dict(color='black',size=15),
                                        x=0,
                                        y=-0.2,
                                        showarrow=False,
                                        text="💡 Most of data is for Expert 👨‍🔬 position",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))
fig_bar.show()

In [30]:
fig=px.treemap(df,path=[px.Constant('Job Roles'),'job_title','company_location','experience_level'],template='ggplot2',hover_name='job_title')
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='TreeMap of Different Roles in Data Science with Experience Level',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                     
                          )
fig.update_traces(root_color='lightgrey')


<div class="alert alert-block alert-info" style="color: black;">  
<h1>Insights 💡</h1>
☞ The top 3 data science positions are a data scientist, data engineer, and data analyst. I think all the others also are related to those 3 positions.<br>
☞ As we can see also based on this dataset, most of the positions are expert and intermediate levels<br>
☞ Talking about the average salary, the USA was in the top 3 all those 3 years, and then coming to 2022 it has the highest wages with USD 124k/year <br>
</div>

# <p style="background-color:#02ADB7;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Questions 🤔 </p>
<ul style="font-size:17px; text-align: center">
    How it looks the distribution of <b>Company Size</b>❓<br>
    How many companies are working <b>Full Remote</b> ❓<br>
    How much the <b>salaries</b> of the top <b>3</b> positions ❓<br>
</ul>
<h3 style="text-align:center">Let's Continue 🚀</h3>

In [31]:
count_size_experience = (
    df.groupby(['company_size', 'experience_level'], as_index=False)
      .agg(count=('company_size', 'count'))
)
description = "💡 It is obvious that Mid size of company distributed along with higher number of Expert"
title = 'Distribution of Company Size with Experience Level'
hatching_plot(count_size_experience,"company_size", "count", "company_size", "experience_level", ["", ".", "+", "x"], title, description)

In [32]:
count_size_remote = (
    df.groupby(['company_size', 'remote_ratio'], as_index=False)
      .agg(count=('company_size', 'count'))
)
count_size_remote['remote_ratio'] = count_size_remote['remote_ratio'].replace(0,'No Remote')
count_size_remote['remote_ratio'] = count_size_remote['remote_ratio'].replace(50,'Partially Remote')
count_size_remote['remote_ratio'] = count_size_remote['remote_ratio'].replace(100,'Fully Remote')

In [33]:
description = "💡 More than 50% of positions are Fully Remote positions"
title = "Distribution of Company Size with Remote Ratio"
hatching_plot(count_size_remote,"company_size", "count", "company_size", "remote_ratio", ["", ".", "+"], title, description)

In [34]:
job_salaries = df.groupby(['work_year', 'job_title'], as_index=False).agg(mean=('salary_in_usd', 'mean'))
top3_job_salaries= job_salaries[job_salaries['job_title'].isin(["Data Scientist", "Data Engineer", "Data Analyst"])]
top3_job_salaries

Unnamed: 0,work_year,job_title,mean
5,2020,Data Analyst,45547.285714
6,2020,Data Engineer,88162.0
10,2020,Data Scientist,85970.52381
34,2021,Data Analyst,79505.411765
38,2021,Data Engineer,83202.53125
43,2021,Data Scientist,70671.733333
72,2022,Data Analyst,100550.739726
77,2022,Data Engineer,126375.696629
80,2022,Data Scientist,136172.090909


In [35]:
fig = px.line(top3_job_salaries, x="work_year", y="mean", color='job_title', markers=True)
fig.update_xaxes(type="category")
fig.update_traces(marker_size=10)
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Top 3 job title Salaries',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig.show()

In [36]:
fig = px.box(df, x="work_year", y="salary_in_usd", points="all", color_discrete_sequence=[ "#01acb7"])
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Distribution of Salaries',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig.update_traces(hovertemplate=None, marker =dict(line=dict(color='#000000', width=1)))

fig.add_annotation(dict(font=dict(color='black',size=15),
                                        x=0,
                                        y=-0.2,
                                        showarrow=False,
                                        text="💡 Very clear that the salaries increased from 2020 to 2022",
                                        textangle=0,
                                        xanchor='left',
                                        xref="paper",
                                        yref="paper"))
fig.show()

<div class="alert alert-block alert-info" style="color: black;">  
<h1>Insights 💡</h1>
☞ The pandemic of Covid have changed a lot of things in term of jobs type, that's why more than 50% of positions are Fully-remote.<br>
☞ The number of entry-level jobs is about the same in all size companies 🤔.<br>
☞ Significant increase in the number of experts in medium-sized companies compared to small companies.<br>
☞ Crystal clear that the salaries have increased through those years.<br>
</div>

# <p style="background-color:#02ADB7;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Questions 🤔 </p>
<ul style="font-size:17px; text-align: center">
    How much different the <b>salaries</b> from one company to another based on <b>size</b>❓<br>
    How much the <b>salaries</b>of different <b>experience level</b>❓<br>
    How much the <b>salaries</b>of different <b>Employement type</b>❓<br>

</ul>
<h3 style="text-align:center">Let's Continue 🚀</h3>

In [37]:
mean_exp_salary = (
    df.groupby(['work_year', 'experience_level'], as_index=False)
      .agg(mean=('salary_in_usd', 'mean'))
)

mean_exp_salary

Unnamed: 0,work_year,experience_level,mean
0,2020,Director,202416.5
1,2020,Expert,137240.5
2,2020,Intermediate,85950.0625
3,2020,Junior,63648.6
4,2021,Director,223752.727273
5,2021,Expert,126596.188406
6,2021,Intermediate,85490.088889
7,2021,Junior,59101.021277
8,2022,Director,178313.846154
9,2022,Expert,143043.398964


In [38]:
fig = px.bar(mean_exp_salary, x="work_year", y="mean", 
             color="experience_level", barmode = 'group')
fig.update_yaxes(showgrid=False)
fig.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Salary By Experience Level',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig.show()

In [39]:
mean_size_salary = (
    df.groupby(['work_year', 'company_size'], as_index=False)
      .agg(mean=('salary_in_usd', 'mean'))
)


fig = px.bar(mean_size_salary, x="work_year", y="mean", 
             color="company_size", barmode = 'group')
fig.update_yaxes(showgrid=False)
fig.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Salary by Company Size',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig.show()

In [40]:
mean_empl_salary = (
    df.groupby(['company_size', 'employment_type'], as_index=False)
      .agg(mean=('salary_in_usd', 'mean'))
)


fig = px.bar(mean_empl_salary, x="company_size", y="mean", 
             color="employment_type", barmode = 'group')
fig.update_yaxes(showgrid=False)
fig.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white', marker=dict(line=dict(color='#000000', width=2)))
fig.update_layout(margin=dict(t=70, b=90, l=90, r=40),
                        coloraxis_showscale=False,
                        title_text='Salary By Company Size And Employment Type',
                        hovermode="x unified",
                        xaxis_tickangle=360,
                        xaxis_title=' ', yaxis_title=" ",
                        plot_bgcolor='#f1deba', paper_bgcolor='#f1deba',
                        title_font=dict(size=25, color='black', family="Lato, sans-serif"),
                        font=dict(color='black'),
                        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1)
                          )
fig.show()

In [41]:
df[df["employment_type"]=="Contract"]

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
28,2020,Junior,Contract,Business Data Analyst,100000,USD,100000,USA,100,USA,L
78,2021,Intermediate,Contract,ML Engineer,270000,USD,270000,USA,100,USA,L
225,2021,Director,Contract,Principal Data Scientist,416000,USD,416000,USA,100,USA,S
283,2021,Expert,Contract,Staff Data Scientist,105000,USD,105000,USA,100,USA,M
489,2022,Junior,Contract,Applied Machine Learning Scientist,29000,EUR,31875,TUN,100,CZE,M


<div class="alert alert-block alert-warning" style="color: black; text-align: center">  
⚠️ Just one row about a Contract employer in a Mid-size Company. That's why we have a value quite weird a bit.
</div>

<div class="alert alert-block alert-info" style="color: black;">  
<h1>Insights 💡</h1>
☞ Through those years, all salaries average have increased except the director position.<br>
☞ All company sizes have increased the salaries because Data science jobs are getting more popular.<br>
☞ Large and mid-size companies are providing more salaries compared to small ones, that doesn't means always big companies are better in term of salaries.<br>
☞ Full time employment have better salaries compared to other types.<br>
</div>

# <p style="background-color:#02ADB7;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">Questions 🤔 </p>
<ul style="font-size:17px; text-align: center">
    Where are the most employee <b>residences</b>❓<br>
    How many employees work in their <b>country of residence</b>❓<br>
</ul>
<h3 style="text-align:center">Let's Continue 🚀</h3>

In [42]:
count_employee_residence=df['employee_residence'].value_counts().head(10)
count_employee_residence

USA    332
GBR     44
IND     30
CAN     29
DEU     25
FRA     18
ESP     15
GRC     13
JPN      7
PRT      6
Name: employee_residence, dtype: int64

In [43]:
residence_company = df[['employee_residence', 'company_location']]
residence_company


Unnamed: 0,employee_residence,company_location
0,DEU,DEU
1,JPN,JPN
2,GBR,GBR
3,HND,HND
4,USA,USA
...,...,...
602,USA,USA
603,USA,USA
604,USA,USA
605,USA,USA


In [44]:
residence_company = df[['employee_residence', 'company_location']]

match = []
for (i, j) in zip(residence_company['employee_residence'], residence_company['company_location']):
    if i == j:
        match.append('Yes')
    else:
        match.append('No')

In [45]:
cprint("Same Country: \n",match.count("Yes"))


In [46]:
cprint("Different Country: \n",match.count("No"))


<div class="alert alert-block alert-info" style="color: black; text-align: center">  
ℹ️ As we can notice a massive number of employees are from the US. It's obvious because the US have the much higher wages than other countries.<br>
ℹ️ Most of employees are working in their country
</div>

# <p style="background-color:#F3A71D;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:5px;padding:7px">The End 🏁 </p>

<div style="text-align:center">  
    <img src="https://i.imgur.com/lFEO6C5.gif"/>
</div>


<div class="alert alert-block alert-info" style="color: black; font-size:20px; text-align: center">  
❤️  If you liked it, upvote it.<br>
❤️  If you have any questions, feel free to ask.<br>
</div>