# Fundamentos de Data Science : Analizando los Salarios en Ciencia de Datos en 2023

## **Requisitos:**

Tu tarea es limpiar y explorar un dataset que contiene información sobre los salarios en el campo de la ciencia de datos para el año 2023. Este análisis es crucial para entender las tendencias salariales y los factores que influyen en las diferencias de salarios en esta industria.

## **Configuración**

In [1]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import plotly.express as px

path = '../data/kaggle/ds_salaries/ds_salaries.csv'
df = pd.read_csv(filepath_or_buffer=path, sep= ',', header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


Data Science Job Salaries Dataset contains 11 columns, each are:

* work_year: The year the salary was paid.
* experience_level: The experience level in the job during the year
* employment_type: The type of employment for the role
* job_title: The role worked in during the year.
* salary: The total gross salary amount paid.
* salary_currency: The currency of the salary paid as an ISO 4217 currency code.
* salaryinusd: The salary in USD
* employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
* remote_ratio: The overall amount of work done remotely
* company_location: The country of the employer's main office or contracting branch
* company_size: The median number of people that worked for the company during the year

In [2]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


## Limpieza de datos con Python:

### **Detección y eliminación de valores duplicados** 

Asegúrate de que cada registro en el dataset sea único

In [3]:
# Identificar duplicados
duplicados = df.duplicated()
# Contar el número de duplicados
num_duplicados = duplicados.sum()
print(f"Número de registros duplicados: {num_duplicados}")
df.head()

Número de registros duplicados: 1171


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


### **Verificación y ajuste de tipos de datos** 

Asegúrate de que todas las columnas coincidan con los tipos de datos indicados en el diccionario de datos.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [5]:
df['work_year'] = pd.to_datetime(df['work_year'].astype(str) + '-01-01')
df['experience_level'] = df.experience_level.astype('category')
df['employment_type'] = df.employment_type.astype('category')
df['job_title'] = df.job_title.astype('category')
df['salary_currency'] = df.salary_currency.astype('category')
df['company_location'] = df.company_location.astype('category')
df['employee_residence'] = df.employee_residence.astype('category')
df['company_size'] = df.company_size.astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   work_year           3755 non-null   datetime64[ns]
 1   experience_level    3755 non-null   category      
 2   employment_type     3755 non-null   category      
 3   job_title           3755 non-null   category      
 4   salary              3755 non-null   int64         
 5   salary_currency     3755 non-null   category      
 6   salary_in_usd       3755 non-null   int64         
 7   employee_residence  3755 non-null   category      
 8   remote_ratio        3755 non-null   int64         
 9   company_location    3755 non-null   category      
 10  company_size        3755 non-null   category      
dtypes: category(7), datetime64[ns](1), int64(3)
memory usage: 152.4 KB


### **Consistencia en valores categóricos**

Identifica y corrige cualquier inconsistencia en los valores categóricos (por ejemplo, ‘Junior’, ‘junior’, ‘JUNIOR’)


In [6]:
df.work_year.unique()

<DatetimeArray>
['2023-01-01 00:00:00', '2022-01-01 00:00:00', '2020-01-01 00:00:00',
 '2021-01-01 00:00:00']
Length: 4, dtype: datetime64[ns]

In [7]:
# Identify categorical variables
categorical_variables = df.select_dtypes(include=['category']).columns.tolist()
print("Categorical Variables:", categorical_variables)

Categorical Variables: ['experience_level', 'employment_type', 'job_title', 'salary_currency', 'employee_residence', 'company_location', 'company_size']


In [8]:
df.company_location.unique()

['ES', 'US', 'CA', 'DE', 'GB', ..., 'CN', 'NZ', 'CL', 'MD', 'MT']
Length: 72
Categories (72, object): ['AE', 'AL', 'AM', 'AR', ..., 'TR', 'UA', 'US', 'VN']

In [9]:
df.company_size.unique()

['L', 'S', 'M']
Categories (3, object): ['L', 'M', 'S']

In [10]:
list(df.employee_residence.unique())[0:10]

['ES', 'US', 'CA', 'DE', 'GB', 'NG', 'IN', 'HK', 'PT', 'NL']

In [11]:
list(df.salary_currency.unique())

['EUR',
 'USD',
 'INR',
 'HKD',
 'CHF',
 'GBP',
 'AUD',
 'SGD',
 'CAD',
 'ILS',
 'BRL',
 'THB',
 'PLN',
 'HUF',
 'CZK',
 'DKK',
 'JPY',
 'MXN',
 'TRY',
 'CLP']

In [12]:
df.experience_level.unique()

['SE', 'MI', 'EN', 'EX']
Categories (4, object): ['EN', 'EX', 'MI', 'SE']

In [13]:
df.employment_type.unique()

['FT', 'CT', 'FL', 'PT']
Categories (4, object): ['CT', 'FL', 'FT', 'PT']

In [14]:
list(df.job_title.unique())[0:10]

['Principal Data Scientist',
 'ML Engineer',
 'Data Scientist',
 'Applied Scientist',
 'Data Analyst',
 'Data Modeler',
 'Research Engineer',
 'Analytics Engineer',
 'Business Intelligence Engineer',
 'Machine Learning Engineer']

In [15]:
# Step 1: Ensure consistency by converting to lowercase and stripping whitespace
df['job_title'] = df['job_title'].str.lower().str.strip()
# Step 2: Simplify the categories
def simplify_job_title(title):
    if any(keyword in title for keyword in ['principal','director', 'lead', 'manager']):
        return 'management'
    elif 'research' in title or 'researcher' in title:
        return 'r&d'
    elif 'consultant' in title:
        return 'consulting'
    else:
        return 'engineers'
    
df['job_title_simplified'] = df['job_title'].apply(simplify_job_title)
# Display the unique simplified job titles
unique_simplified_titles = df['job_title_simplified'].unique()
print("Unique Simplified Job Titles:", unique_simplified_titles)
# Show a sample of the dataset with simplified job titles
print(df[['job_title', 'job_title_simplified']].tail(10))

Unique Simplified Job Titles: ['management' 'engineers' 'r&d' 'consulting']
                               job_title job_title_simplified
3745            director of data science           management
3746                      data scientist            engineers
3747  applied machine learning scientist            engineers
3748                       data engineer            engineers
3749                     data specialist            engineers
3750                      data scientist            engineers
3751            principal data scientist           management
3752                      data scientist            engineers
3753               business data analyst            engineers
3754                data science manager           management


### **Manejo de valores faltantes: Identifica y maneja cualquier valor faltante en el dataset. Rellena los valores faltantes con un marcador adecuado para el tipo de dato**

In [16]:
qsna=df.shape[0]-df.isnull().sum(axis=0)
qna=df.isnull().sum(axis=0)
ppna=round(100*(df.isnull().sum(axis=0)/df.shape[0]),2)
aux= {'datos sin NAs en q': qsna, 'Na en q': qna ,'Na en %': ppna}
na=pd.DataFrame(data=aux)
na.sort_values(by='Na en %',ascending=False)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
work_year,3755,0,0.0
experience_level,3755,0,0.0
employment_type,3755,0,0.0
job_title,3755,0,0.0
salary,3755,0,0.0
salary_currency,3755,0,0.0
salary_in_usd,3755,0,0.0
employee_residence,3755,0,0.0
remote_ratio,3755,0,0.0
company_location,3755,0,0.0


### **Detección de datos anómalos: Identifica y corrige cualquier punto de dato inapropiado o inusual (por ejemplo, un salario anual de 1 millón de dólares para un puesto de entrada).**

In [17]:
fig = px.histogram(df, x='salary_in_usd', nbins=10, title='Histograma de salarios')
# Mostrar la figura
fig.show()

In [18]:
print(df.salary_in_usd.describe())
# Crear el boxplot
fig = px.box(df, x='experience_level', title='Boxplot de Salarios')
# Mostrar la figura
fig.show()

count      3755.000000
mean     137570.389880
std       63055.625278
min        5132.000000
25%       95000.000000
50%      135000.000000
75%      175000.000000
max      450000.000000
Name: salary_in_usd, dtype: float64


In [19]:
print(df.groupby(['experience_level'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='experience_level', y='salary_in_usd', title='Boxplot de Salarios por Nivel de Experiencia')
# Mostrar la figura
fig.show()

                   count           mean           std      min       25%  \
experience_level                                                           
EN                 320.0   78546.284375  52225.424309   5409.0   40000.0   
EX                 114.0  194930.929825  70661.929661  15000.0  145000.0   
MI                 805.0  104525.939130  54387.685128   5132.0   66837.0   
SE                2516.0  153051.071542  56896.263954   8000.0  115000.0   

                       50%        75%       max  
experience_level                                 
EN                 70000.0  110009.25  300000.0  
EX                196000.0  239000.00  416000.0  
MI                100000.0  135000.00  450000.0  
SE                146000.0  185900.00  423834.0  






In [20]:
print(df.groupby(['salary_currency'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='salary_currency', y='salary_in_usd', title='Boxplot de Salarios por tipo de moneda de pago')
# Mostrar la figura
fig.show()

                  count           mean           std       min        25%  \
salary_currency                                                             
AUD                 9.0   74198.444444  27741.015762   42028.0   53368.00   
BRL                 6.0   12448.000000   5687.275833    6270.0    8171.50   
CAD                25.0   96707.400000  40418.226399   40663.0   69133.00   
CHF                 4.0  100682.000000  30389.015899   56536.0   92656.75   
CLP                 1.0   40038.000000           NaN   40038.0   40038.00   
CZK                 1.0    5132.000000           NaN    5132.0    5132.00   
DKK                 3.0   31192.666667  13596.868475   19073.0   23841.00   
EUR               236.0   62281.733051  29468.700571    6304.0   42026.00   
GBP               161.0   83850.229814  40866.486320   33246.0   58331.00   
HKD                 1.0   65062.000000           NaN   65062.0   65062.00   
HUF                 3.0   29892.666667  10576.261170   17684.0   26709.50   





In [21]:
print(df.groupby(['job_title_simplified'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='job_title_simplified', y='salary_in_usd', title='Boxplot de Salarios por tipo de cargo')
# Mostrar la figura
fig.show()

                       count           mean           std      min       25%  \
job_title_simplified                                                           
consulting              26.0   86587.769231  39586.574065   5707.0   65308.5   
engineers             3419.0  136317.250073  61818.661334   5132.0   95000.0   
management             176.0  159016.778409  72929.637572  17509.0  115166.5   
r&d                    134.0  151267.917910  73315.912602   5409.0  100000.0   

                           50%        75%       max  
job_title_simplified                                 
consulting             90000.0  122000.00  145000.0  
engineers             134000.0  175000.00  430967.0  
management            151500.0  192777.75  416000.0  
r&d                   149925.0  200000.00  450000.0  


## **Exploración de datos con Python**


### **Visualizaciones exploratorias univariadas**

Crea dos tipos diferentes de visualizaciones univariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

In [22]:
# Filter dataset by each job category and create a histogram for each
categories = df['job_title_simplified'].unique()
# Loop through each category to create a histogram
for category in categories:
    filtered_df = df[df['job_title_simplified'] == category]
    fig = px.histogram(filtered_df, x='salary_in_usd', nbins=10, title=f'Histograma de salarios - {category.capitalize()}')
    fig.show()

In [23]:
# Step 1: Define function to categorize salary_currency values
def categorize_salary_currency(currency):
    if currency == 'USD':
        return 'USD'
    elif currency == 'EUR':
        return 'EU'
    elif currency in ['CHF', 'GBP', 'AUD', 'SGD', 'CAD']:
        return 'CHF-GBP-AUD-SGD-CAD'
    else:
        return 'others'
# Step 2: Apply the categorization function to create a new column
df['salary_currency_category'] = df['salary_currency'].apply(categorize_salary_currency)
# Display the unique categories
unique_currency_categories = df['salary_currency_category'].unique()
print("Unique Salary Currency Categories:", unique_currency_categories)

Unique Salary Currency Categories: ['EU' 'USD' 'others' 'CHF-GBP-AUD-SGD-CAD']


In [24]:
# Filter dataset by each salary currency category and create a histogram for each with consistent scales
currency_categories = df['salary_currency_category'].unique()
# Define the same range for all histograms to maintain consistency in scale
salary_min = df['salary_in_usd'].min()
salary_max = df['salary_in_usd'].max()
# Loop through each currency category to create a histogram with consistent x-axis range
for category in currency_categories:
    filtered_df = df[df['salary_currency_category'] == category]
    fig = px.histogram(
        filtered_df,
        x='salary_in_usd',
        nbins=10,
        title=f'Histograma de salarios - {category}',
        range_x=[salary_min, salary_max]
    )
    fig.show()

### **Visualizaciones exploratorias multivariadas**

Crea dos tipos diferentes de visualizaciones multivariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

In [25]:
df.work_year.unique()

<DatetimeArray>
['2023-01-01 00:00:00', '2022-01-01 00:00:00', '2020-01-01 00:00:00',
 '2021-01-01 00:00:00']
Length: 4, dtype: datetime64[ns]

In [26]:
df.groupby(['work_year','experience_level'])['salary_in_usd'].describe()





Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
work_year,experience_level,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-01,EN,23.0,57511.608696,54702.473489,5707.0,18817.5,45896.0,71000.0,250000.0
2020-01-01,EX,3.0,139944.333333,163508.499156,15000.0,47416.5,79833.0,202416.5,325000.0
2020-01-01,MI,32.0,87564.71875,75496.425134,6072.0,46509.25,78395.5,107000.0,450000.0
2020-01-01,SE,18.0,137240.5,91121.23766,33511.0,74130.25,118552.0,178500.0,412000.0
2021-01-01,EN,55.0,54905.254545,40083.021451,5409.0,20000.0,55000.0,80000.0,225000.0
2021-01-01,EX,10.0,186128.0,101365.169627,69741.0,132981.0,151833.5,233750.0,416000.0
2021-01-01,MI,92.0,82116.934783,62733.647034,5409.0,39628.5,72606.0,110000.0,423000.0
2021-01-01,SE,73.0,126085.356164,62720.348153,18907.0,77684.0,120000.0,170000.0,276000.0
2022-01-01,EN,124.0,77006.024194,52902.097436,6270.0,39981.25,61252.0,111250.0,300000.0
2022-01-01,EX,41.0,188260.292683,61289.314424,76309.0,145000.0,187200.0,222640.0,324000.0


In [27]:
grouped = df.groupby(['work_year','experience_level'])['salary_in_usd'].describe().reset_index()
grouped 





Unnamed: 0,work_year,experience_level,count,mean,std,min,25%,50%,75%,max
0,2020-01-01,EN,23.0,57511.608696,54702.473489,5707.0,18817.5,45896.0,71000.0,250000.0
1,2020-01-01,EX,3.0,139944.333333,163508.499156,15000.0,47416.5,79833.0,202416.5,325000.0
2,2020-01-01,MI,32.0,87564.71875,75496.425134,6072.0,46509.25,78395.5,107000.0,450000.0
3,2020-01-01,SE,18.0,137240.5,91121.23766,33511.0,74130.25,118552.0,178500.0,412000.0
4,2021-01-01,EN,55.0,54905.254545,40083.021451,5409.0,20000.0,55000.0,80000.0,225000.0
5,2021-01-01,EX,10.0,186128.0,101365.169627,69741.0,132981.0,151833.5,233750.0,416000.0
6,2021-01-01,MI,92.0,82116.934783,62733.647034,5409.0,39628.5,72606.0,110000.0,423000.0
7,2021-01-01,SE,73.0,126085.356164,62720.348153,18907.0,77684.0,120000.0,170000.0,276000.0
8,2022-01-01,EN,124.0,77006.024194,52902.097436,6270.0,39981.25,61252.0,111250.0,300000.0
9,2022-01-01,EX,41.0,188260.292683,61289.314424,76309.0,145000.0,187200.0,222640.0,324000.0


In [28]:
# Crear el gráfico de barras
fig = px.bar(grouped, x='work_year', y='50%', color='experience_level',
             title='Salarios Promedios por Año y Nivel de Experiencia',
             barmode='group')
fig.show()

In [29]:
grouped = df.groupby(['work_year','salary_currency_category'])['salary_in_usd'].describe().reset_index()
grouped 

Unnamed: 0,work_year,salary_currency_category,count,mean,std,min,25%,50%,75%,max
0,2020-01-01,CHF-GBP-AUD-SGD-CAD,4.0,103989.5,18320.635897,76958.0,101007.5,110948.0,113930.0,117104.0
1,2020-01-01,EU,24.0,59907.625,29242.271459,15966.0,44762.75,53031.5,71136.75,148261.0
2,2020-01-01,USD,38.0,129450.263158,98799.18154,8000.0,75250.0,105500.0,138262.5,450000.0
3,2020-01-01,others,10.0,24214.9,16633.035197,5707.0,7927.25,23502.0,39294.5,45896.0
4,2021-01-01,CHF-GBP-AUD-SGD-CAD,28.0,86362.035714,30746.816964,42028.0,65469.0,82528.0,103292.75,187442.0
5,2021-01-01,EU,44.0,69316.431818,34428.758494,10354.0,47163.75,63831.0,88654.0,173762.0
6,2021-01-01,USD,122.0,123831.106557,76013.64609,9272.0,73250.0,111887.5,165000.0,423000.0
7,2021-01-01,others,36.0,29572.305556,21110.179967,5409.0,16735.0,23476.5,37203.75,94665.0
8,2022-01-01,CHF-GBP-AUD-SGD-CAD,104.0,85608.615385,44882.242933,33246.0,61566.0,80036.0,99429.5,430967.0
9,2022-01-01,EU,115.0,59825.434783,26439.421769,6304.0,38874.5,55685.0,73546.0,172309.0


In [30]:
# Crear el gráfico de barras
fig = px.bar(grouped, x='work_year', y='50%', color='salary_currency_category',
             title='Salarios Promedios por tipo de moneda de pago',
             barmode='group')
fig.show()

## **Análisis adicional:**

### **Estadísticas descriptivas**

Proporciona un resumen estadístico del dataset, incluyendo medidas de tendencia central y dispersión para las variables numéricas

In [31]:
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755,3755.0,3755.0,3755.0
mean,2022-05-17 08:33:29.480692736,190695.6,137570.38988,46.271638
min,2020-01-01 00:00:00,6000.0,5132.0,0.0
25%,2022-01-01 00:00:00,100000.0,95000.0,0.0
50%,2022-01-01 00:00:00,138000.0,135000.0,0.0
75%,2023-01-01 00:00:00,180000.0,175000.0,100.0
max,2023-01-01 00:00:00,30400000.0,450000.0,100.0
std,,671676.5,63055.625278,48.58905


In [32]:
grouped = df.groupby(['salary_currency_category'])['salary_in_usd'].describe().reset_index()
grouped 

Unnamed: 0,salary_currency_category,count,mean,std,min,25%,50%,75%,max
0,CHF-GBP-AUD-SGD-CAD,205.0,85261.980488,39797.093586,33246.0,61566.0,76958.0,103294.0,430967.0
1,EU,236.0,62281.733051,29468.700571,6304.0,42026.0,59020.0,75944.25,214618.0
2,USD,3224.0,149366.906638,58018.440261,7000.0,110000.0,144000.0,184000.0,450000.0
3,others,90.0,31563.466667,46330.459471,5132.0,13617.0,19890.0,35703.75,423834.0


In [33]:
grouped = df.groupby(['experience_level'])['salary_in_usd'].describe().reset_index()
grouped 





Unnamed: 0,experience_level,count,mean,std,min,25%,50%,75%,max
0,EN,320.0,78546.284375,52225.424309,5409.0,40000.0,70000.0,110009.25,300000.0
1,EX,114.0,194930.929825,70661.929661,15000.0,145000.0,196000.0,239000.0,416000.0
2,MI,805.0,104525.93913,54387.685128,5132.0,66837.0,100000.0,135000.0,450000.0
3,SE,2516.0,153051.071542,56896.263954,8000.0,115000.0,146000.0,185900.0,423834.0


In [34]:
grouped = df.groupby(['job_title_simplified'])['salary_in_usd'].describe().reset_index()
grouped 

Unnamed: 0,job_title_simplified,count,mean,std,min,25%,50%,75%,max
0,consulting,26.0,86587.769231,39586.574065,5707.0,65308.5,90000.0,122000.0,145000.0
1,engineers,3419.0,136317.250073,61818.661334,5132.0,95000.0,134000.0,175000.0,430967.0
2,management,176.0,159016.778409,72929.637572,17509.0,115166.5,151500.0,192777.75,416000.0
3,r&d,134.0,151267.91791,73315.912602,5409.0,100000.0,149925.0,200000.0,450000.0


### **Identificación de tendencias**

Analiza y discute cualquier tendencia notable que observes en los datos, apoyándote en las visualizaciones y estadísticas descriptivas

In [35]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'salary_currency_category'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='salary_currency_category',
              title='Tendencia de Salarios Medios por Tipo de Moneda de Pago',
              markers=True)
# Show the plot
fig.show()

In [36]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'job_title_simplified'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='job_title_simplified',
              title='Tendencia de Salarios Medios por Tipo cargo',
              markers=True)
# Show the plot
fig.show()

In [37]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'experience_level'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='experience_level',
              title='Tendencia de Salarios Medios por experiencia',
              markers=True)
# Show the plot
fig.show()



