# Fundamentos de Data Science : Analizando los Salarios en Ciencia de Datos en 2023

## **Requisitos:**

Tu tarea es limpiar y explorar un dataset que contiene información sobre los salarios en el campo de la ciencia de datos para el año 2023. Este análisis es crucial para entender las tendencias salariales y los factores que influyen en las diferencias de salarios en esta industria.

## **Configuración**

In [1]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import plotly.express as px

path = '../data/kaggle/ds_salaries/ds_salaries.csv'
df = pd.read_csv(filepath_or_buffer=path, sep= ',', header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


Data Science Job Salaries Dataset contains 11 columns, each are:

* work_year: The year the salary was paid.
* experience_level: The experience level in the job during the year
* employment_type: The type of employment for the role
* job_title: The role worked in during the year.
* salary: The total gross salary amount paid.
* salary_currency: The currency of the salary paid as an ISO 4217 currency code.
* salaryinusd: The salary in USD
* employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
* remote_ratio: The overall amount of work done remotely
* company_location: The country of the employer's main office or contracting branch
* company_size: The median number of people that worked for the company during the year

In [2]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


## Limpieza de datos con Python:

### **Detección y eliminación de valores duplicados** 

Asegúrate de que cada registro en el dataset sea único

In [3]:
# Identificar duplicados
duplicados = df.duplicated()
# Contar el número de duplicados
num_duplicados = duplicados.sum()
print(f"Número de registros duplicados: {num_duplicados}")
df.head()

Número de registros duplicados: 1171


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


### **Verificación y ajuste de tipos de datos** 

Asegúrate de que todas las columnas coincidan con los tipos de datos indicados en el diccionario de datos.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [18]:
df['work_year'] = pd.to_datetime(df['work_year'].astype(str) + '-01-01')
df['experience_level'] = df.experience_level.astype('category')
df['employment_type'] = df.employment_type.astype('category')
df['job_title'] = df.job_title.astype('category')
df['company_location'] = df.company_location.astype('category')
df['employee_residence'] = df.employee_residence.astype('category')
df['company_size'] = df.company_size.astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   work_year           3755 non-null   datetime64[ns]
 1   experience_level    3755 non-null   category      
 2   employment_type     3755 non-null   category      
 3   job_title           3755 non-null   category      
 4   salary              3755 non-null   int64         
 5   salary_currency     3755 non-null   object        
 6   salary_in_usd       3755 non-null   int64         
 7   employee_residence  3755 non-null   category      
 8   remote_ratio        3755 non-null   int64         
 9   company_location    3755 non-null   category      
 10  company_size        3755 non-null   category      
dtypes: category(6), datetime64[ns](1), int64(3), object(1)
memory usage: 177.4+ KB


### **Consistencia en valores categóricos**

Identifica y corrige cualquier inconsistencia en los valores categóricos (por ejemplo, ‘Junior’, ‘junior’, ‘JUNIOR’)


In [6]:
df.company_location.unique()

['ES', 'US', 'CA', 'DE', 'GB', ..., 'CN', 'NZ', 'CL', 'MD', 'MT']
Length: 72
Categories (72, object): ['AE', 'AL', 'AM', 'AR', ..., 'TR', 'UA', 'US', 'VN']

In [7]:
df.experience_level.unique()

['SE', 'MI', 'EN', 'EX']
Categories (4, object): ['EN', 'EX', 'MI', 'SE']

In [8]:
df.employment_type.unique()

['FT', 'CT', 'FL', 'PT']
Categories (4, object): ['CT', 'FL', 'FT', 'PT']

In [9]:
list(df.job_title.unique())

['Principal Data Scientist',
 'ML Engineer',
 'Data Scientist',
 'Applied Scientist',
 'Data Analyst',
 'Data Modeler',
 'Research Engineer',
 'Analytics Engineer',
 'Business Intelligence Engineer',
 'Machine Learning Engineer',
 'Data Strategist',
 'Data Engineer',
 'Computer Vision Engineer',
 'Data Quality Analyst',
 'Compliance Data Analyst',
 'Data Architect',
 'Applied Machine Learning Engineer',
 'AI Developer',
 'Research Scientist',
 'Data Analytics Manager',
 'Business Data Analyst',
 'Applied Data Scientist',
 'Staff Data Analyst',
 'ETL Engineer',
 'Data DevOps Engineer',
 'Head of Data',
 'Data Science Manager',
 'Data Manager',
 'Machine Learning Researcher',
 'Big Data Engineer',
 'Data Specialist',
 'Lead Data Analyst',
 'BI Data Engineer',
 'Director of Data Science',
 'Machine Learning Scientist',
 'MLOps Engineer',
 'AI Scientist',
 'Autonomous Vehicle Technician',
 'Applied Machine Learning Scientist',
 'Lead Data Scientist',
 'Cloud Database Engineer',
 'Financial

### **Manejo de valores faltantes: Identifica y maneja cualquier valor faltante en el dataset. Rellena los valores faltantes con un marcador adecuado para el tipo de dato**

In [10]:
qsna=df.shape[0]-df.isnull().sum(axis=0)
qna=df.isnull().sum(axis=0)
ppna=round(100*(df.isnull().sum(axis=0)/df.shape[0]),2)
aux= {'datos sin NAs en q': qsna, 'Na en q': qna ,'Na en %': ppna}
na=pd.DataFrame(data=aux)
na.sort_values(by='Na en %',ascending=False)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
work_year,3755,0,0.0
experience_level,3755,0,0.0
employment_type,3755,0,0.0
job_title,3755,0,0.0
salary,3755,0,0.0
salary_currency,3755,0,0.0
salary_in_usd,3755,0,0.0
employee_residence,3755,0,0.0
remote_ratio,3755,0,0.0
company_location,3755,0,0.0


### **Detección de datos anómalos: Identifica y corrige cualquier punto de dato inapropiado o inusual (por ejemplo, un salario anual de 1 millón de dólares para un puesto de entrada).**

In [11]:
df.salary_in_usd.describe()

count      3755.000000
mean     137570.389880
std       63055.625278
min        5132.000000
25%       95000.000000
50%      135000.000000
75%      175000.000000
max      450000.000000
Name: salary_in_usd, dtype: float64

In [12]:
fig = px.histogram(df, x='salary_in_usd', nbins=10, title='Histograma de salarios')
# Mostrar la figura
fig.show()

In [15]:
# Crear el boxplot
fig = px.box(df, x='experience_level', y='salary_in_usd', title='Boxplot de Salarios por Nivel de Experiencia')
# Mostrar la figura
fig.show()

## **Exploración de datos con Python**


### **Visualizaciones exploratorias univariadas**

Crea dos tipos diferentes de visualizaciones univariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

In [16]:
df.work_year.unique()

array([2023, 2022, 2020, 2021])

In [26]:
fig = px.bar(df, x='work_year', y='salary_in_usd', color='experience_level',
             title='Salarios Promedios a lo Largo del Tiempo por Nivel de Experiencia',
             barmode='group')
fig.show()

In [33]:
df.groupby(['work_year','experience_level'])['salary_in_usd'].describe()





Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
work_year,experience_level,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-01,EN,23.0,57511.608696,54702.473489,5707.0,18817.5,45896.0,71000.0,250000.0
2020-01-01,EX,3.0,139944.333333,163508.499156,15000.0,47416.5,79833.0,202416.5,325000.0
2020-01-01,MI,32.0,87564.71875,75496.425134,6072.0,46509.25,78395.5,107000.0,450000.0
2020-01-01,SE,18.0,137240.5,91121.23766,33511.0,74130.25,118552.0,178500.0,412000.0
2021-01-01,EN,55.0,54905.254545,40083.021451,5409.0,20000.0,55000.0,80000.0,225000.0
2021-01-01,EX,10.0,186128.0,101365.169627,69741.0,132981.0,151833.5,233750.0,416000.0
2021-01-01,MI,92.0,82116.934783,62733.647034,5409.0,39628.5,72606.0,110000.0,423000.0
2021-01-01,SE,73.0,126085.356164,62720.348153,18907.0,77684.0,120000.0,170000.0,276000.0
2022-01-01,EN,124.0,77006.024194,52902.097436,6270.0,39981.25,61252.0,111250.0,300000.0
2022-01-01,EX,41.0,188260.292683,61289.314424,76309.0,145000.0,187200.0,222640.0,324000.0


In [34]:
grouped = df.groupby(['work_year','experience_level'])['salary_in_usd'].describe().reset_index()
grouped 





Unnamed: 0,work_year,experience_level,count,mean,std,min,25%,50%,75%,max
0,2020-01-01,EN,23.0,57511.608696,54702.473489,5707.0,18817.5,45896.0,71000.0,250000.0
1,2020-01-01,EX,3.0,139944.333333,163508.499156,15000.0,47416.5,79833.0,202416.5,325000.0
2,2020-01-01,MI,32.0,87564.71875,75496.425134,6072.0,46509.25,78395.5,107000.0,450000.0
3,2020-01-01,SE,18.0,137240.5,91121.23766,33511.0,74130.25,118552.0,178500.0,412000.0
4,2021-01-01,EN,55.0,54905.254545,40083.021451,5409.0,20000.0,55000.0,80000.0,225000.0
5,2021-01-01,EX,10.0,186128.0,101365.169627,69741.0,132981.0,151833.5,233750.0,416000.0
6,2021-01-01,MI,92.0,82116.934783,62733.647034,5409.0,39628.5,72606.0,110000.0,423000.0
7,2021-01-01,SE,73.0,126085.356164,62720.348153,18907.0,77684.0,120000.0,170000.0,276000.0
8,2022-01-01,EN,124.0,77006.024194,52902.097436,6270.0,39981.25,61252.0,111250.0,300000.0
9,2022-01-01,EX,41.0,188260.292683,61289.314424,76309.0,145000.0,187200.0,222640.0,324000.0


In [36]:
# Crear el gráfico de barras
fig = px.bar(grouped, x='work_year', y='50%', color='experience_level',
             title='Salarios Promedios por Año y Nivel de Experiencia',
             barmode='group')
fig.show()

### **Visualizaciones exploratorias multivariadas**

Crea dos tipos diferentes de visualizaciones multivariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

## **Análisis adicional:**

### **Estadísticas descriptivas**

Proporciona un resumen estadístico del dataset, incluyendo medidas de tendencia central y dispersión para las variables numéricas

### **Identificación de tendencias**

Analiza y discute cualquier tendencia notable que observes en los datos, apoyándote en las visualizaciones y estadísticas descriptivas