# Fundamentos de Data Science : Analizando los Salarios en Ciencia de Datos en 2023

## **Requisitos:**

Tu tarea es limpiar y explorar un dataset que contiene información sobre los salarios en el campo de la ciencia de datos para el año 2023. Este análisis es crucial para entender las tendencias salariales y los factores que influyen en las diferencias de salarios en esta industria.

## **Configuración**

In [1]:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import plotly.express as px

path = '../data/kaggle/hotel-booking/hotel_booking.csv'
df = pd.read_csv(filepath_or_buffer=path, sep= ',', header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

In [2]:
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


## Limpieza de datos con Python:

### **Verificación y ajuste de tipos de datos** 

Asegúrate de que todas las columnas coincidan con los tipos de datos indicados en el diccionario de datos.

In [3]:
# Convert 'reservation_status_date' to a datetime format
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], errors='coerce')

In [4]:
# Identify categorical variables
categorical_variables = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical Variables:", categorical_variables)

Categorical Variables: ['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status']


In [5]:
# Combine year, month, and day to create a full date
df['date'] = pd.to_datetime(
    df['arrival_date_year'].astype(str) + ' ' + 
    df['arrival_date_month'] + ' ' + 
    df['arrival_date_day_of_month'].astype(str), 
    format='%Y %B %d'
)

# Display the result
df[['arrival_date_month', 'arrival_date_year', 'arrival_date_day_of_month', 'date']].head()

Unnamed: 0,arrival_date_month,arrival_date_year,arrival_date_day_of_month,date
0,July,2015,1,2015-07-01
1,July,2015,1,2015-07-01
2,July,2015,1,2015-07-01
3,July,2015,1,2015-07-01
4,July,2015,1,2015-07-01


In [6]:
df[categorical_variables] = df[categorical_variables].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 33 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   hotel                           119390 non-null  category      
 1   is_canceled                     119390 non-null  int64         
 2   lead_time                       119390 non-null  int64         
 3   arrival_date_year               119390 non-null  int64         
 4   arrival_date_month              119390 non-null  category      
 5   arrival_date_week_number        119390 non-null  int64         
 6   arrival_date_day_of_month       119390 non-null  int64         
 7   stays_in_weekend_nights         119390 non-null  int64         
 8   stays_in_week_nights            119390 non-null  int64         
 9   adults                          119390 non-null  int64         
 10  children                        119386 non-null  float64

### **Detección y eliminación de valores duplicados** 

Asegúrate de que cada registro en el dataset sea único

In [7]:
# Identificar duplicados
duplicados = df.duplicated()
# Contar el número de duplicados
num_duplicados = duplicados.sum()
print(f"Número de registros duplicados: {num_duplicados}")
df.head()

Número de registros duplicados: 31994


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,,,0,Transient,0.0,0,0,Check-Out,2015-07-01,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02,2015-07-01
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03,2015-07-01


### **Consistencia en valores categóricos**

Identifica y corrige cualquier inconsistencia en los valores categóricos (por ejemplo, ‘Junior’, ‘junior’, ‘JUNIOR’)


In [22]:
categorical_variables

['hotel',
 'arrival_date_month',
 'meal',
 'country',
 'market_segment',
 'distribution_channel',
 'reserved_room_type',
 'assigned_room_type',
 'deposit_type',
 'customer_type',
 'reservation_status']

In [9]:
df.hotel.unique()

['Resort Hotel', 'City Hotel']
Categories (2, object): ['City Hotel', 'Resort Hotel']

In [10]:
list(df.arrival_date_month.unique())

['July',
 'August',
 'September',
 'October',
 'November',
 'December',
 'January',
 'February',
 'March',
 'April',
 'May',
 'June']

In [11]:
df.meal.unique()

['BB', 'FB', 'HB', 'SC', 'Undefined']
Categories (5, object): ['BB', 'FB', 'HB', 'SC', 'Undefined']

In [12]:
df.country.unique()

['PRT', 'GBR', 'USA', 'ESP', 'IRL', ..., 'KIR', 'SDN', 'ATF', 'SLE', 'LAO']
Length: 178
Categories (177, object): ['ABW', 'AGO', 'AIA', 'ALB', ..., 'VNM', 'ZAF', 'ZMB', 'ZWE']

In [13]:
df.market_segment.unique()

['Direct', 'Corporate', 'Online TA', 'Offline TA/TO', 'Complementary', 'Groups', 'Undefined', 'Aviation']
Categories (8, object): ['Aviation', 'Complementary', 'Corporate', 'Direct', 'Groups', 'Offline TA/TO', 'Online TA', 'Undefined']

In [14]:
df.distribution_channel.unique()

['Direct', 'Corporate', 'TA/TO', 'Undefined', 'GDS']
Categories (5, object): ['Corporate', 'Direct', 'GDS', 'TA/TO', 'Undefined']

In [15]:
df.reserved_room_type.unique()

['C', 'A', 'D', 'E', 'G', 'F', 'H', 'L', 'P', 'B']
Categories (10, object): ['A', 'B', 'C', 'D', ..., 'G', 'H', 'L', 'P']

In [16]:
df.assigned_room_type.unique()

['C', 'A', 'D', 'E', 'G', ..., 'B', 'H', 'P', 'L', 'K']
Length: 12
Categories (12, object): ['A', 'B', 'C', 'D', ..., 'I', 'K', 'L', 'P']

In [17]:
df.deposit_type.unique()

['No Deposit', 'Refundable', 'Non Refund']
Categories (3, object): ['No Deposit', 'Non Refund', 'Refundable']

In [18]:
df.customer_type.unique()

['Transient', 'Contract', 'Transient-Party', 'Group']
Categories (4, object): ['Contract', 'Group', 'Transient', 'Transient-Party']

In [19]:
df.reservation_status.unique()

['Check-Out', 'Canceled', 'No-Show']
Categories (3, object): ['Canceled', 'Check-Out', 'No-Show']

In [24]:
# Convert all categorical variables to lowercase and strip whitespace
df[categorical_variables] = df[categorical_variables].apply(lambda x: x.str.lower().str.strip())

### **Manejo de valores faltantes: Identifica y maneja cualquier valor faltante en el dataset. Rellena los valores faltantes con un marcador adecuado para el tipo de dato**

In [30]:
qsna=df.shape[0]-df.isnull().sum(axis=0)
qna=df.isnull().sum(axis=0)
ppna=round(100*(df.isnull().sum(axis=0)/df.shape[0]),2)
aux= {'datos sin NAs en q': qsna, 'Na en q': qna ,'Na en %': ppna}
na=pd.DataFrame(data=aux)
na.sort_values(by='Na en %',ascending=False)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
agent,103050,16340,13.69
country,118902,488,0.41
hotel,119390,0,0.0
previous_cancellations,119390,0,0.0
reservation_status_date,119390,0,0.0
reservation_status,119390,0,0.0
total_of_special_requests,119390,0,0.0
required_car_parking_spaces,119390,0,0.0
adr,119390,0,0.0
customer_type,119390,0,0.0


In [29]:
# Drop the 'company' column from the DataFrame
df = df.drop(columns=['company'])

### **Detección de datos anómalos: Identifica y corrige cualquier punto de dato inapropiado o inusual (por ejemplo, un salario anual de 1 millón de dólares para un puesto de entrada).**

In [28]:
# Identify numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns

# Function to detect outliers using IQR
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    # Define bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Return True for outliers
    return (data < lower_bound) | (data > upper_bound)

# Create a summary DataFrame for outliers
outliers_summary = pd.DataFrame()

for column in numerical_columns:
    outliers = detect_outliers_iqr(df[column])
    outliers_summary[column] = {
        'Number of Outliers': outliers.sum(),
        'Percentage of Outliers': 100 * outliers.mean(),
        'Lower Bound': df[column][~outliers].min(),
        'Upper Bound': df[column][~outliers].max()
    }

# Save the summary to a CSV file
outliers_summary

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests
Number of Outliers,0.0,3005.0,0.0,0.0,0.0,265.0,3354.0,29710.0,8590.0,917.0,3810.0,6484.0,3620.0,18076.0,0.0,0.0,3698.0,3793.0,7416.0,2877.0
Percentage of Outliers,0.0,2.516961,0.0,0.0,0.0,0.221962,2.809281,24.884831,7.194907,0.768071,3.191222,5.430941,3.03208,15.140297,0.0,0.0,3.097412,3.176983,6.211576,2.40975
Lower Bound,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0
Upper Bound,1.0,373.0,2017.0,53.0,31.0,5.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,535.0,543.0,0.0,211.03,0.0,2.0


In [None]:
fig = px.histogram(df, x='salary_in_usd', nbins=10, title='Histograma de salarios')
# Mostrar la figura
fig.show()

In [None]:
print(df.salary_in_usd.describe())
# Crear el boxplot
fig = px.box(df, x='experience_level', title='Boxplot de Salarios')
# Mostrar la figura
fig.show()

In [None]:
print(df.groupby(['experience_level'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='experience_level', y='salary_in_usd', title='Boxplot de Salarios por Nivel de Experiencia')
# Mostrar la figura
fig.show()

In [None]:
print(df.groupby(['salary_currency'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='salary_currency', y='salary_in_usd', title='Boxplot de Salarios por tipo de moneda de pago')
# Mostrar la figura
fig.show()

In [None]:
print(df.groupby(['job_title_simplified'])['salary_in_usd'].describe())
# Crear el boxplot
fig = px.box(df, x='job_title_simplified', y='salary_in_usd', title='Boxplot de Salarios por tipo de cargo')
# Mostrar la figura
fig.show()

## **Exploración de datos con Python**


### **Visualizaciones exploratorias univariadas**

Crea dos tipos diferentes de visualizaciones univariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

In [None]:
# Filter dataset by each job category and create a histogram for each
categories = df['job_title_simplified'].unique()
# Loop through each category to create a histogram
for category in categories:
    filtered_df = df[df['job_title_simplified'] == category]
    fig = px.histogram(filtered_df, x='salary_in_usd', nbins=10, title=f'Histograma de salarios - {category.capitalize()}')
    fig.show()

In [None]:
# Step 1: Define function to categorize salary_currency values
def categorize_salary_currency(currency):
    if currency == 'USD':
        return 'USD'
    elif currency == 'EUR':
        return 'EU'
    elif currency in ['CHF', 'GBP', 'AUD', 'SGD', 'CAD']:
        return 'CHF-GBP-AUD-SGD-CAD'
    else:
        return 'others'
# Step 2: Apply the categorization function to create a new column
df['salary_currency_category'] = df['salary_currency'].apply(categorize_salary_currency)
# Display the unique categories
unique_currency_categories = df['salary_currency_category'].unique()
print("Unique Salary Currency Categories:", unique_currency_categories)

In [None]:
# Filter dataset by each salary currency category and create a histogram for each with consistent scales
currency_categories = df['salary_currency_category'].unique()
# Define the same range for all histograms to maintain consistency in scale
salary_min = df['salary_in_usd'].min()
salary_max = df['salary_in_usd'].max()
# Loop through each currency category to create a histogram with consistent x-axis range
for category in currency_categories:
    filtered_df = df[df['salary_currency_category'] == category]
    fig = px.histogram(
        filtered_df,
        x='salary_in_usd',
        nbins=10,
        title=f'Histograma de salarios - {category}',
        range_x=[salary_min, salary_max]
    )
    fig.show()

### **Visualizaciones exploratorias multivariadas**

Crea dos tipos diferentes de visualizaciones multivariadas. Cada visualización debe incluir una breve interpretación dentro del archivo de código

In [None]:
df.work_year.unique()

In [None]:
df.groupby(['work_year','experience_level'])['salary_in_usd'].describe()

In [None]:
grouped = df.groupby(['work_year','experience_level'])['salary_in_usd'].describe().reset_index()
grouped 

In [None]:
# Crear el gráfico de barras
fig = px.bar(grouped, x='work_year', y='50%', color='experience_level',
             title='Salarios Promedios por Año y Nivel de Experiencia',
             barmode='group')
fig.show()

In [None]:
grouped = df.groupby(['work_year','salary_currency_category'])['salary_in_usd'].describe().reset_index()
grouped 

In [None]:
# Crear el gráfico de barras
fig = px.bar(grouped, x='work_year', y='50%', color='salary_currency_category',
             title='Salarios Promedios por tipo de moneda de pago',
             barmode='group')
fig.show()

## **Análisis adicional:**

### **Estadísticas descriptivas**

Proporciona un resumen estadístico del dataset, incluyendo medidas de tendencia central y dispersión para las variables numéricas

In [None]:
df.describe()

In [None]:
grouped = df.groupby(['salary_currency_category'])['salary_in_usd'].describe().reset_index()
grouped 

In [None]:
grouped = df.groupby(['experience_level'])['salary_in_usd'].describe().reset_index()
grouped 

In [None]:
grouped = df.groupby(['job_title_simplified'])['salary_in_usd'].describe().reset_index()
grouped 

### **Identificación de tendencias**

Analiza y discute cualquier tendencia notable que observes en los datos, apoyándote en las visualizaciones y estadísticas descriptivas

In [None]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'salary_currency_category'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='salary_currency_category',
              title='Tendencia de Salarios Medios por Tipo de Moneda de Pago',
              markers=True)
# Show the plot
fig.show()

In [None]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'job_title_simplified'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='job_title_simplified',
              title='Tendencia de Salarios Medios por Tipo cargo',
              markers=True)
# Show the plot
fig.show()

In [None]:
# Group data by work_year and salary_currency_category, calculating the median salary
grouped = df.groupby(['work_year', 'experience_level'])['salary_in_usd'].median().reset_index()
# Create line plot
fig = px.line(grouped, x='work_year', y='salary_in_usd', color='experience_level',
              title='Tendencia de Salarios Medios por experiencia',
              markers=True)
# Show the plot
fig.show()