# Trabajo Final
## Desarrollo de modelos

En este ejemplo, se hará la predicción del salario de un profesional de datos en función de algunas características, como la experiencia, el puesto de trabajo, el tamaño de la empresa, etc.

Los datos salen de aquí: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Dominio público). He modificado ligeramente los datos para reducir el número de opciones para ciertas funciones.

# 1. Importar librerías

In [6]:
#import packages for data manipulation
import pandas as pd
import numpy as np

#import packages for machine learning
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score

#import packages for data management
import joblib

In [7]:
# Importar los datos
df = pd.read_csv('data/ds_salaries.csv')

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [9]:
# Filtrar las columnas que se va a usar
df = df[['experience_level', 'employment_type', 'job_title', 'company_size','salary_in_usd']]

In [10]:
df.head()

Unnamed: 0,experience_level,employment_type,job_title,company_size,salary_in_usd
0,MI,FT,Data Scientist,L,79833
1,SE,FT,Machine Learning Scientist,S,260000
2,SE,FT,Big Data Engineer,M,109024
3,MI,FT,Product Data Analyst,S,20000
4,SE,FT,Machine Learning Engineer,L,150000


Dado que todas las características son categóricas, se codificarán para convertir los datos en numéricos. A continuación, se usará codificadores ordinales para codificar el nivel de experiencia y el tamaño de la empresa. Estos son ordinales porque representan algún tipo de progresión (1 = nivel inicial, 2 = nivel medio, etc.).

Para el título del puesto y el tipo de empleo, se creará variables ficticias para cada opción (Se elimina la primera para evitar la multicolinealidad).

In [11]:
salary_data = df.copy()

#use ordinal encoder to encode experience level
encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])
salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])

#use ordinal encoder to encode company size
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])

#encode employmeny type and job title using dummy columns
salary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)

#drop original columns
salary_data = salary_data.drop(columns = ['experience_level', 'company_size'])

In [12]:
# df codificado
salary_data.head()

Unnamed: 0,salary_in_usd,experience_level_encoded,company_size_encoded,employment_type_FL,employment_type_FT,employment_type_PT,job_title_AI Scientist,job_title_Analytics Engineer,job_title_Applied Data Scientist,job_title_Applied Machine Learning Scientist,...,job_title_Machine Learning Manager,job_title_Machine Learning Scientist,job_title_Marketing Data Analyst,job_title_NLP Engineer,job_title_Principal Data Analyst,job_title_Principal Data Engineer,job_title_Principal Data Scientist,job_title_Product Data Analyst,job_title_Research Scientist,job_title_Staff Data Scientist
0,79833,1.0,2.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,260000,2.0,0.0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,109024,2.0,1.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,20000,1.0,0.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,150000,2.0,2.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Ahora que se ha transformado las entradas del modelo, se crean los conjuntos de entrenamiento y prueba. Luego se introducirá estas características, inicialmente, en un modelo de regresión lineal simple para predecir el salario del empleado.
Luego se probará un nuevo modelo para tratar de incrementar el $R^2$


In [13]:
# Modelo de regresión lineal

#define independent and dependent features
X = salary_data.drop(columns = 'salary_in_usd')
y = salary_data['salary_in_usd']

#split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, random_state = 104, test_size = 0.2, shuffle = True)

#fit linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

#make predictions
y_pred = regr.predict(X_test)

#print the coefficients
print("Coefficients: n", regr.coef_)

#print the MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

#print the adjusted R2 value
print("R2: %.2f" % r2_score(y_test, y_pred))

Coefficients: n [ 3.89891601e+04  7.56505500e+03 -1.22322974e+05 -8.50203748e+04
 -9.36493326e+04  6.91298217e+04  8.73283022e+04  1.50961907e+05
  1.26014074e+04  6.45029992e+04  4.66758822e+04  3.13607053e+04
  5.51927873e+04  9.11144622e+04  1.03895454e+04  1.22246852e+05
  6.12702968e+04  5.35009072e+04  1.17870513e-09  7.57553547e+04
  1.34969626e+05  7.85140631e+04  8.52152505e+04  7.20923965e+04
  3.07292322e+04  1.08017048e+05  8.04501655e+04  1.04407827e+05
  1.39407827e+05  9.01257321e+04  4.09190422e+04  1.30382719e+03
  4.28396987e+05  8.36463147e+04  7.60883159e+04 -2.05423329e+04
  5.37069072e+04  1.01410896e+05  7.26375839e+04  3.49048822e+04
  9.97540177e+04  1.32516257e+05  7.81576777e+04  7.15889706e+04
  5.65118272e+04  1.54160637e+05  2.80618272e+04 -1.45519152e-11
  6.85270972e+04  3.23695775e+05  1.37817232e+05 -1.00195776e+03
  1.04562786e+05 -3.30474926e+04]
Mean squared error: 6412074606.94
R2: 0.05


In [14]:
# Modelo Lasso

# define independent and dependent features
X = salary_data.drop(columns='salary_in_usd')
y = salary_data['salary_in_usd']

# split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=104, test_size=0.2, shuffle=True)

# fit Lasso regression model
# Puedes ajustar el parámetro alpha para controlar la regularización
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# make predictions
y_pred_lasso = lasso.predict(X_test)

# print the coefficients
print("Coefficients (Lasso): \n", lasso.coef_)

# print the MSE
print("Mean squared error (Lasso): %.2f" % mean_squared_error(y_test, y_pred_lasso))

# print the R2 value
print("R2 (Lasso): %.2f" % r2_score(y_test, y_pred_lasso))

Coefficients (Lasso): 
 [  38978.88977302    7564.19493959 -122192.6456627   -84908.37025416
  -94110.80706909   65621.89498533   83682.87697262  147319.24349664
    8991.94393978   60859.07314135   43001.21197281   27708.97719384
   51565.74461851   87459.04717929    6884.31991823  118587.04527653
   57646.91551636   49847.35225196       0.           72118.67643058
  131335.10760707   74889.80593071   81572.40726533   68447.33808116
   27086.79235417  104388.44122834   76825.18270976  100738.22997236
  135738.24159255   86502.70472554   37262.65760935   -2268.66548385
  424717.23594093   80015.3926503    72459.36285254  -24104.43912247
   50056.66880599   97776.88909693   68996.64853759   31234.76188355
   96226.36071844  128824.81608987   74528.11634488   67945.14552419
   52843.10066209  150524.97696657   24393.18210379       0.
   64846.47341326  320056.11954222  134177.28007037   -4608.89015401
  100927.82935997  -36507.73724074]
Mean squared error (Lasso): 6389565937.23
R2 (Lasso

  model = cd_fast.enet_coordinate_descent(


In [15]:
# Modelo Ridge

# define independent and dependent features
X = salary_data.drop(columns='salary_in_usd')
y = salary_data['salary_in_usd']

# split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=104, test_size=0.2, shuffle=True)

# fit Ridge regression model
# Puedes ajustar el parámetro alpha para controlar la regularización
ridge = linear_model.Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# make predictions
y_pred_ridge = ridge.predict(X_test)

# print the coefficients
print("Coefficients (Ridge): \n", ridge.coef_)

# print the MSE
print("Mean squared error (Ridge): %.2f" % mean_squared_error(y_test, y_pred_ridge))

# print the R2 value
print("R2 (Ridge): %.2f" % r2_score(y_test, y_pred_ridge))

Coefficients (Ridge): 
 [ 38625.60811134   8396.1281726  -38377.85654154 -19896.45387479
 -39500.52104113 -10146.36368984   2271.97626752  52677.55547201
 -26599.7402665  -16713.47557376 -18713.11577153 -44851.44622373
 -14215.31125019   4553.71500841 -56899.65967759  28569.65056818
 -22633.44469449 -21076.03710666      0.          -7009.83650349
  44210.41738451  -5974.87175497    536.19632739 -10962.942461
 -40536.08571576  21635.82066695  -4087.84140602   9737.32014217
  27237.32014217   4985.02558236 -29031.08228781 -41814.67985783
 171550.12419784   -624.35812575  -6099.40284446 -52555.9839135
 -20938.70377333  14013.24040007  -8897.05367261 -24598.61577153
  24161.10530825  24259.05642611  -6021.68081486  -9774.40164477
 -14210.67985783  59604.96954883 -28435.67985783      0.
  -7553.74762956 159573.26687665  39779.91428424 -56978.41562115
  18101.38213897 -26012.84270892]
Mean squared error (Ridge): 4964420572.74
R2 (Ridge): 0.27


In [22]:
X.dtypes

experience_level_encoded                              float64
company_size_encoded                                  float64
employment_type_FL                                      int32
employment_type_FT                                      int32
employment_type_PT                                      int32
job_title_AI Scientist                                  int32
job_title_Analytics Engineer                            int32
job_title_Applied Data Scientist                        int32
job_title_Applied Machine Learning Scientist            int32
job_title_BI Data Analyst                               int32
job_title_Big Data Architect                            int32
job_title_Big Data Engineer                             int32
job_title_Business Data Analyst                         int32
job_title_Cloud Data Engineer                           int32
job_title_Computer Vision Engineer                      int32
job_title_Computer Vision Software Engineer             int32
job_titl

In [16]:
# Modelo Elastic Net

# define independent and dependent features
X = salary_data.drop(columns='salary_in_usd')
y = salary_data['salary_in_usd']

# split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=104, test_size=0.2, shuffle=True)

# fit Elastic Net regression model
# Puedes ajustar los parámetros alpha y l1_ratio para controlar la regularización
elastic_net = linear_model.ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)

# make predictions
y_pred_elastic_net = elastic_net.predict(X_test)

# print the coefficients
print("Coefficients (Elastic Net): \n", elastic_net.coef_)

# print the MSE
print("Mean squared error (Elastic Net): %.2f" % mean_squared_error(y_test, y_pred_elastic_net))

# print the R2 value
print("R2 (Elastic Net): %.2f" % r2_score(y_test, y_pred_elastic_net))

Coefficients (Elastic Net): 
 [ 37246.58277053   9298.22836297  -2727.70465707   3308.63022586
  -6505.26983353  -2217.52146768    401.70074397   9265.30819826
  -2256.57609833  -3444.10100327  -1434.25787781  -9466.25915089
  -2372.28243523    557.40227804 -10411.69332856   4136.67288106
 -17205.6350331   -2438.59622117      0.          -1135.61984062
  11436.97965979  -4903.59575505    124.7819355   -2981.14500537
  -5935.50167863   7925.91330289  -3542.25027355    781.59424087
   2167.73285471   1351.54279821  -3331.54737528  -3299.81170038
  13543.835142       41.6510507    -781.0532934   -4095.993197
  -2422.90126249   2679.35964442  -1258.13566652  -1900.43632108
   6009.62689778   1894.05452589  -3515.76578718  -1437.52715843
  -1113.35626846  13785.52131482  -2240.08894233      0.
   -569.23589299  18344.62491426   5852.22288926  -6525.52835573
   6480.5470116   -1093.44107969]
Mean squared error (Elastic Net): 4734337847.92
R2 (Elastic Net): 0.30


In [17]:
#save model using joblib
joblib.dump(elastic_net, 'models/elastic_net.pkl')

['models/elastic_net.pkl']