# Prueba para cargo de cientifico de datos : **3. Modelo de ML JobSatisfaction.**
-   *Nombre:* Humberto Franco Osorio
-   *Fecha:* 6 de febrero
-   *Link:* https://github.com/ingHFrancoO/prueba_02_2023

#### Contenido
1.  Introducción
1.  Objetivo


#### Introducción
Analizar los factores que influencian que tan satisfechos pueden estar los empleados con su trabajo, analizando la información pertinente a su estado de contratación.

### Objetivo
¿Qué tan satisfecho está un empleado en su trabajo?

## 1.   Importamos librerias necesarias

In [41]:
from category_encoders import TargetEncoder
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas_profiling import ProfileReport
import plotly.express as px
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder

## 2.   Carga de dataset

In [42]:
X_train = pd.read_parquet('../data/interim/x_train.gzip')
X_train.drop(['BusinessTravel', 'Education', 'EnvironmentSatisfaction', 'Gender', 
            'JobInvolvement', 'JobLevel', 'JobRole', 'MaritalStatus','RelationshipSatisfaction', 
            'StockOptionLevel', 'TotalWorkingYears', 'JobLevel', 'PercentSalaryHike', 'YearsWithCurrManager', 'OverTime'],axis=1,inplace=True)

y_train = pd.read_csv('../data/interim/y_train.csv', index_col=0)
y_train = y_train.astype('category')
y_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1176 entries, 638 to 830
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   JobSatisfaction  1176 non-null   category
dtypes: category(1)
memory usage: 10.5 KB


In [43]:
X_train.head(10)

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,EducationField,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PerformanceRating,TrainingTimesLastYear,WorkLifeBalance,YearsInCurrentRole,YearsSinceLastPromotion
638,25,0,583,4,Marketing,87,4256,18154,1,3,1,4,2,0
1356,41,0,337,8,Marketing,54,4393,26841,5,4,3,3,4,1
494,34,0,204,14,Technical Degree,31,2579,2912,1,3,3,3,2,0
1056,28,1,1496,1,Technical Degree,92,2909,15747,3,3,3,4,2,1
805,45,0,1050,9,Life Sciences,65,5593,17970,1,3,2,3,10,4
500,32,0,646,9,Life Sciences,92,6322,18089,1,3,2,2,4,0
1176,49,0,301,22,Other,72,16413,3498,3,3,2,3,2,1
614,26,1,887,5,Medical,88,2366,20898,1,3,2,3,7,1
1251,30,0,979,15,Marketing,94,7140,3088,2,3,2,3,7,1
1426,32,0,267,29,Life Sciences,49,2837,15919,1,3,3,3,2,4


In [44]:
y_train.head(10)

Unnamed: 0,JobSatisfaction
638,1
1356,2
494,3
1056,3
805,3
500,4
1176,2
614,3
1251,1
1426,2


## 3. Codificar columnas categóricas

In [45]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1176 entries, 638 to 830
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age                      1176 non-null   int64 
 1   Attrition                1176 non-null   int64 
 2   DailyRate                1176 non-null   int64 
 3   DistanceFromHome         1176 non-null   int64 
 4   EducationField           1176 non-null   object
 5   HourlyRate               1176 non-null   int64 
 6   MonthlyIncome            1176 non-null   int64 
 7   MonthlyRate              1176 non-null   int64 
 8   NumCompaniesWorked       1176 non-null   int64 
 9   PerformanceRating        1176 non-null   int64 
 10  TrainingTimesLastYear    1176 non-null   int64 
 11  WorkLifeBalance          1176 non-null   int64 
 12  YearsInCurrentRole       1176 non-null   int64 
 13  YearsSinceLastPromotion  1176 non-null   int64 
dtypes: int64(13), object(1)
memory usage: 1

In [46]:
cat_cols = ['Attrition', 'EducationField']
for col in cat_cols:
    X_train = pd.concat([X_train.drop(col, axis=1), pd.get_dummies(X_train[col], prefix=col, drop_first=True)], axis=1)

In [47]:
X_train.head(10)

Unnamed: 0,Age,DailyRate,DistanceFromHome,HourlyRate,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PerformanceRating,TrainingTimesLastYear,WorkLifeBalance,YearsInCurrentRole,YearsSinceLastPromotion,Attrition_1,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree
638,25,583,4,87,4256,18154,1,3,1,4,2,0,0,0,1,0,0,0
1356,41,337,8,54,4393,26841,5,4,3,3,4,1,0,0,1,0,0,0
494,34,204,14,31,2579,2912,1,3,3,3,2,0,0,0,0,0,0,1
1056,28,1496,1,92,2909,15747,3,3,3,4,2,1,1,0,0,0,0,1
805,45,1050,9,65,5593,17970,1,3,2,3,10,4,0,1,0,0,0,0
500,32,646,9,92,6322,18089,1,3,2,2,4,0,0,1,0,0,0,0
1176,49,301,22,72,16413,3498,3,3,2,3,2,1,0,0,0,0,1,0
614,26,887,5,88,2366,20898,1,3,2,3,7,1,1,0,0,1,0,0
1251,30,979,15,94,7140,3088,2,3,2,3,7,1,0,0,1,0,0,0
1426,32,267,29,49,2837,15919,1,3,3,3,2,4,0,1,0,0,0,0


Se chequea el dataset que se utilizara

In [48]:
profile2 = ProfileReport(X_train, title="Data Profiling Report 2" )
profile2.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [49]:
profile2.to_file("../reports/data_profiling_bivariate_report_2.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## 3. Generación de Modelo

Inicialmente se cargan las metricas con las cuales se evaluara el modelo que se generara.

In [50]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

Seguido por los algoritmos de aprendizaje supervizado que se utilizaran.

In [51]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

Para finalmente proceder con la generacion de los diferentes modelos

In [54]:
seed =2
models = []

models.append(('KNN', KNeighborsClassifier(n_neighbors=4)))

# Decision Tree classifier
models.append(('CART', DecisionTreeClassifier()))

# Naïve Bayes
models.append(('NB', GaussianNB()))
# cluster
models.append(('RF', RandomForestClassifier()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	# Kfol cross validation for model selection
	kfold = model_selection.KFold(n_splits=10, shuffle=True)
	#X train , y train
	cv_results = model_selection.cross_val_score(model, X_train, y_train.values.ravel(), cv=kfold, scoring = scoring)
	results.append(cv_results)
	names.append(name)
	msg = f"- Accuracy of {name} : {cv_results.mean()} (+/- {cv_results.std()})"
	print(msg)

- Accuracy of KNN : 0.24319136607272202 (+/- 0.02496326134176262)
- Accuracy of CART : 0.24932637983485445 (+/- 0.054653315708258945)
- Accuracy of NB : 0.2780747501086484 (+/- 0.049336658084556645)
- Accuracy of RF : 0.2839924670433145 (+/- 0.03899868081175658)


In [55]:
result_df = pd.DataFrame(results, index=names).T
px.box(result_df,title = 'Algorithm Comparison')