Ejercicio Seminario: Uso de Pipelines en ML

In [1]:
# prompt: mount drive y crear path para sys y para directorio de trabajo

from google.colab import drive
drive.mount('/content/drive')

'''
import sys
import os

# Define the path to your working directory in Google Drive
path = "/content/drive/My Drive/cod/LEA3_Seminario"  # Replace with your actual directory


sys.path.append(path) ### para importar archivo de funciones propias a través de import

os.chdir(path) ### para que por defecto suba y descargue archivos partiendo de esa ruta
'''


Mounted at /content/drive


'\nimport sys\nimport os\n\n# Define the path to your working directory in Google Drive\npath = "/content/drive/My Drive/cod/LEA3_Seminario"  # Replace with your actual directory\n\n\nsys.path.append(path) ### para importar archivo de funciones propias a través de import\n\nos.chdir(path) ### para que por defecto suba y descargue archivos partiendo de esa ruta\n'

In [2]:
import pandas as pd

In [13]:
df_read = 'https://raw.githubusercontent.com/mateotl/LEA3_Seminario/refs/heads/main/aug_train.csv'
df = pd.read_csv(df_read, sep = ',')


In [14]:
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [15]:
# Making Dictionaries of ordinal features

relevent_experience_map = {
    'Has relevent experience':  1,
    'No relevent experience':    0
}

experience_map = {
    '<1'      :    0,
    '1'       :    1,
    '2'       :    2,
    '3'       :    3,
    '4'       :    4,
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8,
    '9'       :    9,
    '10'      :    10,
    '11'      :    11,
    '12'      :    12,
    '13'      :    13,
    '14'      :    14,
    '15'      :    15,
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19,
    '20'      :    20,
    '>20'     :    21
}

last_new_job_map = {
    'never'        :    0,
    '1'            :    1,
    '2'            :    2,
    '3'            :    3,
    '4'            :    4,
    '>4'           :    5
}

# Transform categorical features into numerical features

def encode(df_pre):
    df_pre.loc[:,'relevent_experience'] = df_pre['relevent_experience'].map(relevent_experience_map)
    df_pre.loc[:,'last_new_job'] = df_pre['last_new_job'].map(last_new_job_map)
    df_pre.loc[:,'experience'] = df_pre['experience'].map(experience_map)

    return df_pre

df = encode(df)

In [16]:
# Numerical and categorical data should be transformed in different ways.
# So I define num_col for numerical columns (numbers) and cat_cols for categorical columns.

num_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']

cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

Create Pipelines for Numerical and Categorical Features

The syntax of the pipeline is:

`Pipeline(steps = [(‘step name’, transform function), …])`

In [17]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore'))
])

Create ColumnTransformer to Apply the Pipeline for Each Column Set

The syntax of the ColumnTransformer is:

```python
ColumnTransformer(transformers=[('step name', transform_function, cols), ...])


In [18]:
from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop', #
    n_jobs=-1)

Add a Model to the Final Pipeline

In [19]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0)
clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

Display the Pipeline

In [20]:
from sklearn import set_config

set_config(display='diagram')
display(clf_pipeline)

In [21]:
from sklearn.model_selection import train_test_split

X = df[num_cols+cat_cols]
y = df['target']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)


Pass Data through the Pipeline

In [24]:
#pipeline_name.fit, pipeline_name.predict, pipeline_name.score

In [23]:
clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)
print(f"Model score: {score}") # model accuracy

Model score: 0.7612212943632568
