Aquestes són les dades de cada individu:

    species (espècie de pingüí: Chinstrap, Adélie o Gentoo)
    island (illa: Dream, Torgersen o Biscoe)
    bill_length_mm (longitud del bec en mm)
    bill_depth_mm (profunditat del bec en mm)
    flipper_length_mm (longitud de l'aleta en mm)
    body_mass_g (massa corporal en grams)
    sex (sexe: Male o Female)

Nota: els camps del dataset de Kaggle són lleugerament diferents: en lloc de bill_length_mm i bill_depth_mm es diuen culmen_length_mm i culmen_depth_mm


In [99]:
import seaborn as sns
import pandas as pd

# Carreguem el dataset dels pingüins amb Seaborn
df = sns.load_dataset("penguins")

# Transformem el dataset de Seaborn a un DataFrame de pandas
df = pd.DataFrame(df)

# Eliminem les files amb valors nuls
df = df.dropna()



# Mostrem les primeres files del DataFrame
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [100]:
# Mostrem els tipus de dades de cada columna del DataFrame
df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

In [101]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='species')
Y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print(f'Training set size: {len(X_train)}')
print(f'Test set size: {len(y_test)}')

Training set size: 266
Test set size: 67


In [102]:
# Mostrar les primeres files del conjunt d'entrenament
X_train.head()


Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
230,Biscoe,40.9,13.7,214.0,4650.0,Female
84,Dream,37.3,17.8,191.0,3350.0,Female
303,Biscoe,50.0,15.9,224.0,5350.0,Male
22,Biscoe,35.9,19.2,189.0,3800.0,Female
29,Biscoe,40.5,18.9,180.0,3950.0,Male


In [103]:
# Mostrar les primeres files del conjunt de test
X_test.head()

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
30,Dream,39.5,16.7,178.0,3250.0,Female
317,Biscoe,46.9,14.6,222.0,4875.0,Female
79,Torgersen,42.1,19.1,195.0,4000.0,Male
201,Dream,49.8,17.3,198.0,3675.0,Female
63,Biscoe,41.1,18.2,192.0,4050.0,Male


In [104]:
import numpy as np

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Separem les columnes categòriques i numèriques
categorical_features = X_train.select_dtypes(include=["object"]).columns
numerical_features = X_train.select_dtypes(exclude=["object"]).columns

# Creem diccionaris per als conjunts d'entrenament i test
X_train_dict = X_train[categorical_features].to_dict(orient='records')
X_test_dict = X_test[categorical_features].to_dict(orient='records')

# Apliquem One-Hot Encoding utilitzant DictVectorizer
vectorizer = DictVectorizer(sparse=False)
X_train_categorical = vectorizer.fit_transform(X_train_dict)
X_test_categorical = vectorizer.transform(X_test_dict)

# Escalem les característiques numèriques
scaler = StandardScaler()
X_train_numerical = scaler.fit_transform(X_train[numerical_features])
X_test_numerical = scaler.transform(X_test[numerical_features])

# Combine numerical and categorical features
X_train_prepared = np.hstack((X_train_numerical, X_train_categorical))
X_test_prepared = np.hstack((X_test_numerical, X_test_categorical))

# Codifiquem la variable objectiu
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


In [105]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import joblib

# Definim els models
models = {
    'logistic_regression': LogisticRegression(max_iter=1000),
    'svm': SVC(),
    'decision_tree': DecisionTreeClassifier(),
    'knn': KNeighborsClassifier()
}

# Entrenem i guardem els models
for name, model in models.items():
    model.fit(X_train_prepared, y_train_encoded)
    joblib.dump(model, f'../models/{name}.pkl')
    print(f'Model {name} saved.')


Model logistic_regression saved.
Model svm saved.
Model decision_tree saved.
Model knn saved.


In [106]:
# Guardar el vectoritzador i l'escalador
joblib.dump(vectorizer, '../models/vectorizer.pkl')
joblib.dump(scaler, '../models/scaler.pkl')
print('Vectorizer and scaler saved.')


Vectorizer and scaler saved.
