### Pipelines
Una seguenza di trasformazioni applicate al dataset, seguite da un modello che verrà trainato sui dati di training e testato sul dataset di test. Nella pipeline le trasformazioni sono eseguite in modo sequenziale. L'output di ciascuna trasformazione della pipeline è dato in input alla trasformazione seguente.

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
Y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, train_size=0.8)

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

preprocessing_subpipeline = Pipeline(steps=[
    ('scale_01', MinMaxScaler(feature_range=(0,1))),
    ('PCA', PCA(n_components=2))
])

In [11]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')

In [12]:
pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing_subpipeline),
    ('model', model)
])
pipeline

In [13]:
from sklearn.metrics import accuracy_score

pipeline.fit(X_train, Y_train)
predictions = pipeline.predict(X_test)
accuracy = accuracy_score(Y_test, predictions)
accuracy

0.8666666666666667

### Column Transformer
Vogliamo applicare trasformazioni diverse per colonna. Alcune trasformazioni verranno applicate alle feature categoriche, altre alle numeriche.

In [17]:
dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/2025/06_pipelines/data/melb_data.csv")
dataset.info()

X = dataset.drop(['Price'], axis=1)
Y = dataset['Price']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2)
