Step 1: Import libraries and load data

The first step is to import the necessary libraries and load the data that you want to use. For this example, we will use the iris dataset from scikit-learn's datasets module.

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Step 2: Split the data into training and testing sets

Next, split the data into training and testing sets using scikit-learn's train_test_split function. We will use 70% of the data for training and 30% for testing.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Step 3: Define the pipeline

Now it's time to define the pipeline. In this example, we will create a simple pipeline that consists of two steps: scaling the data and fitting a logistic regression model.

The pipeline is defined as a list of tuples, where each tuple contains the name of the step and the transformer or estimator object.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])


In [7]:
pipeline

Step 4: Fit the pipeline to the training data

Now that the pipeline is defined, we can fit it to the training data using the fit method.

In [8]:
pipeline.fit(X_train, y_train)

Step 5: Make predictions on the testing data

Once the pipeline is trained, we can use it to make predictions on the testing data using the predict method.

In [None]:
y_pred = pipeline.predict(X_test)

Step 6: Evaluate the performance of the model

Finally, we can evaluate the performance of the model by comparing the predicted labels to the true labels using scikit-learn's accuracy_score function.

In [10]:
from sklearn.model_selection import train_test_split, GridSearchCV

# Define the hyperparaeters to tune
hyperparameters = {
    'classifier__max_depth': [2, 3, 4],
    'classifier__min_samples_leaf': [1, 2, 3, 4],
    'classifier__criterion': ['gini', 'entropy']
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, hyperparameters, cv=5)
grid_search.fit(X_train, y_train)

# Print best hyperparameters and accuracy score
print("Best hyperparameters: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)

Best hyperparameters:  {'classifier__criterion': 'gini', 'classifier__max_depth': 4, 'classifier__min_samples_leaf': 3}
Best accuracy:  0.9428571428571428


Step 5: Make predictions on testing data

In [11]:
y_pred = grid_search.predict(X_test)

Step 6: Evaluate model performance

In [12]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



And that's it! This is a basic example of how to build a simple scikit-learn pipeline using Python code. Of course, you can customize the pipeline by adding or removing steps, or by using different transformers and estimators as needed.