### Data Science and Machine Learning Toolkit
#### By: Sebastián Medina Jiménez 
https://www.linkedin.com/in/sebasmedina/
## 2.  Pipelines in scikit-learn




In [1]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Load the Iris dataset
iris = load_iris()

# Access the data and target variables
X = iris.data  # Features
y = iris.target  # Target variables

Adjust the pipeline  with :
1. Standard scaling
2. Imputer of missing data
3. Random forest classifier

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


pipeline = Pipeline([
    ('scaler', StandardScaler()),      # Step 1: Standardization
    ('imputer', SimpleImputer()),      # Step 2: Impute missing data
    ('classifier', RandomForestClassifier())  # Step 3: Classification model
])

Fit the  model and  measure  the metrics

In [3]:
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Calculate precision
precision = precision_score(y_test, y_pred, average='weighted')

# Calculate recall
recall = recall_score(y_test, y_pred, average='weighted')

# Print the results
print("Confusion Matrix:\n", conf_matrix)
print("Accuracy: {:.3f}%".format(accuracy * 100))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))

Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Accuracy: 100.000%
Precision: 1.000
Recall: 1.000


**Explanation of Classification Metrics**

- **Confusion Matrix:**
  - The confusion matrix is a table that shows how your classification model performed.
  - In this case, you have a 3x3 matrix. Each row represents the actual class, and each column represents the predicted class.
  - The diagonal elements (from top-left to bottom-right) show the number of correct predictions, where the predicted class matches the actual class. In this matrix, all diagonal elements are non-zero, indicating perfect predictions.
  - In the given confusion matrix, it shows that your model made:
    - 10 correct predictions for class 0,
    - 9 correct predictions for class 1, and
    - 11 correct predictions for class 2.

- **Accuracy (100.00%):**
  - Accuracy is a measure of how many of the predictions were correct out of the total predictions.
  - An accuracy of 100% means that all predictions were correct. It's the best possible score.

- **Precision (1.00):**
  - Precision measures how many of the positive predictions (in this case, classes 0, 1, and 2) were actually correct.
  - A precision of 1.00 means that all positive predictions were correct. There were no false positives.

- **Recall (1.00):**
  - Recall (or sensitivity) measures how many of the actual positive cases (classes 0, 1, and 2) were correctly predicted.
  - A recall of 1.00 means that all actual positive cases were predicted correctly. There were no false negatives.
