# Logistic Regression Pipeline Example
This notebook demonstrates how to use `scikit-learn` to preprocess data using a `ColumnTransformer` and `Pipeline`, and then train a `LogisticRegression` classifier.

In [1]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Sample dataset creation
We create a small dataset with numerical and categorical features and a binary target.

In [2]:
data = pd.DataFrame({
    "age": [25, 45, 35, 50, 23, 30, 55, 40, 29, 48],
    "income": [30000, 60000, 50000, 80000, 25000, 45000, 70000, 55000, 32000, 72000],
    "city": ["Warsaw", "Krakow", "Gdansk", "Warsaw", "Wroclaw",
             "Krakow", "Gdansk", "Warsaw", "Wroclaw", "Gdansk"],
    "purchased": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
})

## Splitting the data
We separate features (`X`) from labels (`y`) and split the data into training and testing sets.

In [3]:
X = data.drop(columns=["purchased"])
y = data["purchased"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Defining preprocessing
We scale numerical features and one-hot encode categorical features using `ColumnTransformer`.

In [4]:
num_features = ["age", "income"]
cat_features = ["city"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
])

## Creating a pipeline
We build a pipeline that includes preprocessing and a logistic regression model.

In [5]:
pipeline = Pipeline([
    ("preprocessing", preprocessor),
    ("classifier", LogisticRegression(max_iter=100))
])

## Training the model
We fit the pipeline on the training data.

In [6]:
pipeline.fit(X_train, y_train)

## Making predictions and evaluating the model
We use the test data to predict and evaluate the accuracy.

In [7]:
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy on test set:", accuracy)

Model accuracy on test set: 0.3333333333333333


## Displaying predictions and test data

In [8]:
print("\nPredicted classes:", y_pred)
print("\nActual classes:", y_test.values)
print("\nExample test data:")
print(X_test)


Predicted classes: [1 1 1]

Actual classes: [1 0 0]

Example test data:
   age  income     city
8   29   32000  Wroclaw
1   45   60000   Krakow
5   30   45000   Krakow
