<a href="https://colab.research.google.com/github/marco10507/ml-portfolio/blob/main/logistic_regression_1_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import make_pipeline
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt

I am generating synthetic data for a binary classification problem with 10 features and 10,000 samples, using a specified random state for reproducibility.

In [None]:
# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=10, n_classes=2, random_state=42)

Displaying all the classes present within the dataset.

In [19]:
unique_values, counts = np.unique(y, return_counts=True)

for value, count in zip(unique_values, counts):
  print(f"{value}: {count} times")

0: 4988 times
1: 5012 times


The results show that the initial three principal components collectively explain approximately 50.44% of the total variance in the dataset. This suggests that these principal components effectively summarize a significant portion of the dataset's information, showcasing the successful reduction of dimensionality achieved through PCA.

In [25]:
x_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=3)
x_pca = pca.fit_transform(x_scaled)

# Percentage of variance explained by the first 3 principal components
explained_variance = pca.explained_variance_ratio_

# Total variance explained by the first 3 principal components
total_variance = np.sum(explained_variance)
print("Total variance explained by the first 3 principal components:", total_variance)

Total variance explained by the first 3 principal components: 0.504421886673858


The plots clearly demonstrate that using a simple linear classifier may not effectively separate the data into distinct classes. However, it suggests that employing a polynomial degree could potentially improve classification accuracy

In [39]:
df = pd.DataFrame(data=x_pca, columns=["PC1", "PC2", "PC3"])
df["Target"] = y

fig_2d = px.scatter(df, x="PC1", y="PC2", color="Target", title="2 principal components")

fig_2d.update_layout(title_x=0.5)

fig_2d.show()

In [40]:
fig_3d = px.scatter_3d(df, x="PC1", y="PC2", z="PC3", color="Target", title="3 principal components")

fig_3d.update_layout(title_x=0.5)
fig_3d.update_traces(showlegend=False)

fig_3d.show()

I am splitting the dataset into training and testing sets using a 80:20 ratio. Then, I'm constructing a logistic regression model pipeline with polynomial features. Utilizing Grid Search, I will systematically explore various hyperparameter combinations, including polynomial degree, regularization strength (C), penalty type, and solver method, to identify the optimal configuration for maximizing accuracy. This search will be conducted using 5-fold cross-validation on the training data.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, random_state=42)

model = make_pipeline(PolynomialFeatures(), LogisticRegression())
hyperparameter_candicates = {
    "polynomialfeatures__degree": [1,2,3],
    "logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
    "logisticregression__solver": ["liblinear"],
    "logisticregression__penalty": ["l1", "l2"]
}

grid_search = GridSearchCV(model, hyperparameter_candicates, cv=5, scoring="accuracy", return_train_score=True)

grid_search.fit(X_train, y_train)

I will present a graphical representation depicting various hyperparameter combinations alongside their respective average test and train scores.

In [13]:
results = grid_search.cv_results_

mean_test_scores = results['mean_test_score']
mean_train_scores = results['mean_train_score']

hyperparameter_strings = [
    ("_".join(str(value) for value in params.values()))
    for params in results["params"]
]


df = pd.DataFrame({"x": hyperparameter_strings, "mean_test_scores": mean_test_scores, "mean_train_scores": mean_train_scores})

fig = px.line(
    df,
    x="x", y=["mean_test_scores", "mean_train_scores"],
    labels={"x": "hyperparameters"},
    markers=True,
    title="Mean Test and Train Scores for Different Hyperparameters",
    color_discrete_sequence=["red", "blue"]
)

fig.update_yaxes(title_text="Accuracy")

fig.update_traces(
    name="Test Accuracy",
    selector={"name":"mean_test_scores"}
)

fig.update_traces(
    name="Train Accuracy",
    selector={"name":"mean_train_scores"}
)

fig.update_layout(title_x=0.5)


fig.show()

I unveil the optimal hyperparameters derived from a grid search. Following this, the notebook showcases the model's accuracy on the training data. Subsequently, the trained model is assessed on an independent test dataset, and the resulting accuracy is computed and presented. This sequence offers a clear snapshot of the model's performance metrics.






In [6]:
best_hyperparameters = grid_search.best_params_

print("Best hyperparameters:")
for param, value in best_hyperparameters.items():
    print(f"\t{param}: {value}")

print(f"\nModel accuracy on training data: {grid_search.best_score_ * 100:.2f}%")

accurary = grid_search.score(X_test, y_test)

print(f"\nModel accuracy on test data {accurary * 100:.2f}%")

Best hyperparameters:
	logisticregression__C: 0.1
	logisticregression__penalty: l1
	logisticregression__solver: liblinear
	polynomialfeatures__degree: 3

Model accuracy on training data: 92.75%

Model accuracy on test data 93.70%
