In [1]:
# Phase 2 – Project 2: Logistic Regression (Cancer Classification)

Ejaztech.AI – Supervised Machine Learning: Regression & Classification

Objective: Develop and evaluate a Logistic Regression model to classify cancer cases using a real-world dataset.

Dataset: https://www.kaggle.com/datasets/erdemtaha/cancer-data/data


SyntaxError: invalid character '–' (U+2013) (2016723218.py, line 3)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score
)

sns.set(style="whitegrid")


In [None]:

data_path = "cancer_data.csv"

df = pd.read_csv(data_path)
df.head()


In [None]:
df.info()
df.describe()


In [None]:
print(df['diagnosis'].value_counts())

sns.countplot(x='diagnosis', data=df)
plt.title("Class distribution")
plt.show()


In [None]:
# Check missing values
df.isna().sum()

# Drop rows with missing values 
df = df.dropna()

# Drop non-informative ID column if present
if 'id' in df.columns:
    df = df.drop(columns=['id'])

print("Unique diagnosis values:", df['diagnosis'].unique())

mapping = {'M': 1, 'B': 0}
df['diagnosis'] = df['diagnosis'].map(mapping)

df.head()


In [None]:
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape


In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:3]


In [None]:
log_reg = LogisticRegression(
    max_iter=1000,
    solver='lbfgs'
)

log_reg.fit(X_train_scaled, y_train)

# Coefficients
coefficients = pd.DataFrame({
    "feature": X.columns,
    "coefficient": log_reg.coef_[0]
}).sort_values(by="coefficient", ascending=False)

coefficients.head(10)


In [None]:
y_train_pred = log_reg.predict(X_train_scaled)
y_test_pred = log_reg.predict(X_test_scaled)
y_test_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

def print_metrics(y_true, y_pred, dataset_name=""):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    print(f"{dataset_name} Accuracy : {acc:.4f}")
    print(f"{dataset_name} Precision: {prec:.4f}")
    print(f"{dataset_name} Recall   : {rec:.4f}")
    print(f"{dataset_name} F1-score : {f1:.4f}")
    print("-" * 40)

print_metrics(y_train, y_train_pred, "Train")
print_metrics(y_test, y_test_pred, "Test")


In [None]:
cm = confusion_matrix(y_test, y_test_pred)

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix (Test Set)")
plt.show()

print("Classification report (Test):")
print(classification_report(y_test, y_test_pred))


In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
auc_score = roc_auc_score(y_test, y_test_proba)

plt.plot(fpr, tpr, label=f"AUC = {auc_score:.3f}")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

auc_score


## 10. Interpretation and insights

In this project, I used Logistic Regression to classify cancer cases into two classes based on multiple medical measurements. The main evaluation metrics I looked at were accuracy, precision, recall, F1-score, and AUC.

Accuracy tells me the overall proportion of correct predictions out of all samples. Precision focuses on how many of the cases predicted as positive (cancer) are actually positive, which is important to avoid too many false alarms. Recall measures how many of the actual positive cases the model correctly catches, which is critical in healthcare because missing a real cancer case can be very dangerous.

The confusion matrix helps me see the counts of true positives, true negatives, false positives, and false negatives. By inspecting it, I can understand whether the model is making more false positives (predicting cancer when it is not present) or more false negatives (predicting no cancer when it is actually present). In a cancer detection setting, false negatives are usually more serious, so a high recall is especially valuable even if it slightly reduces precision.

Looking at the model coefficients, I can see which features contribute most positively or negatively to predicting the cancer class. Features with large positive coefficients increase the probability of the positive class when they increase, while features with large negative coefficients decrease that probability. This gives some interpretability and hints about which measurements are more influential for the model's decision.


## 11. Conclusion

The goal of this project was to build a Logistic Regression model to classify cancer cases using the Ejaztech.AI Phase 2 dataset and to practice supervised machine learning for classification. I followed a complete workflow: data loading, cleaning and preprocessing, encoding the target variable, feature scaling, model training, and model evaluation using several metrics.

The model achieved a reasonable performance in terms of accuracy, precision, recall, F1-score, and AUC, showing that Logistic Regression can be an effective baseline for this medical classification problem. By examining the confusion matrix and the balance between precision and recall, I gained insight into the trade-off between false positives and false negatives, which is especially important in cancer detection.

Overall, this project strengthened my understanding of how Logistic Regression works, how to interpret its coefficients, and how to evaluate a classification model in a healthcare context. Possible next steps to improve performance include trying other algorithms (such as Random Forests or Gradient Boosting), performing more advanced feature engineering, and tuning hyperparameters to better balance recall and precision for medical use cases.
