
**Dataset Selected**

UCI Breast Cancer Wisconsin Dataset

(Source: UCI Machine Learning Repository)

Why this dataset fits perfectly

Type: Binary Classification

Target Variable: Class

Instances: 699

Features: 9 predictive features (after removing ID column)

Medical dataset → widely used in ML coursework

Well-structured tabular data

Suitable for testing multiple classification algorithms

No plagiarism risk if coded independently

  - Target Variable

Class

0 → Benign

1 → Malignant

STEP 2: Models & Metrics (UNCHANGED)
✔ Models Implemented (ALL 6)

Logistic Regression

Decision Tree Classifier

K-Nearest Neighbors

Naive Bayes (Gaussian)

Random Forest (Ensemble)

XGBoost (Ensemble)

✔ Evaluation Metrics

Accuracy

AUC Score

Precision

Recall

F1 Score

MCC Score

In [38]:
!pip install scikit-learn==1.8.0




In [39]:
import sklearn
print(sklearn.__version__)


1.8.0


In [40]:
pip install numpy pandas scikit-learn matplotlib seaborn xgboost streamlit



In [41]:
import pandas as pd

url = "https://raw.githubusercontent.com/selva86/datasets/master/BreastCancer.csv"

df = pd.read_csv(url)

print(df.shape)
print(df.head())


(699, 11)
        Id  Cl.thickness  Cell.size  Cell.shape  Marg.adhesion  Epith.c.size  \
0  1000025             5          1           1              1             2   
1  1002945             5          4           4              5             7   
2  1015425             3          1           1              1             2   
3  1016277             6          8           8              1             3   
4  1017023             4          1           1              3             2   

   Bare.nuclei  Bl.cromatin  Normal.nucleoli  Mitoses  Class  
0          1.0            3                1        1      0  
1         10.0            3                2        1      0  
2          2.0            3                1        1      0  
3          4.0            3                7        1      0  
4          1.0            3                1        1      0  


Separate Features & Target

In [42]:
# Remove Id column first
df = df.drop("Id", axis=1)

# Separate features and target
X = df.drop("Class", axis=1)
y = df["Class"]



Train–Test Split

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Train–Test Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Then apply imputer
imputer = SimpleImputer(strategy="mean")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)


Feature Scaling required for Logistic Regression and KNN

In [44]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Import evaluation metrics

In [45]:
from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    matthews_corrcoef,
    confusion_matrix
)


Train ALL 6 Models

In [46]:
#1 Logistic Regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

# Test prediction
print(lr.predict(X_test[:5]))



[0 0 0 1 0]


In [47]:
#2. Decision Tree

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
y_prob = dt.predict_proba(X_test)[:,1]


In [48]:
#3. K-Nearest Neighbors

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_prob = knn.predict_proba(X_test)[:,1]


In [49]:
#4. Naive Bayes (Gaussian)

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
y_prob = nb.predict_proba(X_test)[:,1]


from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    matthews_corrcoef
)

nb_metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "AUC": roc_auc_score(y_test, y_prob),
    "Precision": precision_score(y_test, y_pred),
    "Recall": recall_score(y_test, y_pred),
    "F1": f1_score(y_test, y_pred),
    "MCC": matthews_corrcoef(y_test, y_pred)
}

print(nb_metrics)


{'Accuracy': 0.9642857142857143, 'AUC': 0.9915789473684211, 'Precision': 0.9166666666666666, 'Recall': 0.9777777777777777, 'F1': 0.946236559139785, 'MCC': 0.9206136277768457}


In [50]:
#5.Random Forest (Ensemble)

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:,1]


In [51]:
#6. XGBoost (Ensemble)

from xgboost import XGBClassifier

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
y_prob = xgb.predict_proba(X_test)[:,1]


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Calculate All Required Metrics

In [52]:
def evaluate_model(y_test, y_pred, y_prob):
    return {
        "Accuracy": accuracy_score(y_test, y_pred),
        "AUC": roc_auc_score(y_test, y_prob),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1": f1_score(y_test, y_pred),
        "MCC": matthews_corrcoef(y_test, y_pred)
    }


Comparison Table (For README & PDF)

Observations

Save Models (For Streamlit)

In [53]:
import os

os.makedirs("model", exist_ok=True)


import joblib
joblib.dump(rf, "model/random_forest.pkl")



['model/random_forest.pkl']

Streamlit App (app.py)

In [54]:
import streamlit as st
import joblib
import pandas as pd

st.title("ML Classification App")

uploaded_file = st.file_uploader("Upload CSV")
model_choice = st.selectbox(
    "Select Model",
    ["Logistic Regression", "Decision Tree", "KNN", "Naive Bayes", "Random Forest", "XGBoost"]
)




In [55]:
import os

# Root project folder
project_root = "ml-assignment-2"

# Create folders
os.makedirs(f"{project_root}/model", exist_ok=True)

print("Folder structure created!")



Folder structure created!


In [56]:
#Create requirements.txt
requirements_content = """streamlit
numpy
pandas
scikit-learn
xgboost
matplotlib
seaborn
joblib
"""

with open("ml-assignment-2/requirements.txt", "w") as f:
    f.write(requirements_content)

print("requirements.txt created!")


requirements.txt created!


In [57]:
# Create README.md

readme_content = """# ML Assignment 2 – Classification Models & Streamlit Deployment

## Problem Statement
Implement multiple machine learning classification models, evaluate their performance,
and deploy the models using a Streamlit web application.

## Dataset Description
The dataset used is the UCI Breast Cancer Wisconsin dataset.
It contains 699 samples and 9 predictive features after preprocessing.

## Models Used
- Logistic Regression
- Decision Tree Classifier
- K-Nearest Neighbors
- Naive Bayes
- Random Forest (Ensemble)
- XGBoost (Ensemble)

## Evaluation Metrics
Accuracy, AUC, Precision, Recall, F1 Score, Matthews Correlation Coefficient (MCC)

## Deployment
The application is deployed using Streamlit Community Cloud.
"""

with open("ml-assignment-2/README.md", "w") as f:
    f.write(readme_content)

print("README.md created!")


README.md created!


In [58]:
# Create app.py

app_content = """import streamlit as st

st.set_page_config(page_title="ML Classification App")

st.title("Machine Learning Classification App")
st.write("Upload a dataset and select a model to view predictions and metrics.")
"""

with open("ml-assignment-2/app.py", "w") as f:
    f.write(app_content)

print("app.py created!")


app.py created!


In [59]:
#Save Models Programmatically into model/

import joblib

joblib.dump(lr, "ml-assignment-2/model/logistic_regression.pkl")
joblib.dump(dt, "ml-assignment-2/model/decision_tree.pkl")
joblib.dump(knn, "ml-assignment-2/model/knn.pkl")
joblib.dump(nb, "ml-assignment-2/model/naive_bayes.pkl")
joblib.dump(rf, "ml-assignment-2/model/random_forest.pkl")
joblib.dump(xgb, "ml-assignment-2/model/xgboost.pkl")

joblib.dump(imputer, "ml-assignment-2/model/imputer.pkl")
joblib.dump(scaler, "ml-assignment-2/model/scaler.pkl")

print("All models saved successfully!")


All models saved successfully!


In [60]:
# Verify

import os

print("Root files:", os.listdir("ml-assignment-2"))
print("Model files:", os.listdir("ml-assignment-2/model"))


Root files: ['README.md', 'app.py', 'requirements.txt', 'model']
Model files: ['knn.pkl', 'naive_bayes.pkl', 'xgboost.pkl', 'scaler.pkl', 'imputer.pkl', 'random_forest.pkl', 'logistic_regression.pkl', 'decision_tree.pkl']


In [61]:
!zip -r ml-assignment-2.zip ml-assignment-2


updating: ml-assignment-2/ (stored 0%)
updating: ml-assignment-2/README.md (deflated 40%)
updating: ml-assignment-2/app.py (deflated 32%)
updating: ml-assignment-2/requirements.txt (deflated 11%)
updating: ml-assignment-2/model/ (stored 0%)
updating: ml-assignment-2/model/knn.pkl (deflated 89%)
updating: ml-assignment-2/model/naive_bayes.pkl (deflated 25%)
updating: ml-assignment-2/model/xgboost.pkl (deflated 82%)
updating: ml-assignment-2/model/scaler.pkl (deflated 17%)
updating: ml-assignment-2/model/imputer.pkl (deflated 31%)
updating: ml-assignment-2/model/random_forest.pkl (deflated 84%)
updating: ml-assignment-2/model/logistic_regression.pkl (deflated 34%)
updating: ml-assignment-2/model/decision_tree.pkl (deflated 68%)
