# **Formative Assessment: Supervised Learning**

In [2]:
#Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

## **Loading and Preprocessing**

In [3]:
#Load the dataset (Breast cancer dataset)
data = load_breast_cancer()

In [4]:
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.Series(data.target, name = 'target')

In [5]:
#Basic information about the dataset
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [6]:
#Checf for missing values
X.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

In [7]:
#Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

**First we load the Breast cancer dataset from the sklearn.datasets. Then we checked if there are any missing values in the dataset.**
**Since there are no missing values in our dataset we can proceed with next steps.**
**After that, we applied StandardScaler to ensure all the features are on the same scale for effective model traning.**
**These preprocessing steps ensure that the dataset is clean, scaled, and ready for model training.**

# **Implementation of Five Classification Algorithms**

# **1. Logistic Regression**

**Logistic Regression is a linear classification algorithm used to predict binary outcomes.**
**Logistic Regression is appropriate when the relationship between the features and the target is linear.** **It is fast, interpretable, and works well with datasets where the features are independent.**

In [8]:
#Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 42)

In [9]:
#Logistic Regression model
log_reg = LogisticRegression(max_iter = 10000)
log_reg.fit(X_train, y_train)

In [10]:
#Prediction and Accuracy
y_pred_log_reg = log_reg.predict(X_test)
log_reg_acc = accuracy_score(y_test, y_pred_log_reg)
print("Logistic Regression Accuracy:", log_reg_acc)

Logistic Regression Accuracy: 0.9736842105263158


# **2. Decision Tree Classifier**

**Decision Trees work well for datasets with complex relationships between features.** **They can model non-linear patterns, which may be present in the breast cancer dataset.** **They are also interpretable and can handle both numerical and categorical features.**

In [11]:
#Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state = 42)
dt_classifier.fit(X_train, y_train)

In [12]:
#Prediction and Accuracy
y_pred_dt = dt_classifier.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred_dt)
print("Decision Tree Accuracy:", dt_acc)

Decision Tree Accuracy: 0.9473684210526315


# **3. Random Forest Classifier**

**Random Forest is robust and less prone to overfitting.** **It works well with large feature sets and can capture complex relationships between features.** **Given the variability in breast cancer attributes, it is likely to provide strong performance.**

In [13]:
#Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state = 42)
rf_classifier.fit(X_train, y_train)

In [14]:
#Predictions and accuracy
y_pred_rf = rf_classifier.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", rf_acc)

Random Forest Accuracy: 0.9649122807017544


# **4. Support Vector Machine (SVM)**

**SVM is well-suited for high-dimensional spaces and can model non-linear relationships using kernel functions.** **The breast cancer dataset, with its multiple attributes, can benefit from SVM's ability to find a clear margin between the benign and malignant cases.**

In [15]:
#Support Vector Machine Classifier
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train, y_train)

In [16]:
#Predictions and accuracy
y_pred_svm = svm_classifier.predict(X_test)
svm_acc = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy:", svm_acc)

SVM Accuracy: 0.956140350877193


# **5. k-Nearest Neighbors (k-NN)**

**k-NN works well for small to medium-sized datasets and is simple to implement.** **It doesn't make strong assumptions about the underlying data distribution, making it a good candidate when the data has a non-linear structure.**

In [17]:
#K-Nearest Neighbors Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

In [18]:
#Predictions and accuracy
y_pred_knn = knn_classifier.predict(X_test)
knn_acc = accuracy_score(y_test, y_pred_knn)
print("k-NN Accuracy:", knn_acc)

k-NN Accuracy: 0.9473684210526315


In [19]:
print(f"Logistic Regression Accuracy: {log_reg_acc}")
print(f"Decision Tree Accuracy: {dt_acc}")
print(f"Random Forest Accuracy: {rf_acc}")
print(f"SVM Accuracy: {svm_acc}")
print(f"k-NN Accuracy: {knn_acc}")

Logistic Regression Accuracy: 0.9736842105263158
Decision Tree Accuracy: 0.9473684210526315
Random Forest Accuracy: 0.9649122807017544
SVM Accuracy: 0.956140350877193
k-NN Accuracy: 0.9473684210526315


# **Performance Analysis**

## **Best Performing Algorithm:**
**Logistic Regression achieved the highest accuracy of 97.37%.** **This suggests that it effectively captures the patterns in the dataset, making it a reliable choice for this classification task.**

## **Worst Performing Algorithm:**

**Both Decision Tree and k-Nearest Neighbors had the lowest accuracy, at 94.74%.** **While this is still a respectable accuracy, it indicates that these models may not have performed as well as the others.**

**Based on the outcomes, it demonstrate that Logistic Regression is a robust choice for the breast cancer dataset.**