<a href="https://colab.research.google.com/github/nsk20/CMPE257-Fall23-ShyamKumar-Nalluri/blob/take-home-exam/Task_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

SJSU ID - 016421756

In [21]:
# Load the dataset
dataset = pd.read_csv('breast_cancer_dataset_preprocessed.csv')

# Encode the 'y' column (Malignant: 1, Benign: 0)
label_encoder = LabelEncoder()
dataset['y'] = label_encoder.fit_transform(dataset['y'])

# Separate features (X) and target variable (y)
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=756)

In [22]:
# Initialize SVM models with different kernels
svm_models = {
    'Linear SVM': SVC(kernel='linear', random_state=756),
    'Poly SVM': SVC(kernel='poly', random_state=756),
    'RBF SVM': SVC(kernel='rbf', random_state=756),
}

# Combine all models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier(),
    'Neural Network': MLPClassifier(random_state=756),
    **svm_models
}

In [23]:
# Train and evaluate models
results = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    results[model_name] = {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}




In [24]:
# Find the best performing model based on accuracy
best_model = max(results, key=lambda k: results[k]['Accuracy'])

# Print results
print("Performance Metrics:")
print("{:<20} {:<10} {:<10} {:<10} {:<10}".format('Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score'))
for model_name, metrics in results.items():
    print("{:<20} {:<10.4f} {:<10.4f} {:<10.4f} {:<10.4f}".format(model_name, metrics['Accuracy'], metrics['Precision'], metrics['Recall'], metrics['F1 Score']))

print("\nBest Performing Model: {}".format(best_model))

Performance Metrics:
Model                Accuracy   Precision  Recall     F1 Score  
Logistic Regression  1.0000     1.0000     1.0000     1.0000    
Decision Tree        0.9091     0.8214     0.9200     0.8679    
Random Forest        0.9481     0.8621     1.0000     0.9259    
KNN                  0.9221     0.9130     0.8400     0.8750    
Neural Network       0.9740     0.9259     1.0000     0.9615    
Linear SVM           0.9610     0.8929     1.0000     0.9434    
Poly SVM             0.9091     1.0000     0.7200     0.8372    
RBF SVM              0.9481     0.9200     0.9200     0.9200    

Best Performing Model: Logistic Regression


Logistic regression model achieved perfect accuracy (1.0000) as well as perfect precision, recall, and F1 score. Which may indicate overfitting, potential data imbalance or leakage, and limited capability to capture nonlinear relationships.

For optimal model selection, it is advisable to choose the Neural Network and Linear SVM, as they exhibit strong performance across multiple metrics, aligning well with specific task goals and requirements.

The other models also performed reasonably well, but the Neural Network outperformed them in terms of accuracy and overall metrics.