# Assignment 4: Support Vector Machine (SVM) and Model Ensembling {-}

This assignment aims at familiarizing you with training and testing Suppor Vector Machine (SVM) classification model, along with model ensembling methods. The dataset you will be working on is 'data-breast-cancer.csv'. It is composed of attributes to build a prediction model. You will have to do:

1.  **(5 points) Coding tasks:** The following questions involve writing code to complete specific tasks.  
    1.1 *(1 point)* Load the data-breast-cancer.csv dataset and perform basic data cleaning, analysis and visualization to have a deep understanding about the data. Identify and remove outliers if applicable.  
    1.2 *(1 point)* Train and evaluate an SVM model. Use GridSearchCV to find the best parameters: kernel, C, and gamma values, for the SVM model. Details of the SVM parameters can be found at https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.  
    1.3 *(1 point)* Train and evaluate four other classifiers: Logistic Regression, Naive Bayes, Decision Tree and Random Forest. Compare their accuracy, precision, recall, and F1-score with the SVM's performance.  
    1.4 *(2 points)* Apply three ensemble learning techniques: Bagging, Boosting, and Stacking, to solve the problem. Compare their performance against each other as well as against individual models. Summarize your observations and draw conclusions based on the results.

2.  **(5 points) Open discussion questions:** These discussion questions ask you to analyze and argue your points.  Feel free to include relevant code examples to strengthen your arguments.  
    2.1 *(1 point)* How well did the SVM model perform compared to the other classifiers? Did it outperform them, or did another model work better?  
    2.2 *(1 point)* How did tuning the C and gamma parameters affect the modelâ€™s performance? Did you observe any signs of overfitting or underfitting?  
    2.3 *(1 point)* How could this breast cancer classification model be used in real-world healthcare applications? What challenges might arise in deployment?  
    2.4 *(1 point)* Medical diagnosis models come with ethical responsibilities. What are potential risks of using an automated model for breast cancer detection?  
    2.5 *(1 point)* What was the most insightful part of this assignment? If you could improve your classification result, what would you do differently?  

### Submission {-}
The structure of submission folder should be organized as follows:

- ./\<StudentID>-assignment4-notebook.ipynb: Jupyter notebook containing source code.

The submission folder is named ML4DS-\<StudentID>-Assignment4 (e.g., ML4DS-2012345-Assigment4) and then compressed with the same name.
    
### Evaluation {-}
Assignment evaluation will be conducted on how you accomplish the assignment requirements. It is a plus if you have data exploration and modeling steps other than the basic requirements. In addition, your code should conform to a Python coding convention such as PEP-8.

### Deadline {-}
Please visit Canvas for details.

In [None]:
# Load the libraries
import pandas as pd
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv("data-breast-cancer.csv")

In [None]:
# Show some data samples
df.head()

Unnamed: 0.1,Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


This is a dataset used to detect whether a patient has breast cancer depending on the following features:

- diagnosis: (label) the diagnosis of breast (label) tissues (M = malignant, B = benign).
- radius: distances from center to points on the perimeter.
- texture: standard deviation of gray-scale values.
- perimeter: perimeter of the tumor.
- area: area of the tumor.
- smoothness: local variation in radius lengths.
- compactness: is equal to (perimeter^2 / area - 1.0).
- concavity: severity of concave portions of the contour.
- concave points: number of concave portions of the contour.
- symmetry: symmetry of the tumor shape.
- fractal dimension: "coastline approximation" - 1.



## 1. Coding tasks

In [None]:
# Your code goes here. Please make sure to explain the reasons behind your data processing and modeling choices.
# 1.1

In [None]:
# Your code goes here. Please make sure to explain the reasons behind your data processing and modeling choices.
# 1.2

In [None]:
# Your code goes here. Please make sure to explain the reasons behind your data processing and modeling choices.
# 1.3

In [None]:
# Your code goes here. Please make sure to explain the reasons behind your data processing and modeling choices.
# 1.4

## 2. Open discussion questions

In [None]:
# Your argument goes here. Please include data visualization and analysis to back up your argument.
# 2.1

In [None]:
# Your argument goes here. Please include data visualization and analysis to back up your argument.
# 2.2

In [None]:
# Your argument goes here. Please include data visualization and analysis to back up your argument.
# 2.3

In [None]:
# Your argument goes here. Please include data visualization and analysis to back up your argument.
# 2.4

In [None]:
# Your argument goes here. Please include data visualization and analysis to back up your argument.
# 2.5