# Lab 6 - Classification

Welcome to this week's lab on Classification! This week, we will explore two classification methods: `Logistic Regression` and `KNN`.

## Part 1: Logistic Regression Walkthrough
Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (where there are only two possible outcomes). It is used for binary classification tasks.

In this part, we will implement Logistic Regression to predict whether a patient has a particular disease based on certain diagnostic measurements. We will use a breast cancer dataset available through `sklearn.datasets` .

### Step 1: Import Necessary Libraries

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

### Step 2: Load & Split the Dataset

In [None]:
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1. Briefly explain the effects of changing the `test_size`.
2. Briefly explain about the `random_state` parameter. 

### Step 3 (optional): EDA

In [None]:
# Here is just an example for EDA.
# EDA: Visualize the data distributions and relationships
# Plotting histograms for each feature
X.hist(bins=10, figsize=(20,15))
plt.show()

# Correlation matrix
corr_matrix = X.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f")
plt.show()

Why do you think it is important to do EDA while it is not always necessary?

### Step 4: Model Training

In [None]:
# Train a Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

1. What is the `max_iter` parameter for?
2. Does `LogisticRegression` accept more parameters? If yes, list and briefly explain some of them.

### Step 5: Model Evaluation

In [None]:
# Make predictions
predictions = model.predict(X_test)

# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, predictions)}')
print(f'ROC AUC Score: {roc_auc_score(y_test, predictions)}')
print(classification_report(y_test, predictions))

Briefly explain about `ROC` and `AUC`.

## Part 2: Implement a KNN Model

K-Nearest Neighbors (KNN) is an instance-based learning algorithm where the class of a sample is determined by the majority class among its K nearest neighbors.
For this part, use the Iris dataset and create a KNN model to classify Iris plants into three species based on the sizes of their petals and sepals.

### Task 1: Implement the KNN Pipeline

In [None]:
# Your code goes here
# You can break it down to several code cells

### Task 2: Explain your implementation
Provide detailed explanation and discussion about your implementation. Break it down to different steps as relevant to your implementation.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.