# Lab 4: Classifiers - Part 1

In this notebook, we will explore two fundamental machine learning algorithms: K-Nearest Neighbors (KNN) and Logistic Regression. Both methods are widely used for classification tasks, each with its own strengths and applications.

## Part 1: k-Nearest Neighbors

k-Nearest Neighbors (k-NN) is a simple yet powerful algorithm used for classification tasks in machine learning. It operates on the principle that similar data points tend to belong to the same class.

When making a prediction for a new data point, k-NN looks at the 𝑘 nearest data points in the feature space (based on a distance metric, such as Euclidean distance) and assigns the most common class among those neighbors to the new point.


### Step 1: Import Libraries

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report, confusion_matrix
import seaborn as sns
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Import these libraries.

### Step 2: Prepare the Data
In this example, we’ll use the Iris dataset, which is included in Scikit-learn. We can import it with `load_iris()`. You also need to parse your variables and split your data for training.

**Our question is**: Can we predict the species of an iris flower based on informatin about it's sepals and petals?

The **variables** in the dataset are:
- **Sepal Length (cm)**: The length of the sepal (numeric, continuous).
- **Sepal Width (cm)**: The width of the sepal (numeric, continuous).
- **Petal Length (cm)**: The length of the petal (numeric, continuous).
- **Petal Width (cm)**: The width of the petal (numeric, continuous).
- **Species**: The species of the iris flower, either Iris Setosa, Iris Versicolor, or Iris Virginica (categorical). We will be removing Iris Virginica to keep things binary.

```python
# load data and create dataframe
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
iris_df['species'] = pd.Categorical(iris_df['species'])
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: (1) Load the data using the function given above. (2) Parse the data into two new variables: y (for the single target) and X (for the predictors). (3) Split the data into `X_train`, `X_test`, `y_train`, `y_test`, using a `test_size` of .25 and a `random_state` of 22.
(*Refer back to out Lab 2 notebook if you need to refresh on how to do  this!*)

In [12]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report, confusion_matrix
import seaborn as sns
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
#iris_df = iris_df[iris_df['species'] != 'virginica']
iris_df['species'] = pd.Categorical(iris_df['species'])

print(iris_df)
X = iris_df.drop('species', axis=1)  # Features
y = iris_df['species']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=22) # Split data into training and testing sets

k = 4  # number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the model
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', conf_matrix)

class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

       species  
0       se

### Step 3: Create and Train the k-NN Model

Instantiating the k-NN classifier and fitting it to the training data is quite easy. Let's try it out with a 𝑘 of 4 to start.

```python
# Create the k-NN model
k = 4  # number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the model
knn.fit(X_train, y_train)
```
Step 5: Make Predictions
Use the trained model to make predictions on the test set.

```python
# Make predictions
y_pred = knn.predict(X_test)
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Fit the classifier and predict on your data!

### Step 4: Evaluate the Model
Evaluate the model's performance using accuracy, a confusion matrix, and a classification report.

**4.1. Accuracy Score**

Accuracy represents the proportion of correctly predicted instances out of the total instances. It’s a simple measure, but can be misleading if the classes are imbalanced.

```python
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
```

**4.2. Confusion Matrix**

The confusion matrix shows the counts of true positive, true negative, false positive, and false negative predictions. It helps you understand where the model is making errors, such as confusing one class for another.

```python
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', conf_matrix)
```

**4.3. Classification Report**

The classification report includes precision, recall, and F1-score for each class:
- **Precision**: The ratio of true positives to the sum of true positives and false positives. It indicates how many of the predicted positive instances are actually positive.
- **Recall (Sensitivity)**: The ratio of true positives to the sum of true positives and false negatives. It shows how many actual positive instances were captured by the model.
- **F1-Score**: The harmonic mean of precision and recall. It provides a balance between the two, particularly useful when you need to consider both false positives and false negatives.

```python
# Classification report
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)
```

**Overall Scores**

We can also compute summary scores across all classes using macro or micro averaging:
- `average='macro'`: Unweighted mean (treats all classes equally).
- `average='micro'`: Total true positives, false positives, and false negatives are computed globally.

```python
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f'Precision (Macro): {precision:.2f}')
print(f'Recall (Macro): {recall:.2f}')
print(f'F1 Score (Macro): {f1:.2f}')
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Check out Accuracy, the Confusion Matrix, the Classification Report, and the overall metrics for your model. What do you notice? Did it perform well on the test data?

### Step 5: Visualizing the Results
You can visualize the confusion matrix using a heatmap. The colors in a heatmap tell you how well or poorly a model performed on each class.

```python
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
```
##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create a heatmap to further understand the Confusion Matrix.

### Step 6: Repeat the process: Testing 𝑘

You’ve successfully implemented a 𝑘-NN classifier! But is our value for 𝑘 the best one possible? Let's find out.

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Try out a new k. Keep in mind the ways in which 𝑘 impacts the model (overfitting/underfitting). Additionally, consider challenging yourself and try *𝑘-fold cross-validation* for a more robust way of optimizing your choice of 𝑘. Did things get better or worse when you changed 𝑘?

## Part 2: Logistic Regression

Unlike linear regression, which predicts continuous outcomes, logistic regression predicts the likelihood that a given input belongs to a specific category.

There are several forms of logistic regression (*logistic regression, multiple logistic regression, multinomial logistic regression*) but they are all implemented similarly in `sklearn`.

The only **parameter** that you are required to specify is `max_iter`.

- **Purpose**: Logistic regression uses iterative methods (like gradient descent) to converge on the optimal solution (to find the optimal weights). `max_iter` is the maximum number of iterations the algorithm will run during the optimization process to find the best-fitting model. If the algorithm doesn't converge to a solution within the specified number of iterations, it will stop, and you might get a warning.
- **Value**: The default value for `max_iter` is  100, but this can be changed. You might want to increase max_iter if the model is complex and requires more iterations to converge or you receive a warning saying that the optimization did not complete.

**Change 1: Additional imports**

```python
from sklearn.linear_model import LogisticRegression
```

**Change 2: Initialize and fitting the model**

```python
# Initialize and fit the logistic regression model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
```
Then predicting is done the same, with the `predict()` function. The evaluation metrics are also calculated with the same code as above.

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Fit a logistic regression model to your data! Run all of the afformentioned evaluation metrics/methods. How does your logisitic regression model compare to your best kNN classifier? Why do you think one is better than the other?

**Evaluate Coefficients**

Now let's see how our predictors impacted our out target.

```python
# Get the coefficients
coefficients = log_reg.coef_
intercept = log_reg.intercept_

# Create a DataFrame for better visualization
coef_df = pd.DataFrame(coefficients, columns=iris.feature_names, index=iris.target_names)
coef_df['intercept'] = intercept
print(coef_df)
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Check out the coefficients of your model. What does each coefficient mean for your model? Write it out!

### Repeat the process: Model Improvement

You’ve successfully implemented a Logistic Regression! But is our model the best fit we can get?

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Try altering something about the model (e.g., taking out a predictor, increasing/decreasing the number of interations). Did things get better or worse when you make those changes?