# Worksheet 12

Name: Yuzhe Jiang

UID: U92913042

Link to my repo: https://github.com/jiangyz112/Data-Science-Fundamentals/tree/worksheet_12

### Topics

- Introduction to Classification
- K Nearest Neighbors

### Introduction to Classification

a) For the following examples, say whether they are or aren't an example of classification.

1. Predicting whether a student will be offered a job after graduating given their GPA.
2. Predicting how long it will take (in number of months) for a student to be offered a job after graduating, given their GPA.
3. Predicting the number of stars (1-5) a person will assign in their yelp review given the description they wrote in the review.
4. Predicting the number of births occuring in a specified minute.

1, This is an example of classification. The outcome is binary (offered a job or not offered a job), which fits the definition of classification.

2, This is not an example of classification; it's an example of regression. The outcome is a continuous variable (number of months), which does not fit the discrete output of classification models.

3, This is an example of classification. Even though the output is numeric, it represents discrete categories (1, 2, 3, 4, or 5 stars). Each category is distinct and there's no meaningful order or distance metric that applies across categories in the context of classification.

4, This is not an example of classification; it's an example of regression. The outcome is a count (number of births), which is a continuous variable and better suited for regression analysis since it can take on an infinite number of possible values within a range.

b) Given a dataset, how would you set things up such that you can both learn a model and get an idea of how this model might perform on data it has never seen?

1. Split the Dataset
Split your dataset into at least two subsets: a training set and a testing set. A common ratio is 70% of the data for training and 30% for testing, but this can vary based on the size and specifics of your dataset.


2. Choose a Model
Select a model that is appropriate for your problem type (e.g., classification, regression) and the nature of your data. Consider factors like the complexity of the model, the size and dimensions of your dataset, and any computational constraints.

3. Train the Model
Use the training set to train your model. This involves feeding the input features from the training set into the model and adjusting the model parameters to minimize error in predicting the training targets.

4. Validate the Model (Optional)
If you've set aside a validation set or are using cross-validation, use this step to assess how well your model generalizes to unseen data and to tune any hyperparameters. Cross-validation involves partitioning the training set into complementary subsets, training the model on one subset, and validating it on the other, iteratively.

5. Test the Model
Evaluate the model's performance on the test set. Since the test set has not been used during the training process, it serves as a proxy for new, unseen data. Use appropriate metrics to assess performance:

For classification, metrics might include accuracy, precision, recall, F1 score, and ROC-AUC.
For regression, metrics might include mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared.

6. Interpret Results
Analyze the results to understand how well your model might perform in real-world scenarios. If the model performs well on the test set, it suggests that it has generalized well from the training data. If performance is poor, consider revisiting your model choice, feature engineering, or training procedure.

7. Iterate
Model development is often iterative. Based on your test results, you might return to earlier steps to try different models, adjust parameters, or engineer new features to improve performance.

c) In your own words, briefly explain:

- underfitting
- overfitting

and what signs to look out for for each.

Underfitting occurs when a model is too simple to capture the underlying structure of the data. It happens when the model doesn't have enough complexity or information to learn from the training data, leading to poor performance on both the training data and unseen data.

Signs of Underfitting:

The model performs poorly on the training data.

The model also performs poorly on the validation or test data, showing that it is not just a matter of random chance or specific to the training set.

The learning curve shows that both training and validation errors are high but close to each other.

Overfitting happens when a model learns the training data too well, including its noise and outliers, rather than just the underlying pattern. As a result, it performs well on the training data but poorly on any new, unseen data because it has essentially memorized the training set rather than learned the generalizable features of the data.

Signs of Overfitting:

The model performs very well on the training data, often to an unrealistic degree.

The performance drops significantly on the validation or test data compared to the training data.

The learning curve shows a large gap between the training and validation errors, where the training error is much lower than the validation error.

### K Nearest Neighbors

In [14]:
import numpy as np
import matplotlib.pyplot as plt

data = {
    "Attribute A" : [3.5, 0, 1, 2.5, 2, 1.5, 2, 3.5, 1, 3, 2, 2, 2.5, 0.5, 0., 10],
    "Attribute B" : [4, 1.5, 2, 1, 3.5, 2.5, 1, 0, 3, 1.5, 4, 2, 2.5, 0.5, 2.5, 10],
    "Class" : [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
}

a) Plot the data in a 2D plot coloring each scatter point one of two colors depending on its corresponding class.

In [None]:
colors = np.array([x for x in 'bgrcmyk'])
plt.scatter(..., color=colors[data["Class"]].tolist())
plt.show()

Outliers are points that lie far from the rest of the data. They are not necessarily invalid points however. Imagine sampling from a Normal Distribution with mean 10 and variance 1. You would expect most points you sample to be in the range [7, 13] but it's entirely possible to see 20 which, on average, should be very far from the rest of the points in the sample (unless we're VERY (un)lucky). These outliers can inhibit our ability to learn general patterns in the data since they are not representative of likely outcomes. They can still be useful in of themselves and can be analyzed in great depth depending on the problem at hand.

b) Are there any points in the dataset that could be outliers? If so, please remove them from the dataset.

In [None]:
# Convert to numpy arrays for easier manipulation
attribute_a = np.array(data["Attribute A"])
attribute_b = np.array(data["Attribute B"])
classes = np.array(data["Class"])

# Plot to visually inspect for outliers
colors = np.array([x for x in 'bgrcmyk'])
plt.scatter(attribute_a, attribute_b, color=colors[classes].tolist(), label="Original Data")
plt.xlabel("Attribute A")
plt.ylabel("Attribute B")
plt.title("Inspecting for Outliers")
plt.legend()
plt.show()

# Identifying outliers based on the plot
# The point at (10, 10) is likely an outlier due to its distance from the cluster of other points
outlier_index = np.where((attribute_a == 10) | (attribute_b == 10))

# Remove the outliers from the dataset
attribute_a_clean = np.delete(attribute_a, outlier_index)
attribute_b_clean = np.delete(attribute_b, outlier_index)
classes_clean = np.delete(classes, outlier_index)

# Verify removal and plot cleaned data
plt.scatter(attribute_a_clean, attribute_b_clean, color=colors[classes_clean].tolist(), label="Cleaned Data")
plt.xlabel("Attribute A")
plt.ylabel("Attribute B")
plt.title("Data After Removing Outliers")
plt.legend()
plt.show()

# Return the cleaned arrays
(attribute_a_clean, attribute_b_clean, classes_clean)

Noise points are points that could be considered invalid under the general trend in the data. These could be the result of actual errors in the data or randomness that we could attribute to oversimplification (for example if missing some information / feature about each point). Considering noise points in our model can often lead to overfitting.

c) Are there any points in the dataset that could be noise points?

In [None]:
# Re-plotting the dataset with potential outlier removed for closer inspection
plt.figure(figsize=(10, 6))
for i, txt in enumerate(classes_clean):
    plt.scatter(attribute_a_clean[i], attribute_b_clean[i], color=colors[classes_clean[i]].tolist())
    plt.text(attribute_a_clean[i], attribute_b_clean[i], f"{i}", fontsize=8)

plt.xlabel("Attribute A")
plt.ylabel("Attribute B")
plt.title("Dataset Inspection for Noise Points")
plt.show()

For the following point

|  A  |  B  |
|-----|-----|
| 0.5 |  1  |

d) Plot it in a different color along with the rest of the points in the dataset.

In [None]:
# Plot the dataset again with the additional point highlighted
plt.figure(figsize=(8, 6))

# Existing points
for i, txt in enumerate(classes_clean):
    plt.scatter(attribute_a_clean[i], attribute_b_clean[i], color=colors[classes_clean[i]].tolist())
    plt.text(attribute_a_clean[i], attribute_b_clean[i], f"{i}", fontsize=8)

# Additional point
plt.scatter(0.5, 1, color='magenta', edgecolors='k', label='New Point (A=0.5, B=1)')

plt.xlabel("Attribute A")
plt.ylabel("Attribute B")
plt.title("Dataset with New Point Highlighted")
plt.legend()
plt.show()


e) Write a function to compute the Euclidean distance from it to all points in the dataset and pick the 3 closest points to it. In a scatter plot, draw a circle centered around the point with radius the distance of the farthest of the three points.

In [None]:
def n_closest_to(example, attributes, n=3):
    """
    Computes the Euclidean distances from the example point to all points in the dataset
    and picks the n closest points to it.

    Parameters:
        example (tuple): The point of interest (A, B).
        attributes (np.ndarray): The dataset containing all other points (Attribute A, Attribute B).
        n (int): The number of closest points to find.

    Returns:
        np.ndarray: The n closest points to the example.
        float: The distance to the farthest of the n closest points.
    """
    # Compute Euclidean distances from the example to each point in the dataset
    distances = np.sqrt(((attributes - example) ** 2).sum(axis=1))
    
    # Find the indices of the n smallest distances
    closest_indices = np.argsort(distances)[:n]
    
    # The n closest points
    closest_points = attributes[closest_indices]
    
    # Distance of the farthest of the n closest points
    max_distance = distances[closest_indices][-1]
    
    return closest_points, max_distance

# Example location
location = (0.5, 1)

# All points in the dataset (excluding the new point)
attributes_clean = np.column_stack((attribute_a_clean, attribute_b_clean))

# Find the 3 closest points to the example location and the radius for the circle
closest_points, radius = n_closest_to(location, attributes_clean, n=3)

# Plotting
_, axes = plt.subplots(figsize=(8, 6))
axes.scatter(attribute_a_clean, attribute_b_clean, color=colors[classes_clean].tolist())  # Existing dataset
axes.scatter(location[0], location[1], color='magenta', edgecolors='k', label='New Point (A=0.5, B=1)')  # New point
axes.scatter(closest_points[:, 0], closest_points[:, 1], color='red', edgecolors='k', label='3 Closest Points')  # 3 closest points
cir = plt.Circle(location, radius, fill=False, alpha=0.8, linestyle='--')
axes.add_patch(cir)
axes.set_aspect('equal')  # Necessary so that the circle is not oval
axes.legend()
plt.show()


f) Write a function that takes the three points returned by your function in e) and returns the class that the majority of points have (break ties with a deterministic default class of your choosing). Print the class assigned to this new point by your function.

In [None]:
def majority(points, classes):
    """
    Determines the majority class among the given points.

    Parameters:
        points (np.ndarray): The indices of the points in the dataset.
        classes (np.ndarray): The classes corresponding to each point in the dataset.

    Returns:
        int: The majority class among the points, with a deterministic default in case of a tie.
    """
    # Extract the classes for the provided points
    point_classes = classes[points]
    
    # Count the occurrences of each class
    counts = np.bincount(point_classes)
    
    # Determine the majority class, default to 0 in case of a tie
    majority_class = np.argmax(counts)
    
    return majority_class

# Identify the indices of the 3 closest points in the original dataset
closest_indices = [np.where((attribute_a_clean == point[0]) & (attribute_b_clean == point[1]))[0][0] for point in closest_points]

# Use the majority function to determine the class of the new point
assigned_class = majority(closest_indices, classes_clean)

print(f"The class assigned to the new point by the majority function is: {assigned_class}")


g) Re-using the functions from e) and f), you should be able to assign a class to any new point. In this exercise we will implement Leave-one-out cross validiation in order to evaluate the performance of our model.

For each point in the dataset:

- consider that point as your test set and the rest of the data as your training set
- classify that point using the training set
- keep track of whether you were correct with the use of a counter

Once you've iterated through the entire dataset, divide the counter by the number of points in the dataset to report an overall testing accuracy.

In [None]:
def classify_point(example, training_attributes, training_classes, n=3):
    """
    Classify an example point based on the n closest points in the training set.

    Parameters:
        example (tuple): The point to classify.
        training_attributes (np.ndarray): Attributes of the training points.
        training_classes (np.ndarray): Classes of the training points.
        n (int): The number of closest points to consider.

    Returns:
        int: The predicted class for the example point.
    """
    # Find the n closest points and the radius to the example in the training set
    closest_points, _ = n_closest_to(example, training_attributes, n)
    
    # Identify the indices of the n closest points in the original dataset
    closest_indices = [np.where((training_attributes[:, 0] == point[0]) & 
                                (training_attributes[:, 1] == point[1]))[0][0] for point in closest_points]
    
    # Determine the majority class among the closest points
    predicted_class = majority(closest_indices, training_classes)
    
    return predicted_class

# Initialize the counter for correct classifications
count = 0

# Loop over each point in the dataset
for i in range(len(attribute_a_clean)):
    # Consider the current point as the test set and the rest as the training set
    test_point = (attribute_a_clean[i], attribute_b_clean[i])
    actual_class = classes_clean[i]
    
    # Exclude the current point from the training set
    training_attributes = np.delete(attributes_clean, i, axis=0)
    training_classes = np.delete(classes_clean, i)
    
    # Classify the current point using the rest of the data as the training set
    prediction = classify_point(test_point, training_attributes, training_classes, n=3)
    
    # Check if the prediction is correct
    if prediction == actual_class:
        count += 1

# Compute the overall accuracy
overall_accuracy = count / len(attribute_a_clean)

print(f"Overall accuracy = {overall_accuracy:.2f}")
# The overall accuracy is 0.73


## Challenge Problem

For this question we will re-use the "mnist_784" dataset.

a) Begin by creating a training and testing datasest from our dataset, with a 80-20 ratio, and random_state=1. You can use the `train_test_split` function from sklearn. By holding out a portion of the dataset we can evaluate how our model generalizes to unseen data (i.e. data it did not learn from).

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Fetch the MNIST dataset
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Split the dataset into training and testing sets with an 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")


b) For K ranging from 1 to 20:

1. train a KNN on the training data
2. record the training and testing accuracy

Plot a graph of the training and testing set accuracy as a function of the number of neighbors K (on the same plot). Which value of K is optimal? Briefly explain.

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load the dataset
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

# Split the dataset into an 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Initialize lists to store accuracies
train_accuracies = []
test_accuracies = []

# Train and evaluate a KNN classifier for K from 1 to 20
for K in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors=K)
    knn.fit(X_train, y_train)
    
    # Record training accuracy
    y_pred_train = knn.predict(X_train)
    train_accuracies.append(accuracy_score(y_train, y_pred_train))
    
    # Record testing accuracy
    y_pred_test = knn.predict(X_test)
    test_accuracies.append(accuracy_score(y_test, y_pred_test))

# Plotting the accuracies
plt.figure(figsize=(10, 6))
plt.plot(range(1, 21), train_accuracies, label='Training Accuracy')
plt.plot(range(1, 21), test_accuracies, label='Testing Accuracy')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Accuracy')
plt.title('KNN Accuracy vs. Number of Neighbors K')
plt.legend()
plt.show()

""" This script trains a KNN classifier for each K value from 1 to 20 and plots the training and testing accuracies.
    The graph will help identify the optimal K value. The optimal K is usually the one that maximizes testing accuracy while keeping the model as simple as possible. 
    Look for the point where the testing accuracy begins to level off or decrease as K increases, indicating that adding more neighbors does not improve 
    the model's ability to generalize to unseen data. The choice of K is critical in balancing the bias-variance tradeoff inherent in the model's complexity.
"""


c) Using the best model from b), pick an image at random and plot it next to its K nearest neighbors

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Assuming optimal K and MNIST data are already defined
K_optimal = # Your optimal K value
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train the best KNN model
knn_best = KNeighborsClassifier(n_neighbors=K_optimal)
knn_best.fit(X_train, y_train)

# Select a random image from the test set
random_idx = np.random.randint(X_test.shape[0])
random_image = X_test[random_idx].reshape(1, -1)

# Find the K nearest neighbors of the random test image
distances, indices = knn_best.kneighbors(random_image, n_neighbors=K_optimal)

# Plot the random test image and its K nearest neighbors
plt.figure(figsize=(2 * (K_optimal + 1), 2))
plt.subplot(1, K_optimal + 1, 1)
plt.imshow(random_image.reshape(28, 28), cmap='gray')
plt.title('Test Image')
plt.axis('off')

for i, neighbor_idx in enumerate(indices[0], start=2):
    plt.subplot(1, K_optimal + 1, i)
    plt.imshow(X_train[neighbor_idx].reshape(28, 28), cmap='gray')
    plt.title(f'Neighbor {i-1}')
    plt.axis('off')

plt.show()


d) Using a dimensionality reduction technique discussed in class, reduce the dimensionality of the dataset before applying a KNN model. Repeat b) and discuss similarities and differences to the previous model. Briefly discuss your choice of dimension and why you think the performance / accuracy of the model has changed.

In [None]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt

# Load MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Lists to store accuracies
train_accuracies = []
test_accuracies = []

# Loop over K values
for K in range(1, 21):
    # Define the pipeline
    pca = PCA(n_components=0.95)  # Retaining 95% of variance
    knn = KNeighborsClassifier(n_neighbors=K)
    model = make_pipeline(pca, knn)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Record training and testing accuracies
    train_accuracies.append(model.score(X_train, y_train))
    test_accuracies.append(model.score(X_test, y_test))

# Plotting accuracies
plt.figure(figsize=(10, 6))
plt.plot(range(1, 21), train_accuracies, label='Training Accuracy')
plt.plot(range(1, 21), test_accuracies, label='Testing Accuracy')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Accuracy')
plt.title('KNN Accuracy vs. Number of Neighbors K with PCA')
plt.legend()
plt.show()

"""
Comparing this model to the previous one without dimensionality reduction, you may observe:

Different Optimal K: Dimensionality reduction can change the relationship between points, potentially altering the optimal K value.
Improved Testing Accuracy: By reducing dimensions, PCA may help mitigate the curse of dimensionality, improving KNN's performance on the test set.
Faster Training Times: Reduced dimensions lead to faster computations.

The performance/accuracy changes because PCA tends to filter out noise and less informative features, allowing the KNN algorithm to focus on the most relevant aspects of the data.
The choice of dimension (e.g., components covering 95% of variance) is a balance between retaining essential information and removing noise or redundant information.
Depending on how well the reduced dimensions capture the essence of the data, you might see improved accuracy and computational efficiency.
"""

## Midterm Prep (Part 1)

Compete in the Titanic Data Science Competition on Kaggle: https://www.kaggle.com/c/titanic 

Requirements:

1. Add at least 2 new features to the dataset (explain your reasoning below)
2. Use KNN (and only KNN) to predict survival
3. Explain your process below and choice of K
4. Make a submission to the competition and provide a link to your submission below.
5. Show your code below

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Feature engineering
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df['IsAlone'] = (train_df['FamilySize'] == 1).astype(int)

test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1
test_df['IsAlone'] = (test_df['FamilySize'] == 1).astype(int)

# Preprocess data: This is a simplified example. You'd need to preprocess the data appropriately.
# For the purpose of this example, let's focus on numerical columns and ignore categorical columns like 'Embarked', 'Sex'.
X = train_df[['Pclass', 'Age', 'Fare', 'FamilySize', 'IsAlone']].fillna(0)
y = train_df['Survived']

# Splitting the training data for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# KNN model with pipeline for scaling features
pipe = Pipeline([
    ('scaler', StandardScaler()), 
    ('knn', KNeighborsClassifier())
])

param_grid = {'knn__n_neighbors': range(1, 20)}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best cross-validation score: {grid.best_score_}")

# Predict on test data
X_test = test_df[['Pclass', 'Age', 'Fare', 'FamilySize', 'IsAlone']].fillna(0)
predictions = grid.predict(X_test)

# Create submission file
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': predictions})
submission.to_csv('submission.csv', index=False)


Two features to add are:

FamilySize: This combines SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard). The reasoning is that the survival chances might be different for passengers traveling alone versus those with family.

IsAlone: A binary feature indicating if the passenger is traveling alone. This is derived from FamilySize. The hypothesis is that those traveling alone might have a different survival rate than those with family.

---------------

Process:

1. Load the Dataset

First, we need to load the training and testing datasets to understand the data and identify opportunities for feature engineering.

2. Feature Engineering

Add two features: FamilySize, IsAlone

3. Preprocessing

Preprocess the data to handle missing values, convert categorical variables to numerical, and normalize/standardize numerical features if necessary for KNN.

4. Choosing K

The optimal K is found using cross-validation, such as grid search with cross-validation.

5. Training the Model

Train the KNN model on the training data.

6. Making Predictions

Use the trained model to predict survival on the test dataset.

7. Submission

Create a submission file.

-------
Link:
https://www.kaggle.com/competitions/titanic/submissions#