## Assignment 6: K- Nearest Neighbors
### DTSC 680: Applied Machine Learning

### Name: Haneefudin Rasheed

### Preliminaries

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from collections import Counter
import zipfile
import os


# 1 importing data

In [2]:



zip_file_path = 'mushrooms.zip'
extracted_folder_path = 'mushrooms_extracted'

# Unzipping the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_folder_path)

# Define the paths to the extracted .names and .data files
names_file_path = os.path.join(extracted_folder_path, 'mushrooms/agaricus-lepiota.names')
data_file_path = os.path.join(extracted_folder_path, 'mushrooms/agaricus-lepiota.data')

# Function to extract column names from the .names file
def extract_column_names(file_path):
    column_names = []
    with open(file_path, 'r') as file:
        lines = file.readlines()

    attributes_section_started = False
    for line in lines:
        if 'Attribute Information:' in line:
            attributes_section_started = True
            continue  # Skip the title line of the section
        if attributes_section_started:
            if ':' in line:  # Check for the presence of ':' to identify attribute lines
                attribute_name = line.split(':')[0].strip()
                # Handle numeric prefixes in attribute names, if any
                attribute_name = ''.join([i for i in attribute_name if not i.isdigit()]).strip()
                # Format attribute name
                attribute_name = attribute_name.replace('-', '_').replace(' ', '_')
                column_names.append(attribute_name)
            elif line.strip() == '':  # An empty line might indicate the end of the section
                break

    # Ensure the first column is correctly identified as 'class'
    if 'class' not in column_names:
        column_names.insert(0, 'class')
    
    return column_names

# Extracting column names
column_names = extract_column_names(names_file_path)

# Loading the dataset with the extracted column names
data = pd.read_csv(data_file_path, header=None, names=column_names, sep=',')

# Display the first few rows of the DataFrame
data.head()


Unnamed: 0,class,._cap_shape,._cap_surface,._cap_color,._bruises?,._odor,._gill_attachment,._gill_spacing,._gill_size,._gill_color,...,._stalk_surface_below_ring,._stalk_color_above_ring,._stalk_color_below_ring,._veil_type,._veil_color,._ring_number,._ring_type,._spore_print_color,._population,._habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
data.shape

(8124, 23)

## Eploring data

In [4]:

for column in data.columns:
    print(f"Column '{column}':")
    print(data[column].value_counts().head())
    print("\n")

Column 'class':
class
e    4208
p    3916
Name: count, dtype: int64


Column '._cap_shape':
._cap_shape
x    3656
f    3152
k     828
b     452
s      32
Name: count, dtype: int64


Column '._cap_surface':
._cap_surface
y    3244
s    2556
f    2320
g       4
Name: count, dtype: int64


Column '._cap_color':
._cap_color
n    2284
g    1840
e    1500
y    1072
w    1040
Name: count, dtype: int64


Column '._bruises?':
._bruises?
f    4748
t    3376
Name: count, dtype: int64


Column '._odor':
._odor
n    3528
f    2160
y     576
s     576
a     400
Name: count, dtype: int64


Column '._gill_attachment':
._gill_attachment
f    7914
a     210
Name: count, dtype: int64


Column '._gill_spacing':
._gill_spacing
c    6812
w    1312
Name: count, dtype: int64


Column '._gill_size':
._gill_size
b    5612
n    2512
Name: count, dtype: int64


Column '._gill_color':
._gill_color
b    1728
p    1492
w    1202
n    1048
g     752
Name: count, dtype: int64


Column '._stalk_shape':
._stalk_shape
t 

 it seems like there might be some missing values represented by a ?, I should Replace '?' with NaN: Convert '?' to NaN to standardize missing value representation with pandas 

In [5]:
import pandas as pd
import numpy as np


# Replace '?' with NaN
data.replace('?', np.nan, inplace=True)


In [6]:
data.isnull().sum()

class                            0
._cap_shape                      0
._cap_surface                    0
._cap_color                      0
._bruises?                       0
._odor                           0
._gill_attachment                0
._gill_spacing                   0
._gill_size                      0
._gill_color                     0
._stalk_shape                    0
._stalk_root                  2480
._stalk_surface_above_ring       0
._stalk_surface_below_ring       0
._stalk_color_above_ring         0
._stalk_color_below_ring         0
._veil_type                      0
._veil_color                     0
._ring_number                    0
._ring_type                      0
._spore_print_color              0
._population                     0
._habitat                        0
dtype: int64

# imputing missing values


In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
import numpy as np



# Identify feature and response columns
features = data.drop(columns=['._stalk_root'])  # Exclude target column for imputation
response = data['._stalk_root'].fillna('missing')  # Fill missing values temporarily

# Encode features (one-hot encoding)
onehot_encoder = OneHotEncoder()
features_encoded = onehot_encoder.fit_transform(features).toarray()

# Encode response (label encoding)
label_encoder = LabelEncoder()
response_encoded = label_encoder.fit_transform(response)

# Isolate data with and without missing '_stalk_root' values
is_missing = data['._stalk_root'].isnull()
if is_missing.sum() > 0:  # Proceed only if there are missing values
    features_with_missing = features_encoded[is_missing]
    features_without_missing = features_encoded[~is_missing]
    response_without_missing = response_encoded[~is_missing]

    # Initialize and train KNeighborsClassifier
    knn = KNeighborsClassifier()
    knn.fit(features_without_missing, response_without_missing)

    # Predict missing '_stalk_root' values
    predicted_missing_encoded = knn.predict(features_with_missing)

    # Decode predictions to original labels
    predicted_missing = label_encoder.inverse_transform(predicted_missing_encoded)

    # Impute missing values back into the original dataframe
    data.loc[is_missing, '._stalk_root'] = predicted_missing

# Ensure no missing values are left
assert data['._stalk_root'].isnull().sum() == 0, "There are still missing values in '._stalk_root'."

# Optional: Print summary of imputation
if 'predicted_missing' in locals():  # Check if imputation was performed
    # Create the missing_values list with imputed values in their original form
    missing_values = data.loc[is_missing, '._stalk_root'].tolist()

    # Count unique values and their occurrences
    unique_values, counts = np.unique(missing_values, return_counts=True)
    missing_values_count = dict(zip(unique_values, counts))

    # Print the unique imputed values and their counts
    print("Imputed values and their counts:", missing_values_count)
else:
    print("No missing values were found to impute.")


Imputed values and their counts: {'b': 1611, 'c': 107, 'e': 762}


## Concept Question #1  ## 
Why don’t we
one-hot encode the response data to train the KNN model instead?

When working with the KNN algorithm, particularly for tasks like classification or imputing missing values, we typically don't one-hot encode the response data, because, KNN operates on the principle of finding the 'k' nearest neighbors based on distance calculations, such as Euclidean or Manhattan distance. It then assigns a class label to a new instance based on the majority vote among these neighbors. If we were to one-hot encode the response data, each category would be transformed into a separate dimension, and the concept of a "majority vote" would become less clear. This is because the algorithm would have to navigate through a multi-dimensional binary space to determine the nearest neighbors, complicating the prediction process.

Als, Applying one-hot encoding to the response variable unnecessarily expands the dimensionality of the target space, making model training and prediction both more computationally demanding and harder to understand. Moreover, the intrinsic exclusivity of the categories, fundamental to the nature of the response variable, becomes less evident in this expanded, multi-dimensional format may result in information loss, with the model possibly interpreting these binary columns as independent features rather than as different states of a single variable.




#  Train a RandomForestClassifier as well as a LogisticRegression model to predictwhether a mushroom is edible or poisonous

## Preparing the data

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder



# Separate features and the target variable
features = data.drop('class', axis=1)  # Dropping the target column to isolate features
target = data['class']  # The target column indicating whether a mushroom is edible or poisonous

# One-hot encode the feature data
onehot_encoder = OneHotEncoder(sparse=False)
features_encoded = onehot_encoder.fit_transform(features)

# Label encode the target data
label_encoder = LabelEncoder()
target_encoded = label_encoder.fit_transform(target)

# Split the encoded data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target_encoded, test_size=0.2, random_state=42)






## Checking data prepared for modeling

In [9]:
# Check the shapes of the datasets
print("Shapes of the datasets:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}\n")

# Check the data types
print("Data types:")
print(f"X_train type: {X_train.dtype}")
print(f"y_train type: {y_train.dtype}\n")

# Check the balance of the target classes
print("Target class distribution in y_train:")
print(pd.Series(y_train).value_counts(normalize=True))
print("\nTarget class distribution in y_test:")
print(pd.Series(y_test).value_counts(normalize=True))


sparsity = lambda x: 1.0 - (np.count_nonzero(x) / float(x.size))
print(f"\nSparsity in X_train: {sparsity(X_train):.2f}")
print(f"Sparsity in X_test: {sparsity(X_test):.2f}")


Shapes of the datasets:
X_train: (6499, 116)
X_test: (1625, 116)
y_train: (6499,)
y_test: (1625,)

Data types:
X_train type: float64
y_train type: int32

Target class distribution in y_train:
0    0.517772
1    0.482228
Name: proportion, dtype: float64

Target class distribution in y_test:
0    0.518769
1    0.481231
Name: proportion, dtype: float64

Sparsity in X_train: 0.81
Sparsity in X_test: 0.81


# Training the LogisticRegression Model

In [10]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the LogisticRegression model
logistic_model = LogisticRegression(max_iter=1000)

# Train the model
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_logistic = logistic_model.predict(X_test)

# Evaluate the model
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print(f"Accuracy of LogisticRegression model: {accuracy_logistic:.4f}")


Accuracy of LogisticRegression model: 0.9994
CPU times: total: 250 ms
Wall time: 301 ms


## Training the RandomForestClassifier Model

In [11]:
%%time
from sklearn.ensemble import RandomForestClassifier

# Initialize the RandomForestClassifier model
random_forest_model = RandomForestClassifier(n_estimators=100)

# Train the model
random_forest_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = random_forest_model.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy of RandomForestClassifier model: {accuracy_rf:.4f}")


Accuracy of RandomForestClassifier model: 1.0000
CPU times: total: 1 s
Wall time: 1.11 s


# Compute the accuracy, precision, and recall scores for a test set

## Evaluating the LogisticRegression Model

In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Predictions were made previously as y_pred_logistic
# compute the metrics for the LogisticRegression model
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
precision_logistic = precision_score(y_test, y_pred_logistic)
recall_logistic = recall_score(y_test, y_pred_logistic)

print("LogisticRegression Model Evaluation:")
print(f"Accuracy: {accuracy_logistic:.4f}")
print(f"Precision: {precision_logistic:.4f}")
print(f"Recall: {recall_logistic:.4f}\n")


LogisticRegression Model Evaluation:
Accuracy: 0.9994
Precision: 0.9987
Recall: 1.0000



## Evaluating the RandomForestClassifier Model

In [13]:
# Predictions were made previously as y_pred_rf
# compute the metrics for the RandomForestClassifier model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)

print("RandomForestClassifier Model Evaluation:")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")


RandomForestClassifier Model Evaluation:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000


### model performance discussion

The performance metrics for your Logistic Regression and Random Forest Classifier models suggest extremely high, if not perfect, levels of prediction accuracy, precision, and recall. While at first glance, these metrics indicate that both models are performing exceptionally well, they also raise the possibility of overfitting, especially in the case of the RandomForestClassifier, which has perfect scores across all evaluated metrics.

## Concept Question #2:Could we trainthese two models by one-hot encoding the response data instead, being careful tospecify that the drop parameter of the OneHotEncoder class is set to ‘first’? Why or whynot?

whether to use one-hot encoding, especially with a binary target like predicting if a mushroom is edible or poisonous, opens up a discussion on dimensions, complexity, and computing time.

One-hot encoding the response data, despite being a powerful tool for feature preprocessing, introduces unnecessary complexity when applied to a binary target variable. This method transforms our straightforward binary labels into a two-column matrix for the two classes. However, in binary classification tasks, our models are designed to predict the probability of samples belonging to a single class. By adding an extra step of one-hot encoding and then reducing the dimensions back down by dropping one of the columns (using drop='first'), we inadvertently complicate the data preparation process without any gain in performance or interpretability.

This unnecessary increase in dimensions, even if momentarily before reduction, adds to the computational workload. It means our machine has to handle more data transformations, which can  increase the computing time and complexity of our data pipeline. For binary classification tasks, label encoding achieves the desired format—assigning a unique integer to each class—without increasing the dimensionality of the target data. It's a more direct and efficient way to prepare our response variable, allowing the model to focus on learning from the data rather than deciphering the structure of the target variable.

finally, while one-hot encoding is essential for categorical features without a natural order, applying it to a binary response variable and then reducing the dimensions isn't necessary. It complicates the preprocessing step and could lead to slight increases in computing time, without offering benefits in model performance or clarity. 

### model performance discussion

he RandomForestClassifier model achieves perfect scores in accuracy, precision, and recall, making it the superior model based on these metrics. However, the LogisticRegression model's performance is also outstanding and only marginally less perfect in terms of precision.

In practical terms, the high precision of both models means that they are very reliable at predicting poisonous mushrooms, which is crucial for safety reasons—false positives (edible mushrooms incorrectly labeled as poisonous) are minimal. The perfect recall score, particularly important in this context, ensures that there are no false negatives (poisonous mushrooms incorrectly labeled as edible), which could potentially be dangerous.

The high performance might indicate that the dataset features strongly predict whether a mushroom is edible or poisonous, allowing both models to learn effective decision boundaries.
to summarize, 
 both models perform exceptionally well on this dataset, with RandomForestClassifier slightly outperforming LogisticRegression in precision. 

# Performing dimensionality reduction using PCA

In [14]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler



# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize PCA to keep 95% of the variance
pca = PCA(n_components=0.95)

# Fit PCA on the training data and transform both training and test data
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Check the new shape of the datasets and the number of components
print(f"New shape of X_train: {X_train_pca.shape}")
print(f"New shape of X_test: {X_test_pca.shape}")
print(f"Number of components retained to preserve 95% of variance: {pca.n_components_}")


New shape of X_train: (6499, 60)
New shape of X_test: (1625, 60)
Number of components retained to preserve 95% of variance: 60


## results of PCA

In [15]:

initial_dimensions = X_train.shape[1]  

# Number of dimensions/features after PCA dimensionality reduction
reduced_dimensions = X_train_pca.shape[1]  

# Calculate the percentage reduction in dimensions
percentage_reduction = ((initial_dimensions - reduced_dimensions) / initial_dimensions) * 100

# Print the results
print(f"Initial number of features: {initial_dimensions}")
print(f"Number of features after PCA dimensionality reduction: {reduced_dimensions}")
print(f"Percentage reduction in dimensions: {percentage_reduction:.2f}%")


Initial number of features: 116
Number of features after PCA dimensionality reduction: 60
Percentage reduction in dimensions: 48.28%


# training new models on reduced dataset

## Training the LogisticRegression Model on the Reduced Dataset

In [16]:
%%time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the LogisticRegression model
logistic_model_reduced = LogisticRegression(max_iter=1000, random_state=42)

# Train the model on the reduced dataset
logistic_model_reduced.fit(X_train_pca, y_train)

# Make predictions on the test set and evaluate the model
y_pred_logistic_reduced = logistic_model_reduced.predict(X_test_pca)
accuracy_logistic_reduced = accuracy_score(y_test, y_pred_logistic_reduced)
print(f"Accuracy of LogisticRegression model on reduced dataset: {accuracy_logistic_reduced:.4f}")


Accuracy of LogisticRegression model on reduced dataset: 1.0000
CPU times: total: 266 ms
Wall time: 169 ms


## Training the RandomForestClassifier Model on the Reduced Dataset

In [17]:
%%time
from sklearn.ensemble import RandomForestClassifier

# Initialize the RandomForestClassifier model with a random state
random_forest_model_reduced = RandomForestClassifier(random_state=42)

# Train the model on the reduced dataset
random_forest_model_reduced.fit(X_train_pca, y_train)

#  Make predictions on the test set and evaluate the model
y_pred_rf_reduced = random_forest_model_reduced.predict(X_test_pca)
accuracy_rf_reduced = accuracy_score(y_test, y_pred_rf_reduced)
print(f"Accuracy of RandomForestClassifier model on reduced dataset: {accuracy_rf_reduced:.4f}")


Accuracy of RandomForestClassifier model on reduced dataset: 1.0000
CPU times: total: 7.91 s
Wall time: 8.37 s


# Computing the accuracy, precision, and recall scores for the model trained on the reduced data set

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score


# Calculate precision and recall for the Logistic Regression model on the reduced dataset
precision_logistic_reduced = precision_score(y_test, y_pred_logistic_reduced)
recall_logistic_reduced = recall_score(y_test, y_pred_logistic_reduced)

# Calculate precision and recall for the Random Forest model on the reduced dataset
precision_rf_reduced = precision_score(y_test, y_pred_rf_reduced)
recall_rf_reduced = recall_score(y_test, y_pred_rf_reduced)

# Compile all metrics into a DataFrame
metrics_df = pd.DataFrame({
    'Model': [
        'Logistic Regression', 'Logistic Regression',
        'Random Forest', 'Random Forest'
    ],
    'Dataset': [
        'Full Data', 'PCA Reduced',
        'Full Data', 'PCA Reduced'
    ],
    'Accuracy': [
        accuracy_logistic, accuracy_logistic_reduced,
        accuracy_rf, accuracy_rf_reduced
    ],
    'Precision': [
        precision_logistic, precision_logistic_reduced,
        precision_rf, precision_rf_reduced
    ],
    'Recall': [
        recall_logistic, recall_logistic_reduced,
        recall_rf, recall_rf_reduced
    ],
    'Time (s)': [
        0.450, 0.201,  # Times for Logistic Regression on full and reduced data
        1.75, 11.2     # Times for Random Forest on full and reduced data
    ]
})

# Display the DataFrame
metrics_df


Unnamed: 0,Model,Dataset,Accuracy,Precision,Recall,Time (s)
0,Logistic Regression,Full Data,0.999385,0.998723,1.0,0.45
1,Logistic Regression,PCA Reduced,1.0,1.0,1.0,0.201
2,Random Forest,Full Data,1.0,1.0,1.0,1.75
3,Random Forest,PCA Reduced,1.0,1.0,1.0,11.2


## Preparing tabular output for modelcomparison

In [19]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score



y_test = []
y_pred_logistic = []
y_pred_rf = []
y_pred_logistic_reduced  
y_pred_rf_reduced = []



 
metrics_data = {
    ('Logistic Regression', 'Accuracy'): [accuracy_logistic, accuracy_logistic_reduced],
    ('Logistic Regression', 'Precision'): [precision_logistic, precision_logistic_reduced],
    ('Logistic Regression', 'Recall'): [recall_logistic, recall_logistic_reduced],
    ('Logistic Regression', 'Time'): [0.450, 0.201],
    ('Random Forest', 'Accuracy'): [accuracy_rf, accuracy_rf_reduced],
    ('Random Forest', 'Precision'): [precision_rf, precision_rf_reduced],
    ('Random Forest', 'Recall'): [recall_rf, recall_rf_reduced],
    ('Random Forest', 'Time'): [1.750, 11.200]
}

# The metrics DataFrame with multi-index
multi_index = pd.MultiIndex.from_tuples(metrics_data.keys())
metrics_df = pd.DataFrame(list(metrics_data.values()), index=multi_index, columns=['Full Data', 'PCA Reduced'])

# Display the DataFrame
metrics_df


Unnamed: 0,Unnamed: 1,Full Data,PCA Reduced
Logistic Regression,Accuracy,0.999385,1.0
Logistic Regression,Precision,0.998723,1.0
Logistic Regression,Recall,1.0,1.0
Logistic Regression,Time,0.45,0.201
Random Forest,Accuracy,1.0,1.0
Random Forest,Precision,1.0,1.0
Random Forest,Recall,1.0,1.0
Random Forest,Time,1.75,11.2


## model comparion discussion


Looking at the training times and performance metrics of Logistic Regression and Random Forest on both full and PCA reduced datasets, it's interisting to see how they vary. With Logistic Regression, the drop in training time when using the PCA reduced dataset is quite noticable, which makes sense because reducing the number of features should ideally speed things up. However, for the Random Forest, the increase in training time for the PCA reduced data is puzzling; it's not what I had expected given that fewer features typically lead to quicker training. Performance-wise, both models score perfectly on precision and recall, which initially seems great, but I can't help wondering if the models might be  indicative of overfitting. Such perfect scores could signal overfitting, especially if the test data are too similar to the training set. but, it's also possible that the data is just inherently well-structured, allowing both models to perform very well. Comparatively, the slight improvement in Logistic Regression's accuracy with the reduced dataset could be meaningful, but it's the vastly different training times that really catch my attention and make me curious about the underlying reasons. finally,
When comparing the performances of the full and reduced models, it's clear that dimensionality reduction through PCA has nuanced impacts. For Logistic Regression, the transition to a reduced dataset not only slashes training time but also nudges accuracy upward ever so slightly, hinting at the efficiency of the model with fewer dimensions. On the other hand, the Random Forest model, with its performance metrics unaltered, suggests a robustness to dimensionality changes but at a cost of increased training time, which raises questions about efficiency. while Random Forest's complexity management doesn't translate into faster performance,it seems to be  an aspect worth exploring further to understand the dynamics between model complexity and data dimensionality.