# Medical Appointment Attendance Prediction with KNN
<style>
div.code_cell {
    width: 100%;
}
</style>


Create `K-Nearest Neighbor Models` for the following Dataset:
- Download the following dataset (from Kaggle) with 110.527 medical appointments and 14 associated variables (characteristics). Use it to create models that can predict whether a patient will show up for the appointment booked or not.

- Here is the [link](https://www.kaggle.com/datasets/joniarroba/noshowappointments) to the dataset.


## Step 1: Import all necessary libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Step 2: Load the Dataset

- Read the `KaggleV2-May-2016.csv` file into pandas and assign it to the variable name `medicalAppointmentNoShows`
- Use the `DataFrame.info()` and `DataFrame.head()` methods to print information about the `medicalAppointmentNoShows` dataframe as well as the first few rows

In [None]:
medicalAppointmentNoShows = pd.read_csv("KaggleV2-May-2016.csv")
medicalAppointmentNoShows.info()
medicalAppointmentNoShows.head()

## Step 2: Preprocess the Dataset

Drop the unnecessary columns.
Specify the list of unnecessary columns to drop

In [None]:
# Specify the list of unnecessary columns to drop
unnecessary_columns = ['PatientId', 'AppointmentID', 'ScheduledDay', 'AppointmentDay', 'Neighbourhood']

# Drop the unnecessary columns from the DataFrame
medicalAppointmentNoShows = medicalAppointmentNoShows.drop(unnecessary_columns, axis=1)

# Drop rows with missing values
medicalAppointmentNoShows = medicalAppointmentNoShows.dropna()


In [None]:
medicalAppointmentNoShows.head()

###### Encode the categorical target variable
The categorical target variable `No-show` is encoded using the LabelEncoder from the `sklearn.preprocessing` module. This step ensures that the target variable is represented as numeric values `(0 and 1)` instead of string labels. 

In [None]:
label_encoder = LabelEncoder()
medicalAppointmentNoShows['No-show'] = label_encoder.fit_transform(medicalAppointmentNoShows['No-show'])
medicalAppointmentNoShows['Gender'] = label_encoder.fit_transform(medicalAppointmentNoShows['Gender'])

## Step 3: Split the Dataset
Split the dataset into features (X) and target variable (y)

Random Seed `(random_state)`:

- The random seed is used to ensure reproducibility. Setting a specific random seed will result in the same train-test split every time you run the code.
- If you want consistent results, you can set the random seed to a fixed value, such as 42.
- If you don't require consistent results and want a different train-test split each time you run the code, you can omit setting the random seed.

Test Size `(test_size)`:

- The test size determines the proportion of the dataset that will be allocated for testing. It is typically specified as a float value between 0 and 1, representing the percentage of the dataset to be used for testing.
- The choice of test size depends on the size of your dataset and the desired balance between the training and testing sets.
- A common practice is to use a test size of around 0.2 to 0.3, meaning 20% to 30% of the data will be used for testing. This leaves the majority of the data for training the model.
- If you have a large dataset, you can afford to allocate a smaller test size. Conversely, if you have a small dataset, you might want to allocate a larger test size to ensure a representative evaluation.
- It's important to strike a balance between having enough data for training and having enough data for testing to obtain reliable performance metrics.


In [None]:
X = medicalAppointmentNoShows.drop('No-show', axis=1)
y = medicalAppointmentNoShows['No-show']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Step 4: Train the K-Nearest Neighbors Classifier

Create an instance of the KNeighborsClassifier class:
- `n_neighbors` is a hyperparameter that determines the number of neighbors to consider for classification. You can choose an appropriate value for `k` based on your problem and dataset. Higher values of `k` smooth out the decision boundaries, while lower values make the model more sensitive to individual data points. Our `k` is `5`.

Train the KNN classifier using the `fit()` method:

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

## Step 5: Evaluate the K-Nearest Neighbors Classifier

In [None]:
# Make predictions on the test set
y_pred = knn.predict(X_test)

# Print the first few predictions
print("Predictions:")
print(y_pred[:10])

# Evaluate the model
accuracy = knn.score(X_test, y_test)

print()

# Print evaluation metrics
print("Final test set predictions:", y_pred)
print("Final test set accuracy:", accuracy)


## Step 6: Visualize the Dataset

This code creates a `bar chart` with two bars: one representing the accuracy of the model and the other representing the prediction for the first instance in the test set. The `accuracy` is shown in `blue`, and the `prediction` is shown in `green`. This simple visualization provides an overview of the model's performance and its prediction for a specific test instance.

In [None]:
# Plot the accuracy and predictions
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(['Accuracy'], [accuracy], color='blue')
ax.bar(['Prediction'], [y_pred[0]], color='green')
ax.set_ylabel('Value')
ax.set_title('Model Accuracy and First Prediction')
ax.legend(['Accuracy', 'Prediction'])

plt.show()