# KNN with scikit-learn 


To use scikit-learn's implementation of a KNN classifier on the classic Titanic dataset from Kaggle!

## Objectives


* Use KNN to make classification predictions on a real-world dataset
* Perform a parameter search for 'k' to optimize model performance
* Evaluate model performance and interpret results


Start by importing the dataset, stored in the `titanic.csv` file, and cleaning it.


In [12]:
import pandas as pd
import numpy as np
df = pd.read_csv('titanic.csv')

# Preprocessing the Data
- Remove unnecessary columns (PassengerId, Name, Ticket, and Cabin).
- Convert Sex to a binary encoding, where female is 0 and male is 1.
- Detect and deal with any null values in the dataset.
- For Age, replace null values with the median age for the dataset.
- For Embarked, drop the rows that contain null values
- One-Hot Encode categorical columns such as Embarked.
- Store the target column, Survived, in a separate variable and remove it from the DataFrame.

In [13]:
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis= 1)

In [14]:
df.Sex = (df.Sex =='male').astype(int)

In [15]:
df.Age = df.Age.fillna(df.Age.median())

In [16]:
df = df.dropna()

In [17]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(handle_unknown='ignore')
df_Embarked_ohe = onehotencoder.fit_transform(np.array([df['Embarked'].values]).T)


In [18]:
df_ohe = pd.concat([df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']],
                    pd.DataFrame(df_Embarked_ohe.todense())], axis=1)

In [19]:
df_ohe.dropna(inplace = True)

In [20]:
y = df_ohe.Survived
X = df_ohe.drop(columns =['Survived'])

# Creating Training and Testing Sets

Split data into training and testing sets. 

* Import `train_test_split` from the `sklearn.model_selection` module
* Use `train_test_split` to split thr data into training and testing sets.


In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Normalizing the Data
Normalize after splitting our data into training and testing sets. This is to avoid information "leaking" from our test set into our training set. Normalization (also sometimes called Standardization or Scaling) means making sure that all of data is represented at the same scale. The most common way to do this is to convert all numerical values to z-scores.

Since KNN is a distance-based classifier, if data is in different scales, then larger scaled features have a larger impact on the distance between points.

- Import and instantiate a StandardScaler object.
- Use the scaler's .fit_transform() method to create a scaled version of our training dataset.
- Use the scaler's .transform() method to create a scaled version of our testing dataset.
- The result returned by the fit_transform and transform calls will be numpy arrays, not a pandas DataFrame. Create a new pandas DataFrame out of this object called scaled_df. To set the column names back to their original state, set the columns parameter to one_hot_df.columns.
- Print out the head of scaled_df to ensure everything worked correctly.


The scaler also scaled binary/one-hot encoded columns, too! But each binary column still only contains 2 values, meaning the overall information content of each column has not changed.


In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df_train = scaler.fit_transform(X_train)
scaled_df_test = scaler.transform(X_test)
scaled_df_train = pd.DataFrame(scaled_df_train, columns=X.columns)

# Fitting a KNN Model
Time to train a KNN classifier and validate its accuracy.


- Import KNeighborsClassifier from the sklearn.neighbors module.
- Instantiate a classifier. For now, you can just use the default parameters.
- Fit the classifier to the training data/labels
- Use the classifier to generate predictions on the test data. Store these predictions inside the variable test_preds.


Import all the necessary evaluation metrics from sklearn.metrics and complete the print_metrics() function so that it prints out Precision, Recall, Accuracy, and F1-Score when given a set of labels (the true values) and preds (the models predictions).

Finally, use print_metrics() to print out the evaluation metrics for the test predictions stored in test_preds, and the corresponding labels in y_test.


In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(scaled_df_train, y_train)
test_preds = knn.predict(scaled_df_test)


In [26]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
def print_metrics(labels, preds):
    print(f"Precision Score: {precision_score(labels, preds)}")
    print(f"Recall Score: {recall_score(labels, preds)}")
    print(f"Accuracy Score: {accuracy_score(labels, preds)}")
    print(f"F1 Score: {f1_score(labels, preds)}")
                        
print_metrics(y_test, test_preds)

Precision Score: 0.782051282051282
Recall Score: 0.7011494252873564
Accuracy Score: 0.8063063063063063
F1 Score: 0.7393939393939394


# Improving Model Performance
Try to find the optimal number of neighbors to use for the classifier: Iterate over multiple values of k and find the value of k that returns the best overall performance.

The skeleton function takes in six parameters:

X_train
y_train
X_test
y_test
min_k (default is 1)
max_k (default is 25)

Create two variables, best_k and best_score
- Iterate through every odd number between min_k and max_k + 1.
For each iteration:
- Create a new KNN classifier, and set the n_neighbors parameter to the current value for k, as determined by our loop.
- Fit this classifier to the training data.
- Generate predictions for X_test using the fitted classifier.
- Calculate the F1-score for these predictions.
- Compare this F1-score to best_score. If better, update best_score and best_k.
- Once all iterations are complete, print out the best value for k and the F1-score it achieved.

In [27]:
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
    best_k = 0
    best_score = 0.0
    for k in range(min_k, max_k+1, 2):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        preds = knn.predict(X_test)
        f1 = f1_score(y_test, preds)
        if f1 > best_score:
            best_k = k
            best_score = f1
    
    print(f"Best Value for k: {best_k}")
    print(f"F1-Score: {best_score}")

In [29]:
find_best_k(scaled_df_train, y_train, scaled_df_test, y_test)

Best Value for k: 11
F1-Score: 0.778443113772455


Model performance has improved by over 4 percent by finding an optimal value for k. For further tuning, use scikit-learn's built in **Grid Search** to perform a similar exhaustive check of hyper-parameter combinations and fine tune model performance. [sklearn documentation !](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n"
