# K Nearest Neighbors Exercise

## Introduction

We will be using customer churn data from the telecom industry. The data file is called 
`Orange_Telecom_Churn_Data.csv`. We will load this data, do some preprocessing, and use K-nearest neighbors to predict customer churn based on account characteristics.

In [10]:
from __future__ import print_function
import os
data_path = ['..', '..', 'data']


* Begin by importing the data. Examine the columns and data.

In [5]:
from __future__ import print_function
import os
data_path = ['data']

import pandas as pd

filepath = '/Users/ebaniez/Downloads/Orange_Telecom_Churn_Data.csv'

# Use Pandas to read the CSV file into a DataFrame
data = pd.read_csv(filepath)

data.head()

Unnamed: 0,state,account_length,area_code,phone_number,intl_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,...,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churned
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [6]:
#remove extra columns
data.drop(['state', 'area_code', 'phone_number'], axis=1, inplace=True)

In [7]:
data.columns

Index(['account_length', 'intl_plan', 'voice_mail_plan',
       'number_vmail_messages', 'total_day_minutes', 'total_day_calls',
       'total_day_charge', 'total_eve_minutes', 'total_eve_calls',
       'total_eve_charge', 'total_night_minutes', 'total_night_calls',
       'total_night_charge', 'total_intl_minutes', 'total_intl_calls',
       'total_intl_charge', 'number_customer_service_calls', 'churned'],
      dtype='object')


* Notice that some of the columns are categorical data and some are floats. These features will need to be numerically encoded using one of the methods from the lecture.
* Finally, remember from the lecture that K-nearest neighbors requires scaled data. Scale the data using one of the scaling methods discussed in the lecture.

In [13]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

for col in ['intl_plan', 'voice_mail_plan', 'churned']:
    data[col] = lb.fit_transform(data[col])

In [18]:
#mute the sklearn warning
import warnings
warnings.filterwarnings('ignore', module='sklearn')
from sklearn.preprocessing import MinMaxScaler
msc = MinMaxScaler()
data = pd.DataFrame(msc.fit_transform(data), columns=data.columns)


* Separate the feature columns (everything except `churned`) from the label (`churned`). This will create two tables.
* Fit a K-nearest neighbors model with a value of `k=3` to this data and predict the outcome on the same data.

In [55]:
x_cols = [x for x in data.columns if x != 'churned']

X_data = data[x_cols]
y_data = data['churned']

In [54]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
data = pd.read_csv('/Users/ebaniez/Downloads/Orange_Telecom_Churn_Data.csv')

# Separate features (X) and label (y)
X = data.drop(columns=['churned'])
y = data['churned']


X = pd.get_dummies(X, drop_first=True)

# training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and fit the KNN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

# Predict on the test data
y_pred = knn.predict(X_test_scaled)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.868



* Write a function to calculate accuracy using the actual and predicted labels.
* Using the function, calculate the accuracy of this K-nearest neighbors model on the data.

In [79]:
# Import necessary libraries
from sklearn.metrics import accuracy_score

# Function to calculate accuracy
def calculate_accuracy(actual_labels, predicted_labels):
    """
    Calculate the accuracy of a classification model.

    Args:
    actual_labels (list or array): The true labels.
    predicted_labels (list or array): The predicted labels.

    Returns:
    accuracy (float): The accuracy of the model, ranging from 0 to 1.
    """
    accuracy = accuracy_score(actual_labels, predicted_labels)
    return accuracy

# Sample actual and predicted labels (replace with your actual data)
actual_labels = [1, 0, 1, 0, 1]
predicted_labels = [1, 0, 1, 1, 0]

# Calculate accuracy using the function
accuracy = calculate_accuracy(actual_labels, predicted_labels)

# Print the accuracy
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.60



* Fit the K-nearest neighbors model again with `n_neighbors=3` but this time use distance for the weights. Calculate the accuracy using the function you created above. 
* Fit another K-nearest neighbors model. This time use uniform weights but set the power parameter for the Minkowski distance metric to be 1 (`p=1`) i.e. Manhattan Distance.

In [80]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Assuming you have your data and target variables loaded as X and y

# Create KNN model with n_neighbors=3 and 'distance' weights
knn_distance = KNeighborsClassifier(n_neighbors=3, weights='distance')
knn_distance.fit(X, y)

# Predict using the KNN model
y_pred_distance = knn_distance.predict(X)

# Calculate accuracy
accuracy_distance = accuracy_score(y, y_pred_distance)
print("Accuracy with distance weights:", accuracy_distance)

Accuracy with distance weights: 1.0


In [81]:
knn_manhattan = KNeighborsClassifier(n_neighbors=3, weights='uniform', p=1)
knn_manhattan.fit(X, y)


* Fit a K-nearest neighbors model using values of `k` (`n_neighbors`) ranging from 1 to 20. Use uniform weights (the default). The coefficient for the Minkowski distance (`p`) can be set to either 1 or 2--just be consistent. Store the accuracy and the value of `k` used from each of these fits in a list or dictionary.
* Plot (or view the table of) the `accuracy` vs `k`

In [None]:
score_list = list()
for k in range(1, 21):
    
    knn = KNeighborsClassifier(n_neighbors=k)
    knn = knn.fit(x_data, y_data)
    
    y_pred = knn.predict(x_data)
    score = accuracy(y_data, y_pred)
    score_list.append((k, score))
    
score_df = pd.DataFrame(score_list, columns=['k', 'accuracy'])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
sns.set_context('talk')
sns.set_style('ticks')
sns.set_palette('dark')
ax = score_df.set_index('k').plot()
ax.set(xlabel='k', ylabel='accuracy')
ax.set_xticks(range(1, 21));