# KNN with Smote (resampling) Strategy
#### Approach
KNN approach is used for implementing recommendation system.
However, the dataset has highly imbalanced data distribution. Supervised ML techniques such as KNN, Decision Tree, Logistic Regression  have a bias towards the majority class, and tend to ignore the minority class due to which they only predict majority class.
There are two approaches followed to resolve this challenge, first on data level and second on algorithm level.
1. On data level, resampling techniques were applied to get balanced distribution, like undersampling, Smote and Near-miss. Smote (synthetic minority oversampling technique) yielded better results with KNN out of all which is presented in the notebook wherein synthetic training records are generated.
2. On algorithm level, hyperparameter selection and tuning was done along with threshold adjustments.

#### Result
After running several iterations of model and resampling strategies we obtained best results with KNN using Smote resampling and cosine metric.   
**The best F1 score we achieved is 0.0536257482092042 (with 0.8 threshold).**

Following is the code for the model used:

## Part 1 Training

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore') 

from sklearn import neighbors, metrics

# For Validation & Scaling
from sklearn.preprocessing import StandardScaler # For scaling/standardizing dataset
from sklearn.model_selection import train_test_split # Dataset Splitting
from sklearn.model_selection import GridSearchCV # for Cross validation
from sklearn.pipeline import Pipeline

# For evaluation
from sklearn.metrics import f1_score # evaluation metrics

import time
from sklearn.metrics import confusion_matrix 

In [None]:
# install imbalanced-learn
#conda install -c conda-forge imbalanced-learn

### Use pre-processed Dataset for training

In [2]:
Train = pd.read_csv("AkeedTrain.csv")
Train.shape

(5950300, 25)

In [3]:
# drop one ref categorical variable 
# 'cust_location_type_Other', 'gender_male','vendor_category_en_Sweets & Bakes','device_type_3',open hour na, 
X_Train = Train[['cust_location_type_Home', 'cust_location_type_Work', # dropped other
               'gender_female', # dropped male
               'delivery_charge','serving_distance', 'is_open', 'prepration_time','discount_percentage',
               'ven_status', 'ven_verified','vendor_rating', 'Ven_created_days', 
               'TotOpenDuration',
               'vendor_category_en_Restaurants', # dropped Sweets & Bakes
               'device_type_1', # dropped device_type_3 
               'timegroup_early(5-10.59am)','timegroup_normal(11-16.59pm)','timegroup_evening(17-22.59pm)','timegroup_latenight(23-4.59am)']] # dropped na
y_Train = Train['Target']

### Import Library for Smote

In [4]:
import imblearn
# print(imblearn.__version__)
from imblearn.over_sampling import SMOTE
from collections import Counter

In [5]:
# find classes
counter = Counter(y_Train)
print(counter)

Counter({0: 5870158, 1: 80142})


In [6]:
# Balance classes
oversample = SMOTE()
x_trSM, y_trSM = oversample.fit_resample(X_Train, y_Train)
counter = Counter(y_trSM)
print(counter)

Counter({0: 5870158, 1: 5870158})


### Train-Test Split
Create a holdout set from the training dataset

In [7]:
# 1. Create  % split as train-test split of original data
X_training, X_testing, y_training, y_testing = train_test_split(x_trSM, y_trSM, test_size=0.95, random_state = 10)


### Create Pipeline to run KNN with Hyperparameter tuning and 5 fold Cross Validation

In [8]:
#    ******** Method: KNN ******** 
start = time.time() # record start time
# 1. Set up the model pipeline
estimator = Pipeline(steps = [('scale', StandardScaler()), # Scale the data
                     ('knn_clf', neighbors.KNeighborsClassifier(metric = 'cosine', algorithm = 'brute')) ]) # fit the scaled data using KNN

# 2. Set up the parameters for each item in pipeline
parameters = { 'knn_clf__n_neighbors':range(4,10),'knn_clf__weights' : ['uniform','distance']}

# 3. Instantiate gridsearch cross validation for the model in pipeline
knn = GridSearchCV(estimator = estimator, param_grid = parameters, cv = 5, 
                   scoring = 'f1', n_jobs = -2) # Instantiate the gridsearch

# 4. fit the model on train data
knn.fit(X_training, y_training)
end = time.time() # record end time
print('Tuning time in mins:',(end-start)/60)

print('Best hyper parameter set :',knn.best_params_)   # The best parameter from CV

Tuning time in mins: 644.6616194923719
Best hyper parameter set : {'knn_clf__n_neighbors': 9, 'knn_clf__weights': 'distance'}


### Evaluation on Holdout Set

In [9]:
# 5. evaluation on test set and f1 score with 0.5 threshold
start = time.time()
print('f1 score of KNN on test set:',f1_score(y_testing,knn.predict(X_testing)))
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)

f1 score of KNN on test set: 0.5941698847682907
predict time in mins: 2063.160816880067


### Best Threshold maximizing F1 Score

In [None]:
# 6. find best threshold on test set 
thresholds = np.linspace(0, 1, 20) # Set up threshold values
f1 = []
for a in thresholds:
    yhat = np.where(knn.predict_proba(X_testing)[:,1] > a, 1,0)
    #conf_matrix = confusion_matrix(y_testing, yhat, labels=[1,0])
    TP = conf_matrix[0,0]
    FN = conf_matrix[0,1]
    FP = conf_matrix[1,0]
    TN = conf_matrix[1,1]

    P = TP/(TP+FP)
    R = TP/(TP+FN)

    f1_score = 2/((1/P)+(1/R))
    f1.append(f1_score)

# Find the threshold with the best f1
best_a = thresholds[np.argmax(f1)]
print(best_a)
print(f1[np.argmax(fo)])

# Evaluate on the test set
yhat = np.where(knn.predict_proba(X_testing)[:,1] > best_a, 1,0)
conf_matrix1 = confusion_matrix(y_testing, yhat, labels=[1,0])
print(conf_matrix1)



## Part-2 Test Set

In [10]:
# Load Test pre-processed test dataset in the required format
Test = pd.read_csv("test\cust_ven_test_p2.csv")
Test.shape

(1672000, 97)

In [11]:
# Drop Reference column for KNN
X_Test = Test[['cust_location_type_Home', 'cust_location_type_Work', # dropped other
               'gender_female', # dropped male
               'delivery_charge','serving_distance', 'is_open', 'prepration_time','discount_percentage',
               'ven_status', 'ven_verified','vendor_rating', 'Ven_created_days', 
               'TotOpenDuration',
               'vendor_category_en_Restaurants', # dropped Sweets & Bakes
               'device_type_1', # dropped device_type_3 
               'timegroup_early(5-10.59am)','timegroup_normal(11-16.59pm)','timegroup_evening(17-22.59pm)','timegroup_latenight(23-4.59am)']] # dropped na

### Predict on test set with multiple threshold values

In [12]:
start = time.time()
y_hat_Test_def = np.where(knn.predict_proba(X_Test)[:,1] > 0.5, 1, 0)
print(y_hat_Test_def)
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)

[1 0 0 ... 0 0 0]
predict time in mins: 304.6312898437182


In [17]:

start = time.time()
y_hat_Test_def4 = np.where(knn.predict_proba(X_Test)[:,1] > 0.4, 1, 0)
print(y_hat_Test_def4)
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)


[1 1 0 ... 0 0 0]
predict time in mins: 310.82710293928784


In [21]:

start = time.time()
y_hat_Test_def6 = np.where(knn.predict_proba(X_Test)[:,1] > 0.6, 1, 0)
print(y_hat_Test_def6)
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)


[1 0 0 ... 0 0 0]
predict time in mins: 300.16979573965074


In [23]:

start = time.time()
y_hat_Test_def7 = np.where(knn.predict_proba(X_Test)[:,1] > 0.7, 1, 0)
print(y_hat_Test_def7)
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)

[0 0 0 ... 0 0 0]
predict time in mins: 304.1332718451818


In [25]:

start = time.time()
y_hat_Test_def8 = np.where(knn.predict_proba(X_Test)[:,1] > 0.8, 1, 0)
print(y_hat_Test_def8)
end = time.time() # record end time
print('predict time in mins:',(end-start)/60)

[0 0 0 ... 0 0 0]
predict time in mins: 308.41722063620887


In [13]:
df_Test_SS = pd.DataFrame({'test_CID X LOC_NUM X VENDOR':Test['test_CID X LOC_NUM X VENDOR'],'yhat':pd.Series(y_hat_Test_def)})
df_Test_SS

Unnamed: 0,test_CID X LOC_NUM X VENDOR,yhat
0,000IPH5 X 0 X 4,1
1,000IPH5 X 0 X 13,0
2,000IPH5 X 0 X 20,0
3,000IPH5 X 0 X 23,1
4,000IPH5 X 0 X 28,1
...,...,...
1671995,G49BIQ7 X 0 X 849,1
1671996,G49BIQ7 X 0 X 855,0
1671997,G49BIQ7 X 0 X 856,0
1671998,G49BIQ7 X 0 X 858,0


In [None]:
samplesubmission = pd.read_csv('SampleSubmission.csv')

In [None]:
compare_df = pd.merge(samplesubmission,df_Test_SS,
                   left_on='CID X LOC_NUM X VENDOR', right_on='test_CID X LOC_NUM X VENDOR',how='left')
print(compare_df.shape)
compare_df

In [14]:

df_Test_SS.columns = ['CID X LOC_NUM X VENDOR','target']

### Export the csv in the requested format

In [15]:
df_Test_SS.to_csv('KNN cosine SMOTE.csv', index=False)

In [19]:
df_Test_SS1 = pd.DataFrame({'test_CID X LOC_NUM X VENDOR':Test['test_CID X LOC_NUM X VENDOR'],'yhat':pd.Series(y_hat_Test_def4)})
df_Test_SS1.columns = ['CID X LOC_NUM X VENDOR','target']
df_Test_SS1.to_csv('KNN cosine SMOTE 4.csv', index=False)

In [22]:
df_Test_SS2 = pd.DataFrame({'test_CID X LOC_NUM X VENDOR':Test['test_CID X LOC_NUM X VENDOR'],'yhat':pd.Series(y_hat_Test_def6)})
df_Test_SS2.columns = ['CID X LOC_NUM X VENDOR','target']
df_Test_SS2.to_csv('KNN cosine SMOTE 6.csv', index=False)

In [24]:
df_Test_SS3 = pd.DataFrame({'test_CID X LOC_NUM X VENDOR':Test['test_CID X LOC_NUM X VENDOR'],'yhat':pd.Series(y_hat_Test_def7)})
df_Test_SS3.columns = ['CID X LOC_NUM X VENDOR','target']
df_Test_SS3.to_csv('KNN cosine SMOTE 7.csv', index=False)

In [26]:
df_Test_SS4 = pd.DataFrame({'test_CID X LOC_NUM X VENDOR':Test['test_CID X LOC_NUM X VENDOR'],'yhat':pd.Series(y_hat_Test_def8)})
df_Test_SS4.columns = ['CID X LOC_NUM X VENDOR','target']
df_Test_SS4.to_csv('KNN cosine SMOTE 8.csv', index=False)