# K-Nearest Neighbors (K-NN)

### 參考課程實作並在datasets_483_982_spam.csv的資料集中獲得90% 以上的 accuracy (testset)

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
import codecs
import re

## Importing the dataset

In [2]:
dataset = pd.read_csv(r'../data/datasets_483_982_spam.csv', encoding = 'latin-1')
dataset['is_spam'] = dataset['v1'].map({'ham': 0, 'spam': 1})
all_data = dataset[['v2', 'is_spam']].values
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4,is_spam
0,ham,"Go until jurong point, crazy.. Available only ...",,,,0
1,ham,Ok lar... Joking wif u oni...,,,,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,,1
3,ham,U dun say so early hor... U c already then say...,,,,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,,0


### 取出訓練內文與標註

In [3]:
X = all_data[:,0]
Y = all_data[:,1].astype(np.uint8)

In [4]:
print('Training Data Examples : \n{}'.format(X[:5]))

Training Data Examples : 
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 'U dun say so early hor... U c already then say...'
 "Nah I don't think he goes to usf, he lives around here though"]


In [5]:
print('Labeling Data Examples : \n{}'.format(Y[:5]))

Labeling Data Examples : 
[0 0 1 0 0]


### 文字預處理

In [6]:
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords

import nltk

nltk.download('stopwords')

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
## 創建Lemmatizer(詞型還原), cars->car, ate->eat
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """將pos_tag結果mapping到lemmatizer中pos的格式"""
    tag = nltk.pos_tag([word])[0][1][0].upper
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)

def clean_content(x):
    stop_words = set(stopwords.words('english'))

    X_output = []
    for x in X:
        clean = re.sub('[^a-zA-z]', ' ', x).lower()
        tokens = nltk.word_tokenize(clean)
        clean_tokens = []
        for token in tokens:
            if token not in stop_words:
                word = lemmatizer.lemmatize(token, get_wordnet_pos(token))
                clean_tokens.append(word)
        X_output.append(' '.join(clean_tokens))
    return X_output

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alway\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
X = clean_content(X)

### Bag of words

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
#max_features是要建造幾個column，會按造字出現的高低去篩選 
cv=CountVectorizer(max_features=1000)
X=cv.fit_transform(X).toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [9]:
X.shape

(5572, 1000)

## Splitting the dataset into the Training set and Test set

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

## Training the K-NN model on the Training set

In [12]:
from sklearn.neighbors import KNeighborsClassifier
# http://www.taroballz.com/2018/07/08/ML_KNeighbors_Classifier/
'''
．n_neighbors：為int類型，可選，預設值為5，選擇查詢的鄰居數
．algorithm：有'auto','ball_tree','kd_tree','brute'幾種選擇，可選用於計算最近鄰居的算法，'auto'將嘗試根據傳遞給fit()方法的值來決定最合適的算法
．metric：預設為"minkowski"(明可夫斯基距離)
    ．其為歐式距離及曼哈頓距離兩種計算距離的延伸
    ．其實例化KNN算法時參數p預設為2
        ．p為2時所使用的是曼哈頓距離：兩點絕對值距離
        ．p為1時所使用的是歐式距離
．n_jobs：n_jobs 是并行计算的线程数量，默认是1，输入-1 则设为CPU 的内核数。
'''
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2, n_jobs=-1)
classifier.fit(X_train, y_train)

KNeighborsClassifier(n_jobs=-1)

## Predicting a new result

In [13]:
print('Trainset Accuracy: {}'.format(classifier.score(X_train, y_train)))

Trainset Accuracy: 0.9448059232667714


In [14]:
print('Testset Accuracy: {}'.format(classifier.score(X_test, y_test)))

Testset Accuracy: 0.9192825112107623


## Predicting the Test set results

In [16]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

## Making the Confusion Matrix

In [17]:
from sklearn.metrics import confusion_matrix, accuracy_score
# https://blog.csdn.net/m0_38061927/article/details/77198990
# confusion_matrix混淆矩阵, 多个类别是否有混淆（也就是一个class被预测成另一个class）
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[949   0]
 [ 90  76]]


0.9192825112107623