# K-Nearest Neighbors (K-NN)

### 參考課程實作並在 `datasets_483_982_spam.csv` 的資料集中獲得 90% 以上的 accuracy (testset)

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import glob
import codecs
import re

## Importing the dataset

In [22]:
dataset = pd.read_csv(r'datasets_483_982_spam.csv', encoding='latin-1')

In [23]:
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [24]:
dataset = dataset.loc[:, ['v1', 'v2']]
dataset.columns = ['spam_ham', 'text']

In [36]:
all_data = []
for data in dataset.itertuples():
    if data[1] == 'spam':
        is_spam = 1
    else:
        is_spam = 0
    text = data[2]
    all_data.append([text, is_spam])

In [45]:
all_data = np.array(all_data)

In [46]:
print(type(all_data))

<class 'numpy.ndarray'>


In [47]:
all_data[:, 0]

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype='<U910')

In [48]:
all_data[:, 1]

array(['0', '0', '1', ..., '0', '0', '0'], dtype='<U910')

### 取出訓練內文與標註

In [125]:
X = all_data[:, 0]
Y = all_data[:, 1].astype(np.uint8)

In [50]:
print('Training Data Examples : \n{}'.format(X[:5]))

Training Data Examples : 
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 'U dun say so early hor... U c already then say...'
 "Nah I don't think he goes to usf, he lives around here though"]


In [51]:
print('Labeling Data Examples : \n{}'.format(Y[:5]))

Labeling Data Examples : 
[0 0 1 0 0]


### 文字預處理

In [52]:
from sklearn.metrics import confusion_matrix
from nltk.corpus import stopwords

import nltk

nltk.download('stopwords')

# Lemmatize with POS Tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jiaping/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:
lemmatizer = WordNetLemmatizer()

In [73]:
t = all_data[0]
t

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       '0'], dtype='<U910')

In [93]:
t_clean = [re.sub('[^a-zA-Z]', ' ', w) for w in t]
print(t_clean)

t_token = [nltk.word_tokenize(w) for w in t_clean]
print(t_token)

['Go until jurong point  crazy   Available only in bugis n great world la e buffet    Cine there got amore wat   ', ' ']
[['Go', 'until', 'jurong', 'point', 'crazy', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'Cine', 'there', 'got', 'amore', 'wat'], []]


In [106]:
def pos_mapping(word):
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    # 最後一個 [0] 代表只看第一個字母，例如 VB 則只抓 V 來判斷詞性
    tag = nltk.pos_tag([word])[0][1][0]
    return tag_dict.get(tag, wordnet.NOUN)  # 尋找 key=tag，若無則返回 wordnet.NOUN

In [117]:
def clean_content(x):
    x_clean = [re.sub('[^a-zA-Z]', ' ', w) for w in x]
    x_tokenized = [nltk.word_tokenize(w) for w in x_clean]
    stop_words = set(stopwords.words('english'))
    x_lemmatized = []
    for content in x_tokenized:
        content_lemmatized = []
        for word in content:
            if word not in stop_words:
                word = lemmatizer.lemmatize(word, pos_mapping(word))
                content_lemmatized.append(word)
        x_lemmatized.append(content)
    x_lemmatized_line = [' '.join(w) for w in x_lemmatized]
    return x_lemmatized_line

In [126]:
X = clean_content(X)

In [127]:
X[:10]

['Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat',
 'Ok lar Joking wif u oni',
 'Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s',
 'U dun say so early hor U c already then say',
 'Nah I don t think he goes to usf he lives around here though',
 'FreeMsg Hey there darling it s been week s now and no word back I d like some fun you up for it still Tb ok XxX std chgs to send to rcv',
 'Even my brother is not like to speak with me They treat me like aids patent',
 'As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press to copy your friends Callertune',
 'WINNER As a valued network customer you have been selected to receivea prize reward To claim call Claim code KL Valid hours only',
 'Had your mobile months or more U R entitled to Update to the latest colour mobiles with camera for Free Call 

### Bag of words

In [128]:
from sklearn.feature_extraction.text import CountVectorizer
#max_features是要建造幾個column，會按照字出現頻率的高低去篩選 
cv = CountVectorizer(max_features=2000)
X = cv.fit_transform(X).toarray()

In [123]:
X.shape 
# output: (rows, max_features)

(5572, 1000)

In [129]:
X.shape

(5572, 2000)

## Splitting the dataset into the Training set and Test set

In [134]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

In [136]:
len(X_train), len(X_test)

(4457, 1115)

In [137]:
len(y_train), len(y_test)

(4457, 1115)

## Training the K-NN model on the Training set

In [139]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

## Predicting a new result

In [140]:
print('Trainset Accuracy: {}'.format(classifier.score(X_train, y_train)))

Trainset Accuracy: 0.9418891631142023


In [141]:
print('Testset Accuracy: {}'.format(classifier.score(X_test, y_test)))

Testset Accuracy: 0.9201793721973094


## Predicting the Test set results

In [142]:
y_pred = classifier.predict(X_test)

## Making the Confusion Matrix

In [143]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[949   0]
 [ 89  77]]


0.9201793721973094

---

In [145]:
k_nums = [3, 5, 10, 30, 50, 100, 200]

In [146]:
for k in k_nums:
    classifier = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
    classifier.fit(X_train, y_train)
    print('%s_neighbors accuracy:' % str(k), classifier.score(X_test, y_test))
    print('-------------------------')

3_neighbors accuracy: 0.9363228699551569
-------------------------
5_neighbors accuracy: 0.9201793721973094
-------------------------
10_neighbors accuracy: 0.895067264573991
-------------------------
30_neighbors accuracy: 0.8565022421524664
-------------------------
50_neighbors accuracy: 0.8511210762331839
-------------------------
100_neighbors accuracy: 0.8511210762331839
-------------------------
200_neighbors accuracy: 0.8511210762331839
-------------------------
