# Projekt 4 Super k-NN

Celem projektu jest stworzenie zupełnie nowego zespołowego klasyfikatora k-NN  i porównania jego jakości, czasów jego uczenia i odpowiedzi ze standardowym klasyfikatorem SVM. 
W nawiasie wymagania na ocenę maksymalną.

DoD.
Należy sporządzić raport z projektu.

1. Zbiór danych: TNG, ok 18000 próbek, 20 klas. Zbiór danych MNIST (70000 próbek 10 klas). Odnośnie TNG wykorzystujemy gotowe dane reprezentujące tekst blogów w postaci wektorów (dostarcza prowadzący). Dane dekorelujemy wykorzystując transformatę PCA.

2. Z jednego zbioru danych tworzymy kilka sub-zestawów danych (>=5 <=10) na różnych zestawach cech (maski mogą być losowane w sposób random, ale nie powinny być gęste). Maski mogą mieć różną długość. Prawdopodobieństwo wystąpienia cechy w zestawie może być proporcjonalne do jej istotności (np. mierzonej wielkością wartości własnych po transformacie PCA). Jednak nie może być takiej cechy, która nie dostała się do żadnego zestawu. 

3. Liczymy średnią przynależność każdej próbki do danej klasy na bazie klasyfikatora k-NN dla każdego sub-zestawu danych. Dokonujemy fuzji wyników klasyfikacji (jakiej?) dla każdej próbki po sub-zestawach danych. 

4. Jak zmieni się jakość klasyfikatora w zależności od k?

5. Wyniki jakości klasyfikatorów oceniać na bazie krzyżowej-walidacji, (accuracy - Loss-błąd, Krzywa ROC, Precision-Recall, (pola pod krzywymi) F1). 

## Utils

In [41]:
from sklearn.datasets import fetch_openml
import random
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.colors as mcolors

In [42]:
from sklearn.ensemble import ExtraTreesClassifier

rows_size = 0.6

def divide_dataset(X: np.ndarray,
                   y: np.ndarray,
                   sub_num: int = 10,
                   rows_size: float = 0.6,
                   cols_size: float = 0.7):
    # get feature importances
    tree = ExtraTreesClassifier()
    tree.fit(X, y)
    feature_importances = tree.feature_importances_
    feature_importances /= feature_importances.sum()

    # prepare indices
    row_indices = list(range(X.shape[0]))
    col_indices = list(range(X.shape[1]))
    num_rows = round(X.shape[0] * rows_size)
    num_cols = round(X.shape[1] * cols_size)

    used_cols = set()
    subparts = []
    col_masks = []
    
    for i in range(sub_num):
        # randomly, uniformly sample rows
        rows = np.random.choice(row_indices,
                                size=num_rows,
                                replace=True)

        # randomly sample X columns with probability distribution relative to
        # the features' importances
        cols = np.random.choice(col_indices,
                                size=num_cols,
                                replace=False,
                                p=feature_importances)
        if i == sub_num - 1:
            # force usage of columns not used before
            used_cols |= set(cols)
            not_used_cols = set(col_indices) - used_cols
            not_used_cols = np.fromiter(not_used_cols,
                                        int,
                                        len(not_used_cols))
            cols = np.sort(np.concatenate((cols, not_used_cols)))

        
        X_part = X[rows, :]
        X_part = X_part[:, cols]
        
        y_part = y[rows]

        used_cols |= set(cols)
        subparts.append((X_part, y_part))
        col_masks.append(cols)
        
    return subparts, col_masks

In [43]:
from sklearn.metrics import accuracy_score, f1_score

def get_classifiers(dataset, n_neighbors): #todo punkt 4
    classifiers = []
    for row in dataset:
        x, y = row
        neigh = KNeighborsClassifier(n_neighbors=n_neighbors)
        neigh.fit(x, y)
        classifiers.append(neigh)

    return classifiers


def super_k_nn(x, rows_size, divided_train_dataset, classes_count, n_neighbors=3):
    
    classifiers = get_classifiers(divided_train_dataset[0], n_neighbors)  
    masks = divided_train_dataset[1]
    result = []
    
    for row in x:
        votes = [0] * classes_count
        for i, c in enumerate(classifiers):
            
            # match shape to train set shape
            temp_x = row[masks[i]] 
            
            for prediction in c.predict([temp_x]):
                votes[int(prediction)] += 1
                   
        result.append(str(np.argmax(votes))) 

    return result

## MNIST

In [44]:
mnist = fetch_openml("mnist_784", data_home="data/mnist_784", cache=True)

In [45]:
zipped_mnist = list(zip(mnist.data, mnist.target))
mnist_random = random.sample(zipped_mnist, 7000)
x, y = zip(*(mnist_random))
x = np.asarray(x)
y = np.asarray(y)

In [46]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x) 

pca = PCA(n_components=30)
x_pca = pca.fit_transform(x_scaled)
x_train, x_test, y_train, y_test = train_test_split(x_pca, y, train_size=0.2, test_size=0.02, random_state=1)

In [47]:
mnist_divided_dataset = divide_dataset(x_train, y_train, rows_size=rows_size)
result = super_k_nn(x_test, rows_size, mnist_divided_dataset, 10)
accuracy_score(y_true=y_test, y_pred=result), f1_score(y_true=y_test, y_pred=result, average='macro')

(0.85, 0.8459255372786286)

In [52]:
resultss = []
for i in [1, 3, 5, 10, 20]:
    result = super_k_nn(x_test, rows_size, mnist_divided_dataset, 10, i)
    resultss.append((accuracy_score(y_true=y_test, y_pred=result), f1_score(y_true=y_test, y_pred=result, average='macro')))
    
resultss

[(0.8428571428571429, 0.8310467069412419),
 (0.85, 0.8459255372786286),
 (0.8642857142857143, 0.8563689571298265),
 (0.8428571428571429, 0.8355143011690028),
 (0.8071428571428572, 0.7918472447739331)]

### vs SVM

In [53]:
import pandas as pd
import random
import matplotlib
import matplotlib.pyplot as plt
import warnings
from sklearn.preprocessing import label_binarize
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, plot_roc_curve, 
                             precision_recall_curve, plot_precision_recall_curve, f1_score, average_precision_score, 
                             hinge_loss, precision_score, recall_score, classification_report)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import (load_digits, fetch_openml, load_iris,)
from sklearn.multiclass import OneVsRestClassifier
from scipy import stats
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')
%matplotlib inline

In [54]:
from sklearn.neighbors import KNeighborsClassifier

def get_df_row(report): 
    return pd.DataFrame(report, columns = ['name' ,
                                           'accuracy (cross-val)',
                                           'accuracy',
                                           'precision' ,
                                           'recall',
                                          ], index=[0])

def evaluate_svm(X_train, X_test, y_train, y_test, kernel='linear', C=1):
    classifier =  SVC(kernel=kernel, C=C)
    classifier.fit(X_train, y_train)
    y_predicted = classifier.predict(X_test)
    report = classification_report(y_test, y_predicted, output_dict=True)['weighted avg']
    report['name'] = 'SVM, C = {}, kernel: {}'.format(C, kernel)
    report['accuracy'] = report['f1-score']
    del report['f1-score']
    report['accuracy (cross-val)'] = np.mean(cross_val_score(classifier, X_train, y_train, cv=5, scoring='accuracy'))
    return report

def evaluate_all(X_train, X_test, y_train, y_test):
    base_df = pd.DataFrame( columns = ['name' ,
                                       'accuracy (cross-val)',
                                       'accuracy',
                                       'precision' ,
                                       'recall',
                                      ])
    for C in [1, 5]:
        for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
            report = evaluate_svm(X_train, X_test, y_train, y_test, C=C, kernel=kernel)
            base_df = pd.concat([base_df,get_df_row(report)], ignore_index=True)
    return base_df.sort_values(by='accuracy', ascending=False)

In [55]:
mnist = fetch_openml("mnist_784", data_home="./mnist_784", cache=True)
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, train_size=0.008, test_size=0.002, random_state=1)

In [56]:
evaluate_all(X_train, X_test, y_train, y_test)

Unnamed: 0,name,accuracy (cross-val),accuracy,precision,recall
6,"SVM, C = 5, kernel: rbf",0.898214,0.864221,0.874909,0.864286
2,"SVM, C = 1, kernel: rbf",0.885714,0.847333,0.863135,0.85
5,"SVM, C = 5, kernel: poly",0.835714,0.830633,0.848176,0.828571
0,"SVM, C = 1, kernel: linear",0.844643,0.828827,0.840569,0.828571
4,"SVM, C = 5, kernel: linear",0.844643,0.828827,0.840569,0.828571
3,"SVM, C = 1, kernel: sigmoid",0.826786,0.783211,0.810759,0.785714
7,"SVM, C = 5, kernel: sigmoid",0.8125,0.774343,0.790616,0.778571
1,"SVM, C = 1, kernel: poly",0.823214,0.761675,0.795958,0.764286


In [57]:
from sklearn.preprocessing import StandardScaler

scaler_mnist = StandardScaler().fit(X_train)
x_train = scaler_mnist.transform(X_train)
x_test = scaler_mnist.transform(X_test)

In [58]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

clf = SVC(C=0.1, kernel='linear', )
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
f1 = f1_score(y_true=y_test, y_pred=y_pred, average='macro')
print('score=%f, f1=%f' %(score, f1))

score=0.842857, f1=0.834900


## TNG

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(shuffle=True, subset='train', random_state=42, categories=categories, remove=('headers', 'footers', 'quotes'))
twenty_test = fetch_20newsgroups(shuffle=True, subset='test', random_state=42, categories=categories, remove=('headers', 'footers', 'quotes'))

In [60]:
x_train = twenty_train.data[:3000]
y_train = twenty_train.target[:3000]
x_test = twenty_test.data[:600]
y_test = twenty_test.target[:600]
print(len(x_train))

# convert text to vectors
vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)
print(x_train.toarray()[0:2]) 

2257
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [61]:
x_train.shape

(2257, 28865)

In [62]:
x_test.shape

(600, 28865)

In [63]:
from sklearn.decomposition import TruncatedSVD
from sklearn import preprocessing
x_train = preprocessing.scale(x_train, with_mean=False)
x_test = preprocessing.scale(x_test, with_mean=False)

svd = TruncatedSVD(n_components=70)
tng_x_train_scaled = svd.fit_transform(x_train)
x_test_scaled = svd.transform(x_test)


In [64]:
rows_size=0.8
tng_divided_dataset = divide_dataset(tng_x_train_scaled, y_train, rows_size=rows_size)

resultss = []
for i in [1, 3, 5, 10]:
    result = super_k_nn(x_test_scaled, rows_size, tng_divided_dataset, 4, i)
    result_int = [int(x) for x in result]
    resultss.append((accuracy_score(y_true=y_test, y_pred=result_int), f1_score(y_true=y_test, y_pred=result_int, average='macro')))
    
resultss

[(0.625, 0.6226038638976357),
 (0.6283333333333333, 0.6252852630118809),
 (0.6183333333333333, 0.613907684309133),
 (0.6116666666666667, 0.604921477532103)]

### vs SVM

In [65]:
evaluate_all(tng_x_train_scaled, x_test_scaled, y_train, y_test)

Unnamed: 0,name,accuracy (cross-val),accuracy,precision,recall
0,"SVM, C = 1, kernel: linear",0.754528,0.717857,0.733966,0.718333
4,"SVM, C = 5, kernel: linear",0.745225,0.704639,0.717084,0.705
6,"SVM, C = 5, kernel: rbf",0.563121,0.534212,0.604306,0.556667
7,"SVM, C = 5, kernel: sigmoid",0.519711,0.448909,0.517712,0.473333
2,"SVM, C = 1, kernel: rbf",0.459886,0.374703,0.412271,0.433333
3,"SVM, C = 1, kernel: sigmoid",0.437746,0.323623,0.378556,0.383333
1,"SVM, C = 1, kernel: poly",0.269383,0.121204,0.077469,0.278333
5,"SVM, C = 5, kernel: poly",0.272928,0.121204,0.077469,0.278333
