Linear kernel과 RBF kernel은 왜 서로 다른 비율로 support vector를 선택하게 되는가? (왜 linear kernel만 sparse support vector를 지니는가?) 

좀 더 자세하게 linear kernel과 rbf kernel을 썼을 때 각각 어떤 classification boundary를 따는 걸까? 

l30_r15 데이터로만 한 번 살펴보자

In [36]:
from sklearn.svm import SVC
from sklearn.preprocessing import normalize
from py.utils import load_data
import pickle
import numpy as np

import warnings
warnings.filterwarnings('ignore')

directory = '../data/'
head = 'l30_r15'

x, y, x_words, vocabs = load_data(head, directory)

from scipy.sparse import csr_matrix
B = csr_matrix(([1]*len(x.data), (x.nonzero()[0], x.nonzero()[1])))
df_vocabs = B.sum(axis=0).tolist()[0]

x = normalize(x)

x shape = (15106, 2770)
y shape = (15106,)
# features = 2770
# L words = 15106


In [37]:
svm_linear = SVC(C=1.0, kernel='linear')
svm_linear.fit(x, y)
print('%s support vector %.3f' % ('%', svm_linear.n_support_.sum()/x.shape[0]))

% support vector 0.027


In [38]:
svm_rbf = SVC(C=1.0, kernel='rbf')
svm_rbf.fit(x, y)
print('%s support vector %.3f' % ('%', svm_rbf.n_support_.sum()/x.shape[0]))

% support vector 0.332


## Linear kernel을 알아보자

In [39]:
neg = svm_linear.support_[:svm_linear.n_support_[0]].tolist()
pos = svm_linear.support_[svm_linear.n_support_[0]:].tolist()

neg_words = [x_words[i] for i in neg]
pos_words = [x_words[i] for i in pos]

In [40]:
B_pos = B[pos]
B_pos.data = np.asarray([1]*len(B_pos.data))
used_df_pos = B_pos.sum(axis=0).tolist()[0]
used_pos_features = {j for j, df in enumerate(used_df_pos) if df > 0}
n_used_pos_features = len(used_pos_features)
average_pos_feature_df = sum([i for i in used_df_pos if i > 0]) / n_used_pos_features

B_neg = B[neg]
B_neg.data = np.asarray([1]*len(B_neg.data))
used_df_neg = B_neg.sum(axis=0).tolist()[0]
used_neg_features = {j for j, df in enumerate(used_df_neg) if df > 0}
n_used_neg_features = len(used_neg_features)
average_neg_feature_df = sum([i for i in used_df_neg if i > 0]) / n_used_neg_features

used_features = {j for j in used_pos_features}
used_features.update({j for j in used_neg_features})

In [41]:
print('n used pos features= %d, average df = %.2f' % (n_used_pos_features, average_pos_feature_df))
print('n used neg features= %d, average df = %.2f' % (n_used_neg_features, average_neg_feature_df))
print('n used features= %d' % len(used_features))

n used pos features= 622, average df = 4.29
n used neg features= 1030, average df = 3.71
n used features= 1442


In [42]:
n_nouse_in_top_features = 0
for j, vocab in sorted(enumerate(vocabs), key=lambda x:df_vocabs[x[0]], reverse=True)[:500]:
    if not (j in used_features):
        n_nouse_in_top_features += 1
#         print('%s (df= %d, used= %r)' % (vocab, df_vocabs[j], j in used_features))

# print('\n%s\n' % ('-'*80))
n_use_in_bottom_features = 0
for j, vocab in sorted(enumerate(vocabs), key=lambda x:df_vocabs[x[0]], reverse=False)[:1000]:
    if (j in used_features):
        n_use_in_bottom_features += 1
#         print('%s (df= %d, used= %r)' % (vocab, df_vocabs[j], j in used_features))
        
print('used in top 500 features= %d' % n_nouse_in_top_features)
print('used in bottom 1000 features= %d' % n_use_in_bottom_features)

used in top 500 features= 73
used in bottom 1000 features= 368


In [43]:
n_used_order_by_df = 0
for n, (j, vocab) in enumerate(sorted(enumerate(vocabs), key=lambda x:df_vocabs[x[0]], reverse=True)):
    if (j in used_features):
        n_used_order_by_df += 1
    if (n + 1) % 100 == 0:
        print('top %d, used= %d (%.3f)' % (n+1, n_used_order_by_df, 100*n_used_order_by_df/(n+1)))

top 100, used= 99 (99.000)
top 200, used= 194 (97.000)
top 300, used= 274 (91.333)
top 400, used= 353 (88.250)
top 500, used= 427 (85.400)
top 600, used= 495 (82.500)
top 700, used= 561 (80.143)
top 800, used= 621 (77.625)
top 900, used= 679 (75.444)
top 1000, used= 730 (73.000)
top 1100, used= 775 (70.455)
top 1200, used= 822 (68.500)
top 1300, used= 876 (67.385)
top 1400, used= 928 (66.286)
top 1500, used= 965 (64.333)
top 1600, used= 1008 (63.000)
top 1700, used= 1046 (61.529)
top 1800, used= 1080 (60.000)
top 1900, used= 1117 (58.789)
top 2000, used= 1157 (57.850)
top 2100, used= 1204 (57.333)
top 2200, used= 1244 (56.545)
top 2300, used= 1280 (55.652)
top 2400, used= 1317 (54.875)
top 2500, used= 1359 (54.360)
top 2600, used= 1392 (53.538)
top 2700, used= 1425 (52.778)


In [44]:
from collections import Counter
alpha_count = Counter(svm_linear.dual_coef_.data)
print('number of alphas = %d' % len(alpha_count))
print('number of support vectors = %d\n' % len(svm_linear.dual_coef_.data))

for alpha, count in sorted(alpha_count.items(), key=lambda x:(x[1], abs(x[0])), reverse=True)[:50]:
    if count == 1: 
        print('... count=1')
        break
    print('alpha= %.3f, count=%d (%.3f)' % (alpha, count, count/svm_linear.n_support_.sum()))

number of alphas = 130
number of support vectors = 405

alpha= 1.000, count=143 (0.353)
alpha= -1.000, count=134 (0.331)
... count=1
