# PART3: SVM with extended Gaussian kernel

# Summary:#

   In part 2, we've computed the Chamfer distance $D(X,Y)$ between bags.
   
   We now convert the distance $D(X,Y)$ to a kernel $K(X,Y)$ defined as $$K(X,Y)=\exp(-\gamma D(X,Y)),$$ where $\gamma$ is obtained by cross-validation.
   
   We use scikit-learn that allows training SVM with pre-computed kernel by supplying the Gram matrix. The Gram matrix $K$ is defined by $K_{ij}=K(X^{(i)},X^{(j)})$, where $X^{(i)}$ is the i-th training example. 
   
   References: 
   * sklearn: http://scikit-learn.org/stable/modules/svm.html#using-the-gram-matrix
   
   * Page 14 in <a href="http://158.109.8.37/files/Amo2013.pdf#Page=14">Amores' Survey paper</a>   
   

## Load Data (Business labels and distances between bags)

In [1]:
import numpy as np
import pandas as pd 

data_root = '/home/ncchen/Kaggle-Yelp/input/'

train_labels = pd.read_csv(data_root+'train.csv').dropna()
train_labels['labels'] = train_labels['labels'].apply(lambda x: list(sorted(int(t) for t in x.split())))
train_labels.set_index('business_id', inplace=True)
trainbiz_ids = train_labels.index.unique()
y_train = train_labels['labels'].values

print "Number of train business: ", len(trainbiz_ids) ,   "(4 business with missing labels are dropped)\n"
print train_labels[0:5]

print "\nDistance between bags: \n"
bag_df = pd.read_csv(data_root+'train_bag_distance_ResFeatures.csv',index_col=0)
bag_df.columns = bag_df.columns.astype(int)
bag_df.head(5)

Number of train business:  1996 (4 business with missing labels are dropped)

                            labels
business_id                       
1000         [1, 2, 3, 4, 5, 6, 7]
1001                  [0, 1, 6, 8]
100             [1, 2, 4, 5, 6, 7]
1006               [1, 2, 4, 5, 6]
1010                     [0, 6, 8]

Distance between bags: 



Unnamed: 0,1000,1001,100,1006,1010,101,1011,1012,1014,1015,...,982,985,988,989,99,991,993,997,998,999
1000,0.0,43.63504,38.786771,38.158427,37.521892,37.011404,37.398529,39.298789,40.692697,35.943411,...,38.927058,40.030389,42.076054,37.021177,36.237642,37.591032,38.638852,40.529373,36.848072,37.515185
1001,43.63504,0.0,42.286685,39.962211,40.82572,40.987559,38.978621,40.486661,43.100583,37.711336,...,40.198127,41.402758,43.731888,40.226369,40.726832,40.020654,37.763465,40.388742,38.344995,40.087026
100,38.786771,42.286685,0.0,36.551154,39.436967,35.324064,36.677628,36.679129,38.341046,35.85399,...,38.517299,37.307436,41.143644,36.841235,35.952301,35.364403,38.466211,40.33053,36.566883,37.420975
1006,38.158427,39.962211,36.551154,0.0,37.87289,35.991093,36.137871,37.870416,39.362878,35.786854,...,37.866887,36.044087,42.473602,35.640873,35.492958,35.804553,37.23142,39.978678,36.36635,35.606697
1010,37.521892,40.82572,39.436967,37.87289,0.0,37.24315,35.625177,37.493988,40.497968,36.318672,...,39.948268,38.990243,41.353538,35.670863,36.258009,34.848222,37.947193,40.10945,35.955997,38.382367


## Train SVMs with precomputed kernel

Train one classifier for each attribute spearately. 

Do 5-fold cross-validation to find parameters.

Remarks:
* Parameters here are: gamma, C, weight, where gamma is the $\gamma$ appearing in the definition of kernel $K(X,Y)$, a higher $C$ aims at classifying all training examples correctly, and weight decides whether to balance weight.    <a href="http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html">Sklearn-class_weight</a>: The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.

* StratifiedKFold is used so that each fold has same percentage of positive examples.

In [2]:
from sklearn import svm
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score

import time
t=time.time()

mlb = MultiLabelBinarizer()
y_train_binary= mlb.fit_transform(y_train) 

best_gamma={i:0 for i in range(9)}
best_C={i:1 for i in range(9)}
best_f1score={i:0 for i in range(9)}
best_weight={i:None for i in range(9)}
predicts={}


for i in range(9):
    skf=StratifiedKFold(y_train_binary[:,i], n_folds=5, shuffle=True, random_state=2001)
    for gamma in np.delete(np.linspace(0, 0.1,20),0):
        for weight in [None, "balanced"]:
            for c in np.linspace(0.5,1.5,11):
                f1score=[]                
                for train_index, test_index in skf:
                    
                    #Get the label for attribute i
                    y_train_attr = y_train_binary[train_index,i]  
                    y_test_attr = y_train_binary[test_index,i]
                  
                    # train_index is a list of indeices in [0,1995]
                    # need to convert it to the index used in bag_df 
                    df_train_index =  bag_df.index[train_index]
                    df_test_index = bag_df.index[test_index]
                    
                    ## Get sub-matrices representing mutual distances between bags
                    dist_train = bag_df.loc[df_train_index][df_train_index]
                    dist_test = bag_df.loc[df_test_index][df_train_index]

                    clf = svm.SVC(kernel='precomputed',C=c,verbose=False,class_weight=weight)

                    gram = np.exp(-gamma * dist_train.values) ## Gram matrices
                    clf.fit(gram, y_train_attr) 
                    predict=clf.predict(np.exp(-gamma * dist_test.values))
                    f1score.append(f1_score(y_test_attr, predict))
                avg_f1score = np.mean(f1score)  ## Take average over 5 folds.

                if avg_f1score>best_f1score[i]:
                    best_gamma[i]=gamma
                    best_f1score[i]=avg_f1score
                    best_weight[i]=weight
                    best_C[i] = c
                    predicts[i]=predict

print "Time passed: ", "{0:.0f}".format(time.time()-t), "sec"
print "Best Gamma: ", best_gamma
print "Best f1score: ",best_f1score, "     (not indicative)"
print "Best_c: ", best_C
print "Best_weight: ", best_weight
    

  'precision', 'predicted', average, warn_for)


Time passed:  2168 sec
Best Gamma:  {0: 0.04736842105263158, 1: 0.036842105263157891, 2: 0.005263157894736842, 3: 0.057894736842105263, 4: 0.026315789473684209, 5: 0.057894736842105263, 6: 0.042105263157894736, 7: 0.010526315789473684, 8: 0.026315789473684209}
Best f1score:  {0: 0.73760583196360052, 1: 0.85023393001437664, 2: 0.88840288477686968, 3: 0.70149063633248987, 4: 0.81205637098201433, 5: 0.89713473965102608, 6: 0.94180186805218313, 7: 0.78840088828161081, 8: 0.89380907967357159}      (not indicative)
Best_c:  {0: 0.90000000000000002, 1: 1.5, 2: 0.69999999999999996, 3: 1.3999999999999999, 4: 1.3, 5: 1.3999999999999999, 6: 1.1000000000000001, 7: 1.3, 8: 1.3}
Best_weight:  {0: 'balanced', 1: 'balanced', 2: 'balanced', 3: None, 4: 'balanced', 5: 'balanced', 6: 'balanced', 7: 'balanced', 8: None}


In [3]:
# Save the best parameters found
#best_gamma={0: 0.04736842105263158, 1: 0.036842105263157891, 2: 0.005263157894736842, 3: 0.057894736842105263, 4: 0.026315789473684209, 5: 0.057894736842105263, 6: 0.042105263157894736, 7: 0.010526315789473684, 8: 0.026315789473684209}
#best_C={0: 0.90000000000000002, 1: 1.5, 2: 0.69999999999999996, 3: 1.3999999999999999, 4: 1.3, 5: 1.3999999999999999, 6: 1.1000000000000001, 7: 1.3, 8: 1.3}
#best_weight={0: 'balanced', 1: 'balanced', 2: 'balanced', 3: None, 4: 'balanced', 5: 'balanced', 6: 'balanced', 7: 'balanced', 8: None}

In [5]:
from sklearn import svm
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.cross_validation import KFold
from sklearn.metrics import f1_score
import time

kf = KFold(len(bag_df.index), n_folds=5, random_state=2001, shuffle=True)  
mlb = MultiLabelBinarizer()
y_train_binary= mlb.fit_transform(y_train)

t=time.time()

best_f1score=0
best_threshold=0.5


for threshold in np.linspace(0.4,0.6,21):
    f1score=[]
    for train_index, test_index in kf:
        predicts=np.zeros((len(test_index),9))
        for i in range(9):
            #Get the label for attribute i
            y_train_attr = y_train_binary[train_index,i]  
            y_test_attr = y_train_binary[test_index,i]

            # train_index is a list of indeices in [0,1995]
            # need to convert the index to the index used in bag_df 
            df_train_index =  bag_df.index[train_index]
            df_test_index = bag_df.index[test_index]

            ## Get sub-matrices representing mutual distances between bags
            dist_train = bag_df.loc[df_train_index][df_train_index]
            dist_test = bag_df.loc[df_test_index][df_train_index]

            gamma=best_gamma[i]
            class_weight=best_weight[i]
            clf = svm.SVC(kernel='precomputed',C=best_C[i],verbose=False,class_weight=class_weight,probability=True)
            gram = np.exp(-gamma * dist_train.values) ## Gram matrices
            clf.fit(gram, y_train_attr) 
            predict=clf.predict_proba(np.exp(-gamma * dist_test.values))
            predict=(predict[:,1]>threshold)
            predicts[:,i] = predict
        f1score.append(f1_score(y_train_binary[test_index,:], predicts, average='micro'))         
    avg_f1score = np.mean(f1score)
    if avg_f1score>best_f1score:
        best_f1score = avg_f1score
        best_threshold=threshold
        best_model=clf

print "best f1 score: ", best_f1score
print "best threshold: ", best_threshold
print "time passed: ", time.time()-t

best f1 score:  0.847378231217
best threshold:  0.43
time passed:  231.255146027


## Predict the test set

In [12]:
import numpy as np
import pandas as pd 

data_root = '/home/ncchen/Kaggle-Yelp/input/'

train_labels = pd.read_csv(data_root+'train.csv').dropna()
train_labels['labels'] = train_labels['labels'].apply(lambda x: list(sorted(int(t) for t in x.split())))
train_labels.set_index('business_id', inplace=True)
trainbiz_ids = train_labels.index.unique()


train_bag_df = pd.read_csv(data_root+'train_bag_distance_ResFeatures.csv',index_col=0)
train_bag_df.columns = bag_df.columns.astype(int)
test_bag_df = pd.read_csv(data_root+'test_bag_distance_ResFeatures.csv',index_col=0)

X_train = train_bag_df
X_test = test_bag_df
y_train = train_labels['labels'].values

In [13]:
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import MultiLabelBinarizer


mlb = MultiLabelBinarizer()
y_train_binary= mlb.fit_transform(y_train)  #Convert list of labels to binary matrix
y_matrix=np.zeros((len(X_test),9))  #The predicted labels, saved as a binary matrix
for i in range(9):
    y_train_attr = y_train_binary[:,i]
    gamma=best_gamma[i]
    c=best_C[i]
    class_weight=best_weight[i]
    clf = svm.SVC(kernel='precomputed',C=c, verbose=False,class_weight=best_weight[i],probability=True)
    gram = X_train.values
    gram = np.exp(-gram*gamma)
    clf.fit(gram, y_train_attr) 
    predict = clf.predict_proba(np.exp(-X_test*gamma))
    y_matrix[:,i]=(predict[:,1]>best_threshold)

y_label = mlb.inverse_transform(y_matrix)   # Convert binary matrix back to labels. 

In [14]:
statistics = pd.DataFrame(columns=[ "attribuite "+str(i) for i in range(9)]+['num_biz'], index = ["biz count", "biz ratio"])
statistics.loc["biz count"] = np.append(np.sum(y_matrix, axis=0).astype(int), len(y_matrix))

statistics.loc["biz ratio"] = statistics.loc["biz count"]*100/len(y_matrix) 
pd.options.display.float_format = '{:.0f}%'.format
statistics

Unnamed: 0,attribuite 0,attribuite 1,attribuite 2,attribuite 3,attribuite 4,attribuite 5,attribuite 6,attribuite 7,attribuite 8,num_biz
biz count,862,7451,8549,4869,1400,9032,9294,1121,7725,10000
biz ratio,9%,75%,85%,49%,14%,90%,93%,11%,77%,100%


In [15]:
test_data_frame  = pd.read_csv(data_root+"sample_submission.csv")
df = pd.DataFrame(columns=['business_id','labels'])

for i in range(len(test_data_frame)):
    biz = test_data_frame.loc[i]['business_id']
    label = y_label[i]
    label = str(label)[1:-1].replace(",", " ")
    df.loc[i] = [str(biz), label]

with open(data_root+"BagDistance_submission.csv",'w') as f:
    df.to_csv(f, index=False)    
    