## Preprocessing the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Read the train and the test data 
train_users = pd.read_csv('train_users_2.csv')
test_users = pd.read_csv('test_users.csv')


# Extracting labels from the train data
train_users_labels = train_users.loc[:,'country_destination']
print (train_users_labels.head(n=5))

# Extracting attributes from the train data
train_users_attrs = train_users.iloc[:,0:15]
print(train_users_attrs.head(n=5))

train_users = train_users_attrs

0      NDF
1      NDF
2       US
3    other
4       US
Name: country_destination, dtype: object
           id date_account_created  timestamp_first_active date_first_booking  \
0  gxn3p5htnn           2010-06-28          20090319043255                NaN   
1  820tgsjxq7           2011-05-25          20090523174809                NaN   
2  4ft3gnwmtx           2010-09-28          20090609231247         2010-08-02   
3  bjjt8pjhuk           2011-12-05          20091031060129         2012-09-08   
4  87mebub9p4           2010-09-14          20091208061105         2010-02-18   

      gender  age signup_method  signup_flow language affiliate_channel  \
0  -unknown-  NaN      facebook            0       en            direct   
1       MALE   38      facebook            0       en               seo   
2     FEMALE   56         basic            3       en            direct   
3     FEMALE   42      facebook            0       en            direct   
4  -unknown-   41         basic           

In [3]:
train_users = train_users.drop(['date_first_booking'], axis=1)
test_users = test_users.drop(['date_first_booking'], axis=1)

In [4]:
# Date is split into 3 parts as year, month and day in both test and train. These are added as
# new features in both test and train

date_acc_created = np.vstack(train_users.date_account_created.astype(str).apply(
        lambda x: list(map(int, x.split('-')))).values)
train_users['created_year'] = date_acc_created[:,0]
train_users['created_month'] = date_acc_created[:,1]
train_users['created_day'] = date_acc_created[:,2]
train_users = train_users.drop(['date_account_created'], axis=1)

date_acc_created_test = np.vstack(test_users.date_account_created.astype(str).apply(
        lambda x: list(map(int, x.split('-')))).values)
test_users['created_year'] = date_acc_created_test[:,0]
test_users['created_month'] = date_acc_created_test[:,1]
test_users['created_day'] = date_acc_created_test[:,2]
test_users = test_users.drop(['date_account_created'], axis=1)

In [5]:
# Replacing unknown values in gender with -1 and null values with -1
train_users.loc[ train_users['gender'] == '-unknown-', 'gender'] = -1
train_users.loc[ train_users['gender'].isnull(), 'gender' ] = -1
test_users.loc[ test_users['gender'] == '-unknown-', 'gender'] = -1
test_users.loc[ test_users['gender'].isnull(), 'gender'] = -1

In [6]:
# Encoding Female with 0, Male with 1 and Other with 2 in both test and train data
gender_translation = {'FEMALE' : 0,
                     'MALE' : 1,
                     'OTHER' : 2,
                     -1 : -1 }
for data in [train_users, test_users]:
    data['gender'] = data['gender'].apply(lambda x: gender_translation[x])

In [7]:
# Finding valid values for gender and invalid values for gender
nan_gender_count = len(train_users.loc[train_users['gender'] == -1, 'gender'])
valid_gender_count = len(train_users.gender.values) - nan_gender_count

# Creating a map with the gender distribution
count_map = pd.value_counts(train_users['gender'].values)
print ("Existing gender value distribution")
for k, v in count_map.iteritems():
    if k == -1:
        continue
    print (k, ":", float(v)/float(valid_gender_count))

Existing gender value distribution
(0, ':', 0.5353209412124351)
(1, ':', 0.46228441870536585)
(2, ':', 0.002394640082198993)


In [8]:
# Making the gender distribution the same for missing imputation
for k, v in count_map.iteritems():
    if k == -1:
        continue
    c = int ( nan_gender_count * float(v)/float(valid_gender_count) )
    for i in range(len(train_users.gender.values)):
        if train_users.gender.values[i] == -1:
            train_users.gender.values[i] = k
            c -= 1
        if c == 0:
            break
train_users.gender.values[213450] = 0

In [9]:
train_users.gender.describe()

count    213451.000000
mean          0.467072
std           0.503691
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           2.000000
Name: gender, dtype: float64

In [10]:
nan_gender_count = len(test_users.loc[test_users['gender'] == -1, 'gender'])
valid_gender_count = len(test_users.gender.values) - nan_gender_count
count_map = pd.value_counts(test_users['gender'].values)
print ("Existing gender value distribution")
for k, v in count_map.iteritems():
    if k == -1:
        continue
    print (k, ":", float(v)/float(valid_gender_count))

for k, v in count_map.iteritems():
    if k == -1:
        continue
    c = int ( nan_gender_count * float(v)/float(valid_gender_count) )
    for i in range(len(test_users.gender.values)):
        if test_users.gender.values[i] == -1:
            test_users.gender.values[i] = k
            c -= 1
        if c == 0:
            break
test_users.gender.values[62094] = 0

Existing gender value distribution
(0, ':', 0.5116944601469757)
(1, ':', 0.486468343697004)
(2, ':', 0.0018371961560203504)


In [11]:
train_users['age'].describe()

count    125461.000000
mean         49.668335
std         155.666612
min           1.000000
25%          28.000000
50%          34.000000
75%          43.000000
max        2014.000000
Name: age, dtype: float64

In [12]:
# Replacing invalid age with NaN in test and train

train_users.loc[train_users['age'] > 95, 'age'] = np.nan
train_users.loc[train_users['age'] < 16, 'age'] = np.nan
test_users.loc[test_users['age'] > 95, 'age'] = np.nan
test_users.loc[test_users['age'] < 16, 'age'] = np.nan

In [13]:
# Replace missing age with median
print (train_users.age.median())
print (test_users.age.median())
train_users.loc[ train_users['age'].isnull(), 'age' ] = train_users.age.median()
test_users.loc[ test_users['age'].isnull(), 'age' ] = test_users.age.median()

34.0
31.0


In [14]:
# Encoding the signup method for test
signup_translation = {'facebook' : 0,
                     'google' : 1,
                     'basic' : 2,
                     'weibo' : 3}
for data in [train_users, test_users]:
    data['signup_method'] = data['signup_method'].apply(lambda x: signup_translation[x])

In [15]:
# Encoding the language in both train and test
test_users.loc[ test_users['language'] == '-unknown-', 'language'] = "en"

In [16]:
language_encoding = {'en'      :       1       ,
'zh'      :       2       ,
'fr'      :       3       ,
'es'      :       4       ,
'ko'      :       5       ,
'de'      :       6       ,
'it'      :       7       ,
'ru'      :       8       ,
'pt'      :       9       ,
'ja'      :       10      ,
'sv'      :       11      ,
'nl'      :       12      ,
'tr'      :       13      ,
'da'      :       14      ,
'pl'      :       15      ,
'cs'      :       16      ,
'no'      :       17      ,
'el'      :       18      ,
'th'      :       19      ,
'id'      :       20      ,
'hu'      :       21      ,
'fi'      :       22      ,
'ca'      :       23      ,
'is'      :       24      ,
'hr'      :       25}

for data in [train_users, test_users]:
    data['language'] = data['language'].apply(lambda x: language_encoding[x])


In [17]:
# Encoding for affiliate_channel
affiliate_channel_encoding = {'direct' : 1,
                             'sem-brand' : 2,
                             'sem-non-brand' : 3,
                             'other' : 4,
                             'api' : 5,
                             'seo' : 6,
                             'content' : 7,
                             'remarketing' : 8}
for data in [train_users, test_users]:
    data['affiliate_channel'] = data['affiliate_channel'].apply(lambda x: affiliate_channel_encoding[x])

In [18]:
# Encoding for affiliate_provider
affiliate_provider_encoding = {'direct':1,
'google':2,
'other':3,
'craigslist':4,
'bing':5,
'facebook':6,
'vast':7,
'padmapper':8,
'facebook-open-graph':9,
'yahoo':10,
'gsp':11,
'meetup':12,
'email-marketing':13,
'naver':14,
'baidu':15,
'yandex':16,
'wayn':17,
'daum':18}

for data in [train_users, test_users]:
    data['affiliate_provider'] = data['affiliate_provider'].apply(lambda x: affiliate_provider_encoding[x])


In [19]:
# Encoding for first_affiliate_tracked
train_users.loc[ train_users['first_affiliate_tracked'].isnull(), 'first_affiliate_tracked'] = "untracked"
test_users.loc[ test_users['first_affiliate_tracked'].isnull(), 'first_affiliate_tracked'] = "untracked"
first_affiliate_tracked_encoding = {'untracked' : 1,
                                   'linked' : 2,
                                   'omg' : 3,
                                   'tracked-other' : 4,
                                   'product' : 5,
                                   'marketing' : 6,
                                   'local ops' : 7}
for data in [train_users, test_users]:
    data['first_affiliate_tracked'] = data['first_affiliate_tracked'].apply(lambda x: first_affiliate_tracked_encoding[x])


In [20]:
# Encoding for signup_app
signup_app_encoding = {'Web' : 1,
                      'iOS' : 2,
                      'Android' : 3,
                      'Moweb' : 4}
for data in [train_users, test_users]:
    data['signup_app'] = data['signup_app'].apply(lambda x: signup_app_encoding[x])


In [21]:
# Encoding for first_device_type
first_device_type_encoding = { 'Mac Desktop' : 1,
                             'iPhone' : 2,
                             'Windows Desktop' : 3,
                             'Android Phone' : 4,
                             'iPad' : 5,
                             'Android Tablet' : 6,
                             'Other/Unknown' : 7,
                             'Desktop (Other)' : 8,
                             'SmartPhone (Other)' : 9}
for data in [train_users, test_users]:
    data['first_device_type'] = data['first_device_type'].apply(lambda x: first_device_type_encoding[x])

In [22]:
# Encoding for first_browser
first_browser_encoding = {'Chrome':1,
'Safari':2,
'Firefox':3,
'-unknown-':4,
'IE':5,
'Mobile Safari':6,
'Chrome Mobile':7,
'Android Browser':8,
'AOL Explorer':9,
'Opera':10,
'Silk':11,
'Chromium':12,
'BlackBerry Browser':13,
'Maxthon':14,
'IE Mobile':15,
'Apple Mail':16,
'Sogou Explorer':17,
'Mobile Firefox':18,
'RockMelt':19,
'SiteKiosk':20,
'Iron':21,
'IceWeasel':22,
'Pale Moon':23,
'SeaMonkey':24,
'Yandex.Browser':25,
'CometBird':26,
'Camino':27,
'TenFourFox':28,
'wOSBrowser':29,
'CoolNovo':30,
'Avant Browser':31,
'Opera Mini':32,
'Mozilla':33,
'Comodo Dragon':34,
'TheWorld Browser':35,
'Crazy Browser':36,
'Flock':37,
'OmniWeb':38,
'SlimBrowser':39,
'Opera Mobile':40,
'Conkeror':41,
'Outlook 2007':42,
'Palm Pre web browser':43,
'Stainless':44,
'NetNewsWire':45,
'Kindle Browser':46,
'Epic':47,
'Googlebot':48,
'Arora':49,
'Google Earth':50,
'IceDragon':51,
'PS Vita browser':52,
'IBrowse' : 53,
'UC Browser' : 54,
'IBrowse': 55,
'Nintendo Browser' : 56}


for data in [train_users, test_users]:
    data['first_browser'] = data['first_browser'].apply(lambda x: first_browser_encoding[x])

In [23]:
# Reading sessions data
sessions = pd.read_csv('sessions.csv')

In [24]:
# frequency of each user_id in sessions data
df = sessions['user_id'].value_counts()
print (df.shape)
print (df)


(135483,)
mxqbh3ykxl    2722
0hjoc5q8nf    2644
mjbl6rrj52    2476
l5lgm3w5pc    2424
wg9413iaux    2362
ht8alhs4lt    2335
wyv1imf8qw    2323
monrpvx2md    2264
9z4gim1s4l    2264
h0cjxc177k    2246
a0uhiojrra    2137
vcmr2jh5ix    2085
1m6xnhstmb    2019
p1183hxzc4    1938
e8h4qghxlg    1923
gey51ednme    1919
5vpuk5mssg    1876
j2cvctvqve    1861
yu5bdalz2b    1811
ejpe95pcyo    1797
r541x78s24    1792
qkbkunyzq7    1780
n4s6g3grzf    1779
bfiueza7rt    1753
b1io359wpg    1752
8ikl7vnfa3    1732
e81qfos71y    1701
s5ez13snz0    1685
93dulcecw0    1614
r0rgjqbsvp    1612
              ... 
vlji8fg52x       1
4s2v2hmngj       1
n2rrpf1t3h       1
ua4bebdziw       1
gks02el96u       1
e7l7yocdtk       1
ztvrwgyxm2       1
w5sn4qqiav       1
9o5gi1x2i4       1
kl81vani0y       1
1uaksuktr5       1
c9vanbl9nh       1
n6tcyc7thd       1
cgdsmvs4sw       1
f9ohif5u6w       1
wiru94r12h       1
l28osl4y6x       1
t9o5rwmg1k       1
hjhljq8k89       1
ah2mvtfp74       1
q7xk33e009       1
d8

In [25]:
# Updating session_count for users present in the train data

train_users['session_count'] = 0

for key,val in df.iteritems():
    train_users.loc[train_users[ 'id' ] == key, 'session_count'] = val

   

In [65]:
print (train_users['session_count'].max())


2644


In [39]:
# Encding for country_destination
country_destination_encoding = {'NDF': 0,
'US' : 1,
'other' : 2,
'FR' : 3,
'IT' : 4,
'GB' : 5,
'ES' : 6,
'CA' : 7,
'DE' : 8,
'NL' : 9,
'AU' : 10,
'PT' : 11}

# Convert series to frame
labels_df = train_users_labels.to_frame()

for data in [labels_df]:
    data['country_destination'] = data['country_destination'].apply(lambda x: country_destination_encoding[x])

    
    

In [40]:
#print train_users_merge.head()
"""
from sklearn import preprocessing

stdscaler = preprocessing.StandardScaler()
train_users_scaled = stdscaler.fit_transform(train_users.values);
train_users = pd.DataFrame(train_users_scaled, columns = train_users.columns)

train_users_merge_scaled = stdscaler.fit_transform(train_users_merge.values);
train_users_merge = pd.DataFrame(train_users_merge_scaled, columns = train_users_merge.columns)

#test_users_scaled = stdscaler.fit_transform(test_users.values);
#test_users = pd.DataFrame(test_users_scaled, columns = test_users.columns)

"""

'\nfrom sklearn import preprocessing\n\nstdscaler = preprocessing.StandardScaler()\ntrain_users_scaled = stdscaler.fit_transform(train_users.values);\ntrain_users = pd.DataFrame(train_users_scaled, columns = train_users.columns)\n\ntrain_users_merge_scaled = stdscaler.fit_transform(train_users_merge.values);\ntrain_users_merge = pd.DataFrame(train_users_merge_scaled, columns = train_users_merge.columns)\n\n#test_users_scaled = stdscaler.fit_transform(test_users.values);\n#test_users = pd.DataFrame(test_users_scaled, columns = test_users.columns)\n\n'

In [41]:
#train_users['country_destination'] = labels_df
#print(train_users.head())

In [42]:
#train_users_merge['country_destination'] = labels_df
#print(train_users_merge.head())

In [43]:
#train_users.to_csv('train_users_wo_merge_scale.csv',index=False)
#train_users_merge.to_csv('train_users_merge_scale.csv',index=False)

In [44]:
#train_users_merge_wo_scale['country_destination'] = labels_df
#train_users_merge_wo_scale.to_csv('train_users_merge_wo_scale.csv',index=False)
#test_users_merge_wo_scale.to_csv('test_users_merge_wo_scale.csv',index=False)


In [45]:
#train_user_wo_merge_wo_scale['country_destination'] = labels_df
#train_user_wo_merge_wo_scale.to_csv('train_users_wo_merge_wo_scale.csv',index=False)
#test_user_wo_merge_wo_scale.to_csv('test_users_wo_merge_wo_scale.csv',index=False)

In [46]:
def folds_to_split(data,targets,train,test):
    data_tr = pd.DataFrame(data).iloc[train]
    data_te = pd.DataFrame(data).iloc[test]
    labels_tr = pd.DataFrame(targets).iloc[train]
    labels_te = pd.DataFrame(targets).iloc[test]
    return [data_tr, data_te, labels_tr, labels_te]

## Creating the NDCG Scorer

In [48]:
# Reference Kaggle

from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import make_scorer

def dcg_score(y_true, y_score, k=5):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)


#def ndcg_score(ground_truth, predictions, k=5):
def ndcg_score(te_labels, predict, k):
    
    lb = LabelBinarizer()
    lb.fit(range(len(predict) + 1))
    T = lb.transform(te_labels)

    scores = []

    # Iterate over each y_true and compute the DCG score
    for y_true, y_score in zip(T, predict):
        actual = dcg_score(y_true, y_score, k)
        best = dcg_score(y_true, y_true, k)
        if best == 0:
            best = 0.000000001
        score = float(actual) / float(best)
        scores.append(score)
    return np.mean(scores)


# NDCG Scorer function
ndcg_scorer = make_scorer(ndcg_score, needs_proba=True, k=5)
print ndcg_scorer

make_scorer(ndcg_score, needs_proba=True, k=5)


In [85]:
#print train_users.head()
train_users=train_users.drop(['id'], axis=1)
#print train_users.head()

## Modeling

## Part 1

In [50]:
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing, cross_validation
from sklearn.ensemble import AdaBoostClassifier

## 1. Linear Discriminant Analysis

In [52]:
# LDA using test-train split
from sklearn.lda import LDA


[tr_data, te_data, tr_labels, te_labels] = cross_validation.train_test_split(train_users, labels_df, test_size=0.33,
                                                                             random_state=20160302)
lda_clf = LDA()
lda_clf.fit(tr_data, tr_labels.values.ravel())
ground_truth = te_labels.as_matrix()
prob_lda = lda_clf.predict_proba(te_data)
score_lda = ndcg_score(ground_truth, prob_lda, k=5)
    

In [54]:
print "NDCG Score for LDA:"
print score_lda

NDCG Score for LDA:
0.809272396586


In [55]:
print "Accuracy Score for LDA:"
print lda_clf.score(te_data,te_labels)

Accuracy Score for LDA:
0.588565993271


## 2. Naive Bayes

In [61]:
# Naive Bayes using test-train split
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing, cross_validation

gnb = GaussianNB()
[tr_data, te_data, 
 tr_labels, te_labels] = cross_validation.train_test_split(train_users, labels_df, test_size=0.33,random_state=20160302)
gnb.fit(tr_data, tr_labels.values.ravel())   

GaussianNB()

In [62]:
prob_arr = gnb.predict_proba(te_data)

In [63]:
ground_truth = te_labels.as_matrix()
predictions = prob_arr
score = ndcg_score(ground_truth, predictions, k=5)
print score

0.806278567816


In [64]:
print "NDCG Score for Naive Bayes:"
print score

NDCG Score for Naive Bayes:
0.806278567816


In [65]:
print "Accuracy Score for Naive Bayes:"
print gnb.score(te_data,te_labels)

Accuracy Score for Naive Bayes:
0.581013359077


## 3. AdaBoost with Naive Bayes

In [66]:
# Adaboost with Naive Bayes

from sklearn.ensemble import AdaBoostClassifier

clf1 = AdaBoostClassifier(base_estimator=gnb, n_estimators=1000)
clf1.fit(tr_data, tr_labels.values.ravel())
prob_ada_nb = clf1.predict_proba(te_data)
score = ndcg_score(ground_truth, prob_ada_nb, k=5)
print "NDCG Score for Adaboost with Naive Bayes:"
print score

NDCG Score for Adaboost with Naive Bayes:
0.807958234839


In [67]:
print "Accuracy Score for Adaboost with Naive Bayes:"
print clf1.score(te_data,te_labels)

Accuracy Score for Adaboost with Naive Bayes:
0.585442723491


## 4. Quadratic Discriminant Analysis

In [68]:
# QDA with test-train split

from sklearn.qda import QDA


[tr_data, te_data, tr_labels, te_labels] = cross_validation.train_test_split(train_users, labels_df, test_size=0.33,
                                                                             random_state=20160302)
qda_clf = QDA()
qda_clf.fit(tr_data, tr_labels.values.ravel())
ground_truth = te_labels.as_matrix()
prob_qda = qda_clf.predict_proba(te_data)
score_qda = ndcg_score(ground_truth, prob_qda, k=5)
print score_qda




0.356243751631


In [69]:
print "NDCG Score for QDA:"
print score_qda

NDCG Score for QDA:
0.356243751631


In [70]:
print "Accuracy Score for QDA:"
print qda_clf.score(te_data,te_labels)

Accuracy Score for QDA:
0.0951745481906


## 5. Gradient Boosting Classifier

In [72]:
# GradientBoostingClassifier with test-train split, max_depth=3, n_estimators=100

from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(max_depth=3, n_estimators=100, random_state=20160302)
gb_clf.fit(tr_data, tr_labels.values.ravel())

# NDCG of GradientBoostingClassifier
prob_gb = gb_clf.predict_proba(te_data)
score_gb = ndcg_score(ground_truth, prob_gb, k=5)
print "NDCG Score for Gradient Boosting Classifier:"
print score_gb


NDCG Score for Gradient Boosting Classifier:
0.824479634631


In [73]:
print "Accuracy Score for Gradient Boosting Classifier:"
print gb_clf.score(te_data,te_labels)

Accuracy Score for Gradient Boosting Classifier:
0.630346824912


In [74]:
# GradientBoostingClassifier with max_depth=3, n_estimators=500

from sklearn.ensemble import GradientBoostingClassifier

gb_clf2 = GradientBoostingClassifier(max_depth=3, n_estimators=500, random_state=20160302)
gb_clf2.fit(tr_data, tr_labels.values.ravel())

# NDCG of GradientBoostingClassifier
prob_gb2 = gb_clf2.predict_proba(te_data)
score_gb2 = ndcg_score(ground_truth, prob_gb2, k=5)
print "NDCG Score 2 for Gradient Boosting Classifier:"
print score_gb2

NDCG Score 2 for Gradient Boosting Classifier:
0.82377984146


In [75]:
print "Accuracy Score 2 for Gradient Boosting Classifier:"
print gb_clf2.score(te_data,te_labels)

Accuracy Score 2 for Gradient Boosting Classifier:
0.630517185082


## Part 2

## Using 10-fold cross validation for analysis

LDA

In [78]:
# LDA with 10-fold cross-validation

foldnum = 0
fold_results = pd.DataFrame()
for train, test in cross_validation.KFold(len(train_users), n_folds=10, random_state=20160302, shuffle=True):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(train_users,labels_df,train,test)
    lda = LDA()
    lda.fit(tr_data, tr_labels.values.ravel())
    prob_arr_lda = lda.predict_proba(te_data)
    #ground_truth = te_labels.as_matrix()
    #fold_results.loc[foldnum, 'Accuracy'] = gnb.score(te_data,te_labels)
    score_lda = ndcg_score(te_labels.as_matrix(), prob_arr_lda, k=5)
    fold_results.loc[foldnum, 'Ndcg_LDA'] = score_lda
    
print fold_results.mean()   

Ndcg_LDA    0.807742
dtype: float64




Naive Bayes

In [79]:
# Naive Bayes with 10-fold cross-validation
foldnum = 0
fold_results = pd.DataFrame()
for train, test in cross_validation.KFold(len(train_users), n_folds=10, random_state=20160302, shuffle=True):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(train_users,labels_df,train,test)
    gnb1 = GaussianNB()
    gnb1.fit(tr_data, tr_labels.values.ravel())
    prob_arr_gnb1 = gnb1.predict_proba(te_data)
    #ground_truth = te_labels.as_matrix()
    #fold_results.loc[foldnum, 'Accuracy'] = gnb.score(te_data,te_labels)
    score_gnb1 = ndcg_score(te_labels.as_matrix(), prob_arr_gnb1, k=5)
    fold_results.loc[foldnum, 'Ndcg_Gnb'] = score_gnb1
    
print fold_results.mean()  

Ndcg_Gnb    0.805354
dtype: float64


Using 10-fold cross-validation produces similar results for LDA and Naive Bayes.

## Part 3

## Using scaled data for analysis

In [80]:
from sklearn import preprocessing
train_users_scaled = pd.DataFrame(preprocessing.StandardScaler().fit_transform(train_users))
print train_users_scaled.head(n=5)

         0         1         2         3         4         5         6   \
0 -4.380020 -0.927300 -0.163283 -1.596552 -0.427798 -0.141579 -0.582242   
1 -4.357961  1.058047  0.287705 -1.596552 -0.427798 -0.141579  2.556797   
2 -4.348661 -0.927300  2.317149  0.628333 -0.035009 -0.141579 -0.582242   
3 -4.303076 -0.927300  0.738692 -1.596552 -0.427798 -0.141579 -0.582242   
4 -4.283949 -0.927300  0.625945  0.628333 -0.427798 -0.141579 -0.582242   

         7         8         9         10        11        12        13  \
0 -0.468760 -0.798954 -0.359375 -0.876174 -0.971889 -3.222044 -0.006939   
1  0.251719 -0.798954 -0.359375 -0.876174 -0.971889 -2.156499 -0.315897   
2 -0.468760 -0.798954 -0.359375  0.324910  1.091035 -3.222044  0.919936   
3 -0.468760 -0.798954 -0.359375 -0.876174  0.059573 -2.156499  1.846811   
4 -0.468760 -0.798954 -0.359375 -0.876174 -0.971889 -3.222044  0.919936   

         14        15  
0  1.387946 -0.345061  
1  1.044700 -0.345061  
2  1.387946 -0.345061  
3 

LDA with 10-fold cross-validation using scaled data

In [81]:
# LDA with 10-fold cross-validation using scaled data

foldnum = 0
fold_results = pd.DataFrame()
for train, test in cross_validation.KFold(len(train_users_scaled), n_folds=10, random_state=20160302, shuffle=True):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(train_users_scaled,labels_df,train,test)
    lda2 = LDA()
    lda2.fit(tr_data, tr_labels.values.ravel())
    prob_arr_lda2 = lda2.predict_proba(te_data)
    #ground_truth = te_labels.as_matrix()
    #fold_results.loc[foldnum, 'Accuracy'] = gnb.score(te_data,te_labels)
    score_lda2 = ndcg_score(te_labels.as_matrix(), prob_arr_lda2, k=5)
    fold_results.loc[foldnum, 'Ndcg_LDA'] = score_lda2
    
print fold_results.mean() 

Ndcg_LDA    0.807742
dtype: float64


Naive Bayes with 10-fold cross-validation using scaled data

In [82]:
# Naive Bayes with 10-fold cross-validation using scaled data
foldnum = 0
fold_results = pd.DataFrame()
for train, test in cross_validation.KFold(len(train_users_scaled), n_folds=10, random_state=20160302, shuffle=True):
    foldnum+=1
    [tr_data, te_data, tr_labels, te_labels] = folds_to_split(train_users_scaled,labels_df,train,test)
    gnb2 = GaussianNB()
    gnb2.fit(tr_data, tr_labels.values.ravel())
    prob_arr_gnb2 = gnb2.predict_proba(te_data)
    #ground_truth = te_labels.as_matrix()
    #fold_results.loc[foldnum, 'Accuracy'] = gnb.score(te_data,te_labels)
    score_gnb2 = ndcg_score(te_labels.as_matrix(), prob_arr_gnb2, k=5)
    fold_results.loc[foldnum, 'Ndcg_Gnb'] = score_gnb2
    
print fold_results.mean() 

Ndcg_Gnb    0.753768
dtype: float64


Scaling does not improve the score for either of the above classifiers.

## Observations :

Naive Bayes Classification

1. Naive Bayes with test-train split : The base accuracy of the Naive Bayes Classifier was found to be 0.581. The probabilities generated by Naive Bayes were used as predictions for the NDCG scorer and the score was 0.806.

2. Naive Bayes with 10-fold cross-validation: We used 10-fold cross-validation to split our data into test and train and used the Naive Bayes classifier with it. This gave us an NDCG score of 0.805, which was pretty close to what we got for test-train split.

3. Naive Bayes with 10-fold cross-validation using scaled data: We used Standard Scaler to scale the data. It did not prove to be useful. The NDCG score decreased to 0.75.

4. We performed AdaBoost multi-class classification with the Naive Bayes classifier as the base estimator. This was repeated for n_estimators = 500 and n_estmators = 1000. The base classifier accuracy was 0.585 and both experiments gave similar results. The NDCG score increased marginally to 0.808.



Linear Discriminant Analysis

1. LDA with test-train split : The base accuracy of the LDA Classifier was found to be 0.59. The probabilities generated by LDA were used as predictions for the NDCG scorer and the score was 0.81.

2. LDA with 10-fold cross-validation: We used 10-fold cross-validation to split our data into test and train and used LDA classifier to fit the train data. This gave us an NDCG score of 0.807, which was less than what we got for test-train split.

3. LDA with 10-fold cross-validation using scaled data: We used Standard Scaler to scale the data. The NDCG score was 0.807, almost same as what we got with 10-fold cross-validation.



Quadratic Discriminant Analysis

QDA was performed with a test-train split and 10-fold cross-validation but both these setups did not give us a high NDCG score. The maximum score achieved was 0.35

Gradient Boosting Classification

In our experiments, we set the max_depth to 3 and 5 and set the n_estimators to 100 and 500. Base estimator was set to default init = loss.init_estimator. The accuracy of the base classifier with n_estimators = 100 and max_depth = 3 was 0.63. The NDCG score for this setting came out to be 0.8244