<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg" />
</center> 
     

## <center> Kaggle inclass competition from [mlcourse.ai](https://mlcourse.ai/)
    
# <center> [**Catch me if you can**](https://inclass.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2)

### <center> Session: Fall 2019

#### <div style="text-align: right"> Author: [Vladimir Kulyashov](https://github.com/koolvn)


<div style="text-align: right"> creation date: 15 October 2019 </div>

**Prerequisitions proved by EDA:**
1. Alice lives in France
2. Data was collected in the university at working hours 
3. Alice used PC mostly for watching videos and social networks
4. Alice doesn't use GMail or Google+ and Bing


**Goal:**
Beat the A3 strong baseline (0.95965) baseline with as less features, as possible

In [25]:
# Import libraries and set desired options
import os
import pickle
import numpy as np
import pandas as pd
from scipy.sparse import hstack
# !pip install eli5
import eli5
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import display_html

In [26]:
PATH_TO_DATA = '../data/alice/'
filename = f'submission_1.csv'
pred_path = './predictions/alice/'
SEED = 17
time_split = TimeSeriesSplit(n_splits=10)
logit = LogisticRegression(C=1, random_state=SEED, solver='liblinear')
sgdc= SGDClassifier(random_state=SEED)
BEST_LOGIT_C = 5.0118 #2.5118864315095824 #
BEST_LOGIT_TOL = 0.045

In [27]:
def prepare_sparse_features(path_to_train, path_to_test, path_to_site_dict,
                           vectorizer_params):
    """ Prepares sparsed X_train, X_test, y_train, vectorizer, train_times, test_times,
        train_sites, test_sites, top_alice_sites
        
        from input CSV files, pickle file and vectorizer_params dictionary.
    
        return:: X_train, X_test, y_train, vectorizer, train_times, test_times, train_sites, test_sites, top_alice_sites """
    
    times = ['time%s' % i for i in range(1, 11)]
    train_df = pd.read_csv(path_to_train,
                       index_col='session_id', parse_dates=times)
    test_df = pd.read_csv(path_to_test,
                      index_col='session_id', parse_dates=times)

    # Sort the data by time
    train_df = train_df.sort_values(by='time1')
    
    # read site -> id mapping provided by competition organizers 
    with open(path_to_site_dict, 'rb') as f:
        site2id = pickle.load(f)
    # create an inverse id _> site mapping
    id2site = {v:k for (k, v) in site2id.items()}
    # we treat site with id 0 as "unknown"
    id2site[0] = 'unknown'
    
    # Transform data into format which can be fed into TfidfVectorizer
    # This time we prefer to represent sessions with site names, not site ids. 
    # It's less efficient but thus it'll be more convenient to interpret model weights.
    sites = ['site%s' % i for i in range(1, 11)]
    train_sessions = train_df[sites].fillna(0).astype('int').apply(lambda row: 
                                                     ' '.join([id2site[i] for i in row]), axis=1).tolist()
    test_sessions = test_df[sites].fillna(0).astype('int').apply(lambda row: 
                                                     ' '.join([id2site[i] for i in row]), axis=1).tolist()
    
    sites_dict = pd.DataFrame(list(site2id.keys()),
                              index=list(site2id.values()),
                              columns=['site'])
    
    top_alice_sites = pd.Series(train_df[train_df['target'] == 1][sites].fillna(0).astype('int').values.flatten()
                               ).value_counts().sort_values(ascending=False).head(5)
    # we'll tell TfidfVectorizer that we'd like to split data by whitespaces only 
    # so that it doesn't split by dots (we wouldn't like to have 'mail.google.com' 
    # to be split into 'mail', 'google' and 'com')
    vectorizer = TfidfVectorizer(**vectorizer_params)
    X_train = vectorizer.fit_transform(train_sessions)
    X_test = vectorizer.transform(test_sessions)
    y_train = train_df['target'].astype('int').values
    
    # we'll need site visit times for further feature engineering
    train_times, test_times = train_df[times], test_df[times]
    
    # sites_df
    train_sites, test_sites = train_df[sites].fillna(0).astype('int'), test_df[sites].fillna(0).astype('int')
    
    full_df = pd.concat([train_df.drop('target', axis=1), test_df])
    
    return X_train, X_test, y_train, vectorizer, train_times, test_times, train_sites, test_sites, top_alice_sites

In [28]:
%%time
X_train_sites, X_test_sites, y_train, vectorizer, train_times, test_times, train_sites, test_sites, top_alice_sites = prepare_sparse_features(
    path_to_train=os.path.join(PATH_TO_DATA, 'train_sessions.csv'),
    path_to_test=os.path.join(PATH_TO_DATA, 'test_sessions.csv'),
    path_to_site_dict=os.path.join(PATH_TO_DATA, 'site_dic.pkl'),
    vectorizer_params={'ngram_range': (1, 5), 
                       'max_features': 50000,
                       'tokenizer': lambda s: s.split()}
)


Wall time: 46.8 s


In [29]:
# A helper function for writing predictions to a file and write list of features of that predictions
def write_to_submission_file(predicted_labels, out_file, new_feature_names=None, best_params=None, best_score=None,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)
    
    text_file = open(out_file+'.txt', "w")
    text_file.write(f'Features:\n{str(new_feature_names)}\nParams: {best_params}\nBest Score: {best_score}')
    text_file.close()

# I'm lazy, so here is a function that names files for you    
def get_next_filename(filename='submission_1.csv', path=pred_path, file_exists=False, i=1):
    if filename in os.listdir(path):
        file_exists = True
        while file_exists:
            i += 1
            next_ = list(filename.split('.')[0].split('_')[0])
            next_.append('_')
            next_.append(str(i))
            next_ = ''.join(next_) + '.csv'
            next_, file_exists = get_next_filename(next_, path, False, i)
            
        return next_, file_exists  
    else:
        file_exist = False
        return filename, file_exists
    
def train_and_predict(model, X_train, y_train, X_test, site_feature_names=vectorizer.get_feature_names(), 
                      new_feature_names=None, cv=time_split, scoring='roc_auc',
                      top_n_features_to_show=30, submission_file_name='submission.csv', best_params=None):
    
    
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, 
                            scoring=scoring, n_jobs=4)
    print('CV scores', cv_scores)
    print('\nCV mean: {}, CV std: {}'.format(cv_scores.mean(), cv_scores.std()))
    model.fit(X_train, y_train)
    
    if new_feature_names:
        all_feature_names = site_feature_names + new_feature_names 
    else: 
        all_feature_names = site_feature_names
    
    display_html(eli5.show_weights(estimator=model, 
                  feature_names=all_feature_names, top=top_n_features_to_show))
    
    if new_feature_names:
        print('New feature weights:')
    
        print(pd.DataFrame({'feature': new_feature_names, 
                        'coef': model.coef_.flatten()[-len(new_feature_names):]}).sort_values(by='coef'))
    
    test_pred = model.predict_proba(X_test)[:, 1]
    write_to_submission_file(test_pred, submission_file_name, new_feature_names, best_params,
                             best_score=f'\nCV max: {cv_scores.max()} CV mean: {cv_scores.mean()}, CV std: {cv_scores.std()}') 
    
    return cv_scores

### Base score with no features added

In [30]:
%%time
cv_scores_base = train_and_predict(model=logit,
                               X_train=X_train_sites,
                               y_train=y_train,
                               X_test=X_test_sites,
                               site_feature_names=vectorizer.get_feature_names(),
                               top_n_features_to_show=50,
                               cv=time_split, submission_file_name=pred_path+'base_'+filename)

CV scores [0.83124023 0.65993466 0.85673565 0.92824237 0.84779639 0.88954524
 0.88829128 0.8771044  0.92023038 0.92624225]

CV mean: 0.8625362859611094, CV std: 0.07455679334182559


Weight?,Feature
+5.880,youwatch.org
+5.380,cid-ed6c3e6a5c6608a4.users.storage.live.com
+5.222,fr.glee.wikia.com
+5.114,vk.com
+4.875,www.info-jeunes.net
+4.499,www.banque-chalus.fr
+4.220,www.express.co.uk
+4.147,www.audienceinsights.net
+4.089,www.melty.fr
+4.003,glee.hypnoweb.net


Wall time: 17 s


-----------------------------------------

### Adding features

### Feature engineering fields

In [31]:
# with open(PATH_TO_DATA + 'site_dic.pkl', "rb") as input_file:
#         site_dict = pickle.load(input_file)
        
# facebook_ids = []
# youtube_ids = []
# google_ids = []
# france_ids = []
# sowftware_ids = []
# wiki_ids = []
# unknown_ids = []

# for key in list(site_dict.keys()):
#     if 'facebook' in key:
#         facebook_ids.append(site_dict[key])
#     if 'youtube' in key or 'youwatch' in key:
#         youtube_ids.append(site_dict[key])
#     if 'google' in key:
#         google_ids.append(site_dict[key])
#     if '.fr' in key:
#         france_ids.append(site_dict[key])
#     if 'browser-update.org' in key:
#         sowftware_ids.append(site_dict[key])
#     if 'wiki' in key:
#         wiki_ids.append(site_dict[key])
#     if 'unknown' in key:
#         unknown_ids.append(site_dict[key])
            
# sites_dict = pd.DataFrame(list(site_dict.keys()),
#                           index=list(site_dict.values()),
#                           columns=['site'])

# scaler = StandardScaler()
# sites_dict.loc[top_alice_sites.index]

In [32]:
# times = ['time%s' % i for i in range(1, 11)]
# sites = ['site%s' % i for i in range(1, 11)]

# train_df = pd.read_csv(PATH_TO_DATA + 'train_sessions.csv',
#                        index_col='session_id', parse_dates=times)

# top_alice_sites = pd.Series(train_df[train_df['target'] == 1][sites].fillna(0).astype('int').values.flatten()
#                                ).value_counts().sort_values(ascending=False).head(3)

In [33]:
# train_df[train_df['target'] == 1][sites].fillna(0).astype('int').values.flatten()

In [34]:
# hour = train_times['time1'].dt.hour
# hour.isin([12,13,16,17,18]).astype('int').value_counts()

In [35]:
# ((train_times['time1'].dt.month == 10) | (train_times['time1'].dt.month == 5)).astype('int').value_counts()

In [36]:
# train_times['time1'].dt.weekday

The reason of choosing those features is simple - trying to "overfit" to Alice.

In [53]:
def add_features(times, sites, X_sparse, top_alice_sites):
    
    scaler = StandardScaler()
#     scaler = MinMaxScaler()
    
    with open(PATH_TO_DATA + 'site_dic.pkl', "rb") as input_file:
        site_dict = pickle.load(input_file)
        
        
    sites_dict = pd.DataFrame(list(site_dict.keys()),
                              index=list(site_dict.values()),
                              columns=['site'])
        
    # time features
    hour = times['time1'].dt.hour
    morning = ((hour >= 7) & (hour <= 11)).astype('int').values.reshape(-1, 1)
    day = ((hour >= 12) & (hour <= 18)).astype('int').values.reshape(-1, 1)
    evening = ((hour >= 19) & (hour <= 23)).astype('int').values.reshape(-1, 1)
    night = ((hour >= 0) & (hour <=6)).astype('int').values.reshape(-1, 1)
    alice_hours = hour.isin([12,13,16,17,18]).astype('int').values.reshape(-1, 1)
    not_alice_hours = hour.isin([7,8,11,14,19,20,21,22,23]).astype('int').values.reshape(-1, 1)
    alice_days = times['time1'].dt.weekday.isin([0, 1, 3, 4]).astype('int').values.reshape(-1, 1)
    not_alice_days = times['time1'].dt.weekday.isin([2, 5, 6]).astype('int').values.reshape(-1, 1)
    alice_months = times['time1'].dt.month.isin([1,2,3,4,9,11,12]).astype('int').values.reshape(-1, 1)
    not_alice_months = times['time1'].dt.month.isin([5,6,7,8,10]).astype('int').values.reshape(-1, 1)
    
    durations = (times.max(axis=1) - times.min(axis=1)).astype('timedelta64[ms]').astype('int').values.reshape(-1, 1)
    durations = scaler.fit_transform(durations)
    
    week = times['time1'].dt.week.values.reshape(-1, 1)
    week = scaler.fit_transform(week)

    winter = times['time1'].dt.month.isin([12, 1, 2]).astype('int').values.reshape(-1, 1)
    spring = times['time1'].dt.month.isin([3, 4, 5]).astype('int').values.reshape(-1, 1)
    summer = times['time1'].dt.month.isin([6, 7, 8]).astype('int').values.reshape(-1, 1)
    autumn = times['time1'].dt.month.isin([9, 10, 11]).astype('int').values.reshape(-1, 1)
    
    day_of_week = times['time1'].dt.weekday.astype('int').values.reshape(-1, 1)
    day_of_week = scaler.fit_transform(day_of_week)
    
    month = times['time1'].dt.month.astype('int').values.reshape(-1, 1)
    year_month = times['time1'].apply(lambda ts: 100 * ts.year + ts.month).astype('int').values.reshape(-1, 1)
    year_month = scaler.fit_transform(year_month)
    
    sunday = (times['time1'].dt.weekday == 6).astype('int').values.reshape(-1, 1)
    monday = (times['time1'].dt.weekday == 0).astype('int').values.reshape(-1, 1)
    
    may = (times['time1'].dt.month == 5).astype('int').values.reshape(-1, 1)
    october = (times['time1'].dt.month == 10).astype('int').values.reshape(-1, 1)
    
    # site features
    facebook_ids = []
    youtube_ids = []
    google_ids = []
    france_ids = []
    vk_ids = []
    msft_ids = []
    bing_ids = []
    unknown_ids = []
    search_ids = []

    for key in list(site_dict.keys()):
        if 'facebook' in key:
            facebook_ids.append(site_dict[key])
        if 'youtube' in key  in key or 'video' in key or 'youwatch' in key:# or 'ytimg.com' in key:
            youtube_ids.append(site_dict[key])
        if 'mail.google.com' in key or 'plus.google.com' in key: # or 'wwww.bing.com' in key:
            google_ids.append(site_dict[key])
        if '.fr' in key or 'fr.' in key:
            france_ids.append(site_dict[key])
        if 'vk.com' in key or 'vk.ru' in key:
            vk_ids.append(site_dict[key])
        if 'storage.live' in key or '.live.com' in key or 'fr.msn.com' in key:
            msft_ids.append(site_dict[key])
        if 'www.bing.com' in key:
            bing_ids.append(site_dict[key])
        if 'www.google.fr' in key:
            search_ids.append(site_dict[key])
        if 'unknown' in key:
            unknown_ids.append(site_dict[key])

    top_alice_ids = []

    for key in top_alice_sites.index:
        top_alice_ids.append(key)
    
    first_3 = sites[['site1', 'site2', 'site3']]

    in_alice_top = first_3.isin(top_alice_sites).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    
    start_google = first_3.isin(google_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    has_google = sites.isin(google_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    first_3_has_google = first_3.isin(google_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    start_youtube = sites['site1'].isin(youtube_ids).astype('int').values.reshape(-1, 1)
    has_youtube = sites.isin(youtube_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    first_3_has_youtube = first_3.isin(youtube_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    start_facebook = sites['site1'].isin(facebook_ids).astype('int').values.reshape(-1, 1)
    has_facebook = sites.isin(facebook_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    first_3_has_facebook = first_3.isin(facebook_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    start_vk = sites['site1'].isin(vk_ids).astype('int').values.reshape(-1, 1)
    has_vk = sites.isin(vk_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    first_3_has_vk = first_3.isin(vk_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    fr_domains = sites.isin(france_ids).astype('int').sum(axis=1).values.reshape(-1, 1)
    fr_domains = scaler.fit_transform(fr_domains)
#     fr_domains = sites.isin(france_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    msft_usage = sites.isin(msft_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    has_bing = first_3.isin(bing_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    search = first_3.isin(search_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    unknown = sites.isin(unknown_ids).astype('int').sum(axis=1).astype('bool').astype('int').values.reshape(-1, 1)
    
    # stacking matrix
    objects_to_hstack = [X_sparse,
                         morning, day, evening, night,
                         durations,
                         day_of_week,
                         year_month,
                         sunday,
                         start_google,
                         start_youtube,
                         fr_domains,
                         start_vk,
                         msft_usage,
                         summer,
#                          week,
                         has_google,
                         has_youtube,
                         in_alice_top,
                         has_vk,
                         has_facebook,
                         may,
                         october,
#                          has_bing,
                         first_3_has_vk,
                         first_3_has_youtube,
                         first_3_has_facebook,
#                          search,
                         alice_hours,
                         not_alice_hours,
#                          alice_days,
#                          not_alice_days,
#                          first_3_has_google,
#                          monday,
#                          unknown,
                         alice_months,
#                          not_alice_months,
                        ]
    
    feature_names = ['morning', 'day','evening', 'night',
                     'durations',
                     'day_of_week',
                     'year_month',
                     'sunday',
                     'start_google',
                     'start_youtube',
                     'fr_domains',
                     'start_vk',
                     'msft_usage', 
                     'summer',
#                      'week',
                     'has_google',
                     'has_youtube',
                     'in_alice_top',
                     'has_vk',
                     'has_facebook',
                     'may',
                     'october',
#                      'has_bing',
                     'first_3_has_vk',
                     'first_3_has_youtube',
                     'first_3_has_facebook',
#                      'search',
                     'alice_hours',
                     'not_alice_hours',
#                      'alice_days',
#                      'not_alice_days',
#                      'first_3_has_google',
#                      'monday',
#                      'unknown',
                     'alice_months',
#                      'not_alice_months',
                    ]
        
    X = hstack(objects_to_hstack)
    return X, feature_names

### Training model with new features

In [54]:
%%time
X_train, new_feat_names = add_features(train_times, train_sites, X_train_sites, top_alice_sites)
X_test, _ = add_features(test_times, test_sites, X_test_sites, top_alice_sites)

Wall time: 5.77 s


In [55]:
new_feat_names

['morning',
 'day',
 'evening',
 'night',
 'durations',
 'day_of_week',
 'year_month',
 'sunday',
 'start_google',
 'start_youtube',
 'fr_domains',
 'start_vk',
 'msft_usage',
 'summer',
 'has_google',
 'has_youtube',
 'in_alice_top',
 'has_vk',
 'has_facebook',
 'may',
 'october',
 'first_3_has_vk',
 'first_3_has_youtube',
 'first_3_has_facebook',
 'alice_hours',
 'not_alice_hours',
 'alice_months']

In [56]:
%%time
cv_scores_engineered = train_and_predict(model=logit, X_train=X_train, y_train=y_train,
                                         X_test=X_test, 
                                         site_feature_names=vectorizer.get_feature_names(),
                                         new_feature_names=new_feat_names,
                                         cv=time_split,
                                         submission_file_name=pred_path + 'engineered_' + filename)

CV scores [0.88954529 0.90348207 0.96212667 0.94884503 0.95138494 0.9666015
 0.90366413 0.95883664 0.96431865 0.97489431]

CV mean: 0.9423699237315551, CV std: 0.02951656537520633


Weight?,Feature
+4.986,cid-ed6c3e6a5c6608a4.users.storage.live.com
+4.935,www.express.co.uk
+4.686,www.melty.fr
+4.514,www.audienceinsights.net
+4.475,www.info-jeunes.net
+3.885,fr.glee.wikia.com
+3.755,www.banque-chalus.fr
+3.689,api.bing.com
+3.685,youwatch.org
+3.599,dub119.mail.live.com


New feature weights:
                 feature      coef
14            has_google -3.791375
0                morning -2.843488
13                summer -2.706858
25       not_alice_hours -2.433422
7                 sunday -1.836478
20               october -1.806297
1                    day -1.562613
2                evening -1.150321
8           start_google -0.948850
19                   may -0.928471
5            day_of_week -0.579017
6             year_month -0.468383
12            msft_usage -0.284234
4              durations -0.232303
11              start_vk -0.186773
26          alice_months -0.114796
16          in_alice_top  0.000000
3                  night  0.000000
23  first_3_has_facebook  0.038828
9          start_youtube  0.073390
10            fr_domains  0.175686
22   first_3_has_youtube  0.363887
18          has_facebook  0.436903
21        first_3_has_vk  0.490332
15           has_youtube  0.522773
17                has_vk  1.616642
24           alice_hours  2.454341

In [57]:
cv_scores_base < cv_scores_engineered

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

Leaderboard score = 0.96215

### Parameter tuning

In [58]:
%%time
logit = LogisticRegression(random_state=SEED, solver='liblinear')

# params = {'C': np.logspace(0, 1.05, 10), 'tol': np.linspace(0.1, 0.0001, 10)}

params = {'C': [BEST_LOGIT_C],#, 2.2387211385683394, 2.9286445646252366],
          'tol': [BEST_LOGIT_TOL]}#, 0.0556, 0.001]}


logit_grid_searcher = GridSearchCV(estimator=logit, param_grid=params,
                              scoring='roc_auc', n_jobs=4, cv=time_split, verbose=1)

logit_grid_searcher.fit(X_train, y_train)

print('Tuned score', logit_grid_searcher.best_score_)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:   25.9s finished


Tuned score 0.9464087444508743
Wall time: 36.8 s


In [59]:
logit_grid_searcher.best_estimator_

LogisticRegression(C=5.0118, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=17, solver='liblinear', tol=0.045, verbose=0,
                   warm_start=False)

### Training the best model and writing the output files:
submission.csv + text file with features and best parameters

In [60]:
%%time
cv_scores_tuned = train_and_predict(model=logit_grid_searcher.best_estimator_, X_train=X_train, y_train=y_train,
                                         X_test=X_test, 
                                         site_feature_names=vectorizer.get_feature_names(),
                                         new_feature_names=new_feat_names,
                                         cv=time_split,
                                         submission_file_name=pred_path + get_next_filename(filename)[0], best_params=logit_grid_searcher.best_params_)

CV scores [0.91474955 0.88763954 0.96029797 0.95048413 0.95815735 0.96567903
 0.92522617 0.95715403 0.97108434 0.97361534]

CV mean: 0.9464087444508744, CV std: 0.026623791797209157


Weight?,Feature
+11.193,www.express.co.uk
+11.023,cid-ed6c3e6a5c6608a4.users.storage.live.com
+8.313,tru.am
+6.935,browser-update.org
+6.547,www.banque-chalus.fr
+5.982,www.melty.fr
+5.898,api.bing.com
+5.830,glee.hypnoweb.net
+5.825,fr.glee.wikia.com
+5.688,s.radio-canada.ca


New feature weights:
                 feature      coef
14            has_google -4.959255
13                summer -3.504300
0                morning -3.072226
25       not_alice_hours -2.538626
20               october -2.478031
7                 sunday -2.260269
2                evening -2.051348
1                    day -1.732976
19                   may -1.261258
8           start_google -1.116538
5            day_of_week -0.603906
6             year_month -0.568956
12            msft_usage -0.445534
11              start_vk -0.283876
4              durations -0.205393
23  first_3_has_facebook -0.051352
3                  night  0.000000
16          in_alice_top  0.000000
9          start_youtube  0.015027
10            fr_domains  0.132972
22   first_3_has_youtube  0.310091
15           has_youtube  0.336838
18          has_facebook  0.350042
26          alice_months  0.387040
21        first_3_has_vk  0.705389
17                has_vk  1.451027
24           alice_hours  2.645824

In [61]:
cv_scores_engineered < cv_scores_tuned

array([ True, False, False,  True,  True, False,  True, False,  True,
       False])

In [22]:
# working hours feature + mean the latest 3 submissions + weekend feature + mail feature

Leaderboard score: **0.96247**

1. with wednesday - 0.95832 on LB
1. with software - 0.95771 on LB
1. with has_bing - 0.96226 on LB

In [23]:
logit_grid_searcher.best_params_

{'C': 5.0118, 'tol': 0.045}

In [24]:
a = list('order')
print(a)
a.index('d')

['o', 'r', 'd', 'e', 'r']


2