# Домашние задание по уроку 5 (SkillFactory)

Slack: @Lek <br/>
telegram: @AlexLekov <br/>

### Оглавление:
* [1. Подгрузка данных](#1)
* [2. Очистка](#2)
* [3. TfidfVectorizer](#3)
* [4. Word2Vec](#4)
* [5. Submit](#5)


======================================================================================================================

#### Описание задания

Мы владельцы специфического Job-сайта и нам дали большой датасет вакансий. Одни вакансии нам интересны по своей тематике, другие не интересны (target 1 и 0 соответственно). Часть вакансий была размечена людскими ресурсами.
  
Ваша задача обучить классификатор, который на основе размеченной выборки умеет определять интересные вакансии для нашего сайта.

* Метрика качества ROC_AUC.
* ИСПОЛЬЗОВАТЬ ВНЕШНИЕ ДАННЫЕ С JOB-сайтов = ЗАПРЕЩЕНО
* ИСПОЛЬЗОВАТЬ другие ВНЕШНИЕ ДАННЫЕ = только с разрешения организатора (смотри Discussion)
* Результат засчитывается только при наличие кода, который этот результат повторяет
* Участие командное

#### PS:
Обсудить урок и домашнее задание можно в нашем Slack-чате

# Поехали!

![title](img/start.jpg)
 <a class="anchor" id="1"></a>

In [4]:
import pandas as pd
import numpy as np

import re # Регулярные выражения

from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from scipy.sparse import csr_matrix, hstack
from sklearn.pipeline import make_pipeline

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns
import time

%matplotlib inline

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12,5)

In [5]:
# проверк по CV , c перебором не только по фолдам но и по random_state:
seeds=[1, 7, 42, 123, 2019]

def cv_print (lr, X_train, y_train, seeds):
    ''' clean print Cross Val Score + for random_state'''
    cv_final=np.array([])
    for random_state in seeds:
        skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
        cv = cross_val_score(lr, X_train, y_train, scoring='roc_auc', cv = skf, )
        cv_final = np.append(cv_final, cv)
    print('Mean: ', cv_final.mean())
    print('Std: ', cv_final.std())
    print('='*40)

<a class="anchor" id="1"></a>
# 1. Подгрузка данных

In [6]:
train_df = pd.read_csv('data/train.csv', sep='\t', low_memory=False)
test_df = pd.read_csv('data/test.csv', sep='\t', low_memory=False)
#data_df = pd.read_csv('data/other.csv', sep='\t', low_memory=False)

train_df['train'] = 1
train_df['test'] = 0
train_df.drop('id', inplace = True, axis = 1)

test_df['train'] = 0
test_df['test'] = 1
test_df['target'] = 0
test_df.drop('id', inplace = True, axis = 1)

#data_df['target'] = 0
#data_df['train'] = 0
#data_df['test'] = 0

df = test_df.append(train_df, sort=False).reset_index(drop=True)
#df = df.append(data_df, sort=False).reset_index(drop=True)

In [7]:
df.head(10)

Unnamed: 0,name,description,train,test,target
0,Дизайнер-консультант мебели,<p><strong>Обязанности:</strong></p> <ul> <li>...,0,1,0
1,Продавец-консультант (ТЦ на Пушкина),<p><strong>Обязанности</strong>:</p> <p>∙ конс...,0,1,0
2,Менеджер по продажам,<p>Торговый Дом «Форт» это ведущая компания Пе...,0,1,0
3,Продавец-консультант в магазин одежды (ТЦ Волн...,<p><strong>Требуются продавцы консультанты в м...,0,1,0
4,Специалист по охране труда,<strong>Обязанности:</strong> <ul> <li> <p>осу...,0,1,0
5,Эксперт по обеспечению качества при сооружении...,<p><strong>Обязанности:</strong></p> <ul> <li>...,0,1,0
6,Торговый представитель (Арзамас),<p><strong>Обязанности:</strong></p> <ul> <li>...,0,1,0
7,Заместитель генерального директора по производ...,<p><strong>Обязанности:</strong></p> <ul> <li>...,0,1,0
8,Backend Rust developer,<p><strong>Storiqa </strong>- это площадка для...,0,1,0
9,Дизайнер-конструктор 3D,<p><strong>Обязанности:</strong></p> <ul> <li>...,0,1,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 370179 entries, 0 to 370178
Data columns (total 5 columns):
name           370179 non-null object
description    370179 non-null object
train          370179 non-null int64
test           370179 non-null int64
target         370179 non-null int64
dtypes: int64(3), object(2)
memory usage: 14.1+ MB


In [9]:
# посмотрим количество уникальных значений для каждого столбца
for colum in df.columns:
    print(len(df[colum].value_counts()), colum)

113447 name
249280 description
2 train
2 test
2 target


In [10]:
df['name'].value_counts().head(10)

Менеджер по продажам              11643
Продавец-консультант               9644
Торговый представитель             9591
Менеджер по работе с клиентами     6437
Продавец-кассир                    2846
Системный администратор            2792
Мерчендайзер                       2762
Маркетолог                         2108
Программист 1С                     1897
Менеджер по оптовым продажам       1868
Name: name, dtype: int64

In [11]:
train_df['target'].value_counts()

0    106436
1     93564
Name: target, dtype: int64

<a class="anchor" id="2"></a>
# 2. Очистка

#### уберем html теги.
Будем использовать библиотеку re.    
Полная документация на данную библиотеку находится <a href="https://docs.python.org/2/library/re.html">здесь</a>. Хорошее описание регулярных выражений <a href="https://ru.wikipedia.org/wiki/%D0%A0%D0%B5%D0%B3%D1%83%D0%BB%D1%8F%D1%80%D0%BD%D1%8B%D0%B5_%D0%B2%D1%8B%D1%80%D0%B0%D0%B6%D0%B5%D0%BD%D0%B8%D1%8F">здесь</a>.

In [9]:
def clean_html(data):
    clean_re = re.compile('<.*?>')
    cleantext = re.sub(clean_re, '', data)
    return cleantext

df['description'] = df['description'].apply(lambda x: clean_html(x))

In [10]:
df.name = df.name.str.lower()
df.description = df.description.str.lower()

In [11]:
df.head(10)

Unnamed: 0,name,description,train,test,target
0,дизайнер-консультант мебели,"обязанности: работа с клиентом в салоне,выезд...",0,1,0
1,продавец-консультант (тц на пушкина),обязанности: ∙ консультирование покупателей по...,0,1,0
2,менеджер по продажам,торговый дом «форт» это ведущая компания петер...,0,1,0
3,продавец-консультант в магазин одежды (тц волн...,требуются продавцы консультанты в магазин женс...,0,1,0
4,специалист по охране труда,обязанности: осуществление контроля по соблю...,0,1,0
5,эксперт по обеспечению качества при сооружении...,"обязанности: управление несоответствиями, в...",0,1,0
6,торговый представитель (арзамас),обязанности: ведение и развитие существующей ...,0,1,0
7,заместитель генерального директора по производ...,"обязанности: доработка качества, в первую оче...",0,1,0
8,backend rust developer,storiqa - это площадка для торговли физическим...,0,1,0
9,дизайнер-конструктор 3d,обязанности: адаптация дизайнов визуализация ...,0,1,0


<a class="anchor" id="3"></a>
# 3. TfidfVectorizer

In [12]:
df_train_preproc = df.query('train == 1').drop(['train','test'], axis=1)
#df_test_preproc = df_preproc.query('test == 1').drop(['train','test'], axis=1)

y = df_train_preproc.target.values
X = df_train_preproc.drop(['target'], axis=1)

In [13]:
%%time
tfv_name = TfidfVectorizer(ngram_range=(1, 5), 
                            sublinear_tf=True,
                           #stop_words = stopWords,
                          max_features=45000 # если не много памяти можно уменьшить
                          )
tfv_desc = TfidfVectorizer(ngram_range=(1, 2),
                            sublinear_tf=True,
                           #stop_words = stopWords,
                          max_features=80000
                          )
# про эту обработку рассказывали на lesson 5 - SkillFactory_20181211

tfv_name_fit = tfv_name.fit(df.name)
tfv_desc_fit = tfv_desc.fit(df.description)

X_name = tfv_name_fit.transform(X.name)
X_desc = tfv_desc_fit.transform(X.description)

X_tf = hstack((X_name, X_desc)) # может знаете еще более эффективый по памяти способ/либу на слияния матриц?

Wall time: 2min 46s


## ML

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, test_size=0.25, random_state=42)

### LogisticRegression

In [15]:
%%time
lr = LogisticRegression(C=5, solver='sag', random_state=42)
cv_print(lr, X_train, y_train, [42,1,12])

Mean:  0.993614657690429
Std:  0.00023150312275365136
Wall time: 2min 49s


In [16]:
lr = LogisticRegression(C=5, solver='sag', random_state=42) 
lr.fit(X_train, y_train)
lr_1_predict_proba = lr.predict_proba(X_test)
print(roc_auc_score(y_test, lr_1_predict_proba[:,1]))

0.9938239421740739


### RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
%%time
rf = RandomForestClassifier(n_estimators=1000, random_state=42, verbose=1, n_jobs=7)
rf.fit(X_train, y_train)
rf_predict_proba = rf.predict_proba(X_test)
print(roc_auc_score(y_test, rf_predict_proba[:,1]))

In [21]:
%%time
rf = RandomForestClassifier(n_estimators=1000, random_state=42, verbose=1, n_jobs=7)
rf.fit(X_train, y_train)
rf_predict_proba = rf.predict_proba(X_test)
print(roc_auc_score(y_test, rf_predict_proba[:,1]))

[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:  1.5min
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:  7.3min
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed: 16.8min
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed: 30.3min
[Parallel(n_jobs=7)]: Done 1000 out of 1000 | elapsed: 38.5min finished
[Parallel(n_jobs=7)]: Using backend ThreadingBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    0.7s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    3.1s
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed:    7.1s
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed:   12.6s


0.9927562500228956
Wall time: 38min 48s


[Parallel(n_jobs=7)]: Done 1000 out of 1000 | elapsed:   16.0s finished


### Lightgbm

In [24]:
import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
'objective':'binary',
'num_leaves': 100,
#'learning_rate': 0.1,
'metric': 'auc',
'bagging_fraction': 0.75,
'bagging_freq': 10,
#'feature_fraction':0.75,
#'lambda_l1':5,
#'lambda_l2':5,
#'min_data_in_leaf': 500
#'is_unbalance':True
'seed':42,
#'num_leaves':50, 
#'objective':'regression_l2', 
'learning_rate':0.01,
'early_stopping_round':100,
'max_bin':400,
#'boosting':'dart'
}

In [None]:
%%time
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=2000,
                valid_sets = lgb_valid,
                verbose_eval = 10,)



Training until validation scores don't improve for 100 rounds.
[10]	valid_0's auc: 0.971752
[20]	valid_0's auc: 0.977022
[30]	valid_0's auc: 0.979891
[40]	valid_0's auc: 0.981215
[50]	valid_0's auc: 0.982281
[60]	valid_0's auc: 0.982822
[70]	valid_0's auc: 0.983084
[80]	valid_0's auc: 0.983481
[90]	valid_0's auc: 0.983726
[100]	valid_0's auc: 0.984335
[110]	valid_0's auc: 0.984556


In [37]:
lgb_predict_proba = gbm.predict(X_test)
print(roc_auc_score(y_test, lgb_predict_proba))

0.9937994505620824


In [None]:
params_2 = {
'objective':'binary',
'num_leaves': 25,
#'learning_rate': 0.1,
'metric': 'auc',
#'bagging_fraction': 0.75,
#'bagging_freq': 10,
#'feature_fraction':0.75,
#'lambda_l1':5,
#'lambda_l2':5,
#'min_data_in_leaf': 500
#'is_unbalance':True
'seed':42,
#'num_leaves':50, 
#'objective':'regression_l2', 
'learning_rate':0.01,
'early_stopping_round':100,
#'max_bin':400,
#'boosting':'dart'
}

In [None]:
%%time
gbm_2 = lgb.train(params_2,
                lgb_train,
                num_boost_round=2000,
                valid_sets = lgb_valid,
                verbose_eval = 10,)

In [38]:
lgb_2_predict_proba = gbm_2.predict(X_test)
print(roc_auc_score(y_test, lgb_2_predict_proba))

0.9935654681401638


## Mean

In [39]:
mean_predict=np.zeros(len(lgb_predict_proba))
for i in range(len(lgb_predict_proba)):
    mean_predict[i]=(lr_1_predict_proba[i,1]+lgb_predict_proba[i]+rf_predict_proba[i,1])/3
    
print('Mean_predict roc_auc:', '\t', roc_auc_score(y_test, mean_predict))

Mean_predict roc_auc: 	 0.9945612954777295


In [40]:
mean_predict=np.zeros(len(lgb_predict_proba))
for i in range(len(lgb_predict_proba)):
    mean_predict[i]=(lr_1_predict_proba[i,1]+lgb_predict_proba[i]+lgb_2_predict_proba[i]+rf_predict_proba[i,1])/4
    
print('Mean_predict roc_auc:', '\t', roc_auc_score(y_test, mean_predict))

Mean_predict roc_auc: 	 0.9946070169007974


## fit for submit:

In [None]:
lr = LogisticRegression(C=5, solver='sag', random_state=42) 
lr.fit(X_tf, y)

In [30]:
%%time
rf = RandomForestClassifier(n_estimators=1000, random_state=42, verbose=1, n_jobs=7)
rf.fit(X_tf, y)

[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:  1.9min
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:  8.6min
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed: 20.0min
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed: 35.6min


Wall time: 45min 15s


[Parallel(n_jobs=7)]: Done 1000 out of 1000 | elapsed: 45.2min finished


In [31]:
params = {
'objective':'binary',
'num_leaves': 100,
'bagging_fraction': 0.75,
'bagging_freq': 10,
'seed':42,
'learning_rate':0.01,
'max_bin':400,
}

In [32]:
%%time
lgb_train = lgb.Dataset(X_tf, y)
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=2500,)

Wall time: 2h 13min 12s


In [33]:
params_2 = {
'objective':'binary',
'num_leaves': 25,
'seed':42,
'learning_rate':0.01,
}

In [34]:
%%time
gbm_2 = lgb.train(params_2,
                lgb_train,
                num_boost_round=2500,)

Wall time: 38min 43s


## Train

In [61]:
lr_1_predict_proba = lr.predict_proba(X_tf)
rf_predict_proba = rf.predict_proba(X_tf)
lgb_predict_proba = gbm.predict(X_tf)
lgb_2_predict_proba = gbm_2.predict(X_tf)

[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    1.7s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    8.2s
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed:   19.0s
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed:   35.4s
[Parallel(n_jobs=7)]: Done 1000 out of 1000 | elapsed:   45.1s finished


In [62]:
test_predict_models4 = \
    pd.DataFrame([lr_1_predict_proba[:,1],lgb_predict_proba,lgb_2_predict_proba,rf_predict_proba[:,1]]).T

In [63]:
test_predict_models4.to_csv('tfidf_train_models4.csv', index=False)

## Test

In [35]:
df_test_preproc = df.query('test == 1').drop(['train','test'], axis=1)
#df_test_preproc = df_preproc.query('test == 1').drop(['train','test'], axis=1)

y_k = df_test_preproc.target.values
X_k = df_test_preproc.drop(['target'], axis=1)

In [36]:
X_test_name = tfv_name_fit.transform(X_k.name)
X_test_desc = tfv_desc_fit.transform(X_k.description)

In [37]:
X_kf_tf = hstack((X_test_name, X_test_desc))

In [38]:
lr_1_predict_proba = lr.predict_proba(X_kf_tf)
rf_predict_proba = rf.predict_proba(X_kf_tf)
lgb_predict_proba = gbm.predict(X_kf_tf)
lgb_2_predict_proba = gbm_2.predict(X_kf_tf)

[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed:    1.4s
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed:    7.8s
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed:   18.1s
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed:   33.1s
[Parallel(n_jobs=7)]: Done 1000 out of 1000 | elapsed:   41.9s finished


In [58]:
test_predict_models4 = \
    pd.DataFrame([lr_1_predict_proba[:,1],lgb_predict_proba,lgb_2_predict_proba,rf_predict_proba[:,1]]).T

In [60]:
test_predict_models4.to_csv('tfidf_test_models4.csv', index=False)

<a class="anchor" id="4"></a>
# 4. Word2Vec
разработка by Anna  
Slack: @Anna

In [1]:
from gensim.models.word2vec import Word2Vec
import gensim.sklearn_api



In [21]:
def transform_w2v_mean(word2vec, X):
        return np.array([np.mean([word2vec[w] for w in words if w in word2vec]
                    or [np.zeros(100)], axis=0) for words in X])
    
def transform_w2v_sum(word2vec, X):
        return np.array([np.mean([word2vec[w] for w in words if w in word2vec]
                    or [np.zeros(100)], axis=0) for words in X])

In [12]:
# уберем пунктуацию
def clean_p(data):
    clean_re = re.compile('[^\w\s]')
    cleantext = re.sub(clean_re, ' ', data)
    cleantext = cleantext.replace("  ", " ")
    cleantext = cleantext.replace("   ", " ")
    return cleantext

### Clean words

In [None]:
df['description'] = df['description'].apply(lambda x: clean_p(x))
df['name'] = df['name'].apply(lambda x: clean_p(x))

In [13]:
df['description'] = df['description'].str.split()
df['name'] = df['name'].str.split()

In [14]:
df['name'].values

array([list(['Дизайнер', 'консультант', 'мебели']),
       list(['Продавец', 'консультант', 'ТЦ', 'на', 'Пушкина']),
       list(['Менеджер', 'по', 'продажам']), ...,
       list(['Менеджер', 'по', 'продажам']),
       list(['Торговый', 'представитель', 'в', 'Алматы']), list(['Швея'])],
      dtype=object)

### Обучаем словарь

In [15]:
%%time
model = Word2Vec(size=200, min_count=1) #min_count - обуачть только слова встречающиеся n раз, лучше 1, но 2 быстрее
model.build_vocab(df.description.values)
model.train(df.description.values, total_examples=model.corpus_count, epochs=model.iter)
w2v_desc = dict(zip(model.wv.index2word, model.wv.syn0))

  This is separate from the ipykernel package so we can avoid doing imports until


Wall time: 3min 36s


  after removing the cwd from sys.path.


In [16]:
%%time
model = Word2Vec(size=100, min_count=1) #min_count - обуачть только слова встречающиеся n раз, лучше 1, но 2 быстрее
model.build_vocab(df.name.values)
model.train(df.name.values, total_examples=model.corpus_count, epochs=model.iter)
w2v_name = dict(zip(model.wv.index2word, model.wv.syn0))

  This is separate from the ipykernel package so we can avoid doing imports until


Wall time: 4.39 s


  after removing the cwd from sys.path.


In [17]:
w2v_desc['торговый']

array([-1.1092808 ,  3.1371803 ,  3.1112075 , -1.634942  ,  2.5531266 ,
        1.2145935 ,  2.0634282 ,  2.384946  , -1.1940956 ,  0.21138524,
       -1.3777215 , -0.7580045 , -0.5489473 ,  2.9192393 ,  0.29353264,
        0.2148605 ,  0.22488508, -1.8911971 ,  1.7263371 , -4.0578938 ,
       -2.670518  , -1.476629  , -0.72457963,  0.7752355 ,  1.7286243 ,
       -0.0321142 , -0.51697206,  1.3388697 ,  0.3819138 , -1.2054482 ,
        0.10867018, -1.4429821 ,  0.5105837 , -0.06046249, -0.7314942 ,
       -0.22062474, -1.3847734 , -0.83774763,  1.2438725 ,  1.3739913 ,
        2.6923735 , -0.5587024 ,  0.8099814 ,  0.9791361 , -1.3813494 ,
        0.4592724 , -0.7144494 ,  0.23488039,  1.5531833 , -1.5729442 ,
        2.4677796 , -1.4110694 , -3.7599382 ,  0.00666782,  1.0684805 ,
        0.774604  , -2.1721253 , -0.67960525, -1.2658603 ,  0.5761809 ,
        0.43903655,  0.29532656, -3.3248506 , -0.92592263, -0.03689454,
       -0.5025313 , -0.754285  ,  2.0862854 ,  3.8509436 , -0.56

In [18]:
df_train_preproc = df.query('train == 1').drop(['train','test'], axis=1)
#df_test_preproc = df_preproc.query('test == 1').drop(['train','test'], axis=1)

y = df_train_preproc.target.values
X = df_train_preproc.drop(['target'], axis=1)

In [22]:
tf_desc_mean = transform_w2v_mean(w2v_desc, X.description)
#tf_desc_sum = transform_w2v_sum(w2v_desc, X.description)

tf_nam_mean = transform_w2v_mean(w2v_name, X.name)
#tf_nam_sum = transform_w2v_sum(w2v_name, X.name)

In [23]:
tf_desc_mean.shape, tf_nam_mean.shape

((200000, 200), (200000, 100))

In [24]:
X_tf = np.hstack((tf_nam_mean,tf_desc_mean))

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X_tf, y, test_size = 0.25 ,random_state = 42)

In [26]:
import warnings
warnings.filterwarnings('ignore')

In [27]:
lr = LogisticRegression(random_state=42)
cv_print(lr, X_train, y_train, [42,])

Mean:  0.9872178442907057
Std:  0.0005390743086190204


In [28]:
import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
'objective':'binary',
'num_leaves': 25,
#'learning_rate': 0.1,
'metric': 'auc',
'bagging_fraction': 0.75,
'bagging_freq': 10,
#'feature_fraction':0.75,
#'lambda_l1':5,
#'lambda_l2':5,
#'min_data_in_leaf': 500
#'is_unbalance':True
'seed':42,
#'num_leaves':50, 
#'objective':'regression_l2', 
'learning_rate':0.02,
'early_stopping_round':500,
'max_bin':500,
#'boosting':'dart'
}

In [29]:
%%time
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=7500,
                valid_sets = lgb_valid,
                verbose_eval = 50,)

Training until validation scores don't improve for 500 rounds.
[50]	valid_0's auc: 0.978418
[100]	valid_0's auc: 0.983214
[150]	valid_0's auc: 0.985796
[200]	valid_0's auc: 0.987377
[250]	valid_0's auc: 0.988452
[300]	valid_0's auc: 0.989251
[350]	valid_0's auc: 0.989807
[400]	valid_0's auc: 0.990244
[450]	valid_0's auc: 0.990596
[500]	valid_0's auc: 0.990884
[550]	valid_0's auc: 0.991144
[600]	valid_0's auc: 0.991342
[650]	valid_0's auc: 0.991493
[700]	valid_0's auc: 0.991601
[750]	valid_0's auc: 0.991714
[800]	valid_0's auc: 0.991844
[850]	valid_0's auc: 0.991938
[900]	valid_0's auc: 0.992029
[950]	valid_0's auc: 0.992107
[1000]	valid_0's auc: 0.992184
[1050]	valid_0's auc: 0.992246
[1100]	valid_0's auc: 0.99231
[1150]	valid_0's auc: 0.992366
[1200]	valid_0's auc: 0.992401
[1250]	valid_0's auc: 0.992428
[1300]	valid_0's auc: 0.992478
[1350]	valid_0's auc: 0.992507
[1400]	valid_0's auc: 0.992556
[1450]	valid_0's auc: 0.992606
[1500]	valid_0's auc: 0.99265
[1550]	valid_0's auc: 0.99268

## fit for submit:

In [30]:
%%time
lgb_train = lgb.Dataset(X_tf, y)
params = {
'objective':'binary',
'num_leaves': 25,
#'learning_rate': 0.1,
#'metric': 'auc',
'bagging_fraction': 0.75,
'bagging_freq': 10,
#'feature_fraction':0.75,
#'lambda_l1':5,
#'lambda_l2':5,
#'min_data_in_leaf': 500
#'is_unbalance':True
'seed':42,
#'num_leaves':50, 
#'objective':'regression_l2', 
'learning_rate':0.02,
#'early_stopping_round':500,
'max_bin':500,
#'boosting':'dart'
}

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=7000,)

Wall time: 11min 28s


In [31]:
lgb_predict_proba = gbm.predict(X_tf)

In [32]:
print(roc_auc_score(y, lgb_predict_proba))

0.9999945880324963


In [36]:
df_test_preproc = df.query('test == 1').drop(['train','test'], axis=1)
#df_test_preproc = df_preproc.query('test == 1').drop(['train','test'], axis=1)

y_k = df_test_preproc.target.values
X_k = df_test_preproc.drop(['target'], axis=1)

In [37]:
tf_desc_mean = transform_w2v_mean(w2v_desc, X_k.description)
#tf_desc_sum = transform_w2v_sum(w2v_desc, X.description)

tf_nam_mean = transform_w2v_mean(w2v_name, X_k.name)
#tf_nam_sum = transform_w2v_sum(w2v_name, X.name)

X_test_tf = np.hstack((tf_nam_mean,tf_desc_mean))

In [39]:
lgb_predict_proba = gbm.predict(X_test_tf)
pd.DataFrame(lgb_predict_proba).to_csv('w2v_test_models1.csv', index=False)

<a class="anchor" id="5"></a>
# 5. Oбъединяем решения и делаем Submit

In [40]:
pred_model_1_test = pd.read_csv('tfidf_test_models4.csv',)

In [42]:
pred_model_1_test['w2v_lgb'] = pd.read_csv('w2v_test_models1.csv',)

In [43]:
pred_model_1_test.head(10)

Unnamed: 0,0,1,2,3,w2v_lgb
0,0.982608,0.997645,0.987516,0.8745,0.983551
1,0.999489,0.999703,0.99878,0.997,0.999908
2,0.989155,0.989016,0.980836,0.959,0.986869
3,0.999718,0.999841,0.998535,0.992,0.999907
4,0.000623,0.00084,0.005019,0.036,8.7e-05
5,0.016907,0.005448,0.026112,0.001,0.003744
6,0.998063,0.999001,0.996551,1.0,0.999985
7,0.016659,0.430139,0.6832,0.537,0.007741
8,0.000588,0.001213,0.004341,0.068,5.2e-05
9,0.001176,0.004677,0.023571,0.139,0.001404


In [44]:
mean_5_model = pred_model_1_test.iloc[:, [0,1,2,3,4]].mean(axis=1)
mean_5_model[0:10]

0    0.965164
1    0.998976
2    0.980975
3    0.998000
4    0.008514
5    0.010642
6    0.998720
7    0.334948
8    0.014839
9    0.033966
dtype: float64

In [45]:
test_df = pd.read_csv('data/test.csv', sep='\t', low_memory=False)

In [46]:
submission = pd.DataFrame(columns=['id', 'target'],)
submission['id'], submission['target'] = test_df.id.values, mean_5_model

In [47]:
submission.head(5)

Unnamed: 0,id,target
0,200000,0.965164
1,200001,0.998976
2,200002,0.980975
3,200003,0.998
4,200004,0.008514


In [48]:
test_df.shape, submission.shape

((170179, 3), (170179, 2))

In [49]:
submission.to_csv('submission_v9_tfidf-w2v_mean5.csv', index=False)

In [None]:
# 

In [None]:
# 

Подготовил: <b>Lek</b> <br/>
Slack: @Lek <br/>
telegram: @AlexLekov <br/>