# Model: NN- TF-IDF (unigrams) text vectorization (Merged features)

Our NN models vary in the way we preprocess the text features (most important feature) towards feeding it into our NN.
This also effects the architechture of the NN.

In this notebook we:
1. Merge all text features (Title, Description, params) into one feature.
2. We use TF-IDF for (unigram - "word count") text vectorization for this feature.
3. We do not use dropout during learning.

In [1]:
%load_ext autoreload

In [2]:
%run stephan_modules.ipynb


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


In [3]:
DATA_PATH = '/home/u14303/Avito'
HELPER_DATA_PATH = '/home/u14303/Avito/helper_data'

## Feature enrichment

In [4]:
print('loading data...')
train, test = load_data(DATA_PATH)
train, test = basic_enrichment(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_image_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_text_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = add_aggregated_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = numeric_features_cleaning(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = complete_image_top_1(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = complete_price(train, test, helper_data_path=HELPER_DATA_PATH)

loading data...
Adding basic features...
Done adding basic features.
Adding image features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done loading image features.
Loading text features...
Done loading text features.
Loading aggregated features...
Done loading aggregated features.
Loading aggregated features...
Done loading aggregated features.
Cleaning and completing numeric features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done cleaning numeric features.
Completing image_top_1 features...
Done loading image_top_1 completions.
Completing price...
Done loading log_price_regression.


In [5]:
list(train.columns)

['item_id',
 'user_id',
 'region',
 'city',
 'parent_category_name',
 'category_name',
 'param_1',
 'param_2',
 'param_3',
 'title',
 'description',
 'price',
 'item_seq_number',
 'activation_date',
 'user_type',
 'image',
 'image_top_1',
 'deal_probability',
 'has_description',
 'has_price',
 'has_params',
 'has_image',
 'month',
 'day',
 'weekday',
 'user_ads_count',
 'title_description_params',
 'img_size',
 'img_sharpness',
 'img_luminance',
 'img_colorfulness',
 'img_confidence',
 'img_keypoints',
 'log_img_sharpness',
 'log_img_keypoints',
 'title_word_count',
 'description_non_regular_chars_ratio',
 'description_word_count',
 'merged_params_word_count',
 'description_sentence_count',
 'description_words/sentence_ratio',
 'title_capital_letters_ratio',
 'description_capital_letters_ratio',
 'title_non_regular_chars_ratio',
 'title_num_of_newrow_char',
 'description_num_of_newrow_char',
 'title_num_adj',
 'title_num_nouns',
 'title_adj_to_len_ratio',
 'title_noun_to_len_ratio',
 '

## Vectorize features towards input to an NN

Pick features to feed into the NN that will be used to learn the price.

We use the most segnificant features as were indicated by the LGBM model. As for the text, we merge it into one text feature.

In [6]:
# text_features = ['title', 'description', 'param123']
text_feature = 'title_description_params'
cat_features = ['user_type', \
                'region', 'city', \
                'parent_category_name', 'category_name', 'param_1', 'param_2', 'param_3', \
                'image_top_1_class', 'image_top_1_rounded_regression', \
                'month', 'day', 'weekday', \
                'has_price', 'has_description', 'has_params', 'has_image']
cont_ord_features = ['image_top_1_regression', \
                     'log_price_regression', \
                     'avg_days_up_user', 'avg_times_up_user', 'n_user_items', 'user_ads_count', \
                     'log_item_seq_number', \
                     'img_size', 'img_luminance', 'img_colorfulness', 'img_confidence', 'log_img_sharpness', 'log_img_keypoints', \
                     'title_word_count', 'description_word_count', 'merged_params_word_count', \
                     'description_non_regular_chars_ratio', 'title_capital_letters_ratio','description_capital_letters_ratio', \
                     'title_non_regular_chars_ratio', 'title_adj_to_len_ratio', 'title_noun_to_len_ratio',\
                     'title_sentiment']

train_y_prob = train['deal_probability']

#### Text features

TF-IDF vectorize merged texts

In [7]:
import codecs

stopwords = set()
with codecs.open(('stopwords_ru.txt'), encoding='cp1251') as ins:
    for w in ins:
        word = w.strip("\r\n")
        word = word.strip("\n")
        stopwords.add(word.lower())

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

text_tfidf_dim = 7500

train_x_text = train[text_feature].astype('str')
test_x_text = test[text_feature].astype('str')
all_text = np.hstack([train_x_text, test_x_text])

tfidf_enc = TfidfVectorizer(ngram_range=(1, 1), max_df=0.5, stop_words=stopwords, dtype=np.float32, max_features=text_tfidf_dim)
print('Fitting TF-IDF on text...')
tfidf_enc.fit(all_text)
print('Done.')

del all_text
gc.collect()

Fitting TF-IDF on text...
Done.


14

In [9]:
print('TF-IDF: applying encoder on text...')

train_x_text = tfidf_enc.transform(train_x_text)
test_x_text = tfidf_enc.transform(test_x_text)

print('Done.')

TF-IDF: applying encoder on text...
Done.


#### Categorical features.

Vectorize all loaded categorical features.


See: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [10]:
train_x_cat = train[cat_features]
test_x_cat = test[cat_features]
for col in cat_features:
    train_x_cat[col] = train_x_cat[col].astype('category')
    test_x_cat[col] = test_x_cat[col].astype('category')

# Encode to integers.
# For vectorization (encoding) we concat both train and test into one
all_cat = pd.concat([train_x_cat, test_x_cat], axis = 0)
for col in cat_features:
    enc = preprocessing.LabelEncoder().fit(all_cat[col])
    train_x_cat[col] = enc.transform(train_x_cat[col])
    test_x_cat[col] = enc.transform(test_x_cat[col])

# One-hot encode:
enc = OneHotEncoder()
enc.fit(pd.concat([train_x_cat, test_x_cat], axis = 0))
train_x_cat = enc.transform(train_x_cat)
test_x_cat = enc.transform(test_x_cat)

del all_cat
gc.collect()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



63

#### Numerical(Continious/Ordinal) features

Normalize all loaded numeric features.

In [11]:
train_x_numerical = train[cont_ord_features]
test_x_numerical = test[cont_ord_features]
train_x_numerical.fillna(0, inplace = True)
test_x_numerical.fillna(0, inplace = True)
for col in cont_ord_features:
    train_x_numerical[col] = train_x_numerical[col].astype('float64')
    test_x_numerical[col] = test_x_numerical[col].astype('float64')

# Normalize features:
train_x_numerical = normalize(train_x_numerical, axis=0)
test_x_numerical = normalize(test_x_numerical, axis=0)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



# Learning - Neural Net

### Define network

In [12]:
text_hidden_dim = 256
cat_hidden_dim = 128
merged_hidden_dim = 256
out_dim = 1

In [13]:
#Text
l_text_input = Input(shape=(text_tfidf_dim,), sparse=True, name="text_tfidf_input")
l_hidden_text = Dense(text_hidden_dim, activation='relu', 
                      kernel_regularizer=regularizers.l2(1e-6), name='l_hidden_text')(l_text_input)

# Categoricals
l_cat_input = Input(shape=(train_x_cat.shape[1],), sparse=True, name="cat_input")
l_hidden_cat = Dense(cat_hidden_dim, activation='relu',
                     kernel_regularizer=regularizers.l2(1e-6), name='l_hidden_cat')(l_cat_input)

# Numerical
l_numerical_input = Input(shape=(train_x_numerical.shape[1],), name="numerical_input")

# Aggregate all inputs into one hidden layer.
l_aggregative = concatenate([l_hidden_text, l_hidden_cat, l_numerical_input])

l_merged_hidden = Dense(merged_hidden_dim, activation='relu',
                            kernel_regularizer=regularizers.l2(1e-6), name='l_merged_hidden')(l_aggregative)

output = Dense(out_dim, activation='sigmoid', name='output')(l_merged_hidden) # This is the logistic regression output

adam_opt = Adam(lr=0.001)
def rmse_err(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_true - y_pred)))

model = Model(inputs=[l_text_input, l_cat_input, l_numerical_input], outputs=[output])
model.compile(optimizer=adam_opt,
              loss=[rmse_err])
print(model.summary())

load = False
fname = 'NN-TFIDF-UNI-nodropout-model-weights.h5'
if load:
    model.load_weights(fname)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_tfidf_input (InputLayer)   (None, 7500)         0                                            
__________________________________________________________________________________________________
cat_input (InputLayer)          (None, 9950)         0                                            
__________________________________________________________________________________________________
l_hidden_text (Dense)           (None, 256)          1920256     text_tfidf_input[0][0]           
__________________________________________________________________________________________________
l_hidden_cat (Dense)            (None, 128)          1273728     cat_input[0][0]                  
__________________________________________________________________________________________________
numerical_

In [14]:
# reduce_lr_cb = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=1, mode='min', min_delta=0.01,
#                                   verbose=1, min_lr=0.0001)
save_weights_cd = ModelCheckpoint(fname, save_weights_only=True, mode='auto', period=1)

nbatch_size = 512

print ('Curr batch size: {}'.format(nbatch_size))
model.fit({'text_tfidf_input' : train_x_text, 'cat_input' : train_x_cat, 'numerical_input' : train_x_numerical},
          [train_y_prob], 
          validation_split = 0.3, 
          epochs=1, 
          batch_size=nbatch_size,
          callbacks=[save_weights_cd])

nbatch_size = 1024

# Train on all data.
print ('Current batch size: {}'.format(nbatch_size))
model.fit({'text_tfidf_input' : train_x_text, 'cat_input' : train_x_cat, 'numerical_input' : train_x_numerical},
          [train_y_prob],
          epochs=1, 
          batch_size=nbatch_size)

Curr batch size: 512
Train on 1052396 samples, validate on 451028 samples
Epoch 1/1
Current batch size: 1024
Epoch 1/1


<keras.callbacks.History at 0x14fff24fd390>

Batch aggregator to load data from sparse matrices.

### Save predictions to futher ensambels

In [15]:
y_pred_train = model.predict({'text_tfidf_input' : train_x_text, 'cat_input' : train_x_cat, 'numerical_input' : train_x_numerical})
y_pred_test = model.predict({'text_tfidf_input': test_x_text, 'cat_input': test_x_cat, 'numerical_input': test_x_numerical})

In [16]:
res_train_df = pd.DataFrame(train['item_id'])
res_train_df['deal_probability'] = y_pred_train

In [17]:
res_test_df = pd.DataFrame(test['item_id'])
res_test_df['deal_probability'] = y_pred_test

In [18]:
res_df = pd.concat([res_train_df, res_test_df], axis = 0)

In [19]:
res_df.to_csv('/home/u14303/Avito/Predictions/NN-TFIDF-UNI-MERGED-nodropout.csv.gz', index=False, compression='gzip')