# "Price" Completion by learning

Due to the importance of the the price feature in LGBM's ratings, we wish to complete it in a "smart" way (not just 0 mode/median or NA). In this notebook we learn prices from other features of our data by training a regression NN.

In [1]:
%load_ext autoreload

In [2]:
%run stephan_modules.ipynb


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


In [3]:
DATA_PATH = './data'
HELPER_DATA_PATH = './helper_data'

## Feature enrichment

Load all engineered features (including image_top_1).

In [4]:
print('loading data...')
train, test = load_data(DATA_PATH)
train, test = basic_enrichment(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_image_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_text_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = add_aggregated_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = numeric_features_cleaning(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = complete_image_top_1(train, test, helper_data_path=HELPER_DATA_PATH)

loading data...
Adding basic features...
Done adding basic features.
Adding image features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done loading image features.
Loading text features...
Done loading text features.
Loading aggregated features...
Done loading aggregated features.
Loading aggregated features...
Done loading aggregated features.
Cleaning and completing numeric features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done cleaning numeric features.
Completing image_top_1 features...
Done loading image_top_1 completions.


In [5]:
list(train.columns)

['item_id',
 'user_id',
 'region',
 'city',
 'parent_category_name',
 'category_name',
 'param_1',
 'param_2',
 'param_3',
 'title',
 'description',
 'price',
 'item_seq_number',
 'activation_date',
 'user_type',
 'image',
 'image_top_1',
 'deal_probability',
 'has_description',
 'has_price',
 'has_params',
 'has_image',
 'month',
 'day',
 'weekday',
 'user_ads_count',
 'title_description_params',
 'img_size',
 'img_sharpness',
 'img_luminance',
 'img_colorfulness',
 'img_confidence',
 'img_keypoints',
 'log_img_sharpness',
 'log_img_keypoints',
 'title_word_count',
 'description_non_regular_chars_ratio',
 'description_word_count',
 'merged_params_word_count',
 'description_sentence_count',
 'description_words/sentence_ratio',
 'title_capital_letters_ratio',
 'description_capital_letters_ratio',
 'title_non_regular_chars_ratio',
 'title_num_of_newrow_char',
 'description_num_of_newrow_char',
 'title_num_adj',
 'title_num_nouns',
 'title_adj_to_len_ratio',
 'title_noun_to_len_ratio',
 '

Keep real item (ad) IDs for later merge and save.

In [6]:
r_train_ids = pd.DataFrame(train['item_id'])
r_test_ids = pd.DataFrame(test['item_id'])

Turn the train dataframe to contain all samples we have price for (from original test and train dataframes). Respectively put all samples with missing prices in "test" dataframe.

In [7]:
train = pd.concat([train, test], axis = 0)


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





In this notebook we will "samrtly" complete (to some extent, better than "median") NA prices.

In [8]:
test = train[train['price'].isna()]

In [9]:
train = train[train['price'].isna() == False]

We predict log_price instead of the price itself - it's more stable and better concentrated (better for NN).

In [10]:
y_target = train['log_price']

## Vectorize features towards input to an NN

Pick features to feed into the NN that will be used to learn the price.

In [11]:
text_feature = 'title_description_params'
cat_features = ['user_type', \
                'region', 'city', \
                'parent_category_name', 'category_name', 'param_1', 'param_2', 'param_3', \
                'image_top_1_class', 'image_top_1_rounded_regression', \
                'month', 'day', 'weekday', \
                'has_description', 'has_params', 'has_image']
cont_ord_features = ['image_top_1_regression', \
                     'avg_days_up_user', 'avg_times_up_user', 'n_user_items', 'user_ads_count', \
                     'log_item_seq_number', \
                     'img_size', 'img_luminance', 'img_colorfulness', 'img_confidence', 'log_img_sharpness', 'log_img_keypoints', \
                     'title_word_count', 'description_word_count', 'merged_params_word_count', \
                     'description_non_regular_chars_ratio', 'title_capital_letters_ratio','description_capital_letters_ratio', \
                     'title_non_regular_chars_ratio', 'title_adj_to_len_ratio', 'title_noun_to_len_ratio',\
                     'title_sentiment']

#### Text features

TF-IDF vectorize merged texts

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

text_tfidf_dim = 7500

train_x_text = train[text_feature].astype('str')
test_x_text = test[text_feature].astype('str')
all_text = np.hstack([train_x_text, test_x_text])

tfidf_enc = TfidfVectorizer(max_features=text_tfidf_dim, ngram_range=(1, 1), dtype=np.float32)
print('Fitting tf-idf on text...')
tfidf_enc.fit(all_text)
print('Done.')

del all_text
gc.collect()

Fitting tf-idf on text...
Done.


28

In [13]:
print('TF-IDF: applying encoder on text...')

train_x_text = tfidf_enc.transform(train_x_text)
test_x_text = tfidf_enc.transform(test_x_text)

print('Done.')

TF-IDF: applying encoder on text...
Done.


#### Categorical features.

Vectorize all loaded categorical features.


See: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [14]:
train_x_cat = train[cat_features]
test_x_cat = test[cat_features]
for col in cat_features:
    train_x_cat[col] = train_x_cat[col].astype('category')
    test_x_cat[col] = test_x_cat[col].astype('category')

# Encode to integers.
# For vectorization (encoding) we concat both train and test into one
all_cat = pd.concat([train_x_cat, test_x_cat], axis = 0)
for col in cat_features:
    enc = preprocessing.LabelEncoder().fit(all_cat[col])
    train_x_cat[col] = enc.transform(train_x_cat[col])
    test_x_cat[col] = enc.transform(test_x_cat[col])

# One-hot encode:
enc = OneHotEncoder()
enc.fit(pd.concat([train_x_cat, test_x_cat], axis = 0))
train_x_cat = enc.transform(train_x_cat)
test_x_cat = enc.transform(test_x_cat)

del all_cat
gc.collect()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



70

#### Numerical(Continious/Ordinal) features
Normalize all loaded numeric features.

In [15]:
train_x_numerical = train[cont_ord_features]
test_x_numerical = test[cont_ord_features]
train_x_numerical.fillna(0, inplace = True)
test_x_numerical.fillna(0, inplace = True)
for col in cont_ord_features:
    train_x_numerical[col] = train_x_numerical[col].astype('float64')
    test_x_numerical[col] = test_x_numerical[col].astype('float64')

# Normalize features:
train_x_numerical = normalize(train_x_numerical, axis=0)
test_x_numerical = normalize(test_x_numerical, axis=0)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



# Learning - Neural Net

### Define network

In [16]:
text_hidden_dim = 256
cat_hidden_dim = 128
merged_hidden_dim = 256
out_dim = 1

In [17]:
#Text
l_text_input = Input(shape=(text_tfidf_dim,), sparse=True, name="text_tfidf_input")
l_hidden_text = Dense(text_hidden_dim, activation='relu', 
                      kernel_regularizer=regularizers.l2(1e-6), name='l_hidden_text')(l_text_input)

# Categoricals
l_cat_input = Input(shape=(train_x_cat.shape[1],), sparse=True, name="cat_input")
l_hidden_cat = Dense(cat_hidden_dim, activation='relu',
                     kernel_regularizer=regularizers.l2(1e-6), name='l_hidden_cat')(l_cat_input)

# Numerical
l_numerical_input = Input(shape=(train_x_numerical.shape[1],), name="numerical_input")

# Aggregate all inputs into one hidden layer.
l_aggregative = concatenate([l_hidden_text, l_hidden_cat, l_numerical_input])

l_merged_hidden = Dense(merged_hidden_dim, activation='relu',
                            kernel_regularizer=regularizers.l2(1e-6), name='l_merged_hidden')(l_aggregative)
output = Dense(out_dim, name='output')(l_merged_hidden)

adam_opt = Adam(lr=0.001)

model = Model(inputs=[l_text_input, l_cat_input, l_numerical_input], outputs=[output])
model.compile(optimizer=adam_opt, loss='mse')
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_tfidf_input (InputLayer)   (None, 7500)         0                                            
__________________________________________________________________________________________________
cat_input (InputLayer)          (None, 9948)         0                                            
__________________________________________________________________________________________________
l_hidden_text (Dense)           (None, 256)          1920256     text_tfidf_input[0][0]           
__________________________________________________________________________________________________
l_hidden_cat (Dense)            (None, 128)          1273472     cat_input[0][0]                  
__________________________________________________________________________________________________
numerical_

In [18]:
reduce_lr_cb = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=1, mode='min', min_delta=0.01,
                                  verbose=1, min_lr=0.0001)
nbatch_size = 512

# Train for 2 epochs. After each epoch, enlarge batch size by factor 2.
nepochs = 2
for i in range(nepochs):
    print ('Curr batch size: {}'.format(nbatch_size))
    model.fit({'text_tfidf_input' : train_x_text, 'cat_input' : train_x_cat, 'numerical_input' : train_x_numerical},
              [y_target], 
              validation_split = 0.10, 
              epochs=1, 
              batch_size=nbatch_size,
              callbacks=[reduce_lr_cb])
    nbatch_size *= 2

# Train on all data.
print ('Current batch size: {}'.format(nbatch_size))
model.fit({'text_tfidf_input' : train_x_text, 'cat_input' : train_x_cat, 'numerical_input' : train_x_numerical},
          [y_target], 
          epochs=1, 
          batch_size=nbatch_size)

Curr batch size: 512
Train on 1706323 samples, validate on 189592 samples
Epoch 1/1
Curr batch size: 1024
Train on 1706323 samples, validate on 189592 samples
Epoch 1/1
Current batch size: 2048
Epoch 1/1


<keras.callbacks.History at 0x150c0debe978>

#### Complete missing prices and save

In [19]:
test_predictions = model.predict({'text_tfidf_input': test_x_text, 'cat_input': test_x_cat, 'numerical_input': test_x_numerical})

In [20]:
res_test_df = pd.DataFrame(test['item_id'])
res_test_df['log_price_regression'] = test_predictions

In [21]:
res_test_df.head()

Unnamed: 0,item_id,log_price_regression
24,065a4daba35f,5.492645
30,af9d5b0eadc9,6.803106
60,fc21a86b9bb0,7.790543
69,a297b829b758,5.240705
74,1298ab2ff7f3,4.479311


In [22]:
res_train_df = pd.DataFrame(train['item_id'])
res_train_df['log_price_regression'] = y_target

In [23]:
res_train_df.head()

Unnamed: 0,item_id,log_price_regression
0,b912c3c6a6ad,5.993961
1,2dac0150717d,8.006701
2,ba83aefab5dc,8.2943
3,02996f1dd2ea,7.696667
4,7c90be56d2ab,10.59666


In [24]:
res_df = pd.concat([res_train_df, res_test_df], axis = 0)

In [25]:
r_train_ids = r_train_ids.merge(res_df, on='item_id', how='left')
r_test_ids = r_test_ids.merge(res_df, on='item_id', how='left')

In [26]:
r_train_ids.shape

(1503424, 2)

In [27]:
r_train_ids.to_csv("/home/u14303/Avito/completed_train_price.csv.gz", compression='gzip', index=False)
r_test_ids.to_csv("/home/u14303/Avito/completed_test_price.csv.gz", compression='gzip', index=False)