# Model: NN - Just LSTM

Our NN models vary in the way we preprocess the text features (most important feature) towards feeding it into our NN.
This also effects the architechture of the NN.

In this notebook we run seperate LSTMs on "title" and "description" for all ads. We learn "y_probability" (the target feature) directly from the LSTMs output vectors (hidden layers) *without* combining it with any other feature.

In [1]:
%load_ext autoreload

In [2]:
%run stephan_modules.ipynb


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


In [3]:
DATA_PATH = '/home/u14303/Avito'
HELPER_DATA_PATH = '/home/u14303/Avito/helper_data'

## Feature enrichment

In [4]:
print('loading data...')
train, test = load_data(DATA_PATH)
train, test = basic_enrichment(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_image_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = load_text_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = add_aggregated_features(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = numeric_features_cleaning(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = complete_image_top_1(train, test, helper_data_path=HELPER_DATA_PATH)
train, test = complete_price(train, test, helper_data_path=HELPER_DATA_PATH)

loading data...
Adding basic features...
Done adding basic features.
Adding image features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done loading image features.
Loading text features...
Done loading text features.
Loading aggregated features...
Done loading aggregated features.
Loading aggregated features...
Done loading aggregated features.
Cleaning and completing numeric features...




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Done cleaning numeric features.
Completing image_top_1 features...
Done loading image_top_1 completions.
Completing price...
Done loading log_price_regression.


In [5]:
list(train.columns)

['item_id',
 'user_id',
 'region',
 'city',
 'parent_category_name',
 'category_name',
 'param_1',
 'param_2',
 'param_3',
 'title',
 'description',
 'price',
 'item_seq_number',
 'activation_date',
 'user_type',
 'image',
 'image_top_1',
 'deal_probability',
 'has_description',
 'has_price',
 'has_params',
 'has_image',
 'month',
 'day',
 'weekday',
 'user_ads_count',
 'title_description_params',
 'img_size',
 'img_sharpness',
 'img_luminance',
 'img_colorfulness',
 'img_confidence',
 'img_keypoints',
 'log_img_sharpness',
 'log_img_keypoints',
 'title_word_count',
 'description_non_regular_chars_ratio',
 'description_word_count',
 'merged_params_word_count',
 'description_sentence_count',
 'description_words/sentence_ratio',
 'title_capital_letters_ratio',
 'description_capital_letters_ratio',
 'title_non_regular_chars_ratio',
 'title_num_of_newrow_char',
 'description_num_of_newrow_char',
 'title_num_adj',
 'title_num_nouns',
 'title_adj_to_len_ratio',
 'title_noun_to_len_ratio',
 '

## Preprocess text towards input to an LSTM

In [6]:
text_feature = 'title_description_params'

train_y_prob = train['deal_probability']

Import stopwords.

In [7]:
import codecs

stopwords = set()
with codecs.open(('stopwords_ru.txt'), encoding='cp1251') as ins:
    for w in ins:
        word = w.strip("\r\n")
        word = word.strip("\n")
        stopwords.add(word.lower())

Tokenize and Vectorize (keras encoded one-hot representation (each onehot vec represented as an int number)) text feature.

See: https://keras.io/preprocessing/text/#one_hot

In [8]:
from keras.preprocessing.text import one_hot, Tokenizer, text_to_word_sequence

# Those consts are important for the NN itself
max_words_in_desc_title_param = 150 # See text analysis notebook. 95% are shorter than a 110.
word_embed_dim = 300

train_x_title = train['title'].str.lower()
test_x_title = test['title'].str.lower()
tokenizer = Tokenizer(num_words = max_words_in_desc_title_param)
all_titles = np.hstack([train_x_title, test_x_title])

print('Tokenizing titles...')
tokenizer.fit_on_texts(all_titles)
print('Done tokenizing.')

del all_titles
gc.collect()

Tokenizing titles...
Done tokenizing.


14

In [9]:
# By default text_to_word_sequence automatically does 4 things:
#   Splits words by space (split=” “), 
#   Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
#   Converts text to lowercase (lower=True).
# We add stopwords filtering to the process.

def my_text_to_word_sequence(text):
    result = []
    for word in text_to_word_sequence(text):
        if word not in stopwords:
            result.append(word)
    return result

In [10]:
print('Applying tokenizer on titles...')

train_x_title = train_x_title.apply(
    lambda r: 
    [tokenizer.word_index[word] for word in my_text_to_word_sequence(r[text_feature])], axis=1)
test_x_title = test_x_title.apply(
    lambda r: 
    [tokenizer.word_index[word] for word in my_text_to_word_sequence(r[text_feature])], axis=1)
    
train_x_title = pad_sequences(train_x_text[text_feature], maxlen=max_words_in_desc_title_param)
test_x_title = pad_sequences(test_x_text[text_feature], maxlen=max_words_in_desc_title_param)

print('Done.')

Applying tokenizer on titles...


TypeError: <lambda>() got an unexpected keyword argument 'axis'

In [None]:
train_x_description = train['description'].str.lower()
test_x_description = test['description'].str.lower()
tokenizer = Tokenizer(num_words = max_words_in_desc_title_param)
all_descriptions = np.hstack([train_x_description, test_x_description])

print('Tokenizing descriptions...')
tokenizer.fit_on_texts(all_descriptions)
print('Done tokenizing.')

del all_descriptions
gc.collect()

In [None]:
print('Applying tokenizer on descriptions...')

train_x_description = train_x_description.apply(
    lambda r: 
    [tokenizer.word_index[word] for word in my_text_to_word_sequence(r[text_feature])], axis=1)
test_x_description = test_x_description.apply(
    lambda r: 
    [tokenizer.word_index[word] for word in my_text_to_word_sequence(r[text_feature])], axis=1)
    
train_x_description = pad_sequences(train_x_text[text_feature], maxlen=max_words_in_desc_title_param)
test_x_description = pad_sequences(test_x_text[text_feature], maxlen=max_words_in_desc_title_param)

print('Done.')

Import and load FastText (Facebook's) Russian wikipedia word embeddings.

In [None]:
def get_fasttext_embedding_matrix(data_path, tokenizer, embedding_dim):
    print('loading embeddings...')
    
    EMBEDDING_FILE_PATH = os.path.join(data_path, 'cc.ru.300.vec')
    def get_coefs(word, *arr):
        return word, np.asarray(arr, dtype='float32')
    embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE_PATH))

    vocab_size = len(tokenizer.word_index) + 2
    embedding_matrix = np.random.rand(vocab_size, embedding_dim)
    
    print('creating embedding matrix...')
    embedding_exists = 0
    no_embeddings = 0
    for word, i in tokenizer.word_index.items():
        if word in embeddings_index:
            embedding_matrix[i] = embeddings_index[word]
            embedding_exists += 1
        else:
            no_embeddings += 1
    
    print ("There are total of {} words in our corpus.".format(embedding_exists+no_embeddings))
    print ("There are {} embeddings in FastText.".format(len(embeddings_index)))
    print ("We have embeddings for {} words ({}% existing embeddings).".format(embedding_exists, \
                                                                               (100*embedding_exists/(embedding_exists+no_embeddings))))
    print ("Embedding is missing for {} words.".format(no_embeddings))

    del embeddings_index
    gc.collect()
    
    print('done loading embeddings...')
    return embedding_matrix, vocab_size

In [None]:
embedding_matrix, vocab_size = get_fasttext_embedding_matrix(data_path=DATA_PATH, \
                                                             tokenizer=tokenizer, embedding_dim = word_embed_dim)

# Learning - Neural Net

### Define network

In [None]:
lstm_out_dim = 256
out_dim = 1

In [None]:
# Title LSTM:
l_title_input = Input(shape=(max_words_in_desc_title_param,), name="title_lstm_input")
# Load FastText's weights:
l_title_embedding = Embedding(input_dim=vocab_size, output_dim=word_embed_dim, weights = [embedding_matrix], \
                             input_length=max_words_in_desc_title_param, name='title_embeddings')(l_text_input)
title_lstm_out = LSTM(lstm_out_dim, name='LSTM', dropout=0.2, recurrent_dropout=0.2)(l_title_embedding)

# Description LSTM:
l_description_input = Input(shape=(max_words_in_desc_title_param,), name="description_lstm_input")
# Load FastText's weights:
l_description_embedding = Embedding(input_dim=vocab_size, output_dim=word_embed_dim, weights = [embedding_matrix], \
                             input_length=max_words_in_desc_title_param, name='description_embeddings')(l_text_input)
description_lstm_out = LSTM(lstm_out_dim, name='LSTM', dropout=0.2, recurrent_dropout=0.2)(l_description_embedding)

# Aggregate two LSTMs into the one final vector.
l_aggregative = concatenate([title_lstm_out, description_lstm_out])

# This is the main logistic regression output
output = Dense(out_dim, activation='sigmoid', name='output')(l_aggregative)

rmsprop_opt = RMSprop(lr=0.00001) # Best for training RNNs.
def rmse_err(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_true - y_pred)))

model = Model(l_text_input, output)
model.compile(optimizer=rmsprop_opt, loss=[rmse_err]) # Higher weight for main output.
print(model.summary())

load = False
fname = 'NN-JUST-LSTM-SEP-model-weights.h5'
if load:
    model.load_weights(fname)

In [None]:
reduce_lr_cd = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=0.0001)
save_weights_cd = ModelCheckpoint(fname, save_weights_only=True, mode='auto', period=1)

nbatch_size = 512

# We want to double batch size every two epochs.
nepochs = 5
for i in range(nepochs):
    print ('Curr batch size: {}'.format(nbatch_size))
    model.fit(train_x_text, train_y_prob, 
              validation_split = 0.1, 
              epochs=1, 
              batch_size=nbatch_size,
              callbacks=[reduce_lr_cd, save_weights_cd])
    nbatch_size *= 2

### Save predictions to futher ensambels

In [None]:
y_pred_train = model.predict(train_x_text)
y_pred_test = model.predict(test_x_text)

In [None]:
res_train_df = pd.DataFrame(train['item_id'])
res_train_df['deal_probability'] = y_pred_train

In [None]:
res_test_df = pd.DataFrame(test['item_id'])
res_test_df['deal_probability'] = y_pred_test

In [None]:
res_df = pd.concat([res_train_df, res_test_df], axis = 0)

In [None]:
res_df.to_csv('/home/u14303/Avito/Predictions/NN-JUST-LSTM-SEP.csv.gz', index=False, compression='gzip')