## Stacking

This notebook demonstrates the stacking technique with the data in <a href=https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries>Two Sigma Connect: Rental Listing Inquiries</a> from <a href=https://www.kaggle.com/competitions>Kaggle competition</a>. Reading the detail is left to the audience, but to summarise:

- The data is from <a href=https://www.renthop.com/>RentHop</a>, which is a web and mobile-based search engine that allows users to search for apartments in major cities.
- Base on the data for each listing the purpose is to predict the interest level of each listing.

### Load the data

The data from Kaggle comes with the training data and the test data, the latter has no correct label. For the purpose of demonstration, we will use only the training data. The test data is loaded though.

- Below we print out the columns from the dataset:

In [2]:
import pandas as pd
import json
import time

with open('./data/train.json') as data_file:    
    data = json.load(data_file)

raw_train = pd.DataFrame(data)

with open('./data/test.json') as data_file:    
    data = json.load(data_file)

raw_test = pd.DataFrame(data)

for item in raw_train.columns:
    print item

bathrooms
bedrooms
building_id
created
description
display_address
features
interest_level
latitude
listing_id
longitude
manager_id
photos
price
street_address


The id is used for identifying the listing in the competition. For simplicity, we reindex the data by integers.

In [2]:
_id = raw_train.index
raw_train=raw_train.reset_index()

We are going to train a popular model: <a href=https://github.com/dmlc/xgboost>xgboost</a>. We will start with only the numerical feature:

In [16]:
col = ['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price']
X = raw_train[col]
y = raw_train['interest_level'].apply(lambda x: 0 if x=='low' else 1 if x=='medium' else 2)

## Folding 

To be able to evaluate our model, we use the package `model_selection` to split the indices into:

- Training data (used to train the models)
- Validation data (used to evaluate the models. We avoid using the term "test data" to differentiate this from the test data from Kaggle)

For the later usage in stacking, we also split the training data into two folds.

In [17]:
import numpy as np
import sklearn
from sklearn import model_selection

skf = model_selection.StratifiedKFold(n_splits=3)
folds = skf.split(X, y)
_, fold1 = folds.next()
_, fold2 = folds.next()
_, validation_idx = folds.next()

train_idx = np.concatenate([fold1, fold2])

Below we initialize the parameters for xgboost

In [18]:
import xgboost as xgb

param = {}
param['objective'] = 'multi:softprob'
param['eta'] = 0.02
param['max_depth'] = 6
param['silent'] = 1
param['num_class'] = 3
param['eval_metric'] = "mlogloss"
param['min_child_weight'] = 3
param['subsample'] = 0.7
param['colsample_bytree'] = 0.7
param['seed'] = 321
num_rounds = 1000

Below we train and evaluate the xgboost model. We use the logloss as the metric.

In [19]:
start_ = time.time()
X_train = np.array(X.iloc[train_idx])
y_train = np.array(y.iloc[train_idx])
xgtrain = xgb.DMatrix(X_train, label= y_train)

clf = xgb.train(param, xgtrain, num_rounds)

X_validation = np.array(X.iloc[validation_idx])
y_validation = np.array(y.iloc[validation_idx])
xgvalidation = xgb.DMatrix(X_validation)
y_prob = clf.predict(xgvalidation)

print 'The log loss is: %.3f' % sklearn.metrics.log_loss(y_validation, y_prob)
print 'Time elapsed: %.2f seconds' % (time.time() - start_)

The log loss is: 0.656
Time elapsed: 36.94 seconds


## Adding the text data

### Clean the text

The code below cleans the "description" column in the dataset.

In [3]:
import re
import nltk
from nltk.tag import pos_tag
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = stopwords.words('english')

def cleaning_text(sentence):
    sentence = sentence.encode('ascii', errors='replace')
    sentence=sentence.lower()
    sentence=re.sub('[^\w\s]',' ', sentence) #removes punctuations
    sentence=re.sub('_', ' ', sentence) #removes punctuations
    sentence=re.sub('\d+',' ', sentence) #removes digits
    cleaned=' '.join([w for w in sentence.split() if not w in stop]) 
    #removes english stopwords
    cleaned=' '.join([w for w , pos in pos_tag(cleaned.split()) if (pos == 'NN' or pos=='JJ' or pos=='JJR' or pos=='JJS' )])
    #selecting only nouns and adjectives
    
    cleaned=' '.join([w for w in cleaned.split() if not len(w)<=2 ]) 
    #removes single lettered words and digits
    cleaned=cleaned.strip()
    return cleaned

start_ = time.time()
raw_train['cleaned_txt'] = raw_train['description'].apply(lambda x: cleaning_text(x))
print 'Time elapsed: %.2f' % (time.time()-start_)

Time elapsed: 441.20


### TF-IDF

To take the text data into account, we need to somehow encoding the chategorical features into numerical features. Dummification (or one-hot encoding) is a popular way for general chategorical features. For text data, we would use a much more efficient method: <a gref=https://en.wikipedia.org/wiki/Tf%E2%80%93idf>tf-idf</a>. 

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

n_features = 500
n_topics = 10
n_top_words = 20

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
                                stop_words='english')

text_train = np.array(raw_train.loc[train_idx, 'cleaned_txt'].replace(np.nan, ''))
tf = tf_vectorizer.fit_transform(text_train)
TF = TfidfTransformer()
tf_idf_train = TF.fit_transform(tf)
tf_idf_train = tf_idf_train.toarray()
X_train_des = np.concatenate([X_train, tf_idf_train], axis=1)


text_validation = np.array(raw_train.loc[validation_idx, 'cleaned_txt'].replace(np.nan, ''))
tf = tf_vectorizer.transform(text_validation)
tf_idf_validation = TF.transform(tf)
tf_idf_validation = tf_idf_validation.toarray()
X_validation_des = np.concatenate([X_validation, tf_idf_validation], axis=1)

text_fold1 = np.array(raw_train.loc[fold1, 'cleaned_txt'].replace(np.nan, ''))
text_fold2 = np.array(raw_train.loc[fold2, 'cleaned_txt'].replace(np.nan, ''))

Below we trained the model with both the numerical feature we used before, and the new text data.

We see that the performance is improved with no surprise -- we took more feature into account. However, the improvement takes a lot of time.

In [43]:
start_ = time.time()
xgtrain_des = xgb.DMatrix(X_train_des, label= y_train)
clf = xgb.train(param, xgtrain_des, num_rounds)
xgvalidation_des = xgb.DMatrix(X_validation_des)
y_prob_des = clf.predict(xgvalidation_des)
print 'Time elapsed: %.2f seconds' % (time.time() - start_)
print 'The log loss is: %.3f' % sklearn.metrics.log_loss(y_validation, y_prob_des)

Time elapsed: 2165.19 seconds
The log loss is: 0.628


## Stacking models with resampling

In the demo above we saw a drawback of the tree-based models: inefficiency when dealing with large cardinality of a chategorical feature.

One thing we can do is to train some simpler models first on the text data, and then stack back with the numerical predictors.

### Using sparse matrix

The text data contains a lot of zeros, one simple way to gain efficiency is to use the sparse matrices.

In [30]:
tf = tf_vectorizer.transform(text_train)
tf_idf_train = TF.transform(tf)

tf = tf_vectorizer.transform(text_fold1)
tf_idf_fold1 = TF.transform(tf)

tf = tf_vectorizer.transform(text_fold2)
tf_idf_fold2 = TF.transform(tf)

tf = tf_vectorizer.transform(text_validation)
tf_idf_validation = TF.transform(tf)

### Creating the new feature

The function below create the new predictors with 2 folds as we described in the slides.

In [31]:
y_fold1 = np.array(y.iloc[fold1])
y_fold2 = np.array(y.iloc[fold2])

def get_2fold_stack(model):
    model.fit(tf_idf_fold1, y_fold1)
    new_fold2 = model.predict_proba(tf_idf_fold2)[:,:2]
    v1 = model.predict_proba(tf_idf_validation)[:,:2]  ### There is model
    model.fit(tf_idf_fold2, y_fold2)
    new_fold1 = model.predict_proba(tf_idf_fold1)[:,:2]
    v2 = model.predict_proba(tf_idf_validation) [:,:2]### The model here is different

    return np.concatenate([new_fold1, new_fold2], axis=0), (v1+v2)/2

Below we add the predictor created by logistic regression. The performance is actually much better than training with only numerical predictors, even though the time required is not that much longer.

In [36]:
start_ = time.time()
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression()

f_logit_train, f_logit_validation = get_2fold_stack(logit)

X_train_stack = np.concatenate([X_train, f_logit_train], axis=1)
xgtrain_stack = xgb.DMatrix(X_train_stack, label= y_train)
X_validation_stack = np.concatenate([X_validation, f_logit_validation], axis=1)
xgvalidation_stack = xgb.DMatrix(X_validation_stack)

clf = xgb.train(param, xgtrain_stack, num_rounds)
y_prob_stack = clf.predict(xgvalidation_stack)
print 'The log loss is: %.3f' % sklearn.metrics.log_loss(y_validation, y_prob_stack)
print 'Time elapsed: %.2f seconds' % (time.time() - start_)

The log loss is: 0.641
Time elapsed: 52.21 seconds


## Stacking multiple models

We may of course stack more than just one model. Below we stack:

- logistic regression (with different penalty constant)
- knn (with different k)
- latenet dirichlet allocation (folding is not required since this is an unsupervised learning)
- naive bayes and quadratic discriminant analysis (densed matrices are required so the rasampling function is rewritten).

We see that the stacking model takes only around one-third of the time to reach the same level of performance.

In [38]:
start_ = time.time()


from sklearn.linear_model import LogisticRegression

logit1000 = LogisticRegression(C=1000)
f_logit1000_train, f_logit1000_validation = get_2fold_stack(logit1000)


from sklearn.neighbors import KNeighborsClassifier
knn10 = KNeighborsClassifier(n_neighbors=10)
f_knn10_train, f_knn10_validation = get_2fold_stack(knn10)

knn100 = KNeighborsClassifier(n_neighbors=100)
f_knn100_train, f_knn100_validation = get_2fold_stack(knn100)


knn10000 = KNeighborsClassifier(n_neighbors=10000)
f_knn10000_train, f_knn10000_validation = get_2fold_stack(knn10000)


from sklearn.decomposition import LatentDirichletAllocation
topic10 = LatentDirichletAllocation(n_topics=10)
f_topics10_train = topic10.fit_transform(tf_idf_train)
f_topics10_validation = topic10.transform(tf_idf_validation)


def get_2fold_stack_dense(model):
    _f1 = tf_idf_fold1.toarray()
    _f2 = tf_idf_fold2.toarray()
    _v = tf_idf_validation.toarray()
    model.fit(_f1, y_fold1)
    new_fold2 = model.predict_proba(_f2)[:,:2]
    v1 = model.predict_proba(_v)[:,:2]  ### There is model
    model.fit(_f2, y_fold2)
    new_fold1 = model.predict_proba(_f1)[:,:2]
    v2 = model.predict_proba(_v) [:,:2]### The model here is different

    return np.concatenate([new_fold1, new_fold2], axis=0), (v1+v2)/2

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB(binarize=0.00001)
f_bnb_train, f_bnb_validation = get_2fold_stack(bnb)

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
f_mnb_train, f_mnb_validation = get_2fold_stack(mnb)

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
f_qda_train, f_qda_validation = get_2fold_stack_dense(qda)


X_train_stack = np.concatenate([X_train, 
                                f_logit_train,
                                f_logit1000_train,
                                f_knn10_train,
                                f_knn100_train,  
                                f_knn10000_train,
                                f_mnb_train, 
                                f_bnb_train,
                                f_qda_train,
                                f_topics10_train], axis=1)
xgtrain_stack = xgb.DMatrix(X_train_stack, label= y_train)


X_validation_stack = np.concatenate([X_validation, 
                                     f_logit_validation,
                                     f_logit1000_validation,
                                     f_knn10_validation,
                                     f_knn100_validation, 
                                     f_knn10000_validation, 
                                     f_mnb_validation,
                                     f_bnb_validation,
                                     f_qda_validation,
                                     f_topics10_validation], axis=1)
xgvalidation_stack = xgb.DMatrix(X_validation_stack)

clf = xgb.train(param, xgtrain_stack, num_rounds)
y_prob_stack = clf.predict(xgvalidation_stack)
print 'The log loss is: %.3f' % sklearn.metrics.log_loss(y_validation, y_prob_stack)
print 'Time elapsed: %.2f seconds' % (time.time() - start_)



The log loss is: 0.626
Time elapsed: 605.73 seconds
