# Why this kernel

The purpose of this kernel is to provide you with a hackable baseline for this competition.

The unofficial secondary purpose (shhh!) is to get a shiny Kaggle medal. If you find it useful vote!


## Introduction
So far all kernels I see are not doing feature engineering in a maintainable and scalable way:
* Feature engineering has to be done twice, once for training, one for testing.
* It's contamination prone for cross-validation and GridSearch. (E.G. You compute the mean of the whole dataset and use it as a feature even though you cross-validate on 80% only for example).
* Feature testing and scaling is all over the code.
* You have to leave Pandas at one point and use NumPy array, meaning you lose context and label of data.
* You can't find useful features in an easy automated way, especially after OneHotEncoding or LabelBinarizer.

## Learning outcomes
You will learn:
* How to scale feature engineering with a Pipeline
* How to debug easily any step in your feature engineering Pipeline
* How to structure your code to enable/disable features in a single place (aka Command Center) and preprocess them properly (StandardScaler, LabelBinarizer, OneHotEncoder ...)
* How to extract the most useful features from a feature set, even after OneHotEncoding or Binarization

**The end goal is to have a very clean code that allows to test features very rapidly**

What you will not learn:
* Data exploration and visualization
* Stacking in 2 liners, use mlxtend for that: https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
* Imputing missing values with advanced techniques (beyond mean/median/mode): check my Titanic kernels in [Python](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch) and [Julia](https://www.kaggle.com/mratsim/titanic/titanic-julia-end-to-end-pipelining) for examples.


## Notes
This is a port of [Li Li's kernel](https://www.kaggle.com/aikinogard/two-sigma-connect-rental-listing-inquiries/random-forest-starter-with-numerical-features) to Scikit's Pipeline. Thank you Li Li for some clean and to the point code.

Unfortunately this kernel does not run completely at Kaggle kernel due to the lack of sklearn-pandas library that allows to use Pandas' dataframes with ScikitLearn

The Baseline score is : 0.63

# Import libraries
* Numerical libraries
* ScikitLearn Tools
* Classifier: XGBoost, using the Scikit Learn API
* time: to name the output files

sklearn-pandas is imported at a later time as it won't run in Kaggle anyway


In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import LabelBinarizer, RobustScaler, Binarizer, StandardScaler, OneHotEncoder

In [3]:
# Impact sklearn_pandas which Pandas DataFrame compatibility with Scikit's classifiers and Pipeline
from sklearn_pandas import DataFrameMapper

In [4]:
import lightgbm as lgb

In [5]:
import time

# Import and display data

In [6]:
df_train = pd.read_json(open("./data/train.json", "r"))
df_train.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue
10000,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue
100004,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street
100007,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street
100013,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street


In [7]:
df_test = pd.read_json(open("./data/test.json", "r"))
df_test.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address
0,1.0,1,79780be1514f645d7e6be99a3de696c5,2016-06-11 05:29:41,Large with awesome terrace--accessible via bed...,Suffolk Street,"[Elevator, Laundry in Building, Laundry in Uni...",40.7185,7142618,-73.9865,b1b1852c416d78d7765d746cb1b8921f,[https://photos.renthop.com/2/7142618_1c45a2c8...,2950,99 Suffolk Street
1,1.0,2,0,2016-06-24 06:36:34,Prime Soho - between Bleecker and Houston - Ne...,Thompson Street,"[Pre-War, Dogs Allowed, Cats Allowed]",40.7278,7210040,-74.0,d0b5648017832b2427eeb9956d966a14,[https://photos.renthop.com/2/7210040_d824cc71...,2850,176 Thompson Street
100,1.0,1,3dbbb69fd52e0d25131aa1cd459c87eb,2016-06-03 04:29:40,New York chic has reached a new level ...,101 East 10th Street,"[Doorman, Elevator, No Fee]",40.7306,7103890,-73.989,9ca6f3baa475c37a3b3521a394d65467,[https://photos.renthop.com/2/7103890_85b33077...,3758,101 East 10th Street
1000,1.0,2,783d21d013a7e655bddc4ed0d461cc5e,2016-06-11 06:17:35,Step into this fantastic new Construction in t...,South Third Street\r,"[Roof Deck, Balcony, Elevator, Laundry in Buil...",40.7109,7143442,-73.9571,0b9d5db96db8472d7aeb67c67338c4d2,[https://photos.renthop.com/2/7143442_0879e9e0...,3300,251 South Third Street\r
100000,2.0,2,6134e7c4dd1a98d9aee36623c9872b49,2016-04-12 05:24:17,"~Take a stroll in Central Park, enjoy the ente...","Midtown West, 8th Ave","[Common Outdoor Space, Cats Allowed, Dogs Allo...",40.765,6860601,-73.9845,b5eda0eb31b042ce2124fd9e9fcfce2f,[https://photos.renthop.com/2/6860601_c96164d8...,4900,260 West 54th Street


# Transformers

This is the main lesson of this code. Transformers allow your code to be easily maintainable by wrapping eachpreprocessing step in an easily testable class :
* You will always be sure that transformations are applied to train and test data
* You won't have to track matrix sizes, column names
* You can store temporary data in the training phase to reuse in the testing phase in a clean way, for example the number of occurences of a manager id.

Transformers have 4 methods:

**\_\_init\_\_**

Usually at "pass", it's useful:
* for temporary variables, like a dictionary to do some mapping
* if you need to store data from the training phase to reuse during the predict phase. Check my [Titanic Kernel](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch)

**fit**

If the transformer don't learn from the input data (same transformation at training and testing phase) it should return self
If the transformer "learns" from the data, i.e. tehre is one or more internal properties in the __init__ methods, update the properties and then return self. Those properties can then be used in the transform method

**transform**

How the transformer transform the data

**predict**

This is used if you build [your custom classifier](http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator)


In [8]:
# This transformer extracts the number of photos
def transformer_numphot(train, test):
    def _trans(df):
        return df.assign(NumPhotos = df['photos'].str.len())
    return _trans(train), _trans(test)
    
# This transformer extracts the number of features
def transformer_numfeat(train, test):
    def _trans(df):
        return df.assign(NumFeat = df['features'].str.len())
    return _trans(train), _trans(test)
    
# This transformer extracts the number of words in the description
def transformer_numdescwords(train, test):
    def _trans(df):
        return df.assign(
            NumDescWords = df["description"].apply(lambda x: len(x.split(" ")))
            )
    return _trans(train), _trans(test)
    
# This transformer extracts the date/month/year and timestamp in a neat package
def transformer_datetime(train,test):
    def _trans(df):
        df = df.assign(
            Created_TS = pd.to_datetime(df["created"])
        )
        return df.assign(
            Created_Year = df["Created_TS"].dt.year,
            Created_Month = df["Created_TS"].dt.month,
            Created_Day = df["Created_TS"].dt.day
            )
    return _trans(train), _trans(test)

# Bucket nombre de chambres et de bathrooms
# Heure de la journée
# Jour de la semaine
# Retirer numéro de rue
# Imputer les rues sans géoloc
# Quartier (centre le plus proche)
# Distance par rapport au centre
# Clusteriser la latitude/longitude
# Features: categorizer
# manager skill (2*high + medium)

####### Debug Transformer ###########
# Use this transformer anywhere in your Pipeline to dump your dataframe to CSV
def transformer_debug(train,test):
    X.to_csv('./debug_train.csv')
    y.to_csv('./debug_test.csv')
    return train,test

# Helper functions for functional programming

In [9]:
# inspired by: https://joshbohde.com/blog/functional-python
# Transformations do not take extra arguments so no need for partial or starmap

from functools import reduce

#compose list of functions (chained composition)
def compose(*funcs):
    def _compose(f, g):
        # functions are expecting X,y not (X,y) so must unpack with *g
        return lambda *args, **kwargs: f(*g(*args, **kwargs))
    return reduce(_compose, funcs)

# pipe function, reverse the order so that it's usual FIFO function order
def pipe(*funcs):
    return compose(*reversed(funcs))

# Command Center - enabled features

This is where you:
* enable/disable features
* specify scaling, onehotencoding, labelbinarizing


Note: LabelBinarizer use the following syntax ("feature", LabelBinarizer())
instead of (["feature"], LabelBinarizer())

Note2: Copy Pasting your DataFrameMapper config when you submit your results makes for a superb description of your model

In [10]:
mapper = DataFrameMapper([
    (["bathrooms"],RobustScaler()), #Some bathrooms number are 1.5, Some outliers are 112 or 20    (["bedrooms"],OneHotEncoder()),
    (["latitude"],None),
    (["longitude"],None),
    (["price"],RobustScaler()),
    # (["NumDescWords"],None),
    (["NumFeat"],StandardScaler()),
    (["Created_Year"],None),
    (["Created_Month"],None),
    (["Created_Day"],None)
]
)

# Command Center - feature engineering pipeline + classifier

This is your transformation pipeline in the format:
("arbitrary_name", Transformer())

You can comment/uncomment to remove/add steps.
The last step should be your classifier.

Note: if you need to configure a specific step "step" use the format step\_\_parameter = value

For example I wanted to set the eval_metric of xgboost (name xgb) during the fit step so I used:

pipe.fit(X_train, y_train, **xgb\__eval_metric**='mlogloss')

In [11]:
featurize = pipe(
    transformer_numphot,
    transformer_numfeat,
    transformer_numdescwords,
    transformer_datetime
)

# Helper functions

## Cross Validation
5 folds, results summarized with 3 decimals of precision

## Get features that contributes most to the score
This function gives sensible names in case of OneHotEncoding or LabelBinarizer.
This is only possible with classifiers that implements **feature\_importances_** like RandomForest, ExtraTrees or XGBoost.

It outputs a top_featurs.csv files

## Predict and format the output

In [12]:
##### Cross Validation #######
def crossval(X,y,n):
    cv = cross_val_score(pipe, X, y, cv=n)
    print("Cross Validation Scores are: ", cv.round(3))
    print("Mean CrossVal score is: ", round(cv.mean(),3))
    print("Std Dev CrossVal score is: ", round(cv.std(),3))

In [13]:
####### Get top features and noise #######
def top_feat():
    dummy, model = pipe.steps[-1]

    feature_list = []
    for feature in pipe.named_steps['featurize'].features:
        if isinstance(feature[1], OneHotEncoder):
            for feature_value in feature[1].active_features_:
                feature_list.append(feature[0][0]+'_'+str(feature_value))
        else:
            try:
                for feature_value in feature[1].classes_:
                    feature_list.append(feature[0]+'_'+feature_value)
            except:
                feature_list.append(feature[0])


    top_features = pd.DataFrame({'feature':feature_list,'importance':np.round(model.feature_importances_,3)})
    top_features = top_features.sort_values('importance',ascending=False).set_index('feature')
    top_features.to_csv('./top_features.csv')
    top_features.plot.bar()

In [14]:
####### Predict and format output #######
def output():
    predictions = pipe.predict_proba(df_test,
                                     lgbm__num_iteration=pipe.named_steps['lgbm'].best_iteration)
    
    #debug
    print(pipe.classes_)
    print(predictions)
    
    result = pd.DataFrame({
        'listing_id': df_test['listing_id'],
        pipe.classes_[0]: [row[0] for row in predictions], 
        pipe.classes_[1]: [row[1] for row in predictions],
        pipe.classes_[2]: [row[2] for row in predictions]
        })
    result.to_csv(time.strftime("%Y-%m-%d_%H%M-")+'-002-feat-eng.csv', index=False)

In [15]:
################ Model Selection ################################
df_train, df_test = featurize(df_train,df_test)


X = df_train
y = df_train['interest_level']



In [16]:
y

10        medium
10000        low
100004      high
100007       low
100013       low
100014    medium
100016       low
100020       low
100026    medium
100027       low
100030       low
10004        low
100044      high
100048       low
10005        low
100051    medium
100052       low
100053       low
100055       low
100058       low
100062       low
100063    medium
100065       low
100066       low
10007     medium
100071       low
100075    medium
100076       low
100079      high
100081       low
           ...  
99915        low
99917        low
99919     medium
99921     medium
99923        low
99924        low
99931        low
99933        low
99935        low
99937        low
9994         low
99953        low
99956        low
99960     medium
99961        low
99964     medium
99965        low
99966        low
99979        low
99980        low
99982       high
99984        low
99986        low
99987        low
99988     medium
9999      medium
99991        low
99992        l

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [18]:
y_test

59100        low
108970       low
102253      high
77900        low
17279        low
19745        low
60258        low
88633     medium
98546        low
30563        low
72559        low
43996        low
70034        low
12875     medium
64731        low
98805       high
88366        low
57920     medium
68970        low
60395        low
21395        low
4796      medium
22938        low
17905        low
77099     medium
107451       low
27354     medium
46827        low
2870      medium
66860        low
           ...  
101350       low
22319        low
50130        low
21203        low
81583        low
4607         low
93516        low
60813        low
71894     medium
66622        low
45875        low
100854       low
30892        low
123373      high
18773        low
15843     medium
96283        low
16692     medium
9606      medium
82926     medium
101051    medium
70516        low
67922     medium
61286       high
103270       low
122671       low
115295       low
18048        l

In [19]:
X_train = mapper.fit_transform(X_train)

X_test = mapper.transform(X_test)

df_test = mapper.transform(df_test)



In [28]:
y_train = y_train.as_matrix()
y_test = y_test.as_matrix()

In [22]:
# Metric Multiclass log loss
def multiclass_log_loss(y_true, y_pred, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    https://www.kaggle.com/wiki/MultiClassLogLoss
    Parameters
    ----------
    y_true : array, shape = [n_samples]
            true class, intergers in [0, n_classes - 1)
    y_pred : array, shape = [n_samples, n_classes]
    Returns
    -------
    loss : float
    """
    predictions = np.clip(y_pred, eps, 1 - eps)

    # normalize row sums to 1
    predictions /= predictions.sum(axis=1)[:, np.newaxis]

    actual = np.zeros(y_pred.shape)
    n_samples = actual.shape[0]
    actual[np.arange(n_samples), y_true.astype(int)] = 1
    vectsum = np.sum(actual * np.log(predictions))
    loss = -1.0 / n_samples * vectsum
    return loss

In [29]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

In [30]:
y_train

array(['low', 'low', 'low', ..., 'low', 'medium', 'low'], dtype=object)

In [31]:
# specify your configurations as a dict
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_class': 3,
    'metric': {'multi_logloss'},
    'learning_rate': 0.1,
    #'feature_fraction': 0.9,
    #'bagging_fraction': 0.8,
    #'bagging_freq': 5,
    'verbose': 1
}

print('Start training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=999,
                valid_sets=lgb_eval,
                early_stopping_rounds=50,
               feature_name='auto',
               categorical_feature='auto')



Start training...


ValueError: could not convert string to float: 'low'

In [None]:
print('Start predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The mlogloss of prediction is:', multiclass_log_loss(t, y_pred))

In [None]:
t = pd.get_dummies(y_test).as_matrix()

In [None]:
y_pred.save('dump.csv')

In [None]:
gbm.predict(df_test, num_iteration=gbm.best_iteration)

In [None]:
X_train, X_test = featurize(X_train,X_test)

In [None]:
################ Cross Validation ################################
crossval(X,y,5)

In [None]:
######### Most influential features ########
top_feat()

In [None]:
######## Predict ########
output()

# The End

I hope you enjoyed the kernel and that it will help you iterate and test faster in your feature engineering quest.

Thank you for your attention, feel free to post comment and upvote.

You can check advanced transformer usage on my Titanic kernels in [Python](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch) and [Julia](https://www.kaggle.com/mratsim/titanic/titanic-julia-end-to-end-pipelining).