# Decision Tree Regression with Simple Features

The reason why extracted features are so simple is limited computing power. Despite the fact that I am using a cloud platform's virtual machine having 32 gb memory, it easily runs out of memory because of text-based features of 2 millon items amounting 500 megabytes.

In [1]:
import math

import pandas as pd

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.metrics import make_scorer
from nltk import download
from nltk.corpus import stopwords

In [2]:
df_test = pd.read_csv('../data/test.tsv', sep='\t')
test_count = len(df_test)
print("Number of items in test set: {}".format(test_count))

df_train = pd.read_csv('../data/train.tsv', sep='\t')
train_count = len(df_train)
print("Number of items in training set: {}".format(train_count))

Number of items in test set: 693359
Number of items in training set: 1482535


## Preparing Data

In [3]:
y_train = df_train["price"]
df_combined = df_train.append(df_test).drop("price", axis=1)
df_combined.head()

Unnamed: 0,brand_name,category_name,item_condition_id,item_description,name,shipping,test_id,train_id
0,,Men/Tops/T-shirts,3,No description yet,MLB Cincinnati Reds T Shirt Size XL,1,,0.0
1,Razer,Electronics/Computers & Tablets/Components & P...,3,This keyboard is in great condition and works ...,Razer BlackWidow Chroma Keyboard,0,,1.0
2,Target,Women/Tops & Blouses/Blouse,1,Adorable top with a hint of lace and a key hol...,AVA-VIV Blouse,1,,2.0
3,,Home/Home Décor/Home Décor Accents,1,New with tags. Leather horses. Retail for [rm]...,Leather Horse Statues,1,,3.0
4,,Women/Jewelry/Necklaces,1,Complete with certificate of authenticity,24K GOLD plated rose,0,,4.0


In [4]:
df_combined.isnull().any()

brand_name            True
category_name         True
item_condition_id    False
item_description      True
name                 False
shipping             False
test_id               True
train_id              True
dtype: bool

Filled null values for attributes having string type to handle errors during preprocessing

In [5]:
df_combined["brand_name"].fillna(value="none", inplace=True)
df_combined["category_name"].fillna(value="none", inplace=True)
df_combined["name"].fillna(value="none", inplace=True)
df_combined["item_description"].fillna(value="none", inplace=True)

## Feature extraction

Separate first two levels of categories to obtain new features ("1category" and "1category" instead of "1category/2category/3category/4category"). Not most items has third and forth levels of categories. Thus number of levels are limited to two. 

In [6]:
def level_cat(x, level=1):
    try:
        levels = x.split("/")
        return levels[level]
    except IndexError:
        return None
    except AttributeError:
        return None

for i in range(2):
    df_combined["{}_level_cat".format(i+1)] = df_combined["category_name"].apply(lambda x: level_cat(x, level=i+1))

Dowloaded and then removed stopwords from both description and name of item. Features for them are only lengths. Tried to use text extraction classes of Scikit-learn (TF-IDF) but run out of memory.

In [7]:
download("stopwords")
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /home/ml/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
word_count = lambda text: len([word for word in text.split(" ") if word not in stop_words])

df_combined["name_len"] = df_combined["name"].apply(word_count)
df_combined["description_len"] = df_combined["item_description"].apply(word_count)

Encode labels (does same thing as LabelEncoder) of "1_level_cat", "2_level_cat", "category_name" and "brand_name". Cannot use one hot encoding since it produces thousands of features and raise MemoryError.

In [9]:
df_combined = df_combined.drop(["name", "item_description"], axis=1)
categories = ["1_level_cat", "2_level_cat", "category_name", "brand_name"]
for category in categories:
    df_combined[category] = df_combined[category].astype('category')

categorical_columns = df_combined.select_dtypes(['category']).columns
df_combined[categorical_columns] = df_combined[categorical_columns].apply(lambda x: x.cat.codes)

Split combined data for training and prediction.

In [10]:
id_columns = ["test_id", "train_id"]

X_train_id, X_test_id = df_combined[:-test_count], df_combined[train_count:]
X_train = X_train_id.drop(id_columns, axis=1)
X_test = X_test_id.drop(id_columns, axis=1)

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("----------------------------")
print("Last version of X_train:")
X_train.head()

X_train shape: (1482535, 8)
y_train shape: (1482535,)
X_test shape: (693359, 8)
----------------------------
Last version of X_train:


Unnamed: 0,brand_name,category_name,item_condition_id,shipping,1_level_cat,2_level_cat,name_len,description_len
0,5268,829,3,1,102,773,7,3
1,3889,86,3,0,30,215,4,21
2,4588,1277,1,1,103,97,2,16
3,5268,503,1,1,55,410,3,22
4,5268,1204,1,0,58,542,4,3


## Train and Fine-tune Decision Tree Regressor

In [11]:
def gridsearchcv_report(predictor):
    print("Best parameters: {}".format(regr.best_params_))
    print("Best score: {}".format(regr.best_score_))

In [12]:
def rmsle(y_true, y_pred):
    """
    Modified version of function on https://www.kaggle.com/marknagelberg/rmsle-function
    Reason of modification: It was not working with GridSearchCV
    """
    terms_to_sum = [(math.log(yp + 1) - math.log(yt + 1)) ** 2.0 for yp ,yt in zip(y_pred, y_true)]
    return (sum(terms_to_sum) * (1.0/len(y_true))) ** 0.5

Use GridSearchCV to find out which regularization parameters fit best. Used 5 fold cross validation to speed up training. Used root mean squared logarithmic error since it is the evaluation metric for the competition.

In [13]:
rmsle_scorer = make_scorer(rmsle,greater_is_better=False)

In [14]:
hyperparameters = {"max_depth": [None, 5, 15, 45],
                   "min_samples_split": [1., 3, 9],
                   "min_samples_leaf": [6, 18, 34],
                   "max_leaf_nodes": [None, 5, 15, 45]}

regr = GridSearchCV(DecisionTreeRegressor(random_state=42), hyperparameters, cv=5, scoring=rmsle_scorer)
regr.fit(X_train, y_train)
gridsearchcv_report(regr)

Best parameters: {'max_depth': None, 'min_samples_split': 3, 'max_leaf_nodes': None, 'min_samples_leaf': 34}
Best score: -0.6089151877023106


Tried regularization parameters closer to previous run. Used 20%, 40% above and below. If the number is smaller and don't round up to a different number add and subtract 1. If None is the best don't try values for same parameter again.

In [15]:
bp = regr.best_params_
get_close_hyp = lambda x: [x-round(x*0.4), x-round(x*0.2), x, x+round(x*0.2), x+round(x*0.4)]

hyperparameters = {}
for p in bp:
    if not bp[p]:
        continue
    close_hyp = get_close_hyp(bp[p])
    if not close_hyp[1:] == close_hyp[:-1]:
        hyperparameters[p] = close_hyp
    else:
        val = close_hyp[0]
        hyperparameters[p] = [val-1, val, val+1]

regr = GridSearchCV(DecisionTreeRegressor(random_state=42), hyperparameters, cv=5, scoring=rmsle_scorer)
regr.fit(X_train, y_train)
gridsearchcv_report(regr)

Best parameters: {'min_samples_split': 2, 'min_samples_leaf': 34}
Best score: -0.6089151877023106


Calculated RMSLE using 10 fold cross validation.

In [16]:
y_pred = cross_val_predict(regr.best_estimator_, X_train, y_train)
rmsle(y_train, y_pred)

0.6120178163429827

Predicted prices for test data and make predictions ready to submit on Kaggle competition.

In [17]:
regr.best_estimator_.fit(X_train, y_train)
y_pred = regr.best_estimator_.predict(X_test)

In [18]:
submission = pd.DataFrame({"price":y_pred.tolist()})
submission["test_id"] = submission.index
cols = submission.columns.tolist()
cols = [cols[-1] , cols[0]]
submission = submission[cols]
submission.to_csv("../data/submisson.csv", index=False)

Obtained 0.60621 RMSLE error and placed 1133/1411 on final try. I will try to improve feature extraction and use more powerful ML algorithm (high scoring users in the competiton are mostly using neural nets) for next time.