In this Kernel, we want to go through the data in order to develop intuitions while performing a few sanity checks.

In [None]:
import pandas as pd
from IPython.display import HTML, display

In [None]:
train_df = pd.read_csv('../input/train.tsv', sep='\t')
test_df = pd.read_csv('../input/test.tsv', sep='\t')

We start by looking into the datatypes and the number of unique elements for each column and their missing values. We notice immediately that a lot of values are missing in **brand_name**. For now we'll replace them with 'unknown'. Generally speaking, in the context of this dataset, missing values in **brand_name** and **category_name** should be treated as actual values. The rationale behind this is the following: if a listing does not contain a **brand_name** or a **category_name**, it's less likely to be found by customers which in turn means that the price paid is more likely to be below market value.

In [None]:
print('DataFrame shape:')
print(train_df.shape)
print('\nData types:')
print(train_df.dtypes)
print('\nUnique Count:')
print(train_df.nunique())
display(train_df.head())

import missingno as msno
msno.matrix(train_df)

train_df['brand_name'].fillna('$not available$',inplace=True)
train_df['category_name'].fillna('$not available$',inplace=True)

test_df['brand_name'].fillna('$not available$',inplace=True)
test_df['category_name'].fillna('$not available$',inplace=True)



We can learn a couple things here. 
1. **train_id**: id of sample (drop during training)
2. **name**: literally the name of the product
3. **item_condition_id**: Encodes item condition on 5 levels
4. **category_name**: The name of the category. Notice that it comes in multiple levels delimited by '/'
5. **brand_name**: Literally the brand name (surprisingly few, considering the total number of rows)
6. **price**: Our target
7. **shipping**: a binary feature probably indicating if shipping costs are included or not
8. **item_description**: the text part of the listing. There are almost as many unique entries here as there are data points


Combining these observations with common sense, we can perform a few sanity checks:
1. There should be some correlation between **item_condition_id** and **price**
2. **price** is is unlikely to assume values below 0 so the values are likely to be log normal or something similar
3. Common sense: **brand_name** and **category_name** will show strong biases towards a few select values
4. Common sense: Listings that include the shipping price are likely to be more expensive

While these points are not mandatory for subsequent analysis, they will show us a few things, e.g. 1. should show us in which order **item_condition_id** is defined and 4. will tell us if **shipping**=1 -> shipping included or shipping separate.

Sanity check 1: Thee should be some correlation between **item_condition_id** and **price**

We start with a heatmap of the correlations. As the heatmap alone does not reveal anything, we will direcly plot price against **item_condition_id**. THis shows us that there are generally higher prices for lower **item_condition_id**. This lets us assume that 1 = pristine condition.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_samp = train_df.sample(10000)
c = train_df.corr()
sns.heatmap(c)
plt.figure()
plt.scatter(df_samp['price'],df_samp['item_condition_id'])
plt.ylabel('item_condition_id')
plt.xlabel('price')

Sanity check 2: **Price** is is unlikely to assume values below 0 so the values are likely to be log normal or something similar

Let's confirm our assumptions about the distribution of the numerical values. Since there are some extreme values we'll also take a look what happens when we clip the bottom and top 1%. The plots confirm that our assumption of log-normality of price is decent albeit not perfect. We'll keep a log-price column as it may come in handy if we later want to use methods that assume normality.

In [None]:
import numpy as np
plt.figure()
sns.distplot(train_df['price'])
plt.title('All raw prices')

plt.figure()
s = train_df['price'];
s = s[s > np.percentile(s,1)]
s = s[s < np.percentile(s,99)]
sns.distplot(s)
plt.title('Clipped raw prices')

plt.figure()
s = train_df['price'];
#s = s[s > 0]
sns.distplot(np.log(s+1))
plt.title('log prices')

train_df['log_price'] = np.log(train_df['price']+1)

Sanity check 3: Common sense: **brand_name** and **category_name** will show strong biases towards a few select values

We start by taking a look at the distribution of the counts for each **brand_name** and **category_name**. Since we are already assuming that the distributions are dominated by a few values, we'll opt for pie charts. As expected, the majority of data points fall into a countable number of top **brand_name**s and **categroy_name**s. We keep this in mind as a grouping of 'lesser' brands into one unified 'other' brand might be helpful. A potentially useful detail to keep in mind is that **brand_name** is actually missing for almost half the data.

In [None]:
plt.figure()
plt.title('brand name')
grp_brand = train_df.groupby('brand_name')
grp_brand.count()['train_id'].sort_values().plot(kind='pie',labels=None)

plt.figure()
plt.title('category_name')
grp_cat = train_df.groupby('category_name')
grp_cat.count()['train_id'].sort_values().plot(kind='pie',labels=None)

print('Top 10 brands:')
print(grp_brand.count()['train_id'].sort_values(ascending=False).head(10))

print('\nTop 10 categories:')
print(grp_cat.count()['train_id'].sort_values(ascending=False).head(10))

Sanity check 4: Common sense: Listings that include the shipping price are likely to be more expensive

Both the mean and median value are higher for **shipping**=0. We can safely assume that **shipping**=0 indicates that shipping costs are included in the item cost. The difference in the median/mean lets us also roughly estimate that the average shipping cost lies somewhere around 3-5 currency units (CU, whatever they may be). While this point may seem trivial, in a market where most items cost around 20 CU, the question of included or exclulded shipping cost has a profound impact on the price fairness.

In [None]:
grp_ship = train_df.groupby('shipping')

plt.figure()
plt.title('Median price grouped by shipping')
grp_ship['price'].median().plot.bar()

plt.figure()
plt.title('Mean price grouped by shipping')
grp_ship['price'].mean().plot.bar()

What's next? Revisiting **category_name**

Earlier, we have notcied that **category_name** comes with different levels. Essentially this means that **category_name**s represent the leaf nodes of a tree. However, since these inputs are provided by human users, it is possible that some people go to different depths for the same product, e.g. men/apparell/shoes (3 levels) might be as valid as men.apparell/shoes/winter (4 levels). The current category_name column has over 1000 values but the data may be better represented by multiple columns with (hopefully) 10-20 categories (where each column represents a certain tree depth). We start by figuring out the depth.

Interestingly, it turns out that almost all **categorie_name**s have a depth of 3 (2 delimiters = 3 levels) while the maximum level is 5.

In [None]:
train_df['category_depth'] = train_df['category_name'].map(lambda s: s.count('/'))

grp_depth = train_df.groupby('category_depth')
grp_depth['train_id'].count()


Next, we determine how many unique values we have in each level. Unfortuntely, it turns out that we still have a  rather large number of unique values, especially in the 2nd level. Nevertheless, we use the groupings to get some intuitions about the overall price distribution. Unsurprisingly, homemade products fetch the lowest prices while electronics come out at the top.****

In [None]:
train_df.category_name[0].split('/')

def get_level_value(cn, target_depth):
    components = cn.split('/')
    n_components = len(components)
    if target_depth < n_components:
        return components[target_depth]
    else:
        return np.nan
        
    

print('Number of unique values')
for cat_depth in range(0, train_df['category_depth'].max()+1):
    col_name = 'cat_lvl' + str(cat_depth)
    train_df[col_name] = train_df['category_name'].map(lambda s: get_level_value(s,cat_depth))
    test_df[col_name] = test_df['category_name'].map(lambda s: get_level_value(s,cat_depth))
    print(col_name + ': '+ str(train_df[col_name].nunique()))

train_df.iloc[:,[3, 10, 11, 12, 13, 14]].head()

Before we prcoeed, we well consolidate cat_lvl2, 3, 4 into one level as the last 2 columns have very few values anyways. We expect that the increase in the number of unique values in **cat_lvl2+** will be marginal over **cat_lvl2**. Actually, it turns out that there is no increase at all. We'll also convert **category_name** and the **cat_lvls** into dummies.

In [None]:
train_df['cat_lvl2+'] = (train_df['cat_lvl2'] 
                         + '/' + train_df['cat_lvl3'].fillna('')
                         + '/' + train_df['cat_lvl3'].fillna(''))

test_df['cat_lvl2+'] = (test_df['cat_lvl2'] 
                         + '/' + test_df['cat_lvl3'].fillna('')
                         + '/' + test_df['cat_lvl3'].fillna(''))

display(train_df.loc[train_df['cat_lvl4'].notnull(),['category_name', 'cat_lvl2','cat_lvl2+']].head())
print('#unique values in cat_lvl2: ' +  str(len(train_df['cat_lvl2'].unique())))
print('#unique values in cat_lvl2+: ' +  str(len(train_df['cat_lvl2+'].unique())))

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
to_encode = ['category_name', 'cat_lvl0', 'cat_lvl1', 'cat_lvl2', 'cat_lvl2+', 'cat_lvl3', 'cat_lvl4']


for col_name in to_encode:
    le.fit(list(train_df[col_name].fillna('')) + list(test_df[col_name].fillna('')))
    train_df[col_name + '_ENC'] = le.transform(train_df[col_name].fillna(''))
    test_df[col_name + '_ENC'] = le.transform(test_df[col_name].fillna(''))
    #train_df[[col_name, col_name + '_ENC']].head()

    


It is somewhat questionable if our last exercise has added any value-> tsne

In [None]:
# from sklearn.manifold import TSNE
# from sklearn.preprocessing import StandardScaler

# n_samples = 10000

# def perform_TSNE_and_map(df):
#     df_no_price = df_sample.drop(['price','log_price'],axis=1)
#     scaled_data = StandardScaler().fit_transform(df_no_price)
#     embedded_data = TSNE(n_components = 2).fit_transform(scaled_data)

#     plt.figure(figsize = [15, 5])
#     plt.subplot(1,3,1)
#     plt.scatter(embedded_data[:,0],embedded_data[:,1],c = df['shipping'], s= 1)
#     plt.title('shipping')

#     plt.subplot(1,3,2)
#     plt.scatter(embedded_data[:,0],embedded_data[:,1],c = df['item_condition_id'], s= 1)
#     plt.title('item_condition_id')

#     plt.subplot(1,3,3)
#     plt.scatter(embedded_data[:,0],embedded_data[:,1],c = df['log_price'], s=1)
#     plt.title('log_price')

# # df_sample = train_df[['item_condition_id', 'price', 'log_price', 'shipping', 'category_name_ENC']].sample(n_samples)
# # perform_TSNE_and_map(df_sample)



In [None]:
# n_samples = 10000
# df_sample = train_df[['item_condition_id', 'price', 'log_price', 'shipping', 
#                       'cat_lvl0_ENC','cat_lvl1_ENC']].sample(n_samples)
# perform_TSNE_and_map(df_sample)

In [None]:
train_df.head()

dummy encoding brands

In [None]:
plt.figure()
grp_brand.count()['price'].sort_values().cumsum().plot()

plt.figure()
grp_brand.mean()['price'].sort_values().cumsum().plot()


more engineering

In [None]:
# nlp stuff
train_df.item_description.fillna('',inplace=True)
train_df['description_length'] = train_df['item_description'].map(len)
                         
test_df.item_description.fillna('',inplace=True)
test_df['description_length'] = test_df['item_description'].map(len)

In [None]:
# testing dummy encoding
train_df = pd.get_dummies(train_df,columns=['cat_lvl0_ENC'])
test_df = pd.get_dummies(test_df,columns=['cat_lvl0_ENC'])

In [None]:
#brand stuff
train_df['is_brand'] = ~(train_df['brand_name'] == '$not available$')
test_df['is_brand'] = ~(test_df['brand_name'] == '$not available$')

params for xgb taken from https://www.kaggle.com/maheshdadhich/i-will-sell-everything-for-free-0-55

In [None]:
train_df.head()

In [None]:
features_to_drop = ['train_id', 'name', 'category_name', 'brand_name',
       'price', 'item_description', 'log_price', 'category_depth',
       'cat_lvl0', 'cat_lvl1', 'cat_lvl2', 'cat_lvl3', 'cat_lvl4', 'cat_lvl2+',
       'category_name_ENC', 'cat_lvl2_ENC', 
       'cat_lvl3_ENC', 'cat_lvl4_ENC']

feature_names = np.setdiff1d(train_df.columns,features_to_drop)
feature_names


In [None]:


train=train_df.copy()
test=test_df.copy()

mm_scaler = preprocessing.MinMaxScaler(feature_range=(-1,1))
y = mm_scaler.fit_transform(train['log_price'].reshape(-1,1))
y= train['log_price']

# XGBoost
import xgboost as xgb

xgb_par = {'min_child_weight': 20, 'eta': 0.05, 'colsample_bytree': 0.5, 'max_depth': 15,
            'subsample': 0.9, 'lambda': 2.0, 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear'}


# light LGB (adapted from 'https://www.kaggle.com/infinitewing/lightgbm-example')
import lightgbm as lgb

lgb_par = {'n_estimators': 900, 'learning_rate': 0.15, 'max_depth': 9, 'application': 'regression',
               'num_leaves': 256, 'subsample': 0.9, 'colsample_bytree': 0.8,
               'min_child_samples': 50, 'n_jobs': 3, 'metric':'RMSE'}

# cross validation
from sklearn.model_selection import train_test_split
Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=520)

dtrain_xgb = xgb.DMatrix(Xtr, label=ytr)
dvalid_xgb = xgb.DMatrix(Xv, label=yv)
dtest_xgb = xgb.DMatrix(test[feature_names].values)
watchlist = [(dtrain_xgb, 'train'), (dvalid_xgb, 'valid')]
model_xgb = xgb.train(xgb_par, dtrain_xgb, 80, watchlist, early_stopping_rounds=30, maximize=False, verbose_eval=20)
# print('Modeling RMSLE by XGB %.5f' % model_xgb.best_score)

# lightGBM
dtrain_lgb = lgb.Dataset(Xtr, label=ytr, max_bin=8192)
dvalid_lgb = lgb.Dataset(Xv, label=yv, max_bin=8192)
dtest_lgb = lgb.Dataset(test[feature_names].values)
watchlist = [dtrain_lgb, dvalid_lgb]
model_lgb = lgb.train(lgb_par, train_set=dtrain_lgb, num_boost_round=200, valid_sets=watchlist, \
    early_stopping_rounds=30, verbose_eval=20) 

In [None]:
# feature_names = ['item_condition_id', 'shipping', 
#                  'cat_lvl0_ENC','cat_lvl1_ENC','cat_lvl2+_ENC']

# train=train_df.copy()
# test=test_df.copy()
# y = train['log_price']

# from sklearn.model_selection import train_test_split
# import xgboost as xgb
# Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=520)
# dtrain = xgb.DMatrix(Xtr, label=ytr)
# dvalid = xgb.DMatrix(Xv, label=yv)
# dtest = xgb.DMatrix(test[feature_names].values)
# watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

# xgb_par = {'min_child_weight': 20, 'eta': 0.05, 'colsample_bytree': 0.5, 'max_depth': 15,
#             'subsample': 0.9, 'lambda': 2.0, 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
#             'eval_metric': 'rmse', 'objective': 'reg:linear'}

# model_1 = xgb.train(xgb_par, dtrain, 80, watchlist, early_stopping_rounds=20, maximize=False, verbose_eval=20)
# print('Modeling RMSLE %.5f' % model_1.best_score)

# feature_names = ['item_condition_id', 'shipping', 
#                  'category_name_ENC']

# train=train_df.copy()
# test=test_df.copy()
# y = train['log_price']

# from sklearn.model_selection import train_test_split
# import xgboost as xgb
# Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=520)
# dtrain = xgb.DMatrix(Xtr, label=ytr)
# dvalid = xgb.DMatrix(Xv, label=yv)
# dtest = xgb.DMatrix(test[feature_names].values)
# watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

# xgb_par = {'min_child_weight': 20, 'eta': 0.05, 'colsample_bytree': 0.5, 'max_depth': 15,
#             'subsample': 0.9, 'lambda': 2.0, 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,
#             'eval_metric': 'rmse', 'objective': 'reg:linear'}

# model_1 = xgb.train(xgb_par, dtrain, 80, watchlist, early_stopping_rounds=20, maximize=False, verbose_eval=20)
# print('Modeling RMSLE %.5f' % model_1.best_score)


In [None]:
# prediction
yvalid = 0.6*model_xgb.predict(dvalid_xgb) + 0.4*model_lgb.predict(Xv)
ytest = 0.6*model_xgb.predict(dtest_xgb) + 0.4*model_lgb.predict(test[feature_names].values)

In [None]:
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True)
sns.distplot(yvalid, ax=ax[0], color='blue', label='Validation')
sns.distplot(ytest, ax=ax[1], color='green', label='Test')
ax[0].legend(loc=0)
ax[1].legend(loc=0)
plt.show()

In [None]:
# submission
if test.shape[0] == ytest.shape[0]:
    print('Test shape OK.') 
test['price'] = np.exp(ytest) - 1
test[['test_id', 'price']].to_csv('mercari_prototype2.csv', index=False)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_data = train_df.item_description

tfidf = TfidfVectorizer()
tfidf.fit(text_data) #TODO add test data
transformed = tfidf.transform(text_data)

In [None]:
print(transformed[1:3,:])