# Mercari Price Suggestion
This project seeks to solve the problem of building an algorithm for an electronic commerce company to suggest the right product prices based on the information provided by the sellers.

## Benchmark Model
* Transfer an item description to a term-document matrix;
* Apply linear regression to fit independent variables, including the term-document matrix, to the dependent variable, the price.

Because the independent variables have more than 1000 dimension, the convergence of the linear regression model is very slow. I use the Lasso linear regression model instead, which is trained with L1 prior as a regularizer.

In [16]:
import numpy as np
import pandas as pd
import math
%matplotlib inline

A function to calculate Root Mean Squared Logarithmic Error (RMSLE)

In [17]:
def rmsle(y, y_pred):

    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function

Load the dataset

In [18]:
data = pd.read_table("train.tsv")
display(data.head(n=2))
print(data.shape)

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...


(100001, 8)


Log-transfer price.

In [19]:
y = np.log1p(data["price"])

- Handle missing data;
- Cut number of brand names and number of categories;
- Transfer category_name, brand_name, and item_condition_id to categorical data.

In [20]:
NUM_BRANDS = 1000
NUM_CATEGORIES = 500

def handle_missing_inplace(dataset):
    dataset['category_name'].fillna(value='missing', inplace=True)
    dataset['brand_name'].fillna(value='missing', inplace=True)
    dataset['item_description'].fillna(value='missing', inplace=True)

def cutting(dataset):
    pop_brand = dataset['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS]
    dataset.loc[~dataset['brand_name'].isin(pop_brand), 'brand_name'] = 'missing'
    pop_category = dataset['category_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS]
    dataset.loc[~dataset['category_name'].isin(pop_category), 'category_name'] = 'missing'


def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

handle_missing_inplace(data)
print('Finished to handle missing')

cutting(data)
print('Finished to cut')

to_categorical(data)
print('Finished to convert categorical')

Finished to handle missing
Finished to cut
Finished to convert categorical


Process data

In [21]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import csr_matrix, hstack

NAME_MIN_DF = 10
MAX_FEATURES_ITEM_DESCRIPTION = 20000

cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(data['name'])
print('Finished count vectorize `name`')
 
cv = CountVectorizer()
X_category = cv.fit_transform(data['category_name'])
print('Finished count vectorize `category_name`')
 
tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
                         ngram_range=(1, 1),
                         stop_words='english')
X_description = tv.fit_transform(data['item_description'])
print('Finished TFIDF vectorize `item_description`')
 
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(data['brand_name'])
print('Finished label binarize `brand_name`')
 
X_dummies = csr_matrix(pd.get_dummies(data[['item_condition_id', 'shipping']],
                                          sparse=True).values)
print('Finished to get dummies on `item_condition_id` and `shipping`')
 
sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
print('Finished to create sparse merge')

Finished count vectorize `name`
Finished count vectorize `category_name`
Finished TFIDF vectorize `item_description`
Finished label binarize `brand_name`
Finished to get dummies on `item_condition_id` and `shipping`
Finished to create sparse merge


Divid data.

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sparse_merge, y, test_size=0.1, random_state=99)


## Lasso Regression

In [23]:
from sklearn.linear_model import Ridge, Lasso
model = Lasso(alpha=0.1, fit_intercept=True, max_iter=10000,
   normalize=False, precompute=False, random_state=666,
   selection='random', tol=0.00001)
model.fit(X_train, y_train)
print('Finished to train ridge')

y_pred = model.predict(X=X_test)
print('Finished to predict ridge')
y_pred = np.expm1(y_pred)
y_true = np.expm1(y_test)
v_rmsle = rmsle(y_true.values, y_pred)
print("Lasso- RMSLE error on test dataset: "+str(v_rmsle))

Finished to train ridge
Finished to predict ridge
Lasso- RMSLE error on test dataset: 0.7549523139317528
