# Mercari Price Suggestion Challenge (Model development)
This competition is hosted by [Mercari](!https://www.mercari.com/), Japan’s biggest community-powered shopping app. They provide a hassle-free and secure way for anyone to buy and sell stuff straight from their mobile device.

In this competition, we are asked to build an algorithm to predict the sale price of a product based on information a user provides for this product. The schema of the data is as follows:
 * train_id or test_id - the id of the listing
 * name - the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. \$20) to avoid leakage. These removed prices are represented as [rm]
 * item_condition_id - the condition of the items provided by the seller
category_name - category of the listing
 * brand_name
 * price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
 * shipping - 1 if shipping fee is paid by seller and 0 by buyer
 * item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

In this notebook, we develop a model to predict the sale price.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_log_error
from sklearn.preprocessing import PolynomialFeatures, FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import text

import matplotlib.pyplot as plt
%matplotlib inline

### Load the training data into a Pandas dataframe

In [2]:
df = pd.read_csv('../input/train.tsv', sep='\t')
df = df.drop(['train_id'], axis=1)
df.head(2)

Unnamed: 0,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...


In [3]:
# We will work with a samller subset for code development and test 
df = df.sample(n=None, frac=0.05)
print (len(df))
df.head(2)

74127


Unnamed: 0,name,item_condition_id,category_name,brand_name,price,shipping,item_description
969957,J.CREW Drawstring dress in pebble dot,1,"Women/Dresses/Above Knee, Mini",J. Crew,20.0,1,J.CREW Drawstring dress in pebble dot SIZE : M...
41192,Reserved,1,Beauty/Tools & Accessories/Makeup Brushes & Tools,,27.0,0,By Sigma 2 brushes Deluxe sigma mat and shampoo


### Check for missing values

In [4]:
print ('Total number of rows in training set: {:d}'.format(len(df)))

print ('\nNumber of missing values')
df.isnull().sum(axis=0)

Total number of rows in training set: 74127

Number of missing values


name                     0
item_condition_id        0
category_name          291
brand_name           31623
price                    0
shipping                 0
item_description         0
dtype: int64

### Fill missing values
 * A missing category name becomes 'Other'
 * A missing brand name becomes 'Unknown'
 * A missing item description becomes 'No description yet'

In [5]:
class MissingValuesHandler(BaseEstimator, TransformerMixin):
    # Extracts a given list of columns from the input dataframe and returns a new dataframe
    
    def __init__(self, col_name_replacevalue_tuples):
        self.col_name_replacevalue_tuples = col_name_replacevalue_tuples
    
    def fit(self, df, y=None):
        return self
    
    def transform(self, df):
        for (col, val) in self.col_name_replacevalue_tuples:
            df[col] = df[col].fillna(val)
        return df

In [6]:
missing_values_handler = MissingValuesHandler([('category_name', 'Other'), 
                                               ('brand_name', 'Unknown'), 
                                               ('item_description', 'No description yet')]
                                             )
df = missing_values_handler.fit_transform(df)#.iloc[:10])

### ColumnSelectTransformer

In [7]:
class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    # Extracts a given list of columns from the input dataframe and returns a new dataframe
    
    def __init__(self, col_names):
        self.col_names = col_names
    
    def fit(self, df, y=None):
        return self
    
    def transform(self, df):
        return df[self.col_names].values
    
    def get_feature_names(self):
        return self.col_names
    

#ColumnSelectTransformer(['item_condition_id', 'brand_name', 'shipping']).fit_transform(df.iloc[:2])

### Categories
Each item is hierarchically classified into 3 subcategories. For example 'Men/Sweats & Hoodies/Hoodie'. We convert subcategories into lists and use them as one-hot-encoded features in our predictive model. The main category is stored in a separate column called 'cat_1'. There are 11 unique main categories.

In [8]:
regex = r'\b\w+\b'
regex = r'[^-\w]+'
pattern_alphanumeric = re.compile('([^\s\w]|_)+')

class CategoriesProcessor(BaseEstimator, TransformerMixin):
    
    def __init__(self, cat_col):
        self.cat_col = cat_col
    
    def fit(self, df, y=None):
        return self
    
    def transform(self, df):
        df[self.cat_col] = df[self.cat_col].apply(lambda x: self.parse_categories_line(x))
        df['cat_1'] = df[self.cat_col].apply(lambda x: x[0] if x else '')        
        return df
        
    def parse_categories_line(self, line):
        try:
            cats = ' '.join(line.split('/')[:3]) 
            cats = re.sub('-','', line)
            cats = cats.lower()
                   # convert to lowercase
            cats = pattern_alphanumeric.sub(' ', cats)
                   # Remove everything except alphanumeric characters            
            return cats
        except:
            return ['Other']

#categories_processor = CategoriesProcessor('category_name').fit_transform(df).iloc[:10]

In [9]:
cat_pipe = Pipeline([('categories_processor', CategoriesProcessor('category_name')),
                     ('col_selctor', ColumnSelectTransformer('category_name')),                     
                     ('CountVectorizer', CountVectorizer())
                    ])
cat_features = cat_pipe.fit_transform(df)
cat_features

<74127x834 sparse matrix of type '<class 'numpy.int64'>'
	with 292290 stored elements in Compressed Sparse Row format>

In [10]:
#cat_pipe.named_steps['CountVectorizer'].get_feature_names()

### Item description tf-idf
Now we generate tf-idf features using 'item_description' column.

In [11]:
pattern_alphanumeric = re.compile('([^\s\w]|_)+')
pattern_numeric = re.compile('[0-9]+')
pattern_rem_multi_spaces = re.compile('\s\s+')
pattern_words = re.compile(r'\w*\d\w*')


class TextCleaner(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, df, y=None):
        return self
    
    def transform(self, df):
        result_df = df[self.columns[0]]
        for col in self.columns[1:]:
            result_df += " " + df[col]
        result_df = result_df \
                    .apply(lambda x: self.string_cleanup(str(x)))
        return result_df.values
        
    def string_cleanup(self, string):
        string = string.strip()
                # Remove leading and trailing whitespaces from EmployerName
        string = string.lower()
                # convert to lowercase
        string = pattern_alphanumeric.sub(' ', string)
                # Remove everything except alphanumeric characters
        string = pattern_numeric.sub('', string).strip()
                # Remove numbers
        string = pattern_rem_multi_spaces.sub(' ', string)
                # Replace multiple whitespaces by a single whitespace    
        return string    
    
    
#TextCleaner(['item_description', 'name']).fit_transform(df.iloc[:2])

In [12]:
stop_words = text.ENGLISH_STOP_WORDS.union(["rm"])

tfidf_vectorizer = TfidfVectorizer(max_features = 25000, 
                                   ngram_range = (1,3),
                                   stop_words = stop_words)

description_pipe = Pipeline([('text_cleaner', TextCleaner(['item_description'])),
                             ('vectorizer', tfidf_vectorizer)
                            ])

#description_features = description_pipe.fit_transform(df, y=None)

### Name tf-idf
Now we generate tf-idf features using 'name' column.

In [13]:
stop_words = text.ENGLISH_STOP_WORDS.union(["rm"])

tfidf_vectorizer = TfidfVectorizer(max_features = 25000, 
                                   ngram_range = (1,3),
                                   stop_words = stop_words)

name_pipe = Pipeline([('text_cleaner', TextCleaner(['name'])),
                      ('vectorizer', tfidf_vectorizer)
                     ])

#name_features = name_pipe.fit_transform(df, y=None)
#name_features

### Brand
Encode 'brand' column to a matrix of token counts.

In [14]:
print("Brand encoder")
brand_pipe = Pipeline([('col_selctor', ColumnSelectTransformer('brand_name')),
                       ('CountVectorizer', CountVectorizer(max_features=2500))    
                      ])
#brand_features = brand_pipe.fit_transform(df)
#brand_features

Brand encoder


### Item_condition_id
One-hot-encode 'item_condition_id' column. 

In [15]:
itemCond_pipe = Pipeline([('col_selctor', ColumnSelectTransformer(['item_condition_id'])),
                          ('vectorizer', OneHotEncoder())    
                         ])
#itemCond_features = itemCond_pipe.fit_transform(df)
#itemCond_features

### Score brands by price
We calculate a score for each brand as follows: for each category-brand pair we calculate the ratio of median price of all items of that brand belonging to a category and median price of all items in that category. The final brand score is the average of these ratios over all categories.

In [16]:
class BrandScoreCalculator(BaseEstimator, TransformerMixin):
    # Returns fuzzy match score between EmployerName and Description of the transaction
    
    def __init__(self, brand_col, cat_col, price_col, cutoff_count=10):
        self.brand_col = brand_col
        self.cat_col = cat_col
        self.price_col = price_col 
        self.cutoff_count = cutoff_count
    
    def fit(self, df, y=None):
        df_median_cat_price = df.groupby([self.cat_col])[self.price_col].agg(['median']).add_prefix('price_')
        self.median_cat_price = {k: v[0] for k,v in df_median_cat_price.T.to_dict('list').items()}
        
        self.brand_counts = df.groupby(self.brand_col)[self.brand_col].agg(['count']).add_prefix('brand_') #.reset_index()
        
        self.brand_score = {k: 0 for k in df[self.brand_col].unique()}
        gr = df.groupby([self.brand_col])
        for k, v in gr:
            df_temp = v.groupby([self.cat_col, self.brand_col])[self.price_col].agg(['median']).add_prefix('price_').reset_index()
            df_temp['cat_price_ratio'] = df_temp[[self.cat_col, self.brand_col, 'price_median']] \
                                         .apply(lambda x: self.cat_price_ratio(x[self.cat_col], x[self.brand_col], x['price_median']), axis=1)
            self.brand_score[k] = df_temp['cat_price_ratio'].mean()

        self.df_brand_score = pd.DataFrame(list(self.brand_score.items()), columns=['brand', 'brand_score'])        
        return self
    
     
    def transform(self, df):
        return df[self.brand_col].apply(lambda x: self.get_price_ratio(x))
        
    def cat_price_ratio(self, cat, brand, price):
        if price==0 or self.brand_counts.loc[brand].values<self.cutoff_count:
            return 1.0
        return price/self.median_cat_price[cat]

    def get_price_ratio(self, brand):
        try:
            return self.brand_score[brand]
        except KeyError:
            return 1.0
        
    def get_brand_scores(self):
        return self.df_brand_score.sort_values(by='brand_score', ascending=False)
    
    
#brand_score_calculator = BrandScoreCalculator('brand_name', 'cat_1', 'price', cutoff_count = 10)
#brand_score_calculator.fit(df, None)
#brand_score_calculator.transform(df.iloc[:10])
#brand_score_calculator.get_brand_scores()

In [17]:
# Convert 1D numpy array to 2D numpy array with one column.
reshaper = FunctionTransformer(lambda X: X.values.reshape(-1,1),validate=False)

In [18]:
brand_score_pipe = Pipeline([('brand_score_calculator', BrandScoreCalculator('brand_name', 'cat_1', 'price', cutoff_count = 10)),
                            ('reshaper', reshaper),
                            #('poly', PolynomialFeatures(2))
                           ])
brand_score_features = brand_score_pipe.fit_transform(df)
#brand_score_features

### Feature union
Combine all features.

In [19]:
feature_union = FeatureUnion([
    ('itemCond_features', itemCond_pipe),
    ('shipping_features', ColumnSelectTransformer(['shipping'])),
    ('cat_features', cat_pipe),
    ('brand_features', brand_pipe),
    ('brand_score_features', brand_score_pipe),
    ('description_features', description_pipe),
    ('name_features', name_pipe)
])

#features = feature_union.fit_transform(df)
#print (features.shape)
#features
#print ('Alleatures done.')

### Train regression model

In [20]:
print ('Training regression model.')

Training regression model.


In [21]:
def df_train_test_split(df, y_col_name, test_fraction=0.2):
    # Splits input dataframe into train and test dataframes.
    # Response variables y_train and y_test are returned as numpy array
    
    msk = np.random.rand(len(df)) < 1-test_fraction
    df_train = df[msk]
    df_test = df[~msk]
    y_train = df_train[y_col_name].values
    y_test = df_test[y_col_name].values
    return df_train, df_test, y_train, y_test

df_train, df_test, y_train, y_test \
    = df_train_test_split(df, 'price', test_fraction=0.2)
print (len(df_train), len(df_test))

59079 15048


In [22]:
model = Ridge(alpha=1.0)

reg_pipe = Pipeline([
    ('missing_values_handler', missing_values_handler),
    ('features', feature_union),
    ('reg', model)
])

reg_pipe.fit(df_train, np.log(y_train+1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Pipeline(memory=None,
     steps=[('missing_values_handler', MissingValuesHandler(col_name_replacevalue_tuples=[('category_name', 'Other'), ('brand_name', 'Unknown'), ('item_description', 'No description yet')])), ('features', FeatureUnion(n_jobs=1,
       transformer_list=[('itemCond_features', Pipeline(memory=None,
     ste...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

### Evaluate model

In [23]:
y_train_pred = np.exp(reg_pipe.predict(df_train)) - 1
print ('Train mean squared error = {:.2f}, train r2_score = {:.2f}'.format(mean_squared_error(y_train, y_train_pred), r2_score(y_train, y_train_pred)))
print ('Train mean squared log error = {:.2f}'.format(mean_squared_log_error(y_train, y_train_pred)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Train mean squared error = 508.22, train r2_score = 0.64
Train mean squared log error = 0.13


In [24]:
y_test_pred = np.exp(reg_pipe.predict(df_test)) - 1
print ('Test mean squared error = {:.2f}, test r2_score = {:.2f}'.format(mean_squared_error(y_test, y_test_pred), r2_score(y_test, y_test_pred)))
print ('Test mean squared log error = {:.2f}'.format(mean_squared_log_error(y_test, y_test_pred)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Test mean squared error = 791.15, test r2_score = 0.40
Test mean squared log error = 0.27


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Kaggle test set predictions

In [None]:
df_kaggle_test = pd.read_csv('../input/test.tsv', sep='\t')
df_kaggle_test = df_kaggle_test.drop(['test_id'], axis=1)

In [None]:
df_kaggle_test.head(2)

In [None]:
y_kaggle_test_pred = np.exp(reg_pipe.predict(df_kaggle_test)) - 1

In [None]:
submit_df = pd.DataFrame(y_kaggle_test_pred, columns=['price'])
submit_df.index.name = 'test_id'
submit_df.to_csv('mercari_submission_7.csv')