<img src="https://www.mercari.com/assets/img/help_center/us/ogp.png"/>

# Mercari Price Suggestion Challenge - PART 2
***
### Can you automatically suggest product prices to online sellers?

**Product pricing gets even harder at scale**, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.

**Mercari**, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

In this competition, Mercari’s challenging you to **build an algorithm that automatically suggests the right product prices**. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.

### Dataset Features

- **ID**: the id of the listing
- **Name:** the title of the listing
- **Item Condition:** the condition of the items provided by the seller
- **Category Name:** category of the listing
- **Brand Name:** brand of the listing
- **Shipping:** whether or not shipping cost was provided
- **Item Description:** the full description of the item
- **Price:** the price that the item was sold for. This is the target variable that you will predict. The unit is USD.

### Evaluation metric:
- **RMSLE**
- It puts more penalty on **lower errors**.
- This is used when you want to penalize **under estimates** more than **over estimates**.

**Source:** https://www.kaggle.com/c/mercari-price-suggestion-challenge

## Review
- In **Mercari Price Suggestion Challenge - PART 1**, I performed feature pre-processing, transformations, and derivations to the text data because these will be features for my model.
- I will now experiment with stop-words, n-grams, and various other methods to come up with a good representation for the model.

In [1]:
__author__ = "Mrunal Salvi"
__email__ = "mrunalsalvi94@gmail.com"

In [2]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split

from sklearn.linear_model import Ridge

# Training Data
- Since in Part 1 we saw that the price distribution is **POSITIVELY SKEWED**, we take a log transformation so that the data is **NORMALLY DISTRIBUTED**
- Also, we convert our target - 'PRICE' into log value so that we can use RMSE on already converted Target value

In [3]:
# Get 10% of the Training Data
train = pd.read_csv('C:\\Users\\Mrunal\\Documents\\NLP Project\\train.tsv', sep = '\t')
reduced_X_train = train.sample(frac=0.1).reset_index(drop=True)
reduced_y_train = np.log1p(reduced_X_train['price'])

## Fast Data Cleaning

In [4]:
# Fast Cleaning of Data
reduced_X_train['category_name'] = reduced_X_train['category_name'].fillna('other').astype(str)
reduced_X_train['brand_name'] = reduced_X_train['brand_name'].fillna('unknown').astype(str)
reduced_X_train['shipping'] = reduced_X_train['shipping'].astype(str)
reduced_X_train['item_condition_id'] = reduced_X_train['item_condition_id'].astype(str)
reduced_X_train['item_description'] = reduced_X_train['item_description'].fillna('none')

In [5]:
reduced_X_train.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,602443,Book Bundle,2,Other/Books/Literature & Fiction,unknown,160.0,1,"ACOTAR, TDM series, TTSAO, Signs Point to Yes,..."
1,286644,Reserved,1,Women/Dresses/Knee-Length,LuLaRoe,71.0,0,Carly bundle. (65.00 w/ 15% discount off 39.00...
2,1291682,NWT vs coffee tumbler,1,Home/Kitchen & Dining/Coffee & Tea Accessories,unknown,15.0,0,NWT in box Lowest price No offers please Angel...
3,1324500,NWT Nike Pro Shorts,1,Women/Athletic Apparel/Shorts,Nike,20.0,1,"Brand new with tags red Nike Pro shorts. 3"" in..."
4,345966,PINK Collegiate Collection,1,Women/Sweaters/Collared,unknown,24.0,0,PINK Collegiate collection Sweater LSU BRAND N...


# Topic Modeling:
- Instead of assigning vectors, we can create topics which contain words with similar meaning, which help in understanding a document.
- There are 2 methods: **LSA** and **LDA**

- **LSA**: It helps discover underlying words and their combinations which are not visible during prelimnary text analysis.
- **LDA**: It gives the probability of the word belonging to a particular topic.

## Latent Dirichlet Allocation (LDA):

- We focus on **'item_description'** and check different topics produced using LDA.

In [6]:
import string
import nltk
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from string import punctuation
from nltk.tokenize import word_tokenize
from collections import Counter
import operator
nltk.download('wordnet')
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mrunal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
reduced_X_train.shape

(148254, 8)

In [8]:
reduced_y_train.shape

(148254,)

In [9]:
stop = set(stopwords.words('english'))
stop.remove('no')                       #So that 'no' is not removed from 'item_description'
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [10]:
#1 - Create all text list
item_description_list = reduced_X_train['item_description'].tolist()

#2 - call defined function
clean_item_description = [clean(doc).split() for doc in item_description_list]

#3 - Create dictionary
dictionary =  gensim.corpora.Dictionary(clean_item_description)

#4 - Corpus - Doc-term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in clean_item_description]

In [11]:
#5 - build LDA model
lda_model = LdaModel(doc_term_matrix, num_topics=10, id2word = dictionary, random_state=42)

#6 extract topics for headlines
topics = lda_model.print_topics(num_topics=10, num_words=15)

In [12]:
from pprint import pprint

pprint(lda_model.print_topics())

[(0,
  '0.111*"no" + 0.050*"condition" + 0.048*"description" + 0.045*"yet" + '
  '0.036*"free" + 0.032*"home" + 0.026*"smoke" + 0.020*"good" + 0.020*"size" + '
  '0.020*"legging"'),
 (1,
  '0.121*"new" + 0.075*"brand" + 0.044*"never" + 0.031*"used" + 0.029*"box" + '
  '0.018*"authentic" + 0.016*"color" + 0.015*"tag" + 0.013*"lip" + '
  '0.010*"shade"'),
 (2,
  '0.036*"condition" + 0.030*"used" + 0.024*"great" + 0.023*"good" + '
  '0.015*"one" + 0.015*"time" + 0.015*"no" + 0.013*"picture" + 0.012*"wear" + '
  '0.012*"come"'),
 (3,
  '0.019*"oz" + 0.017*"body" + 0.016*"hair" + 0.015*"skin" + 0.010*"1" + '
  '0.009*"full" + 0.009*"bottle" + 0.009*"oil" + 0.009*"dry" + 0.008*"bath"'),
 (4,
  '0.045*"size" + 0.039*"worn" + 0.023*"fit" + 0.021*"small" + 0.019*"cute" + '
  '0.015*"super" + 0.014*"black" + 0.014*"medium" + 0.014*"condition" + '
  '0.014*"never"'),
 (5,
  '0.020*"please" + 0.011*"color" + 0.011*"day" + 0.009*"question" + '
  '0.009*"sticker" + 0.009*"item" + 0.007*"make" + 0.00

### Effectiveness of our LDA model - Coherence

In [13]:
#Coherence check of LDA

coherence_model_lda = CoherenceModel(model=lda_model, texts=clean_item_description, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(coherence_lda)

0.6206063804400851


# Eli5 library:
- It's a library that allows you to see what your model has learned from the text features.
- Looking at features helps to **understand how your classifier works**. 
- Looking at features helps to understand how classifier works and it also helps to notice preprocessing bugs.

### How does Eli5 Work?
It shows you the correlation of each feature/text with the target variable. We can inspect features and weights because we’re using a bag-of-words vectorizer and a linear classifier (so there is a direct mapping between individual words and classifier coefficients).

- https://eli5.readthedocs.io/en/latest/
- https://eli5.readthedocs.io/en/latest/overview.html

In [14]:
import eli5



In [15]:
# Definte RMSLE Cross Validation Function
def rmsle_cv(model):
    kf = KFold(shuffle=True, random_state=42).get_n_splits(reduced_X_train['item_description'])
    rmse= np.sqrt(-cross_val_score(model, reduced_X_train['item_description'], reduced_y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse.mean())

## Baseline Model with CountVectorizer

In [16]:
from sklearn.linear_model import Ridge

vec = CountVectorizer()
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

cv_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(cv_rmsle))

The Validation Score is: 0.6853173736299532


In [17]:
import eli5
eli5.show_weights(pipe, vec=vec, top=100, feature_filter=lambda x: x != '<BIAS>')

Weight?,Feature
+2.100,timebomb
+1.786,montsouris
+1.681,hello123
+1.674,1600
+1.547,giftcards
+1.503,linda804
+1.501,1tb
+1.481,anya
+1.455,médium
+1.383,10218184


In [18]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.817,<BIAS>
0.125,all
0.111,watch
0.086,scratches
0.076,condition
0.075,minimal
0.062,rewards
0.051,applied
0.05,flawless
0.047,excellent


## Baseline Model with CountVectorizer and Stop Words

In [19]:
vec = CountVectorizer(stop_words='english')
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

cv_sw_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(cv_sw_rmsle))

The Validation Score is: 0.6859066731878126


In [20]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.82,<BIAS>
0.111,watch
0.103,scratches
0.08,condition
0.073,minimal
0.052,included
0.051,flawless
0.049,applied
0.045,excellent
0.044,rewards


## Baseline Model with TF-IDF

In [21]:
vec = TfidfVectorizer()
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_rmsle))

The Validation Score is: 0.6190213018923119


In [22]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.719,<BIAS>
0.094,watch
0.069,all
0.053,codes
0.051,minimal
0.039,flawless
0.033,scratches
0.029,check
0.023,included
0.021,no


## Baseline Model with TF-IDF and Stop Words

In [23]:
vec = TfidfVectorizer(stop_words='english')
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_sw_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_sw_rmsle))

The Validation Score is: 0.6202699999545592


In [24]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.727,<BIAS>
0.114,watch
0.055,minimal
0.048,flawless
0.047,scratches
0.046,codes
0.033,included
0.027,condition
0.019,excellent
0.019,listing


## Baseline Model with TF-IDF, Stop Words, and N-Grams

In [25]:
vec = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_sw_ng_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_sw_ng_rmsle))

The Validation Score is: 0.6062479529607294


In [26]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.755,<BIAS>
0.116,watch
0.05,included
0.049,minimal
0.034,good excellent
0.027,included watch
0.026,codes
0.025,play flawless
0.024,scratches
0.023,condition


## RMSLE Summary

TF-IDF + Stop Words + N-Grams works best

In [27]:
print ("RMSLE Score: " + str(cv_rmsle) + " , CountVectorizer")
print ("RMSLE Score: " + str(cv_sw_rmsle) + " , CountVectorizer with Stop Words")
print ("RMSLE Score: " + str(tfidf_rmsle) + " , TF-IDF")
print ("RMSLE Score: " + str(tfidf_sw_rmsle) + " , TF-IDF with Stop Words")
print ("RMSLE Score: " + str(tfidf_sw_ng_rmsle) + " , TF-IDF with Stop Words and N-Grams")

RMSLE Score: 0.6853173736299532 , CountVectorizer
RMSLE Score: 0.6859066731878126 , CountVectorizer with Stop Words
RMSLE Score: 0.6190213018923119 , TF-IDF
RMSLE Score: 0.6202699999545592 , TF-IDF with Stop Words
RMSLE Score: 0.6062479529607294 , TF-IDF with Stop Words and N-Grams


### Tfidf vectorizer with stopwords and n-grams gives lowest rmsle score

# Modeling

- Ridge Regression
- LASSO Regression

### Creating Transformed Training Set (SPARSE - created in Part1 - similar steps)

In [28]:
reduced_X_train['item_description'] = reduced_X_train['item_description'].apply(clean)

In [29]:
# Applying LabelBinarizer to "brand_name"
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(reduced_X_train['brand_name'])

# Onehotencoding 'item_condition_id' and 'shipping'
X_dummies = csr_matrix(pd.get_dummies(reduced_X_train[['item_condition_id', 'shipping']], sparse=True).values)

In [30]:
# Count vectorizing 'name' and 'category_name'
cv = CountVectorizer(min_df=10)
X_name = cv.fit_transform(reduced_X_train['name'])
X_category_name = cv.fit_transform(reduced_X_train['category_name'])

In [31]:
# Tfidf vectorizing 'item_description'
tv = TfidfVectorizer(max_features=55000, ngram_range =(1, 2), stop_words='english')
X_description = tv.fit_transform(reduced_X_train['item_description'])

In [32]:
reduced_Xt_train = hstack((X_dummies, X_description, X_brand, X_name, X_category_name)).tocsr()

In [33]:
reduced_Xt_train

<148254x63272 sparse matrix of type '<class 'numpy.float64'>'
	with 4683601 stored elements in Compressed Sparse Row format>

### Define RMSLE Function

- It puts more penalty on **lower errors**
- This is used when you want to penalize **under estimates** more than **over estimates**.

In [34]:
def get_rmsle(y, pred): 
    return np.sqrt(mean_squared_error(y, pred))

### Ridge Cross Validation

In [35]:
%%time

# Creating 3-Fold CV
cv = KFold(n_splits=3, shuffle=True, random_state=42)

for train_ids, valid_ids in cv.split(reduced_Xt_train):
    
    model_ridge = Ridge(solver = "auto", fit_intercept=True, random_state=42)
    model_ridge.fit(reduced_Xt_train[train_ids], reduced_y_train[train_ids])
    
    # Predict & Evaluate Training Score
    y_pred_train = model_ridge.predict(reduced_Xt_train[train_ids])
    rmsle_train = get_rmsle(y_pred_train, reduced_y_train[train_ids])
    
    # Predict & Evaluate Validation Score
    y_pred_valid = model_ridge.predict(reduced_Xt_train[valid_ids])
    rmsle_valid = get_rmsle(y_pred_valid, reduced_y_train[valid_ids])
    
    print(f'Ridge Training RMSLE: {rmsle_train:.5f}')
    print(f'Ridge Validation RMSLE: {rmsle_valid:.5f}')


Ridge Training RMSLE: 0.38809
Ridge Validation RMSLE: 0.50945
Ridge Training RMSLE: 0.38614
Ridge Validation RMSLE: 0.51250
Ridge Training RMSLE: 0.38577
Ridge Validation RMSLE: 0.51368
Wall time: 15.4 s


## LASSO Cross Validation

Why did LASSO Perform way worse than Ridge?
- Ridge RMSLE: 0.53 
- LASSO RMSLE: 0.74

One reason why could be because since LASSO performs automatic feature selection. So keep in mind majority of our features are just words. It'll remove some of our text features. And this may not generalize well with new data. Because our dataset is suppose to capture and use all our words as features. 

In [36]:
%%time
from sklearn.linear_model import Lasso

# Creating 3-Fold CV
cv = KFold(n_splits=3, shuffle=True, random_state=42)

for train_ids, valid_ids in cv.split(reduced_Xt_train):
    
    model_LASSO = Lasso(fit_intercept=True, random_state=42)
    model_LASSO.fit(reduced_Xt_train[train_ids], reduced_y_train[train_ids])
    
    # Predict & Evaluate Training Score
    y_pred_train = model_LASSO.predict(reduced_Xt_train[train_ids])
    rmsle_train = get_rmsle(y_pred_train, reduced_y_train[train_ids])
    
    # Predict & Evaluate Validation Score
    y_pred_valid = model_LASSO.predict(reduced_Xt_train[valid_ids])
    rmsle_valid = get_rmsle(y_pred_valid, reduced_y_train[valid_ids])
    
    print(f'LASSO Training RMSLE: {rmsle_train:.5f}')
    print(f'LASSO Validation RMSLE: {rmsle_valid:.5f}')


LASSO Training RMSLE: 0.74786
LASSO Validation RMSLE: 0.74870
LASSO Training RMSLE: 0.74896
LASSO Validation RMSLE: 0.74649
LASSO Training RMSLE: 0.74760
LASSO Validation RMSLE: 0.74922
Wall time: 52.3 s


In [37]:
train_X, test_X, train_y, test_y = train_test_split(reduced_Xt_train, reduced_y_train, test_size=0.2, random_state=144)

## Price - target prediction using Ridge

In [38]:
model_ridge = Ridge(solver = "auto", fit_intercept=True, random_state=42)
model_ridge.fit(train_X, train_y)
    
ridge_y_pred = model_ridge.predict(test_X)

### RIDGE - Price prediction WITHOUT exponentiating back our previous log of target

In [39]:
ridge_y_pred[:20]

array([2.65310302, 3.08731146, 2.58511956, 4.63463426, 2.19409618,
       2.31016467, 3.14645611, 2.89743448, 2.49294055, 2.9758532 ,
       3.72234847, 1.98459212, 2.52073504, 2.92565183, 2.51279412,
       3.23115041, 2.75873652, 2.99942625, 2.77826493, 3.20889777])

### RIDGE - Price prediction WITH exponentiating back our previous log of target

In [40]:
ridge_y = np.expm1(ridge_y_pred)
ridge_y[200:220]

array([24.9348245 , 37.49232764, 10.58031245,  7.67218705, 36.95587499,
       16.17446696,  7.12070009, 17.36141348,  8.91013554, 20.82005848,
       24.52306359,  7.53941399, 14.85721212, 13.83451139, 19.436138  ,
        3.2705994 ,  5.28388163, 17.72993511, 30.58454786,  6.25114214])

## TEST prediction - PRICE

In [41]:
np.expm1(test_y[200:220])

27389     30.0
15132     86.0
48407      3.0
109582    13.0
21767     19.0
106154    15.0
4959       6.0
101231    10.0
102148    14.0
19708     17.0
127741    19.0
48785     10.0
133352    20.0
72243     20.0
28311     25.0
6026       3.0
21722      4.0
125264    30.0
52726     26.0
39941      6.0
Name: price, dtype: float64