<img src="https://www.mercari.com/assets/img/help_center/us/ogp.png"/>

# Mercari Price Suggestion Challenge
***
### Can you automatically suggest product prices to online sellers?

**Product pricing gets even harder at scale**, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.

**Mercari**, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

In this competition, Mercari’s challenging you to **build an algorithm that automatically suggests the right product prices**. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.

### Dataset Features

- **ID**: the id of the listing
- **Name:** the title of the listing
- **Item Condition:** the condition of the items provided by the seller
- **Category Name:** category of the listing
- **Brand Name:** brand of the listing
- **Shipping:** whether or not shipping cost was provided
- **Item Description:** the full description of the item
- **Price:** the price that the item was sold for. This is the target variable that you will predict. The unit is USD.

**Work on supply and demand**

**Source:** https://www.kaggle.com/c/mercari-price-suggestion-challenge

<img src = "https://cdn.dribbble.com/users/56196/screenshots/2281553/mobile-dribbble.gif"/>

# Representing and Mining Text
***
Since, text is the most **unstructured** form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as **text pre-processing**.

### Fundamental Concepts 

The importance of constructing mining-friendly data representations; Representation of text for data mining. 

### Important Terminologies
- **Document**: One piece of text. It could be a single sentence, a paragraph, or even a full page report. 
- **Tokens**: Also known as terms. It is simply just a word. So many tokens form a document. 
- **Corpus**: A collection of documents. 
- **Term Frequency (TF)**: Measures how often a term is in a single document
- **Inverse Document Frequency (IDF)**: distribution of a term over a corpus

### Pre-Processing Techniques
- **Stop Word Removal:** stop words are terms that have little no meaning in a given text. Think of it as the "noise" of data. Such terms include the words, "the", "a", "an", "to", and etc...
- **Bag of Words Representation: ** treats each word as a feature of the document

- **TFIDF**: a common value representation of terms. It boosts or weighs words that have low occurences. For example, if the word "play" is common, then there is little to no boost. But if the word "mercari" is rare, then it has more boosts/weight. 

- **N-grams**: Sequences of adjacent words as terms. For example, since a word by itself may have little to no value, but if you were to put two words together and analyze it as a pair, then it might add more meaning. For example, "iPhone" VS "iPhone Charger"

- **Stemming and Lemmatization**: Get the root meaning of the word

- **Topic Models**: A type of model that represents a set of topics from a sequence of words.

# Table of Content
***
### Define the Problem:

- [What's the Business Goal?](#map_of_newyork)

- [How will the solution be used?](#asian_white_geomap)

- [How to frame the problem?](#black_hispanic_geomap)

- [What metric are we optimizing?](#black_hispanic_geomap)

### Descriptive Statistics:
- [Observe Training Statistics](#correlation)

- [Simple Data Inspection](#correlation)

- [Missing Value Treatment](#correlation)


### Exploratory Data Analysis:
- [Price Distribution & Log Transformation](#correlation)

- [Shipping Type Distribution](#race_economic)

- [Category Distribution & Feature Engineering](#school_attendance)

- [Brand Analysis](#student_performance)

- [Length of Item Description VS Price](#math_test)


### Text Processing:
- [Normalizing Words (Stemming, Lowercase, Punctuation, Stop Words](#community_vs_noncommunity)

### Feature Extraction with Text:
- [Bag of Words Model](#community_vs_noncommunity)

- [Word Tokens](#economic_need)

- [Word Frequency Weights](#avg_school_income_comparison)

- [CountVectorizer, TF-IDF, LabelBinarizer](#avg_school_income_comparison)

- [Encoding Categorical Variables](#avg_school_income_comparison)


# Define the Problem

**A. Define the objective in business terms:** The objective is to come up with the right pricing algorithm that can we can use as a pricing recommendation to the users. 

**B. How will your solution be used?:** Allowing the users to see a suggest price before purchasing or selling will hopefully allow more transaction within Mercari's business. 

**C. How should you frame this problem?:** This problem can be solved using a supervised learning approach, and possible some unsupervised learning methods as well for clustering analysis. 

**D. How should performance be measured?:** Since its a regression problem, the evaluation metric that should be used is RMSE (Root Mean Squared Error). But in this case for the competition, we'll be using the RMSLE; which puts less penalty on large errors and focuses more on the smaller errors (since our main distribution in price is centered at around $10)

** E. Are there any other data sets that you could use?:** To get a more accurate understanding and prediction for this problem, a potential dataset that we can gather would be more about the user. Features such as user location, user gender, and seasonality.

# Import Packages

In [6]:
__author__ = "Data Science Dream Job"
__copyright__ = "Copyright 2018, Data Science Dream Job LLC"
__email__ = "info@datasciencedreamjob.com"

In [7]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

## Training Set

The training set has about 600,000 observations

In [9]:
# Observe the training set
train = pd.read_csv('C:/Users/Randy/Desktop/training/train.tsv', sep = '\t')
train.head()

In [10]:
print("The size of the training data is: " + str(train.shape))
print(train.dtypes)

## Summary Statistics:
- Most item price are at 10 Dollars
- There are about 33k items with no descriptions
- There are 3751 unique brands
- Majority of the items are Women's Brand

In [12]:
train.astype('object').describe().transpose()

## Test Set
The testing set has about 700,000 observations

In [14]:
# Observe testing set
test = pd.read_csv('C:/Users/Randy/Desktop/training/test.tsv', sep = '\t',engine = 'python')
test.head()

In [15]:
test.shape

# Reduce Training Size
Sample only 10% of the training set for now... to save time

In [17]:
reduced_train = train.sample(frac=0.3).reset_index(drop=True)
reduced_train.shape

# Fast Data Cleaning

In [19]:
# How many missing values do we have in our training set
# Why is brand name missing? Do all items need a brand? 
train.isnull().sum()

In [20]:
# Create a function to impute missing values
def fill_missing_value(data):
    data['category_name'].fillna(value = 'Other', inplace=True)
    data['brand_name'].fillna(value = 'uknown', inplace=True)
    data['item_description'].fillna(value = 'No description yet', inplace=True)
    
    return data

In [21]:
train = fill_missing_value(train)
train.isnull().sum()

# Exploratory Data Analysis

### Examine Target Value (Price)

In [24]:
train.price.describe()

# Price Distribution
**Summary:**
- The mean price in the dataset is **26 Dollars**
- The median price in the dataset is **17 Dollars**
- The max price in the dataset is **2000 Dollars**
- Most item prices are at about **10 Dollars**

**Why take log(price)?** 

Generally, the Root Mean Squared Error (RMSE) metric is used for regression tasks. But as price followed a long-tailed distribution (50% of the products were under $10), in order to make errors on low price product more relevant than for higher prices, the metric chosen for competition evaluation was Root Mean Squared Logarithmic Error (RMSLE). Thus, I applied the log transformation to the price target variable, to make this assumption available for model training.


**Example:**

**Step 1 Log Transformation:** np.log(train['price']+1)

**Step 2 Predict with Log Transformation:** test_pred = model.predict(X_test)

**Step 3 Convert back to original value by Exponential Transformation** Y_test = np.expm1(test_pred)

In [27]:
# Plot Price Distribution
plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(15, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)

#Plot Log Price Distribution
plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(15,6), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('Log Price Distribution', fontsize=12)

plt.show()

## Remove Items with 0 Price

In [29]:
train[train.price==0]

In [30]:
# We have 311 items with price of $0. Let's take them out because it looks like an error on their part. 
train[train.price==0].shape

In [31]:
# Remove items with price of $0 from our training set
train = train[train.price != 0]
train.shape

## Shipping Distribution

In [33]:
train['shipping'].value_counts() / len(train)

## Price Distribution by Shipping Type

Seems about right. Shipping does increase the price value and confirms our intiution.

In [36]:
shipping_fee_by_buyer = train.loc[train['shipping'] == 0, 'price']
shipping_fee_by_seller = train.loc[train['shipping'] == 1, 'price']

fig, ax = plt.subplots(figsize=(18,8))

ax.hist(shipping_fee_by_seller, color='blue', alpha=1.0, bins=50, range = [0, 100],label='Price when Seller pays Shipping')
ax.hist(shipping_fee_by_buyer, color='red', alpha=0.7, bins=50, range = [0, 100],label='Price when Buyer pays Shipping')

plt.xlabel('price', fontsize=12)
plt.ylabel('frequency', fontsize=12)
plt.title('Price Distribution by Shipping Type', fontsize=15)
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

# Average Price for Shipping Type

In [38]:
print('The average price is {}'.format(round(shipping_fee_by_seller.mean(), 2)), 'if seller pays shipping');
print('The average price is {}'.format(round(shipping_fee_by_buyer.mean(), 2)), 'if buyer pays shipping')

## Top 10 Categories

In [40]:
top_10_category = train['category_name'].value_counts()[:10].reset_index()
top_10_category

# Top 10 Categories Distribution

In [42]:
top_10_category = train['category_name'].value_counts()[:10].reset_index()

sns.set(style="whitegrid")

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize=(15, 10))

# Plot the total items per category
sns.set_color_codes("pastel")
sns.barplot(x="category_name", y='index', data=top_10_category,
            label="# Items", color="g")

# Add a legend and informative axis label
ax.set( ylabel="Item Category", title='# Item Per Category',
       xlabel="# of Items")

plt.show()

## Create new Feature by Splitting Category

There's a format in the category feature: **Root Category/Category/Subcategory**. In the given dataset also it is following the same trend so we need to split the category and save each of them in a separate column.

In [44]:
# Feature Engineering (1): Creating new Category Features
def transform_category_name(category_name):
    try:
        main, sub1, sub2= category_name.split('/')
        return main, sub1, sub2
    except:
        return 'Other','Other','Other'

In [45]:
train['category_main'], train['category_sub1'], train['category_sub2'] = zip(*train['category_name'].apply(transform_category_name))

cat_train = train[['category_main','category_sub1','category_sub2', 'price']]

cat_train.head()

# Top 10 Main Category
Women and beauty products make up the most of the items. This is interesting because it allows us to know who are users are, which we can then probably do some sort of targeted marketing...

In [47]:
plt.figure(figsize=(17,10))
sns.countplot(y = train['category_main'], order = train['category_main'].value_counts().index, orient = 'v')
plt.title('Top 10 Main Categories', fontsize = 25)
plt.ylabel('Main Category', fontsize = 20)
plt.xlabel('Number of Items')
plt.show()

# Ratio of Main Category 
Women takes up about 45 percent of the main category

In [49]:
# Look at the ratio of category for items
train['category_main'].value_counts()/len(train)

# Brand Analysis

There is about 3750 unique brands

In [52]:
# Amount of unique brand names
train['brand_name'].nunique()

In [53]:
# Top 20 Brand Distribution
b20 = train['brand_name'].value_counts()[1:20].reset_index().rename(columns={'index': 'brand_name', 'brand_name':'count'})
ax = sns.barplot(x="brand_name", y="count", data=b20)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.set_title('Top 20 Brand Distribution', fontsize=15)
plt.show()

## Top 20 Expensive Brands

In [55]:
top_20_exp_brand = pd.DataFrame(train.groupby(['brand_name'],as_index=True).std().price.sort_values(ascending=False)[0:20]).reset_index()
ax = sns.barplot(x="brand_name", y="price", data=top_20_exp_brand)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.set_title('Top 20 Expensive Brand Distribution', fontsize=15)
plt.show()

# Length of Description VS Price

Does the length of description have some affect on pricing?

In [58]:
train.item_description = train.item_description.astype(str)

descr = train[['name','item_description', 'price']]
descr['count'] = descr['item_description'].apply(lambda x : len(str(x)))
descr.head()

In [59]:
df = descr.groupby('count')['price'].mean().reset_index()
sns.regplot(x=df["count"], y=(df["price"]))
plt.xlabel("word count")
plt.show()

# Text Processing
***
Let's normalize the words by:
- Removing Punctuations
- Removing Stop Words
- Lowercasing the Words
- Stemming the Words

### List of Punctuations

In [62]:
from string import punctuation
punctuation

In [63]:
# Create a list of punctuation replacements
punctuation_symbols = []
for symbol in punctuation:
    punctuation_symbols.append((symbol, ''))
    
punctuation_symbols

### List of Stop Words

In [65]:
# Examine list of stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop

### Create Functions to Normalize the Words

In [67]:
import string

# Create a function to remove punctuations
def remove_punctuation(sentence: str) -> str:
    return sentence.translate(str.maketrans('', '', string.punctuation))

# Create a function to remove stop words
def remove_stop_words(x):
    x = ' '.join([i for i in x.lower().split(' ') if i not in stop])
    return x

# Create a function to lowercase the words
def to_lower(x):
    return x.lower()


### Apply Normalizing Functions

In [69]:
# Stem the Words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
train['item_description'] = train['item_description'].apply(porter.stem)

train['item_description'] = train['item_description'].apply(remove_punctuation)
train['item_description'] = train['item_description'].apply(remove_stop_words)
train['item_description'] = train['item_description'].apply(to_lower)

train['name'] = train['name'].apply(remove_punctuation)
train['name'] = train['name'].apply(remove_stop_words)
train['name'] = train['name'].apply(to_lower)

# Feature Extraction with Text
***

**Feature Pre-Processing:**
Sometimes you can't just fit a dataset into your model and expect good results. Each type of feature has their own way of preprocessing. Choice of preprocessing method also depends on the model we are trying to use.

Since we're working with Text Features, we're going to do a lot of vectorization:
- Tokenization: split each text into words (bag of words model)
- Stemming: removing word inflections (getting the root word)
- Vectorization: reducing text into a vector with different types of frequencies for each word (Count Values or TF-IDF Values)

## Bag of Words
When we vectorize these words, we're doing creating a feature for each word. Also known as, **Bag of Words**. **We lose word ordering**

**Solution**: To preserve some ordering, we can introduce **n-grams** into our vectorization of words (problem: too many features)
- one way to reduce dimensions of n-grams is to remove stop words (a, the, is)
- **stop words**: we can remove these words becaues they are just there for grammatical structure with little to no meaning
- **n-gram** with smaller frequencies can highlight and capture important parts of a document/text. This **preserves local ordering** and **can improve model performance**.

In [73]:
# Examine the normalize item description
train['item_description'][115:125]

In [74]:
#import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize

text1 = train['item_description'][120]
tokens = word_tokenize(text1)
print(tokens)

## Word Frequency Weights
Each word in our feature space can have different frequency weights
- Frequency Weight
- TF-IDF Weight
- Binary


**CountVectorizer**: Returns an encoded vector with integer count for each word

**TF-IDF(min_df, max_df, n-gram)**: Returns encoded vector with weighted count for each word (utilizes the word in the document in corresponsdence to the whole corpus, to keep it short, more emphasis on the rarity of a word).  This is good because we want to find frequent terms from that document that isn't so frequent within the whole document corpus.

**LabelBinarizer**: Get's all the word and assigns it to its own column. 0 means it's there and 1 means not (example with brand names)

**Why are we doing this again?**: Because some Machine Learning models don't recognize text as well. You're going to have to convert it into numbers

# Feature Engineering
***

**BONUS** Extra Feature Engineering Ideas
- Character Count
- Word Count
- Number of Unique Words
- Average Post Length of Main Category
- If brand yes/no feature
- Etc..

In [78]:
# Look at our features
train.columns

### Categorical Variables (Need to do Encoding):
How should we encode these features?
- name
- brand_name
- category_main, category_sub1, category_sub2
- item_description
- shipping
- item_condition_id

# CountVectorizer

In [81]:
# CountVectorizer - name & categories
cv = CountVectorizer(min_df=10)
X_name = cv.fit_transform(train['name'])
X_category_main = cv.fit_transform(train['category_main'])
X_category_sub1 = cv.fit_transform(train['category_sub1'])
X_category_sub2 = cv.fit_transform(train['category_sub2'])

In [82]:
print("Item Name Shape: " + str(X_name.shape))
print("Category Main Shape: " + str(X_category_main.shape))
print("Category Sub1 Shape: " + str(X_category_sub1.shape))
print("Category Sub2 Shape: " + str(X_category_sub2.shape))

# LabelBinarizer

In [84]:
# Apply LabelBinarizer to "brand_name"
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(train['brand_name'])

In [85]:
print("Item Brand Shape: " + str(X_brand.shape))

# Get_Dummies

In [87]:
# Apply get_dummies to "item_condition_id" and "shipping" and then convert into a CSR Matrix
X_dummies = csr_matrix(pd.get_dummies(train[['item_condition_id', 'shipping']], sparse=True).values)

# TFIDF
**Main Goal:** Measure hwo important a word or phrase is within a collection of documents. It essentially **weigh down** terms that appear frequently and **scale up** unique terms.

**TF Term Frequency** how often a term occurs 

**IDF Inverse Document Frequency** how important a term is

### Important Parameters
1. ngram_range 
2. stop_words 
3. lowercase 
4. max_df - max threshold that will ignore a term that has a document frequency higher than the threshold
5. min_df - min threshold that will ignore a term that has a document frequency lower than the threshold
6. max_features - gets the assigned amount of features with highest amount of term frequencies (scores)

In [89]:
# Perform a TFIDF Transformation of the item description with the top 55000 features and has an n-gram range of 1-2
tv = TfidfVectorizer(max_features=55000, ngram_range=(1, 2), stop_words='english')
X_description = tv.fit_transform(train['item_description'])

In [90]:
print("Item Description Shape: " + str(X_description.shape))

### Observing the TFIDF Weights

In [92]:
#  create a dictionary mapping the tokens to their tfidf values
tfidf = dict(zip(tv.get_feature_names(), tv.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(
                    dict(tfidf), orient='index')
tfidf.columns = ['tfidf']

# Lowest TFIDF Scores
tfidf.sort_values(by=['tfidf'], ascending=True).head(10)

In [93]:
# HIghest TFIDF Scores
tfidf.sort_values(by=['tfidf'], ascending=False).head(10)

# Combine All Features Into One Merge

In [95]:
# Combine everything together
sparse_merge = hstack((X_dummies, X_description, X_brand, X_name, X_category_main, X_category_sub1, X_category_sub2)).tocsr()

### To be continued...