<img src="https://www.mercari.com/assets/img/help_center/us/ogp.png"/>

# Mercari Price Suggestion Challenge
***
### Can you automatically suggest product prices to online sellers?

**Product pricing gets even harder at scale**, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.

**Mercari**, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

In this competition, Mercari’s challenging you to **build an algorithm that automatically suggests the right product prices**. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.

### Dataset Features

- **ID**: the id of the listing
- **Name:** the title of the listing
- **Item Condition:** the condition of the items provided by the seller
- **Category Name:** category of the listing
- **Brand Name:** brand of the listing
- **Shipping:** whether or not shipping cost was provided
- **Item Description:** the full description of the item
- **Price:** the price that the item was sold for. This is the target variable that you will predict. The unit is USD.

### Key Words
- Pricing Recommendation
- Product Features
- NLP
- C2C & B2C

**Source:** https://www.kaggle.com/c/mercari-price-suggestion-challenge

<img src = "https://cdn.dribbble.com/users/56196/screenshots/2281553/mobile-dribbble.gif"/>

# Representing and Mining Text
***
Since, text is the most **unstructured** form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as **text pre-processing**.

### Fundamental Concepts 

The importance of constructing mining-friendly data representations; Representation of text for data mining. 

### Important Terminologies
- **Document**: One piece of text. It could be a single sentence, a paragraph, or even a full page report. 
- **Tokens**: Also known as terms. It is simply just a word. So many tokens form a document. 
- **Corpus**: A collection of documents. 
- **Term Frequency (TF)**: Measures how often a term is in a single document
- **Inverse Document Frequency (IDF)**: distribution of a term over a corpus

### Pre-Processing Techniques
- **Stop Word Removal:** stop words are terms that have little no meaning in a given text. Think of it as the "noise" of data. Such terms include the words, "the", "a", "an", "to", and etc...
- **Bag of Words Representation: ** treats each word as a feature of the document

- **TFIDF**: a common value representation of terms. It boosts or weighs words that have low occurences. For example, if the word "play" is common, then there is little to no boost. But if the word "mercari" is rare, then it has more boosts/weight. 

- **N-grams**: Sequences of adjacent words as terms. For example, since a word by itself may have little to no value, but if you were to put two words together and analyze it as a pair, then it might add more meaning. For example, "iPhone" VS "iPhone Charger"

- **Stemming and Lemmatization**: Get the root meaning of the word

- **Topic Models**: A type of model that represents a set of topics from a sequence of words. 

# Table of Contents

### Review
- [Core NLP Concepts](#correlation)

- [Feature Pre-Processing](#race_economic)

- [Feature Engineering](#school_attendance)

### Latent Dirichlet Allocation (LDA)
- [Topic Modeling](#definition)


### Model Selection:
- [Ridge, LASSO, LGBM](#correlation)

- [Cross Validation](#race_economic)

- [RMSLE Definition](#school_attendance)

- [Tree Based VS Non-Tree Based](#student_performance)

- [Ensemble Blend](#math_test)

### Eli5 
- [Evaluating Text Features](#math_test)

### Conclusion
- [Future Work](#correlation)


<img src='https://wrm5sysfkg-flywheel.netdna-ssl.com/wp-content/uploads/2018/07/Chilmark-Report_NLP-Technology-e1531398420983.png'>

***
# Review
Remember everything that we did from feature pre-processing, to transformations, and to derivations are important for text data because these will be our features for our model. You're going to have to experiment with stop-words, n-grams, and various other methods to come up with a good representation for your model.

## Text Feature Pre-Processing
- Normalizing Text
- Stemming Words
- Lemmitizing Words
- Remove Stop Words
- N-Grams

## Text Feature Transformations
- Bag of Words Model
- Log Transformation of Price
- Dummy (One Hot Encoding) of Categorical Variables
- Sparse Matrices to compress the data

## Text Feature Derivations
Generate corpus & Make TF-IDF Weights

### CountVectorizer 
Replaces tokens(words in this case) by their counts in the document.

["dog", "dog", "cat"] - > ["dog": 2, "cat": 1]

### TFIDF
Replaces tokens by the relative importance of the token in the document. 

### LabelBinarizer 
Converts tokens to binary vectors.

["tree", "human", "alien", "tree"] - > ["tree": [1, 0, 0], "human": [0, 1, 0], "alien": [0, 0, 1]]

***
# Training Set

In [None]:
__author__ = "Data Science Dream Job"
__copyright__ = "Copyright 2018, Data Science Dream Job LLC"
__email__ = "info@datasciencedreamjob.com"

In [1]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split

from sklearn.linear_model import Ridge

## Get Training Data

In [2]:
# Get 10% of the Training Data
train = pd.read_csv('C:/Users/Randy/Desktop/training/train.tsv', sep = '\t')
reduced_X_train = train.sample(frac=0.1).reset_index(drop=True)
reduced_y_train = np.log1p(reduced_X_train['price'])

<img src='http://i63.tinypic.com/qrg9bb.png'/>

## Fast Data Cleaning

In [3]:
# Fast Cleaning of Data
reduced_X_train['category_name'] = reduced_X_train['category_name'].fillna('Other').astype(str)
reduced_X_train['brand_name'] = reduced_X_train['brand_name'].fillna('missing').astype(str)
reduced_X_train['shipping'] = reduced_X_train['shipping'].astype(str)
reduced_X_train['item_condition_id'] = reduced_X_train['item_condition_id'].astype(str)
reduced_X_train['item_description'] = reduced_X_train['item_description'].fillna('None')

<img src='https://mith.umd.edu/wp-content/uploads/2015/10/header_topic-modeling.png'>

# Topic Modeling
***

### LDA – Latent Dirichlet Allocation 
Its foundations are Probabilistic Graphical Models


### Why is Topic Modeling Useful?
There are several scenarios when topic modeling can prove useful. Here are some of them:

1. Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
2. Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
3. Uncovering Themes in Texts – Useful for detecting trends in online publications for example

### How does LDA Work?
LDA is an iterative algorithm. Here are the two main steps:

1. In the initialization stage, each word is assigned to a random topic.
2. Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration: What’s the probability of the word belonging to a topic and What’s the probability of the document to be generated by a topic

Each topic in a document are percentages that all add up to 1.

Uses CV instead of TFIDF as weights


In [4]:
%%time
from sklearn.decomposition import LatentDirichletAllocation

# Initialize CountVectorizer
cvectorizer = CountVectorizer(max_features=20000,
                              stop_words='english', 
                              lowercase=True)

# Fit it to our dataset
cvz = cvectorizer.fit_transform(reduced_X_train['item_description'])

# Initialize LDA Model with 10 Topics
lda_model = LatentDirichletAllocation(n_topics=10,
                                      random_state=42)

# Fit it to our CountVectorizer Transformation
X_topics = lda_model.fit_transform(cvz)

# Define variables
n_top_words = 10
topic_summaries = []

# Get the topic words
topic_word = lda_model.components_
# Get the vocabulary from the text features
vocab = cvectorizer.get_feature_names()

# Display the Topic Models
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))



Topic 0: rm | color | colors | use | phone | look | great | adjustable | light | different
Topic 1: size | new | small | black | worn | pink | tags | brand | medium | white
Topic 2: description | included | silver | ring | looks | kit | disney | bracelet | jewelry | tall
Topic 3: free | shipping | new | price | rm | bundle | firm | brand | items | home
Topic 4: iphone | case | gold | plus | new | shoes | quality | high | 6s | charger
Topic 5: used | just | super | box | questions | need | don | cute | comes | feel
Topic 6: condition | size | great | good | worn | used | perfect | excellent | large | wear
Topic 7: pair | set | print | length | waist | cream | super | includes | warm | pack
Topic 8: bag | original | leather | color | shown | pockets | pants | picture | pocket | perfect
Topic 9: new | brand | oz | box | authentic | brush | works | makeup | skin | matte
Wall time: 2min 13s


<img src='https://pbs.twimg.com/profile_images/879279669031383040/e32kPbRC_400x400.jpg'>

***
# Eli5 


### Eli5  – Explain it Like I'm 5
- It's a library that allows you to see what your model has learned from the text features.
- Looking at features helps to **understand how your classifier works**. 


### Why is Eli5 Useful?
Looking at features helps to understand how classifier works. Maybe even more importantly, it helps to notice preprocessing bugs, data leaks, issues with task specification - all these nasty problems you get in a real world.

### How does Eli5 Work?
It shows you the correlation of each feature/text with the target variable. We can inspect features and weights because we’re using a bag-of-words vectorizer and a linear classifier (so there is a direct mapping between individual words and classifier coefficients). 

### Debugging Best Practices
- classifier assigns high weights to seemingly unrelated words like ‘do’ or ‘my’ -> Remove Stop Words

I think it might be more important to see what the model finds important and try to normalize maybe the top 10-30 tokens your particular model sees as important. Focusing on some of the top features and finding normalizations for that also mixed with some extra feature engineering has helped me push my score a little bit farther.

## Analyzing Item Description with Eli5

In [5]:
# Definte RMSLE Cross Validation Function
def rmsle_cv(model):
    kf = KFold(shuffle=True, random_state=42).get_n_splits(reduced_X_train['item_description'])
    rmse= np.sqrt(-cross_val_score(model, reduced_X_train['item_description'], reduced_y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse.mean())

## Baseline Model with CountVectorizer

In [17]:
from sklearn.linear_model import Ridge

vec = CountVectorizer()
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

cv_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(cv_rmsle))

The Validation Score is: 0.695663755122


In [119]:
import eli5
eli5.show_weights(pipe, vec=vec, top=100, feature_filter=lambda x: x != '<BIAS>')

Weight?,Feature
+2.943,unlocked
+2.411,16gb
+2.281,imei
+1.971,dustbag
+1.961,gb
+1.944,14k
+1.917,receipt
+1.855,lululemon
+1.855,carly
+1.841,ugg


In [120]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.772,<BIAS>
0.117,Highlighted in text (sum)


## Baseline Model with CountVectorizer and Stop Words

In [121]:
vec = CountVectorizer(stop_words='english')
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

cv_sw_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(cv_sw_rmsle))

The Validation Score is: 0.693154381376


In [122]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.819,<BIAS>
-0.076,Highlighted in text (sum)


## Baseline Model with TF-IDF

In [123]:
vec = TfidfVectorizer()
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_rmsle))

The Validation Score is: 0.632846048742


In [124]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.721,<BIAS>
0.107,Highlighted in text (sum)


## Baseline Model with TF-IDF and Stop Words

In [125]:
vec = TfidfVectorizer(stop_words='english')
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_sw_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_sw_rmsle))

The Validation Score is: 0.634721498961


In [126]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.731,<BIAS>
0.077,Highlighted in text (sum)


## Baseline Model with TF-IDF, Stop Words, and N-Grams

In [127]:
vec = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
clf = Ridge(random_state=42)
pipe = make_pipeline(vec, clf)
pipe.fit(reduced_X_train['item_description'], reduced_y_train)

tfidf_sw_ng_rmsle = rmsle_cv(pipe)

print("The Validation Score is: " + str(tfidf_sw_ng_rmsle))

The Validation Score is: 0.626062273763


In [128]:
eli5.show_prediction(clf, doc=reduced_X_train['item_description'][1297], vec=vec)

Contribution?,Feature
2.772,<BIAS>
0.117,Highlighted in text (sum)


## RMSLE Summary

TF-IDF + Stop Words + N-Grams works best

In [27]:
print ("RMSLE Score: " + str(cv_rmsle) + " | CountVectorizer")
print ("RMSLE Score: " + str(cv_sw_rmsle) + " | CountVectorizer | Stop Words")
print ("RMSLE Score: " + str(tfidf_rmsle) + " | TF-IDF")
print ("RMSLE Score: " + str(tfidf_sw_rmsle) + " | TF-IDF | Stop Words")
print ("RMSLE Score: " + str(tfidf_sw_ng_rmsle) + " | TF-IDF | Stop Words | N-Grams")

RMSLE Score: 0.695663755122 | CountVectorizer
RMSLE Score: 0.693154381376 | CountVectorizer | Stop Words
RMSLE Score: 0.632846048742 | TF-IDF
RMSLE Score: 0.634721498961 | TF-IDF | Stop Words
RMSLE Score: 0.626062273763 | TF-IDF | Stop Words | N-Grams


***
# Feature Pre-Processing / Transformation

It's super modular. So you have an estimator or a transformer, then you have a pipeline, then you connect more than one transformations together

In [32]:
from sklearn.pipeline import FeatureUnion

default_preprocessor = CountVectorizer().build_preprocessor()

def build_preprocessor(field):
    field_idx = list(reduced_X_train.columns).index(field)
    return lambda x: default_preprocessor(x[field_idx])

vectorizer = FeatureUnion([
    ('name', CountVectorizer(
        ngram_range=(1, 2),
        max_features=50000,
        preprocessor=build_preprocessor('name'))),
    ('category_name', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('category_name'))),
    ('brand_name', CountVectorizer(
        token_pattern='.+',
        preprocessor=build_preprocessor('brand_name'))),
    ('shipping', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('shipping'))),
    ('item_condition_id', CountVectorizer(
        token_pattern='\d+',
        preprocessor=build_preprocessor('item_condition_id'))),
    ('item_description', TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=55000,
        stop_words='english',
        preprocessor=build_preprocessor('item_description'))),
])

# Modeling

- Ridge Regression
- LASSO Regression
- Light GBM

### Create Transformed Training Set

In [33]:
# Create Transformed Train Set
reduced_Xt_train = vectorizer.fit_transform(reduced_X_train.values)
reduced_Xt_train

<59338x107580 sparse matrix of type '<class 'numpy.float64'>'
	with 1875069 stored elements in Compressed Sparse Row format>

### Define RMSLE Function

**Why RMSLE?**
- It puts more penalty on **lower errors**
- This is used when you want to penalize **under estimates** more than **over estimates**.

Lets have a look at the below example

Case a) : Pi = 600, Ai = 1000 RMSE = 400, RMSLE = 0.5108

Case b) : Pi = 1400, Ai = 1000 RMSE = 400, RMSLE = 0.3365

As it is evident, the differences are same between actual and predicted in both the cases. RMSE treated them equally however RMSLE penalized the under estimate more than over estimate.

In [34]:
def get_rmsle(y, pred): return np.sqrt(mean_squared_error(y, pred))

## Ridge Cross Validation

In [35]:
%%time

# Create 3-Fold CV
cv = KFold(n_splits=3, shuffle=True, random_state=42)
for train_ids, valid_ids in cv.split(reduced_Xt_train):
    # Define LGBM Model
    model_ridge = Ridge(solver = "lsqr", fit_intercept=True, random_state=42)
    
    # Fit LGBM Model
    model_ridge.fit(reduced_Xt_train[train_ids], reduced_y_train[train_ids])
    
    # Predict & Evaluate Training Score
    y_pred_train = model_ridge.predict(reduced_Xt_train[train_ids])
    rmsle_train = get_rmsle(y_pred_train, reduced_y_train[train_ids])
    
    # Predict & Evaluate Validation Score
    y_pred_valid = model_ridge.predict(reduced_Xt_train[valid_ids])
    rmsle_valid = get_rmsle(y_pred_valid, reduced_y_train[valid_ids])
    
    print(f'LGBM Training RMSLE: {rmsle_train:.5f}')
    print(f'LGBM Validation RMSLE: {rmsle_valid:.5f}')




LGBM Training RMSLE: 0.22295
LGBM Validation RMSLE: 0.53216
LGBM Training RMSLE: 0.22198
LGBM Validation RMSLE: 0.53228
LGBM Training RMSLE: 0.22118
LGBM Validation RMSLE: 0.53346
Wall time: 9.14 s


## LASSO Cross Validation

Why did LASSO Perform way worse than Ridge?
- Ridge RMSLE: 0.53 
- LASSO RMSLE: 0.74

One reason why could be because since LASSO performs automatic feature selection. So keep in mind majority of our features are just words. It'll remove some of our text features. And this may not generalize well with new data. Because our dataset is suppose to capture and use all our words as features. 

In [36]:
%%time
from sklearn.linear_model import Lasso

# Create 3-Fold CV
cv = KFold(n_splits=3, shuffle=True, random_state=42)
for train_ids, valid_ids in cv.split(reduced_Xt_train):
    # Define LGBM Model
    model_LASSO = Lasso(fit_intercept=True, random_state=42)
    
    # Fit LGBM Model
    model_LASSO.fit(reduced_Xt_train[train_ids], reduced_y_train[train_ids])
    
    # Predict & Evaluate Training Score
    y_pred_train = model_LASSO.predict(reduced_Xt_train[train_ids])
    rmsle_train = get_rmsle(y_pred_train, reduced_y_train[train_ids])
    
    # Predict & Evaluate Validation Score
    y_pred_valid = model_LASSO.predict(reduced_Xt_train[valid_ids])
    rmsle_valid = get_rmsle(y_pred_valid, reduced_y_train[valid_ids])
    
    print(f'LASSO Training RMSLE: {rmsle_train:.5f}')
    print(f'LASSO Validation RMSLE: {rmsle_valid:.5f}')


LASSO Training RMSLE: 0.74948
LASSO Validation RMSLE: 0.74714
LASSO Training RMSLE: 0.74836
LASSO Validation RMSLE: 0.74938
LASSO Training RMSLE: 0.74826
LASSO Validation RMSLE: 0.74959
Wall time: 25.6 s


## LGBM Cross Validation

**Reference:** https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc

Why LightGBM?
- ‘Light’ because of its high speed. 
- Can handle the large size of data
- Can take lower memory to run
- Focuses on accuracy of results

Why not LightGBM?
- When you have small data
- Prone to overfitting
- Needs a lot of tuning

How Gradient Boosting Trees work:
- Learn a regression predictor
- Compute the error residual
- Learn to predict the residual

Tree Based Models Pros:
- Automatically approximates linear transformations
- Automaically approximates subtle interactions between features (such as ratios/sums/differences of features)
- Gracefully handles missing values
- Deals with outliers well

In [37]:
%%time
import lightgbm as lgb

# Create 3-Fold CV
cv = KFold(n_splits=3, shuffle=True, random_state=42)
for train_ids, valid_ids in cv.split(reduced_Xt_train):
    # Define LGBM Model
    model_lgb = lgb.LGBMRegressor(num_leaves=31, n_jobs=-1, learning_rate=0.1, n_estimators=500, random_state=42)
    
    # Fit LGBM Model
    model_lgb.fit(reduced_Xt_train[train_ids], reduced_y_train[train_ids])
    
    # Predict & Evaluate Training Score
    y_pred_train = model_lgb.predict(reduced_Xt_train[train_ids])
    rmsle_train = get_rmsle(y_pred_train, reduced_y_train[train_ids])
    
    # Predict & Evaluate Validation Score
    y_pred_valid = model_lgb.predict(reduced_Xt_train[valid_ids])
    rmsle_valid = get_rmsle(y_pred_valid, reduced_y_train[valid_ids])
    
    print(f'LGBM Training RMSLE: {rmsle_train:.5f}')
    print(f'LGBM Validation RMSLE: {rmsle_valid:.5f}')
    

LGBM Training RMSLE: 0.42046
LGBM Validation RMSLE: 0.52510
LGBM Training RMSLE: 0.42012
LGBM Validation RMSLE: 0.52503
LGBM Training RMSLE: 0.41779
LGBM Validation RMSLE: 0.52963
Wall time: 1min 55s


<img src='https://image.slidesharecdn.com/ipbimprovingthemodelspredictivepowerwithensembleapproaches-121203224610-phpapp02/95/improving-the-models-predictive-power-with-ensemble-approaches-10-638.jpg?cb=1354575467'>

# Ensemble (Ridge + LGBM)

LGBM:
- Takes a while
- Starts with a weak learner and it is built upon each other to create more complex learners.

Ridge 
- Has a nice convex objective to maximise which is also easy to evaluate etc
- Unlike lasso or even elastic net it does not force parameters to be zero if their effect size is too small 
- Standard CountVectorizer features act mostly as dummies, and provided interactions between them are slight linear model should work quite well, as all you are mostly measuring is if the feature is on or off. 
- That's why better tokenisation / n-grams may improve the result, rather then attempting to capture "iPhone case" as interaction between iPhone and case keywords, it is better to capture "iPhone case" keyword directly.

Regularization:
- You can think of regularization as a technique that makes you fit your data less well so it can generalize it better on new data.
- Lasso does a variable selection, while Ridge does not
- When you have **highly-correlated variables**, **Ridge** shrinks the two coefficients. **Lasso** goes a step beyond that by setting them to zero and generally picks one over the other, and depending on the context you may not know which variable gets picked. **Elastic Net** does both: it shrinks and does sparse selection.

### Create Train/Test Split

In [38]:
# Train and Test Split
train_X, test_X, train_y, test_y = train_test_split(reduced_Xt_train, reduced_y_train, test_size=0.2, random_state=144)

### Define LGBM Model

In [39]:
# Define LGBM Model
model_lgb = lgb.LGBMRegressor(num_leaves=31, n_jobs=-1, learning_rate=0.1, n_estimators=500, random_state=42)

# Fit LGBM Model
model_lgb.fit(train_X, train_y)

# Predict with LGBM Model
lgbm_y_pred = model_lgb.predict(test_X)

### Define Ridge Model

In [40]:
# Define Ridge Model
model_ridge = Ridge(solver = "lsqr", fit_intercept=True, random_state=42)
    
# Fit Ridge Model
model_ridge.fit(train_X, train_y)
    
# Evaluate Training Score
ridge_y_pred = model_ridge.predict(test_X)



### Define Ensemble Model

**Definition::** Essentially an ensemble is where you get a bunch of indepdent models and you combine them toget one prediction. This is how the **Netflix Competition** was won. 
Tips for Ensemble:
- Start by guessing random weights. Start exploiting with a certain chance (say, 20% of the time) by nudging a weight up or down. Gradually increase exploitation rate.
- This should not be the case. Let's say you have a perfect and a lousy model, and you are trying to average them by finding optimal weights.  A weight-averaged model should always be at least as good as your best model, and in most cases it is better.
- Using different preprocessing methods like stemming/not stemming, different vectorizations, different tokenization can add diversity to the model, and may improve the quality of the ensemble. 

In [41]:
ensemble_y_pred = (lgbm_y_pred+ridge_y_pred)/2

ensemble_rmsle = get_rmsle(ensemble_y_pred, test_y)

print(f'Ensemble RMSLE: {ensemble_rmsle:.5f}')

Ensemble RMSLE: 0.49624


***
# Predictions

### Ensemble Predictions without Inverse Log Transformation

In [42]:
ensemble_y_pred[0:20]

array([ 2.69347021,  2.63638384,  3.99386934,  2.77175101,  2.27600616,
        2.85199986,  3.24552176,  3.92270868,  2.44183053,  3.17645482,
        3.70630063,  3.14630236,  3.64366471,  3.64104892,  3.10760259,
        2.391616  ,  2.57408646,  2.44544235,  2.72218835,  3.24296557])

### Ensemble Predictions (Inverse Log - Exponential)

In [45]:
# Esenmble Predictions
ensemble_y = (np.expm1(lgbm_y_pred)+np.expm1(ridge_y_pred))/2
ensemble_y[200:220]

array([  20.47281458,    8.09170275,   15.90818501,   71.21181587,
         11.37496636,   22.4514398 ,   23.99851141,   15.80124167,
         36.74616081,  110.36621236,    7.73942427,   29.54603   ,
         10.1671548 ,   21.37622152,   11.60641601,   11.43833577,
          8.12514104,   10.85690302,   18.18292294,   43.96665327])

### Test Predictions (Inverse Log - Exponential)

In [46]:
# Test Predictions 
np.expm1(test_y[200:220])

29801     11.0
22100      9.0
54120      9.0
12289     70.0
3446       9.0
47819     45.0
5742      21.0
53126     25.0
2541      43.0
22832    100.0
3670       7.0
38425     18.0
17062     12.0
4853      20.0
22573      4.0
12909     10.0
23700      8.0
56893     10.0
26935     20.0
54656     44.0
Name: price, dtype: float64

<img src='https://smist08.files.wordpress.com/2017/03/kaggle.jpg'/>

# Conclusion

### 1. Kaggle is a great way to learn Data Science and apply a lot of cool techniques
### 2. You learn the most by applying 
### 3. Join a competition, not to hopefully win, but to ultimately learn
### 4. Share your work and learn from others through discussions
### 5. Text data is dirty