# Avito Demand Prediction Challenge - LightGBM Model

## Introduction
### The challenge
When selling used goods online, a combination of tiny, nuanced details in a product description can make a big difference in drumming up interest.

Avito, Russia’s largest classified advertisements website, is deeply familiar with this problem. Sellers on their platform sometimes feel frustrated with both too little demand (indicating something is wrong with the product or the product listing) or too much demand (indicating a hot item with a good description was underpriced).

In [their fourth Kaggle competition](https://kaggle.com/c/avito-demand-prediction), Avito is challenging participants to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts. With this information, Avito can inform sellers on how to best optimize their listing and provide some indication of how much interest they should realistically expect to receive.

The description of the data files from the [data page](https://www.kaggle.com/c/avito-demand-prediction/data):

* train.csv - Train data.
* test.csv - Test data. Same schema as the train data, minus deal_probability.
* train_active.csv - Supplemental data from ads that were displayed during the same period as train.csv. Same schema as the train data, minus deal_probability.
* test_active.csv - Supplemental data from ads that were displayed during the same period as test.csv. Same schema as the train data, minus deal_probability.
* periods_train.csv - Supplemental data showing the dates when the ads from train_active.csv were activated and when they where displayed.
* periods_test.csv - Supplemental data showing the dates when the ads from test_active.csv were activated and when they where displayed. Same schema as periods_train.csv, except that the item ids map to an ad in test_active.csv.
* train_jpg.zip - Images from the ads in train.csv.
* test_jpg.zip - Images from the ads in test.csv.
* sample_submission.csv - A sample submission in the correct format.

### LightGBM model

In this notebook, we will train a [LightGBM](https://github.com/Microsoft/LightGBM) model to predict deal probability. LightGBM is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms. It is under the umbrella of the [DMTK](http://github.com/microsoft/dmtk) project of Microsoft.

We break this notebook down into 5 steps.

- [Step 1](#step1): Read in csv files
- [Step 2](#step2): Engineer features: text, categorical and numerical features
- [Step 3](#step3): Perform Ridge Regression and k-fold cross validation
- [Step 4](#step4): Train and validate the model
- [Step 5](#step5): Predict deal probabilities

First, let's import some libraries.

In [2]:
import numpy as np
import pandas as pd

import gc
import random
random.seed(2018)

from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn import feature_selection
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from scipy.sparse import hstack, csr_matrix

from nltk.corpus import stopwords 

import lightgbm as lgb

import matplotlib.pyplot as plt

import string

from utils import *

%matplotlib inline

<a id='step1'></a>
## Load data

We will load the training, testing and aggregated data.

In [3]:
train = pd.read_csv('./csv/train.csv', index_col = "item_id", parse_dates = ["activation_date"])
test = pd.read_csv('./csv/test.csv', index_col = "item_id", parse_dates = ["activation_date"])
train.head()

Unnamed: 0_level_0,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
b912c3c6a6ad,e00f8ff2eaf9,Свердловская область,Екатеринбург,Личные вещи,Товары для детей и игрушки,Постельные принадлежности,,,Кокоби(кокон для сна),"Кокон для сна малыша,пользовались меньше месяц...",400.0,2,2017-03-28,Private,d10c7e016e03247a3bf2d13348fe959fe6f436c1caf64c...,1008.0,0.12789
2dac0150717d,39aeb48f0017,Самарская область,Самара,Для дома и дачи,Мебель и интерьер,Другое,,,Стойка для Одежды,"Стойка для одежды, под вешалки. С бутика.",3000.0,19,2017-03-26,Private,79c9392cc51a9c81c6eb91eceb8e552171db39d7142700...,692.0,0.0
ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ...",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a...,3032.0,0.43177
02996f1dd2ea,bf5cccea572d,Татарстан,Набережные Челны,Личные вещи,Товары для детей и игрушки,Автомобильные кресла,,,Автокресло,Продам кресло от0-25кг,2200.0,286,2017-03-25,Company,e6ef97e0725637ea84e3d203e82dadb43ed3cc0a1c8413...,796.0,0.80323
7c90be56d2ab,ef50846afc0b,Волгоградская область,Волгоград,Транспорт,Автомобили,С пробегом,ВАЗ (LADA),2110.0,"ВАЗ 2110, 2003",Все вопросы по телефону.,40000.0,3,2017-03-16,Private,54a687a3a0fc1d68aed99bdaaf551c5c70b761b16fd0a2...,2264.0,0.20797


In [5]:
# Load and merge aggregated features into training and testing dataframes
af = pd.read_csv('csv/aggregated_features.csv', index_col=False)
train = train.merge(af, on='user_id', how='left')
test = test.merge(af, on='user_id', how='left')

agg_cols = list(af.columns)[1:]

del af
gc.collect()

57

We will combine training and testing data so that it's convenient when engineering features, etc.

In [6]:
# Make a copy of the target column before merging
y_train = train.deal_probability.copy()
train.drop("deal_probability", axis=1, inplace=True)

print('Train shape:', train.shape)
print('Test shape:', test.shape)

# Combine traing and testing data
df = pd.concat([train, test], axis=0)
del train, test
gc.collect()
print('All data shape:', df.shape)

Train shape: (1503424, 19)
Test shape: (508438, 19)
All data shape: (2011862, 19)
