# Criteo Sponsored Search Conversion Log Dataset 
## Exploratory Data Analysis
This dataset contains logs obtained from Criteo Predictive Search (CPS). Each row in the dataset represents an action (i.e. click) performed by the user on a product related advertisement. The product advertisement was shown to the user, post the user expressing an intent via an online search engine.  Each row in the dataset, contains information about the product characteristics (age, brand, gender, price), time of the click ( subject to uniform shift), user characteristics and device information. The logs also contain information on whether the clicks eventually led to a conversion (product was bought) within a 30 day window and the time between click and the conversion.

This dataset represents a sample of 90 days of Criteo live traffic data. Each line corresponds to one click (product related advertisement) that was displayed to a user. For each advertisement, detailed information is provided about the product. Further, information is provided on whether the click led to a conversion, amount of conversion and the time between the click and the conversion.  

**Delimited**: \t (tab separated)

**Missing Value Indicator**: -1 ( Missing value indicator is 0 for click_timestamp)

**Outcome/Labels**
- Sale : Indicates 1 if conversion occurred and 0 if not).
- SalesAmountInEuro : Indicates the revenue obtained when a conversion took place. This might be different from product-price, due to attribution issues. It is -1, when no conversion took place.
- Time_delay_for_conversion : This indicates the time between click and conversion. It is -1, when no conversion took place.

**Features**
- click_timestamp: Timestamp of the click. The dataset is sorted according to timestamp.
- nb_clicks_1week: Number of clicks the product related advertisement has received in the last 1 week.
- product_price: Price of the product shown in the advertisement.
- product_age_group: The intended user age group of the user, the product is made for.
- device_type: This indicates whether it is a returning user or a new user on mobile, tablet or desktop. 
- audience_id:  We do not disclose the meaning of this feature.
- product_gender: The intended gender of the user, the product is made for.
- product_brand: Categorical feature about the brand of the product.
- product_category(1-7): Categorical features associated to the product. We do not disclose the meaning of these features.
- product_country: Country in which the product is sold.
- product_id: Unique identifier associated with every product.
- product_title: Hashed title of the product.
- partner_id: Unique identifier associated with the seller of the product.
- user_id: Unique identifier associated with every user.

Note :- All categorical features have been hashed.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from IPython.display import display, HTML
pd.options.display.float_format = '{:,}'.format
#-----------------------------------------
from cvr.data.processing import Data


## 1. Acquire Data
### 1.0 Parameters

In [None]:
filepath = "data\external\criteo\criteo.txt"
dev = True
sample_size = 0.01
sep = "\t"
random_state = 55

### 1.1. Load Data
Data is the class responsible for manipulating data. We can pass it the filepath and some information about the file structure and it loads the raw data. Calling the stage method stages the data for analysis. 

In [None]:
cd = Data(filepath,dev,sep,sample,random_state)
cd.stage()

## 2. Explore the Data
CriteoEDA class supports the exploration and analysis of the data. Let's instantiate the EDA class with 