### Data balance - RPC = 0
I also found a lot of data points with RPC = 0. This unbalanced data was a problem for my model. I modelized this as bad products: products that people will not buy. Let's try here to build a classifier that find those bad products.

# 1. Helpers definition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def split_train_test(data):
    # around 20% of the population. Found Manually.
    data_train = data[data["Date"] <= "2015-03-19"]
    data_test = data[data["Date"] > "2015-03-19"]
    
    split_percentage = len(data_train) * 100 / (len(data_train) + len(data_test)) 
    print "INFO - percentage of the data in training set: " + str(split_percentage) + "%"
    
    return data_train, data_test


def assign_group(rpc):
    if rpc == 0:
        return 0
    else:
        return 1

    
def rpc_feature(data):
    return data["Revenue"].apply(float) / data["Clicks"]

# 2. Modelization

In [3]:
# load dataset
data = pd.read_csv("./sem-database.csv").sample(frac=0.1)

# feature creations
data["RPC"] = rpc_feature(data)
data["product_category"] = data["RPC"].apply(assign_group)

# split train / test data
data_train, data_test = split_train_test(data)

# we apply it to data points with enough data, and generlize it afterward.
data_train = data_train.loc[(data_train["Clicks"] >= 30)]
data_test = data_test.loc[(data_test["Clicks"] >= 30)]

# we don't have that much features, I just tried all the possible combinations and selected the best one.
features = ["Account_ID", "Device_ID", "Match_type_ID"]

# create model
X_train = data_train[features].applymap(str)
y_train = data_train["product_category"]
X_test = data_test[features].applymap(str)
y_test = data_test["product_category"]

# train model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)

# get predictions
predictions = model.predict(X_test)

INFO - percentage of the data in training set: 80%


# 3. Evaluation

In [4]:
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score

print "Precisions: " + str(precision_score(y_test, predictions) * 100)
print "Recall: " + str(recall_score(y_test, predictions) * 100)
print "F1: " + str(f1_score(y_test, predictions) * 100)
print ""
print confusion_matrix(y_test, predictions)

Precisions: 27.9069767442
Recall: 77.0053475936
F1: 40.9672830725

[[258 372]
 [ 43 144]]
