### Feature RPC: modelization of the customer behaviour on the landing page
Let's consider:
- customer: a user clicking on an advertisement and arriving on the landing page of the e-commerce platform.
 
Let's then consider the following independent variables:
- X = the number of FCUR (fantasy currency) the customer is going to buy
- Revenue: sum of X

According to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) the bigger the population will be, the more the revenue is going to tend to a normal distribution. What's the minimum population to make this approximation? Well, I am aware that a lot of statisticians would argue, but according to google's deep learning course on udacity, according to a probability professor, and according to my personal experience, a population of 30 is the minimum to do this approximation.

Concretely, in our case, we are going to train the behaviour of the customer on data points with more than 30 clicks, and we will predict on all the population.

# 1. Helpers definition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def split_train_test(data):
    # around 20% of the population. Found Manually.
    data_train = data[data["Date"] <= "2015-03-19"]
    data_test = data[data["Date"] > "2015-03-19"]
    
    data_train = data_train.loc[data["Clicks"] >= 30]
    data_test = data_test.loc[data["Clicks"] >= 30]
    
    split_percentage = len(data_train) * 100 / (len(data_train) + len(data_test)) 
    print "INFO - percentage of the data in training set: " + str(split_percentage) + "%"
    
    return data_train, data_test

# 2. Modelization

In [None]:
# load data
data = pd.read_csv("./sem-database.csv")

# feature extraction
data["RPC"] = data["Revenue"].apply(float) / data["Clicks"] 
data = data.loc[data["RPC"] > 0] # We will build something special for RPC = 0

# split train/test
data_train, data_test = split_train_test(data)

# delete outliers. Not cheating: deleting them only on data_train ;-)
outliers_border = data[data.RPC != 0].RPC.quantile(.99)
data_train = data_train.loc[data_train.RPC < outliers_border]
print "INFO - Outlier border: " + str(int(outliers_border))

# create model - categorical data are encoded to " dummies "
X_train = pd.get_dummies(data["Campaign_ID"].apply(str)).loc[data_train.index]
y_train = data_train["RPC"]
X_test = pd.get_dummies(data["Campaign_ID"].apply(str)).loc[data_test.index]
y_test = data_test["RPC"]

# Train Model
from sklearn.linear_model import SGDRegressor
model = SGDRegressor(loss="squared_loss", n_iter=50, shuffle=True, random_state=2029)
model.fit(X_train, y_train)

# Get predictions
predictions = model.predict(X_test)

# 3. Numerical evaluation

In [None]:
from sklearn.metrics import mean_squared_error

d = mean_squared_error(y_test, predictions)
print "Mean squared error: " + str(d)

# 4. Graphical evaluation
Visualisation of the discrepancy between expected and predited data

In [None]:
dark_blue = "#3333FF"
light_blue = "#AAAAFF"

distance = pd.DataFrame()
distance["expected"] = y_test
distance["predicted"] = predictions

sample_size = 30
visualized_distance = distance.sample(n=sample_size)
plt.scatter(range(sample_size), visualized_distance["expected"], color=dark_blue)
plt.scatter(range(sample_size), visualized_distance["predicted"], color=light_blue)

# Graph formattage
plt.xlabel("Sample (ID)")
plt.ylabel("RPC")
plt.title("Digging into model: difference estimated vs expected")
plt.legend()
plt.show()

[](To display image on github)
<img src="https://user-images.githubusercontent.com/1684807/28826438-757b1caa-76ca-11e7-899d-98890dec49de.png">