# Teads challenge

In [1]:
import pandas
df = pandas.read_csv("aft100k.csv")
print(df.head(10))

   creative_id user_operating_system       user_device  \
0       113521               Android             Phone   
1       115340               Windows  PersonalComputer   
2       113582               Android             Phone   
3        97385               Windows  PersonalComputer   
4       114821               Windows  PersonalComputer   
5       113065               Android            Tablet   
6       111414                   iOS             Phone   
7       111414                   iOS             Phone   
8       112705               Windows  PersonalComputer   
9       113176                 macOS  PersonalComputer   

   average_seconds_played      cost  revenue  
0                     NaN  0.010128      0.0  
1                0.000000  0.005937      0.0  
2                7.142857  0.004398      0.0  
3                     NaN  0.006157      0.0  
4                     NaN  0.001994      0.0  
5                     NaN  0.003781      0.0  
6               17.000000  0.002

- creative_id: A unique identifier of the video that has been displayed to the user
- user_operating_system: The user Operating System (OS)
- user_device: The user device type
- average_seconds_played: The average number of seconds the user usually watches our videos (only if we already know the user, based on the user history)
- cost: The cost we had to pay to display the video
- revenue: The revenue generated by this video when it has been watched


## Preeliminary questions

####  1) The margin being defined as (revenue - cost) / revenue, what is our global margin based on this log?

In [2]:
print("Are there null revenues? {}".format(df["revenue"].isnull().values.any()))
print("Are there null costs? {}".format(df["cost"].isnull().values.any()))


Are there null revenues? False
Are there null costs? False


In [3]:
revs = df["revenue"].sum()
costs = df["cost"].sum()
margin = (revs-costs)/revs
print("Global margin is {0:.4f}, or {1:.2f}%".format(margin, margin*100))

Global margin is 0.2719, or 27.19%


#### 2) Can you think of other interesting metrics to optimize

Not in the csv: Click-through rate (CTR), Conversion rate.

Metrics to optimize, from the csv:

1) Profits - No explanation needed.

2) Profits per device - We want to make sure all devices are profitable.

3) Profits per OS - We want to make sure all OSs are profitable.

4) Profits per video - check if all ads are profitable.

I would also like to understand why sometimes it takes more seconds than others to get the money - some of them don't make us any money in 30 seconds, some of them do in 10 seconds. I guess it is very possible that we are losing money in some type of ads. Maybe we should only display them in certain platforms where people is willing to spend more time watching ads - and this is a metric we could get from our csv, the number of seconds per device.



#### 3) What are the most profitable Operating Systems?

In [4]:
df_agg = df[['user_operating_system','cost', 'revenue']].groupby(
    ['user_operating_system']).sum()
series_profits = df_agg["revenue"]-df_agg["cost"]
print(series_profits[series_profits==max(series_profits)])

user_operating_system
iOS    115.521775
dtype: float64


## Machine learning questions

#### How would you use this historical data to predict the event 'we benefit from a revenue' (ie revenue > 0) in a dataset where the revenue is not known yet?

#### Compute the prediction accuracy of a well chosen algorithm and comment the results. Do not hesitate to describe your methodology.

The problem at hand is a problem of supervised learning, in particular, it is a regression problem. We need to build a regressor from our dataset, using:

- Five features: *creative_id, user_operating_system, user_device, average_seconds_played, cost*. 
- A label: *revenues*.
- A regression algorithm: We have a dataset with 100 000 examples - that is pretty good to use some of the more complex algorithms like SVMs with kernels, or ensemble learning (random forests, adaboost). Neural networks could also be considered but, considering the scope of this work, it would take too long to train and optimize.


In order to validate our results, we will divide the dataset in 2, training set (80%) and test set (20%). Cross validation with several bags would be a better technique, but it would take longer.

Before doing this, and just in case our data has some hidden time sequentiality, let's shuffle the data.

In [5]:
from sklearn.utils import shuffle

df = shuffle(df)

df_train = df[:len(df)*4//5]
df_test = df[len(df)*4//5:]


##### Data cleaning:

The first problem we face is that the features user_device and average_seconds_played are not present in some examples of the data set - or, in pandas' jargon, they are NaNs.

In [6]:
feature_cols = ['creative_id', 'user_operating_system', 'user_device', 
                'average_seconds_played', 'cost']
X_train = df_train.loc[:, feature_cols]

X_train.isnull().sum()

creative_id                   0
user_operating_system         0
user_device                   6
average_seconds_played    49453
cost                          0
dtype: int64

In the case of *user_device*, it is not a big problem (only 8 labels out of 100 000, we can discard 8 without any second thoughts). Also substituting the NaNs for the average or something similar is tricky since this is a categorical feature (i.e., a string) - we could use the most frequent value but, again, there is no need. 


In [7]:
X_train = X_train[-X_train["user_device"].isnull()]

In the case of *average_seconds_played*, the problem is different: more than half of the data does not contain a value for the feature. However in this case, this is a numerical feature which gives us two options:

1) Removal: We would end up with around 40% of the dataset. The impact of this could be huge, but at least we are not "tampering" the data.

2) Substituting the data with its average, mode or median. This looks more promising, and it will be our default approach.

It could be interesting to compare the performances of both options, but for the scope of this work lets limit ourselves to the second option.

In [13]:
import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X_train["average_seconds_played"] = imp.fit_transform(X_train[["average_seconds_played"]])


##### Feature Generation and Selection

Feature generation and feature selection is one of the most important and time consuming processes in data science. Visualizing, and trial and error, are a great

Because of this, I will just name  



Interactions: interaction between the videos and the operating systems

Binning

Statistical Transformations

Log Transform


Features can be of two major types based on the dataset. Inherent raw features are obtained directly from the dataset with no extra data manipulation or engineering. Derived features are usually obtained from feature engineering, where we extract features from existing data attributes. A simple example would be creating a new feature “Age” from an employee dataset containing “Birthdate” by just subtracting their birth date from the current date.

poke_df[['HP', 'Attack', 'Defense']].describe()

Feature scaling

In [None]:
 'user_operating_system', 'user_device',
       'average_seconds_played', 'cost', 'revenue'

In [None]:
from sklearn import svm

X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = svm.SVR()
clf.fit(X, y) 


clf.predict([[1, 1]])



Since we don't have another dataset, let's use the given dataset as a proxy for the new dataset - although, of course, we will use the revenue not as a feature but as a label. 

Use an SVR and ensemble regressors 
Feature selection? Evidently, no creative id

Cross Validation