# Desrciption

A leading affiliate network company from Europe wants to leverage machine learning to improve (optimise) their conversion rates and eventually their topline. Their network is spread across multiple countries in europe such as Portugal, Germany, France, Austria, Switzerland etc.

Affiliate network is a form of online marketing channel where an intermediary promotes products / services and earns commission based on conversions (click or sign up). The benefit companies sees in using such affiliate channels is that, they are able to reach to audience which doesn’t exist in their marketing reach.

The company wants to improve their CPC (cost per click) performance. A future insight about an ad performance will give them enough headstart to make changes (if necessary) in their upcoming CPC campaigns.

In this challenge, you have to predict the probability whether an ad will get clicked or not.

    Variable-----Description
          ID-----Unique  ID
    datetime-----timestamp
    siteid-------website id
    offerid------offer id (commission based offers)
    category-----offer category
    merchant-----seller ID
    countrycode--country where affiliates reach is present
    browserid----browser used
        devid----device used
        click----target variable

## -----------------------------------------Data preprocessing--------------------------------------------------------

In [27]:
graphlab.canvas.set_target('ipynb')

In [10]:
#remove columns with empty or null values

import pandas as pd
from pandas import read_csv
dataset = read_csv('train.csv', header=None)
dataset = dataset.dropna(axis=0,subset=[7])
dataset = dataset.dropna(axis=0,subset=[8])

In [22]:
#export the file without null values
dataset.to_csv('train_wnull.csv', index = False, header = False)

In [23]:
#read the file without null values in sframe
import graphlab
dataset_train = graphlab.SFrame('train_wnull.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,long,long,long,long,str,str,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [25]:
dataset_train.head(5)

ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid
IDmMSxHur,2017-01-18 17:50:53,5189467,178235,21407,9434818,b,Mozilla Firefox,Desktop
ID32T6wwQ,2017-01-17 10:18:43,8896401,390352,40339,72089744,c,Firefox,Mobile
IDqUShzMg,2017-01-14 16:02:33,5635120,472937,12052,39507200,d,Mozilla Firefox,Desktop
IDjO9XQ1Z,2017-01-14 12:08:49,2729292,961176,33638,47079934,e,Google Chrome,Mobile
IDFnmhUgG,2017-01-13 05:58:13,7295565,144797,33638,23981625,b,Firefox,Mobile

click
0
0
0
0
0


In [28]:
dataset_train['click'].show(view = 'Categorical')

In [29]:
# assigning -1 to zero clicks and +1 to 1 click
dataset_train["click_val"]  = dataset_train["click"].apply(lambda click:+1 if click ==1 else -1)
dataset_train = dataset_train.remove_column('click')
#one way to to combat class imbalance is to undersample the larger class 
#until the class distribution is approximately half and half.
# Here we will undersample the larger class(no clicks)

#find the ratio of sizes and use that percentage to undersample no clicks
clicks_raw = dataset_train[dataset_train["click_val"]==+1]
no_clicks_raw = dataset_train[dataset_train["click_val"]==-1]
percentage = len(clicks_raw)/float(len(no_clicks_raw))
clicks = clicks_raw
no_clicks = no_clicks_raw.sample(percentage)

ads_data_sampled = clicks.append(no_clicks)

In [30]:
#check the number of data in our new undersampled set
ads_data_sampled['click_val'].show(view = 'Categorical')

print "Percentage of clicks                 :", len(clicks) / float(len(ads_data_sampled))
print "Percentage of no clicks                :", len(no_clicks) / float(len(ads_data_sampled))
print "Total number of clicks in our new dataset :", len(ads_data_sampled)

Percentage of clicks                 : 0.49998370022
Percentage of no clicks                : 0.50001629978
Total number of clicks in our new dataset : 705531


In [31]:
ads_data_sampled.head(5)

ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid
IDU8ukCsv,2017-01-10 15:28:24,469603,385750,93286,7122654,a,Google Chrome,Mobile
IDfEQxvi3,2017-01-14 19:01:18,5369414,513860,27655,92826840,f,Google Chrome,Mobile
IDbF2aKjs,2017-01-11 19:21:58,6005717,956737,1678,60293830,f,InternetExplorer,Desktop
IDEgpeME8,2017-01-16 14:29:36,7979331,501647,68947,58321067,f,Mozilla,Desktop
ID0SqwVVC,2017-01-17 00:04:53,2092870,400635,43897,46512126,c,Firefox,Mobile

click_val
1
1
1
1
1


In [32]:
# export the file after undersampling
ads_data_sampled.export_csv('train_preprocessed.csv')

## -------------------------------------------------Training------------------------------------------------------------------

In [33]:
ads_training_data = graphlab.SFrame('train_preprocessed.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,long,long,long,long,str,str,str,long]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [34]:
ads_training_data.head(5)

ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid
IDU8ukCsv,2017-01-10 15:28:24,469603,385750,93286,7122654,a,Google Chrome,Mobile
IDfEQxvi3,2017-01-14 19:01:18,5369414,513860,27655,92826840,f,Google Chrome,Mobile
IDbF2aKjs,2017-01-11 19:21:58,6005717,956737,1678,60293830,f,InternetExplorer,Desktop
IDEgpeME8,2017-01-16 14:29:36,7979331,501647,68947,58321067,f,Mozilla,Desktop
ID0SqwVVC,2017-01-17 00:04:53,2092870,400635,43897,46512126,c,Firefox,Mobile

click_val
1
1
1
1
1


In [35]:
ads_training_data['click_val'].show(view = 'Categorical')

In [36]:
ads_training_data['browserid'].show(view = 'Categorical')

In [37]:
#renaming column values into a uniform value
def transform_browser(browser):
    if browser == 'InternetExplorer':
        return 'Internet Explorer'
    elif browser == 'IE':
        return 'Internet Explorer'
    elif browser == 'Mozilla':
        return 'Firefox'
    elif browser == 'Mozilla Firefox':
        return 'Firefox'
    elif browser == 'Chrome':
        return 'Google Chrome'
    else:
        return browser

In [38]:
browser_list = ['InternetExplorer', 'IE', 'Mozilla', 'Mozilla Firefox', 'Chrome','Google Chrome', 'Firefox', 'Edge',
                'Internet Explorer','Opera', 'Safari']

for names in browser_list:
    transform_browser(names)
ads_training_data['browserid'] = ads_training_data['browserid'].apply(transform_browser)

In [39]:
ads_training_data['browserid'].show(view = 'Categorical')

In [40]:
len(ads_training_data)

705531

In [41]:
ads_training_data['devid'].show(view = 'Categorical')

In [42]:
ads_training_data

ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid
IDU8ukCsv,2017-01-10 15:28:24,469603,385750,93286,7122654,a,Google Chrome,Mobile
IDfEQxvi3,2017-01-14 19:01:18,5369414,513860,27655,92826840,f,Google Chrome,Mobile
IDbF2aKjs,2017-01-11 19:21:58,6005717,956737,1678,60293830,f,Internet Explorer,Desktop
IDEgpeME8,2017-01-16 14:29:36,7979331,501647,68947,58321067,f,Firefox,Desktop
ID0SqwVVC,2017-01-17 00:04:53,2092870,400635,43897,46512126,c,Firefox,Mobile
IDN439rpK,2017-01-20 12:49:37,94250,527842,23576,16150900,c,Safari,Tablet
IDd9dRKG1,2017-01-15 01:00:05,7885100,913591,15912,7181598,d,Firefox,Mobile
IDLM0u0QM,2017-01-16 02:56:27,5571806,103249,40339,43452411,a,Internet Explorer,Desktop
IDY7aCYRs,2017-01-13 13:25:22,8476528,945081,41706,4000296,c,Firefox,Desktop
IDLBtW0hS,2017-01-13 20:05:13,2148269,995962,80554,7122654,c,Internet Explorer,Desktop

click_val
1
1
1
1
1
1
1
1
1
1


In [43]:
print ads_training_data['datetime'].show(view = 'Categorical')
# Note- round time into nearest hour and check

None


In [44]:
print ads_training_data['siteid'].show(view = 'Categorical')

None


In [45]:
print ads_training_data['offerid'].show(view = 'Categorical')

None


In [155]:
print ads_training_data['category'].show(view = 'Categorical')

None


In [156]:
print ads_training_data['merchant'].show(view = 'Categorical')

None


In [157]:
print ads_training_data['countrycode'].show(view = 'Categorical')

None


In [46]:
#split data into training vs validation(75 25 would be good)
train_data, validation_data = ads_training_data.random_split(.75)

## --------------------------------------------------Model-------------------------------------------------------------------

In [47]:
features = ['countrycode', 'merchant', 'category', 'browserid', 'devid']
target = 'click_val'

decision_tree_model_1 = graphlab.decision_tree_classifier.create(train_data,
                                                                 validation_set=None,
                                                                 target = target,
                                                                 features = features)

In [1]:
# am i gonna use graphlabs built in classifier or write my own classifier???

In [None]:
#function to pick the best feature to split



In [14]:
#use decision tree to build a classifier 


## --------------------------------------------------Testing-------------------------------------------------------------------

In [48]:
#decision_tree_model_1 accuracy
print "decision_tree_model_1 on the train data: ", decision_tree_model_1.evaluate(train_data, 'auc')

decision_tree_model_1 on the train data:  {'auc': 0.9592555025722249}


In [None]:
#predict using the test samples data subs



## ------------------------------------------------Validation----------------------------------------------------------------

In [49]:
#decision_tree_model_1 accuracy
print "decision_tree_model_1 on the validation data: ", decision_tree_model_1.evaluate(validation_data, 'auc')

decision_tree_model_1 on the validation data:  {'auc': 0.9594192578066286}


In [None]:
#evaluate the accuracy of the model



## ------------------------------------------------Preditcion----------------------------------------------------------------

In [50]:
# load test data 1
ads_test_data = graphlab.SFrame('test.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,long,long,long,long,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [51]:
ads_test_data.head(1)

ID,datetime,siteid,offerid,category,merchant,countrycode,browserid,devid
IDFDJVI,2017-01-22 09:55:48,755610,808980,17714,26391770,b,Mozilla Firefox,Desktop


In [52]:
browser_list = ['InternetExplorer', 'IE', 'Mozilla', 'Mozilla Firefox', 'Chrome','Google Chrome', 'Firefox', 'Edge',
                'Internet Explorer','Opera', 'Safari']

for names in browser_list:
    transform_browser(names)
ads_test_data['browserid'] = ads_test_data['browserid'].apply(transform_browser)

In [53]:
# make predictions for test data 1
predict_1 = decision_tree_model_1.predict(ads_test_data, output_type='probability')

In [55]:
#export predictions for test data 1
sub = pd.DataFrame({'ID':ads_test_data['ID'],'click':predict_1})
sub.to_csv('prediction_file_1.csv',index=False)

## -------------------------------------------------Explore----------------------------------------------------------------

In [None]:
from sklearn.metrics import roc_auc_score
#scikit learn doesn't work with missing data

model = LinearRegression()
model.fit(X_train, y_train)


def describe_categorical(x):
    #returns the results for categorical variables
    from IPython.display import disply, HTML
    display(HTML(X[X.columns[X.dtypes== "object"]].describe().to_html))

In [None]:
#gives descriptive statistics like how many unique variables are there in a category
describe_categorical(x)

In [None]:
#change the variable to only have the first lettter or none
def clean(x):
    try:
        return x[0]
    except TypeError:
        return "None"
X["browserid"] = X.browserid.apply(clean)

In [None]:
categorical_variables = ['browserid', 'merchant_id']
for variable in categorical_variables:
    #fill missing data with word "Missing
    X[variable].fillna("Missing", inplace = True)
    #create array of dummie
    dummies = pd.get_dummies(X[variable], prefix= variable)
    #update X to include the dummies and drop the main variable
    X = pd.concat([X, dummies], axis = 1)
    X.drop([variable], axis =1, inplace = True)

In [None]:
#simple feature that shows all the variables and their importance in isolation
feature_importances = pd.Series(model.feature_importances_, index = x.columns)
feature_importances.sort()
feature_importances.plot(kind = 'barh', figsize=(7,6))

In [None]:
# Complex version that shows the summary view

def graph_feature_importances(model, feature_names, autoscale=True, headroom=0.05, width=10, summarized_columns=None):
    """
    By Mike Bernico
    
    Graphs the feature importances of a random decision forest using a horizontal bar chart. 
    Probably works but untested on other sklearn.ensembles.
    
    Parameters
    ----------
    ensemble = Name of the ensemble whose features you would like graphed.
    feature_names = A list of the names of those featurs, displayed on the Y axis.
    autoscale = True (Automatically adjust the X axis size to the largest feature +.headroom) / False = scale from 0 to 1
    headroom = used with autoscale, .05 default
    width=figure width in inches
    summarized_columns = a list of column prefixes to summarize on, for dummy variables
    (e.g. ["day_"] would summarize all day_ vars
    """
    
    if autoscale:
        x_scale = model.feature_importances_.max()+ headroom
    else:
        x_scale = 1
    
    feature_dict=dict(zip(feature_names, model.feature_importances_))
    
    if summarized_columns: 
        #some dummy columns need to be summarized
        for col_name in summarized_columns: 
            #sum all the features that contain col_name, store in temp sum_value
            sum_value = sum(x for i, x in feature_dict.items() if col_name in i )  
            
            #now remove all keys that are part of col_name
            keys_to_remove = [i for i in feature_dict.keys() if col_name in i ]
            for i in keys_to_remove:
                feature_dict.pop(i)
            #lastly, read the summarized field
            feature_dict[col_name] = sum_value
        
    results = pd.Series(feature_dict)
    results.sort_values(inplace=True)
    results.plot(kind="barh", figsize=(width,len(results)/4), xlim=(0,x_scale))
    
graph_feature_importances(model, X.columns, summarized_columns=categorical_variables)

In [1]:
#------------------How to handle missing data-----------------------------------------------------
# see which columns have missing data
#do this after balacing the data sample 
# We can use plots and summary statistics to help identify missing or corrupt data.
# We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

from pandas import read_csv
dataset = read_csv('pima-indians-diabetes.csv', header=None)
print(dataset.describe())

#then count the number of zeros in each column
#In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
# Values with a NaN value are ignored from operations like sum, count, etc.
# mark zero values as missing or NaN

dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# count the number of NaN values in each column
print(dataset.isnull().sum())

#Pandas provides the dropna() function that can be used to drop either columns or rows with missing data.
# We can use dropna() to remove all rows with missing data
#if we drop values with missing data then what will it do if it finds a data point in the test set with missing values?


In [None]:
#change the zero clicks to -1 and 1 click to +1

#--------------------Experimental perspectives on learning from Imbalanced data----------------
#data sampling
#check the percentage of +1 data to -1 data/total data 
#seven sampling techniques:
#   1.Random Under sampling-majority class is discarded-performs well with Random Forests 
#   2. random oversampling- minority class is duplicated-performs well with Logistic Regression
#   3. one sided selection- removes majority class samples that are redundant or noisy
#      (may be I can remove the ones with missing data)
#   4. cluster based sampling
#   5. wilson editing-think of having a high precision probability for majority class 
#   6. SMOTE
#   7. borderline SMOTE 

#take the dataset, seperate into -1 50% and +1 50% and export the data to a csv and then do the same
#for the three training sets and then combine those into one, may be you will be able to keep it under 25mb


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# skip row 1 so pandas can parse the data properly.
loans_2007 = pd.read_csv('data/lending_club_loans.csv', skiprows=1, low_memory=False) 
half_count = len(loans_2007) / 2
# Drop any column with more than 50% missing values
loans_2007 = loans_2007.dropna(thresh=half_count,axis=1)

#to explore different values in a column
loans_2007["loan_status"].value_counts()

# First, use the Pandas DataFrame method isnull() to return a DataFrame containing Boolean values:
# True if the original value is null
# False if the original value isn’t null
#Then, use the Pandas DataFrame method sum() to calculate the number of null values in each column.

null_counts = filtered_loans.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

for name in ['purpose','title']:
    print("Unique Values in column: {}\n".format(name))
    print(filtered_loans[name].value_counts(),'\n')

#The approach to converting nominal features into numerical features is to encode them as dummy variables
#remember to store each of these filteres files seperately

In [None]:
# Time Series Forecast Study with Python

#make plots to see if you can identify any pattern over time
#making a seasonal plot
n_groups = len(groups)
for name, group in groups:
    pyplot.subplot((n_groups*100) + 10 + i)
    i += 1
    pyplot.plot(group)
pyplot.show()

#converting specific columns to int types
df.iloc[:,2:7] = df.iloc[:,2:7].astype(int)

#converting to date time
pd.to_datetime('January 2012')

