# FINAL PROJECT SUBMISSION

**Group number:** 1 <br>
**Student IDs:** 48725, 48483, 48481, 49036 <br>
**Project name:** Ad Clicks

### Load packages and data

In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [2]:
#EDA and More
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Graphic Parameter Display
from sklearn import set_config
set_config(display="diagram")
#Splitting Dataset
from sklearn.model_selection import train_test_split
#Pipeline and GridSearch
from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
#Balancing
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
#Preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion
#Feature Selection
from sklearn.feature_selection import SelectPercentile,  mutual_info_classif
from sklearn.feature_selection import SelectFromModel
#Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
#Saving Models outside notebook
import pickle

In [3]:
#load data into dataframe
df = pd.read_csv("ad_clicks_100k.csv")

#Check dataset
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,6.448465e+18,0,14102806,1005,0,d6137915,bb1ef334,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,a506b0a5,9efa421a,1,0,19771,320,50,2227,0,935,-1,48
1,1.342805e+19,0,14102307,1002,0,85f751fd,c4e18dd6,50e219e0,9a08a110,7801e8d9,07d7df22,0fb3da37,73075152,02d14ecc,0,0,21676,320,50,2495,2,167,-1,23
2,1.048699e+19,0,14102310,1005,0,9a28a858,64778742,f028772b,ecad2386,7801e8d9,07d7df22,a99f214a,1847b3fb,ecb851b2,1,0,21837,300,250,2523,3,39,-1,221
3,8.833733e+18,0,14102307,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,a99f214a,9ac6509a,779d90c2,1,0,15706,320,50,1722,0,35,-1,79
4,1.035453e+19,0,14102811,1005,0,85f751fd,c4e18dd6,50e219e0,a5184c22,b8d325c3,0f2161f8,a99f214a,cfeed5cf,dc15c87e,1,2,23224,320,50,2676,0,35,100176,221


### Pre Preparation of Data

In [None]:
y=df['click']
X=df.drop(columns='click')

In [None]:
#Transform hour column
#First transform column into datetime format
X['datetime'] = pd.to_datetime(X['hour'].astype(str), format='%y%m%d%H')
#check first and last date
X['datetime'].sort_values()
# As all incidences are from year 2014 and month october
# -> we decide to include only the hour and weekday in our model
X["weekday"] = X['datetime'].dt.day_of_week
X["hour"] = X['datetime'].dt.hour #as we name the new column hour we automatically drop the old hour column
X.drop(["datetime"], axis =1,inplace=True)

In [None]:
#Drop ID column
X.drop(columns="id",inplace=True)

## Splitting Data into Train and Test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=0.5)

print("Records in training dataset:",X_train.shape[0],"(",(X_train.shape[0]/X.shape[0])*100,"%)")
print("Records in test dataset:",X_test.shape[0],"(",(X_test.shape[0]/X.shape[0])*100,"%)")

## The Pipeline Steps:


#### Balancing
1. Balance Date
    1. OverSampler
    2. UnderSampler
    3. No Balancing ("passthrough)

#### Feature Engineering
##### Categorical Features:
1. Select only Categorical Features (FeatureSelector(categorical_features)
2. Reduction (Due to Processing Power)
    1. Aggregate Categories (CategoryAggregator)
    2. Exclude Features with many categories (BigFeatureExcluder)
    3. No Reduction ("passthrough")
3. Feature Encoding:
    1. OneHot Encoding
    2. Label Encoding
    
##### Ordinal Features:
1. Select only numerical Features (FeatureSelector(numerical_features))
2. Encode (OrdinalEncoder)
3. Scale the numerical data (StandardScaler)

Combine Both Feature sets (FeatureUnion)

#### Feature Selection / Dimensionality Reduction

1. Reduce Dimensionality/Select Features
    1. Univariate Filtering
    2. Model based using Log Reg
    3. PCA
    4. No Reduction ("passthrough") 

#### Models:

1. Use Model to predict:
    1. Linear Regression
    2. Decison Tree
    3. Random Forrest


_Definiton of Columns Type for Pipeline:_

In [None]:
categorical_features = ['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category',
       'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip',
       'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16',
       'C17', 'C18', 'C19', 'C20', 'C21']

ordinal_features = ['hour', 'weekday']

##### FeatureSelector takes a pandas dataframe and columns and returns dataframe with the given columns:

In [None]:
#Feature Selector excludes categorical with more than max_cat=50 categories
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names, max_cat=50):
        self.feature_names = feature_names
        self.max_cat = max_cat
    def fit( self, X, y = None ):
        return self
    def transform(self, X, y=None):
        return X.loc[:, self.feature_names].copy(deep=True)

##### CategoryAggregator aggregates features with large categories (more categories per feature than defined threshold), and takes all categories into account which cover till a certeain percentile, and aggregates all categories below percentile threshold as one category other.:

In [None]:
class CategoryAggregator(BaseEstimator, TransformerMixin):
    def __init__(self, agg_percentil=0.7, agg_threshold=10, cat_limit=True, max_cat=50):
        self.top_categories = {}
        self.feature_names = []
        self.agg_percentil = agg_percentil
        self.agg_threshold = agg_threshold
        self.max_cat = max_cat
        self.cat_limit = cat_limit

    def fit(self, X, y = None):
        #Create a list of the top percentil categories and aggregate them, skip features with <= treshold
        self.feature_names = list(X.columns)
        for col in X[self.feature_names]:
            if X[col].nunique() <= self.agg_threshold:
                self.top_categories[col] = list(X[col].unique())
            else:
                feature = pd.DataFrame(X[col].value_counts()/X.shape[0])
                feature["cumsum"] = feature.cumsum()
                categories = list(feature.loc[feature["cumsum"] < self.agg_percentil].index)
                if self.cat_limit == True:
                    if len(categories) > self.max_cat:
                        print("Feature",col,"has after aggregation","(with agg_percentil=",self.agg_percentil,")", len(categories), "categories and is excluded.")
                else:
                    self.top_categories[col] = list(feature.loc[feature["cumsum"] < self.agg_percentil].index)
        return self
    def transform(self, X, y=None):
        X_output = X.loc[:, self.feature_names].copy(deep=True)
        for col, categories in self.top_categories.items():
            if (X_output[col].dtype == 'int64'):
                X_output.loc[~X[col].isin(categories), col] = -10
            else:
                X_output.loc[~X[col].isin(categories), col] = "other" 
        return X_output
    

##### BigFeatureExcluder excludes categorical with more categories then defined (max_cat):

_The Pipeline Main Construct:_


In [None]:
# Pipeline - WORKING

# Preprocessing
## Feature Engineering
#categorical pipeline OneHotEncoder
categorical_pipeline = Pipeline(steps = [ 
    ("column_selector", FeatureSelector(categorical_features)),
    ("reduction", CategoryAggregator(agg_percentil=0.7, agg_threshold=10, cat_limit=True, max_cat=30)),
    ("encoding", OneHotEncoder(drop="first", handle_unknown='ignore', sparse=False)) 
])


ordinal_pipeline = Pipeline(steps = [ 
    ("column_selector", FeatureSelector(ordinal_features)),
    ("encoding", OrdinalEncoder()),
    ("std_scaler", StandardScaler()) 
])


feature_engine_pipe = FeatureUnion(transformer_list=([("ordinal_pipe", ordinal_pipeline),
                                                      ("categorical_pipe", categorical_pipeline)]),
                                   n_jobs=-1, )
#Main Pipeline
model_pipeline = Pipeline(steps=[
    ('balancing', RandomUnderSampler()),
    ("feature_engineering", feature_engine_pipe),
    ("feature_selection", SelectPercentile(f_classif,  percentile=50)),
    ("model", LogisticRegression())
])


### PARAM GRID

#### Logistic Regression
_Hyperparameters tuned: C_

In [None]:
param1a = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectPercentile(mutual_info_classif,  percentile=50)],
          "feature_selection__percentile": [50,65,80],
          "model":[LogisticRegression()],
          "model__C":[0.1, 1.0, 10]}

param1b = {"balancing": [RandomUnderSampler(),RandomOverSampler(), 'passthrough'],
          "feature_selection": [SelectFromModel(estimator=LogisticRegression(solver='liblinear', penalty='l1', random_state=42))],
          "feature_selection__estimator__C": [0.1, 1.0, 10],
          "model":[LogisticRegression()],
          "model__C":[0.1, 1.0, 10]}


__Possible Parameter Combinations:__ Balancing x Feature Engineering x Feature Selection x Model

Param1a: 2 x 1 x 3 x 3 = 18

Param1b: 2 x 1 x 3 x 3 = 18

__Total: 36__

#### Decision Tree
_Hyperparameters tuned: max depth of tree, min samples leaf and criterion_

In [None]:
param2a = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectPercentile(mutual_info_classif,  percentile=50)],
          "feature_selection__percentile": [50,65,80],
          'model': [DecisionTreeClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_samples_leaf': [5, 6, 7, 8, 9],
          'model__criterion': ['gini', 'entropy']}


param2b = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectFromModel(estimator=LogisticRegression(solver='liblinear', penalty='l1', random_state=42))],
          "feature_selection__estimator__C": [0.1, 1.0, 10],
          'model': [DecisionTreeClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_samples_leaf': [5, 6, 7, 8, 9],
          'model__criterion': ['gini', 'entropy']}

param2c = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": ["passthrough"],
          'model': [DecisionTreeClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_samples_leaf': [5, 6, 7, 8, 9],
          'model__criterion': ['gini', 'entropy']}


__Parameter Combinations:__ Balancing x Feature Engineering x Feature Selection x Model

Param2a: 2 x 1 x 3 x (4x5x2=40) = 240

Param2b: 2 x 1 x 3 x (4x5x2=40) = 80

Param2c: 2 x 1 x 1 x (4x5x2=40) = 80

__Total: 400__

#### Random forest
Hyperparameters tuned: max depth and number of estimators - #add others??

In [None]:
param3a = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectPercentile(mutual_info_classif,  percentile=50)],
          "feature_selection__percentile": [50,65,80],
          'model': [RandomForestClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__criterion': ['gini', 'entropy'],
          'model__n_estimators': list(np.arange(100,600,100))}


param3b = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectFromModel(estimator=LogisticRegression(solver='liblinear', penalty='l1', random_state=42))],
          "feature_selection__estimator__C": [0.1, 1.0, 10],          
          'model': [RandomForestClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__criterion': ['gini', 'entropy'],
          'model__n_estimators': list(np.arange(100,600,100))}


param3c = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": ["passthrough"],
          'model': [RandomForestClassifier()],
          'model__max_depth': [3, 5, 7, 9],
          'model__criterion': ['gini', 'entropy'],
          'model__n_estimators': list(np.arange(100,600,100))}
        

__Parameter Combinations:__ Balancing x Feature Engineering x Feature Selection x Model

Param3a: 2 x 1 x 3 x (4x2x5=40) = 240

Param3b: 2 x 1 x 3 x (4x2x5=40) = 80

Param3c: 2 x 1 x 1 x (4x2x5=40) = 80

__Total: 400__

#### XGB Boost
_Hyperparameters tuned: learning rate, max tree depth - #add others??_

In [None]:
param4a = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectPercentile(mutual_info_classif,  percentile=50)],
          "feature_selection__percentile": [50,65,80],
          'model': [XGBClassifier()],
          'model__learning_rate':[0.001,0.01,0.1],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_child_weight': [0.5, 1.0, 3.0, 5.0]}


param4b = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": [SelectFromModel(estimator=LogisticRegression(solver='liblinear', penalty='l1', random_state=42))],
          "feature_selection__estimator__C": [0.1, 1.0, 10],
          'model': [XGBClassifier()],
          'model__learning_rate': [0.001,0.01,0.1],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_child_weight': [0.5, 1.0, 3.0, 5.0]}

param4c = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_selection": ["passthrough"],
          'model': [XGBClassifier()],
          'model__learning_rate':[0.001,0.01,0.1],
          'model__max_depth': [3, 5, 7, 9],
          'model__min_child_weight': [0.5, 1.0, 3.0, 5.0]}


__Parameter Combinations:__ Balancing x Feature Engineering x Feature Selection x Model

Param4a: 2 x 1 x 3 x (3x4x4=48) = 288

Param4b: 2 x 1 x 3 x (4x2x5=48) = 288

Param4c: 2 x 1 x 1 x (4x2x5=48) = 96

__Total: 672__

#### Catboost

In [None]:
param5 = {"balancing": [RandomUnderSampler(),RandomOverSampler()],
          "feature_engineering__categorical_pipe__reduction": ["passthrough"],
          "feature_selection": ['passthrough'],
          "model":[CatBoostClassifier(one_hot_max_size=50, logging_level='Silent')],
          'model__learning_rate': [0.1,0.05,0.03,0.01],
          'model__depth': [3, 5,7],
          'model__l2_leaf_reg': [1, 3, 5, 7, 9]}


__Parameter Combinations:__ Balancing x Feature Engineering x Feature Selection x Model

Param5: 2 x 1 x 1 x (4x3x5=60) = 120

__Total: 120__

##### Combination of all parameter constellations for grid search:

In [None]:
param_grid = [param1a, param1b, 
              param2a, param2b, param2c, 
              param3a, param3b, param3c, 
              param4a, param4b, param4c, 
              param5]

__Total Cobinations:__ 36 + 400 + 400 + 672 + 120 = __1628__
### In order to efficiently compute the model without using all parameters we will use Random Search with 30% x 1628 ≈ 500 iterations

### RandomSearch

Defintion of RandomSearch with 5 Crossfold Validations, meaning the models are 5 times validated:

In [None]:
#use balanced accuracy for evaluation of best model
random_search = RandomizedSearchCV(model_pipeline, param_grid, cv=5, scoring='f1', n_iter=500, n_jobs=-1)


_Model fit with test data:_

In [None]:
random_search.fit(X_train,y_train)

### The Parameters for the best Estimator:

Best Model:

In [None]:
display(random_search.best_estimator_)

In [None]:
print('Best cross-validation score ', random_search.best_score_)
print('Test-set score:  ', random_search.score(X_test, y_test))

_Top 10 Models:_

In [None]:
results = pd.DataFrame(random_search.cv_results_)
print("Tried",results.shape[0],"different Model constellations")
results.sort_values(by=["rank_test_score"], inplace=True)

In [None]:
results.head()

### Saving Important Outputs
To be able to work on them after running & closing.

In [None]:
# save the model to disk
filename = 'tuned_random_search.sav'
pickle.dump(random_search, open(filename, 'wb'))