### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

### <strong> Random Forest </strong>

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = True (by default), which means it samples replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Like random forest, the Extra Trees algorithm will randomly sample the features at each split point of a decision tree. Unlike random forest, which uses a greedy algorithm to select an optimal split point, the Extra Trees algorithm selects a split point at random.

In terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. This algorithm saves time because the whole procedure is the same, but it randomly chooses the split point and does not calculate the optimal one.

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

![Bagging](./images/rf_extra.png)

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [1]:
#import libraries
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [2]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [3]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

Exercise: Using the data in scout_data, build a model to predict a product tier(Classification) and a model to predict the number of detail views.(Regression)

In [15]:
data = pd.read_csv("data/scout_data/Case_Study_Data.csv",sep=';')

print(data.shape)

data = data.dropna()

#print(data.dropna().shape)
print(data.shape)
print(data.head())

(78321, 12)
(78297, 12)
   article_id product_tier      make_name  price  first_zip_digit  \
0   350625839        Basic     Mitsubishi  16750                5   
1   354412280        Basic  Mercedes-Benz  35950                4   
2   349572992        Basic  Mercedes-Benz  11950                3   
3   350266763        Basic           Ford   1750                6   
4   355688985        Basic  Mercedes-Benz  26500                3   

   first_registration_year created_date deleted_date  search_views  \
0                     2013     24.07.18     24.08.18        3091.0   
1                     2015     16.08.18     07.10.18        3283.0   
2                     1998     16.07.18     05.09.18        3247.0   
3                     2003     20.07.18     29.10.18        1856.0   
4                     2014     28.08.18     08.09.18         490.0   

   detail_views  stock_days                   ctr  
0         123.0          30   0.03780329990294403  
1         223.0          52   0.0679

In [18]:
print(data.shape)
print(data.describe())

(78297, 12)
         article_id          price  first_zip_digit  first_registration_year  \
count  7.829700e+04   78297.000000     78297.000000             78297.000000   
mean   3.574864e+08   15069.670358         4.631876              2011.090336   
std    5.076809e+06   16375.598837         2.354368                 6.538638   
min    3.472324e+08      50.000000         1.000000              1924.000000   
25%    3.536387e+08    5750.000000         3.000000              2008.000000   
50%    3.585479e+08   10909.000000         5.000000              2013.000000   
75%    3.614817e+08   18890.000000         7.000000              2015.000000   
max    3.647040e+08  249888.000000         9.000000              2106.000000   

       search_views  detail_views    stock_days  
count   78297.00000  78297.000000  78297.000000  
mean     2297.91333     93.486583     35.995070  
std      6339.52668    228.042547     32.213083  
min         1.00000      0.000000     -3.000000  
25%       368.000

In [22]:
print(data['product_tier'].unique())

['Basic' 'Premium' 'Plus']


First model: Build a classifier for the "product tier" category. 
 - at first: not worrying too much about werid data. ;-)

In [39]:
#load dataset
X,y = data.drop(['product_tier'], axis=1), data['product_tier']

print(X.shape, '\n', y.shape)

(78297, 11) 
 (78297,)


OK, right: we'll need to encode the categorical data. 

In [40]:
print(X.head(), '\n')
print(X.dtypes)

   article_id      make_name  price  first_zip_digit  first_registration_year  \
0   350625839     Mitsubishi  16750                5                     2013   
1   354412280  Mercedes-Benz  35950                4                     2015   
2   349572992  Mercedes-Benz  11950                3                     1998   
3   350266763           Ford   1750                6                     2003   
4   355688985  Mercedes-Benz  26500                3                     2014   

  created_date deleted_date  search_views  detail_views  stock_days  \
0     24.07.18     24.08.18        3091.0         123.0          30   
1     16.08.18     07.10.18        3283.0         223.0          52   
2     16.07.18     05.09.18        3247.0         265.0          51   
3     20.07.18     29.10.18        1856.0          26.0         101   
4     28.08.18     08.09.18         490.0          20.0          12   

                    ctr  
0   0.03780329990294403  
1   0.06792567773378008  
2    0.0

OK ... I'll just drop created_date and deleted_date. But what's ctr?

"ctr;Click through rate calculated as the quotient of detail_views over search_views"

Why is this not a float? Let's try to make it one.

In [42]:
print(X.columns)

X = X.drop(['created_date', 'deleted_date'], axis=1)
# Merke: kann man nur einmal machen. ;-)
print(X.shape)


Index(['article_id', 'make_name', 'price', 'first_zip_digit',
       'first_registration_year', 'created_date', 'deleted_date',
       'search_views', 'detail_views', 'stock_days', 'ctr'],
      dtype='object')
(78297, 9)


In [45]:
#X['ctr'] = X['ctr'].astype('float64')
X['ctr'].describe()

count     78297
unique    47246
top         0.0
freq       1244
Name: ctr, dtype: object

OK, diese Column macht seltsame Sachen. For now: ignorieren!

Todo: einen besseren Umgang finden!? 

In [46]:
# Drop this thing: 
X = X.drop(['ctr'], axis=1)


Next: Encode!

In [55]:
#one hot encoding for make_name column
import category_encoders as ce

# create an object of the OneHotEncoder
ce_one = ce.OneHotEncoder(cols=['make_name'])

print(X)

Xone = ce_one.fit_transform(X)
# Was passiert hier eigentlich?

print(Xone)

       article_id      make_name  price  first_zip_digit  \
0       350625839     Mitsubishi  16750                5   
1       354412280  Mercedes-Benz  35950                4   
2       349572992  Mercedes-Benz  11950                3   
3       350266763           Ford   1750                6   
4       355688985  Mercedes-Benz  26500                3   
...           ...            ...    ...              ...   
78316   348704581          Lexus  15740                8   
78317   359231940        Hyundai   2950                6   
78318   362425932     Volkswagen   7850                8   
78319   357164227         Toyota  13945                5   
78320   353639932     Volkswagen  38800                7   

       first_registration_year  search_views  detail_views  stock_days  
0                         2013        3091.0         123.0          30  
1                         2015        3283.0         223.0          52  
2                         1998        3247.0         265.0  

OK, shit. Warum habe ich 91 make names? 

In [53]:
X['make_name'].unique()

array(['Mitsubishi', 'Mercedes-Benz', 'Ford', 'Volkswagen', 'Fiat',
       'Renault', 'Mazda', 'Peugeot', 'Opel', 'Toyota', 'Jaguar', 'Volvo',
       'Dacia', 'MINI', 'Porsche', 'Nissan', 'BMW', 'Land Rover', 'Audi',
       'Citroen', 'Hyundai', 'Suzuki', 'Alfa Romeo', 'Chevrolet',
       'Daewoo', 'Kia', 'Maserati', 'Skoda', 'Caravans-Wohnm', 'SEAT',
       'Honda', 'Daihatsu', 'Chrysler', 'smart', 'Saab', 'Jeep',
       'Others ', 'Lexus', 'Aixam', 'Ligier', 'Lancia', 'Oldtimer',
       'Chatenet', 'Subaru', 'Triumph', 'Ferrari', 'Rolls-Royce', 'Dodge',
       'MG', 'Cadillac', 'DS Automobiles', 'Iveco', 'Bentley',
       'SsangYong', 'Tesla', 'Trucks-Lkw', 'TVR', 'Aston Martin',
       'Abarth', 'HUMMER', 'Lincoln', 'Isuzu', 'Microcar', 'Buick', 'AC',
       'Alpina', 'Corvette', 'McLaren', 'Rover', 'Austin', 'De Tomaso',
       'FISKER', 'Infiniti', 'Lotus', 'Morgan', 'GMC', 'Oldsmobile',
       'Donkervoort', 'Alpine', 'Daimler', 'Lamborghini', 'Grecav',
       'Casalini', 'Pontia

Hmm, offensichtlich habe ich da vorhin was falsch gemacht. :-) 

Ah, oder hatte ich nur nach dem Target geschaut? 

In [60]:
#one hot encoding for make_name column
import category_encoders as ce

# create an object of the OneHotEncoder
ce_one = ce.OneHotEncoder(cols=['product_tier'])


y = ce_one.fit_transform(y)
# Was passiert hier eigentlich?

print(y)

       product_tier_1  product_tier_2  product_tier_3
0                   1               0               0
1                   1               0               0
2                   1               0               0
3                   1               0               0
4                   1               0               0
...               ...             ...             ...
78316               1               0               0
78317               1               0               0
78318               1               0               0
78319               1               0               0
78320               1               0               0

[78297 rows x 3 columns]


In [57]:
#Target encoding
ce_te = ce.TargetEncoder(cols=['make_name'])

#column to perform encoding
Xmake = X['make_name']
#y = data['color']
print(y)
#create an object of the Targetencoder
#ce_te.fit(Xmake,y)

TODO: Hier war ich gerade ...

#ce_te.transform(Xmake).head()

0        Basic
1        Basic
2        Basic
3        Basic
4        Basic
         ...  
78316    Basic
78317    Basic
78318    Basic
78319    Basic
78320    Basic
Name: product_tier, Length: 78297, dtype: object


In [47]:
from sklearn.metrics import f1_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn import tree


#train_test_split
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

#initialize the decisiontreeclassifier
#dtc = tree.DecisionTreeClassifier(max_depth=5,random_state=42,criterion='gini')

#fit and return f1_score
dtc.fit(X_train,y_train)
f1_score(y_test,dtc.predict(X_test),average=None)

ValueError: could not convert string to float: 'Skoda'