# Why R Conference? Introduction to mlr

## Spam E-mail Database  
--------------------


### Description

A data set collected at Hewlett-Packard Labs, that classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail.

### Format

A data frame with 4601 observations and 58 variables.

The first 48 variables contain the frequency of the variable name (e.g., business) in the e-mail. If the variable name starts with num (e.g., num650) the it indicates the frequency of the corresponding number (e.g., 650). The variables 49-54 indicate the frequency of the characters ‘;’, ‘(’, ‘\[’, ‘!’, ‘\\$’, and ‘\\#’. The variables 55-57 contain the average, longest and total run-length of capital letters. Variable 58 indicates the type of the mail and is either `"nonspam"` or `"spam"`, i.e. unsolicited commercial e-mail.

### Details

The data set contains 2788 e-mails classified as `"nonspam"` and 1813 classified as `"spam"`.

The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... This collection of spam e-mails came from the collectors' postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

### Source

*   Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt at Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
    
*   Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835
    

These data have been taken from the UCI Repository Of Machine Learning Databases at [http://www.ics.uci.edu/~mlearn/MLRepository.html](http://www.ics.uci.edu/~mlearn/MLRepository.html)

### References

T. Hastie, R. Tibshirani, J.H. Friedman. _The Elements of Statistical Learning._ Springer, 2001.


## Exercise
---------------

a) Create a binary classifcation task from the spam data

In [7]:
library(mlr)
data(spam, package = "kernlab")
spam.task = makeClassifTask(id = "spam", data = spam, target = "type", positive = "spam")
spam.task

Supervised task: spam
Type: classif
Target: type
Observations: 4601
Features:
   numerics     factors     ordered functionals 
         57           0           0           0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 2
nonspam    spam 
   2788    1813 
Positive class: spam

b) List all learners that could be trained on `spam.task`

In [8]:
listLearners(spam.task, warn.missing.packages = FALSE)

class,name,short.name,package,note,type,installed,numerics,factors,ordered,⋯,multiclass,class.weights,featimp,oobpreds,functionals,single.functional,se,lcens,rcens,icens
classif.binomial,Binomial Regression,binomial,stats,Delegates to `glm` with freely choosable binomial link function via learner parameter `link`. We set 'model' to FALSE by default to save memory.,classif,True,True,True,False,⋯,False,False,False,False,False,False,False,False,False,False
classif.cforest,Random forest based on conditional inference trees,cforest,party,See `?ctree_control` for possible breakage for nominal features with missingness.,classif,True,True,True,True,⋯,True,False,True,False,False,False,False,False,False,False
classif.ctree,Conditional Inference Trees,ctree,party,See `?ctree_control` for possible breakage for nominal features with missingness.,classif,True,True,True,True,⋯,True,False,False,False,False,False,False,False,False,False
classif.featureless,Featureless classifier,featureless,mlr,,classif,True,True,True,True,⋯,True,False,False,False,True,False,False,False,False,False
classif.fnn,Fast k-Nearest Neighbour,fnn,FNN,,classif,True,True,False,False,⋯,True,False,False,False,False,False,False,False,False,False
classif.gausspr,Gaussian Processes,gausspr,kernlab,Kernel parameters have to be passed directly and not by using the `kpar` list in `gausspr`.  Note that `fit` has been set to `FALSE` by default for speed.,classif,True,True,True,False,⋯,True,False,False,False,False,False,False,False,False,False
classif.h2o.deeplearning,h2o.deeplearning,h2o.dl,h2o,"The default value of `missing_values_handling` is `""MeanImputation""`, so missing values are automatically mean-imputed.",classif,True,True,True,False,⋯,True,False,False,False,False,False,False,False,False,False
classif.h2o.gbm,h2o.gbm,h2o.gbm,h2o,'distribution' is set automatically to 'gaussian'.,classif,True,True,True,False,⋯,True,False,False,False,False,False,False,False,False,False
classif.h2o.glm,h2o.glm,h2o.glm,h2o,"`family` is always set to `""binomial""` to get a binary classifier. The default value of `missing_values_handling` is `""MeanImputation""`, so missing values are automatically mean-imputed.",classif,True,True,True,False,⋯,False,False,False,False,False,False,False,False,False,False
classif.h2o.randomForest,h2o.randomForest,h2o.rf,h2o,,classif,True,True,True,False,⋯,True,False,False,False,False,False,False,False,False,False


c) Select a learner you like and create it. If you want to can change its hyperparameters 

In [21]:
lrn = makeLearner("classif.rpart", predict.type = "prob")

d) Create an index set of train and test indicies. The test set should have 1000 observations.

d*) Ensure that the fraction between `"spam"` and `"nonspam"` is the training and test set is the same as in the full dataset. 

In [11]:
n = getTaskSize(spam.task)
test.inds = sample(1:n, size = 1000)
train.inds = setdiff(1:n, test.inds) 
head(test.inds)
head(train.inds)

e) Train your model on the train subset of the spam data and predict on the test subset.

In [22]:
mod = train(lrn, spam.task, subset = train.inds)
preds = predict(mod, spam.task, subset = test.inds)
print(mod)

print(preds)

Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = spam; obs = 3601; features = 57
Hyperparameters: xval=0
Prediction: 1000 observations
predict.type: prob
threshold: nonspam=0.50,spam=0.50
time: 0.01
       id   truth prob.nonspam prob.spam response
1849 1849 nonspam      0.91453   0.08547  nonspam
3260 3260 nonspam      0.91453   0.08547  nonspam
808   808    spam      0.81266   0.18734  nonspam
1473 1473    spam      0.05836   0.94164     spam
4069 4069 nonspam      0.91453   0.08547  nonspam
4061 4061 nonspam      0.81266   0.18734  nonspam
... (#rows: 1000, #cols: 5)


f) Evaluate the performance of your model based on accuracy and area under the curve.

In [24]:
perf = performance(preds, measures = list(acc, auc))
perf

g) Try to find a model with an AUC of at least 98%. 
- Try different models
- Change hyperparameters 
- Have a closer look at the feature and try to find transformations or combination of features that improve your model's performance   

In [26]:
lrn2 = makeLearner("classif.randomForest", predict.type = "prob")
mod = train(lrn2, spam.task, subset = train.inds)
preds = predict(mod, spam.task, subset = test.inds)
performance(preds, measures = auc)