## Topics
- Vowpal Wabbit
- Locality Sensitive Hashing (LSH)
- Apache Storm
- Apache hadoop and HDFS
- AdaBoost classifier
- Gradient Boost Decision Tree (GBDT)
- cluster analysis and clustering algorithms

## Abaca Spam Classifier Improvement
### initial notes
- noisy feature data may not work well with ML models
- need to add weights to features
- 1% of senders send over 50% of all emails, which will affect the model
- power distribution

### Project plan

Naive Bayesian --> Linear Regression --> incorporate retro data --> Logistic Regression --> Random Forest

(current ~93% acc.)

### Todo
- [ ] learn vowpal_wabbit
- [ ] evaluate which tools to use (Python/Java/C++)
- [ ] design model
    - [ ] design input features
    - [ ] build model with vw
    - [ ] add weight to different features
    - [ ] run model with different learning rate and compare
- [ ] design input features into model
- [ ] grab data with pig script, parse and feed into script

### Useful git repos and links
- AbacaInbound [link](https://git.ouroath.com/asd/AbacaInbound)
- AbacaDecoder documentation [link](https://git.ouroath.com/asd/AbacaInbound/blob/master/rotla/doc/abaca_decoder.html)
- AbacaInboundMessage
    - AbacaInboundResponse.java??

## Random ML notes
### Model evaluation methods
#### Residuals
The model is fit using all the data points and the prediction for each data point is compared with its actual output. The absolute value of each error is taken and the mean of those values is computed to arrive at the mean absolute residual error. Models with lower values of this measure are deemed to be better.

![](https://www.cs.cmu.edu/~schneide/tut5/img90.gif)
Approximating a one-dimensional data set with A90:9, L90:9, L10:9 metacodes. The residual error for each data point is the distance along a vertical line between it and the fitted line. The result is very large, large, and zero residual error, respectively.

drawback: it does not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen.

#### Cross Validation
Better than `residual`. Not all of the dataset is used, a subset is reserved to test the performance of the model after training.

The **holdout** method is the simplest kind of cross validation. The data set is separated into two sets, called the training set and the testing set. The function approximator fits a function using the training set only. Then the function approximator is asked to predict the output values for the data in the testing set (it has never seen these output values before). The errors it makes are accumulated as before to give the mean absolute test set error, which is used to evaluate the model. The advantage of this method is that it is usually preferable to the residual method and takes no longer to compute. However, its evaluation can have a high variance. The evaluation may depend heavily on which data points end up in the training set and which end up in the test set, and thus the evaluation may be significantly different depending on how the division is made.

**K-fold cross validation** is one way to improve over the holdout method. The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.

**Leave-one-out cross validation** is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass it seems very expensive to compute. Fortunately, locally weighted learners can make LOO predictions just as easily as they make regular predictions. That means computing the LOO-XVE takes no more time than computing the residual error and it is a much better way to evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its metacodes.

#### Stochastic gradient descent

## Cluster Analysis 
[*A tutorial on Clustering Algorithms*](https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/)

**Clustering** is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. An important problem in *unsupervised learning*, it deals with finding a structure in a collection of unlabeled data.

```
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
```

Regarding emails, a clustering algorithm aims to recognize messages that deal with similar topic and aggregate them into one cluster, then identifying which clusters can be marked as spam.

## Clustering Algorithms
Clustering algorithms may be classified as listed below:
- Exclusive Clustering: a datum can only be in one cluster
- Overlapping Clustering: a datum may belong to two or more clusters with different degrees of membership
- Hierarchical Clustering: based on the union of 2 nearest clusters, beginning condition is that every datum is a cluster, after x iterations, we reach the final clusters wanted
- Probabilistic Clustering: based on probability distributions

4 Common Algorithms:
- K-means
- Fuzzy C-means
- Hierarchical clustering
- Mixture of Gaussians




## Locality Sensitive Hashing (LSH)
[](http://www.mit.edu/~andoni/LSH/)

