# Notes of book "Machine Learning for Subsurface Characterization"

* __Author__: _Lei Fu_
* __Date__: 04/20/2020

- <a>0. Information</a>
- <a>1. Unsupervised outlier detection techniques for well logs and geophysical data </a>
    - <a>1.1. Introduction</a>
    - <a>1.2 Outlier detection techniques</a>
    - <a>1.3 Unsupervised outlier detection techniques</a>
    - <a>1.4 Comparative study of unsupervised outlier detection methods on well logs</a>
    - <a>1.5 Performance of unsupervised ODTs on the four validation datasets</a>
    - <a>1.6 Conclusions</a>
 

## <a>0. Information </a>

    https://books.google.com/books?hl=en&lr=&id=WdO1DwAAQBAJ&oi=fnd&pg=PA1&dq=Unsupervised+Outlier+Detection+Techniques+for+Well+Logs+and+Geophysical+Data&ots=bqx1qdGVIz&sig=6sUAVkNO4yxcD3NhszLhicyMMrc#v=onepage&q&f=false
    
## <a>1. Unsupervised outlier detection techniques for well logs and geophysical data </a>

### <a>1.1. Introduction</a>

For example, washed out zones in the wellbore and borehole rugosity significantly affects the reading os shallow-sensing logs, such as density, sonic, and photoelectric factor (PEF) logs, resulting in outlier response. Along with wellbore condistions, uncommon beds and sudden change in physical/chemical properties at a certain depth in formation also result in outlier behavior of the subsurface measurements.

#### 1.1.1 Basic terminologies in machine learning and data-driven models

#### 1.1.2 Types of machine learning techniques

#### 1.1.3 Types of outliers

In the context of this work, outliers can be broadly categorized intro 3 types:
- point/global outliers
- contextual outliers
- collective outliers

### <a>1.2 Outlier detection techniques</a>

Simple methods for outlier detection use statistical tools, such as boxplot and Z-score.
$$ Z-score = \frac{x_i - \bar x}{\sigma} $$

Outlier detection based on simple statistical tools generally assume that the features have normal distributions while neglecting the correlation between features in multivariate dataset. Advanced outlier detection methods based on machine learning can handle correlated multivariate dataset, detect abnormalities within them, and do not assume a normal distributions of the features.

Unsupervised outlier detection technique (ODTs) generally assume the following:
- (1) The number of outliers is much smaller than the normal samples.
- (2) outlier do not follow the overall "trend" in the dataset.

The primary motivation of our study is to identify the best-performing unsupervised ODT methods that needs minimal hyperparameter tuning and manual interventions. 

### <a>1.3 Unsupervised outlier detection techniques</a>

Unsupervised ODT are based on distance, density, decision boundary, or affinity, which are used to quantify the relationships among the features governing the inlier and outlier behavior of samples. 

4 unsupervised ODTs:
- isolation forest (IF)
- one-class SVM (OCSVM)
- local outlier factor (LOF)
- density-based spatial clustering of applications with noise (DBSCAN)

#### 1.3.1 Isolation forest

Isolation forest (IF) assumes that the outliers tend to lie in sparse regions of the feature space and have more empty space around them than the densely clustered normal/inlier data. IF uses a forest of randomly partitioned tress to isolate outlier samples in terminating nodes. IF performs recursive random partitioning/splitting of the features space by randomly subsampling features and corresponding threshold values of the features. The path length, averaged over a forest of such random trees, is a measure of normality of a sample, such that anomalies/outliers have noticeably shorter path lengths; in other words, it is easy to partition the outliers with a few number of partitioning of the features space. A decision function categorizes each observation as an inlier or outlier based on the path length of the observation compared with the average path length of all observations. Unlike most other unsupervised ODTs that use distance and density as measures for outlier detection, IF uses isolation as a measure.

IF has low computational requirements, is fast to deploy, has low computational time complexity, and can be parallelized for faster computations. IF does not require feature scaling and dimensionality reduction. Nonetheless, users need to tune the hyperparameters: amount of contamination in the dataset, number of tress/estimators, maximum number of samples to be used in each tree, and maximum number of subsampled features used in each tree. 

#### 1.3.2 One-class SVM

One-class support vector machine (OCSVM) is a parametric unsupervised ODT suitable when the data points are mostly "normal" data with very few outliers (minimally contaminated). 