In [2]:
# file = '../chapter-06/pyspark-machine-learning.ipynb'

f = open(file,'r')
filedata = f.read()
f.close()

newdata = (
    filedata
    # .replace("###", "####")
    # .replace("# 1.", "## 1.")
    # .replace("# 2.", "## 2.")
    # .replace("# 3.", "## 3.")
    # .replace("# 4.", "## 4.")
    # .replace("# 5.", "## 5.")
    .replace("<code style='font-size:13px'>", "`")
    .replace("<code style='font-size:13px;'>", "`")
    .replace("<code style='font-size:13px; color:firebrick'>", "`")
    .replace("<code style='font-size:13px; color:firebrick;'>", "`")
    .replace("</code>", "`")
)

f = open(file,'w')
f.write(newdata)
f.close()

In [1]:
import glob
import os

In [2]:
glob.glob('../*/*.ipynb')

['../03-data-manipulation/numpy-arrays.ipynb',
 '../03-data-manipulation/pandas-data-transformation.ipynb',
 '../03-data-manipulation/pandas-data-exploratory.ipynb',
 '../03-data-manipulation/pandas-data-cleaning.ipynb',
 '../03-data-manipulation/janitor-pandas-extensions.ipynb',
 '../util/refactor.ipynb',
 '../01-python-programing/python-data-types.ipynb',
 '../01-python-programing/python-algorithms.ipynb',
 '../01-python-programing/python-data-containers.ipynb',
 '../01-python-programing/selenium-web-scraping.ipynb',
 '../01-python-programing/python-external-sources.ipynb',
 '../01-python-programing/python-basic-concepts.ipynb',
 '../01-python-programing/python-functions-objects.ipynb',
 '../05-data-visualization/plotly-interactive-visualization.ipynb',
 '../05-data-visualization/matplotlib-graph-construction.ipynb',
 '../05-data-visualization/seaborn-statistical-visualization.ipynb',
 '../09-unsupervised-learning/sklearn-clustering.ipynb',
 '../09-unsupervised-learning/pyod-anomaly-

In [10]:
for file in glob.glob('../*/*.ipynb'):
    f = open(file,'r')
    filedata = f.read()
    f.close()

    newdata = (
        filedata
        .replace("## References", "## Resources")
    )

    f = open(file,'w')
    f.write(newdata)
    f.close()

#### Local Outlier Factor
[LOF](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.lof) is widely uses in anomaly detection specially for local outlier. It computes the local density deviation of a given data point with respect to its neighbors. A point will be considered as outlier when it has a significantly lower density than it neighbors. In other word, LOF compares the local density of a point to local density of its k-nearest neighbors and gives a score as final output. The disadvantage of LOF or proximity-based algorithms is it costs time very much to calculate the distance between large data points.

The process of LOF follow these step:
- Determined the distance from data point $p_i$ to $k_{th}$ nearest neighbors (pyod support all distance metrics from sklearn and scipy). Get the max distance among $k$ points - this is called *K_distance*. The number of neighbors of $p_i$ can greater or equal $k$ due to the distance between them - denote $|N_p|$ 
- Computes the *reachability density (RD)* of each $p_i$ related to others. RD is defined as the maximum of K-distance of $p_i$ and the distance between $p_i$ and $p_j$:

$$\text{RD}(p_i,p_j) = \max(\text{K_distance}_{p_i}, d(p_i,p_j))$$

:::{image} ../image/local_outlier_factor_2.png
:height: 250px
:align: center
:::
<br>

- Computes the *local reachability density (LRD)*. LRD is inverse of the average RD of $p_i$ from its neighbors. The larger average RD leads to the smaller LRD - it means the density of $p_i$ is quite low:

$$\text{LRD}_{p_i}= \frac{1}{\sum_{p_j \in N_p}\frac{\text{RD}(p_i,p_j)}{|N_p|} } $$

- Calculates the *LOF score* for each $p_i$ - LOF score is the ratio of the average LRD of the $K$ neighbors of $p_i$ to the LRD of $p_i$. If a point is inliner, the LRD of this point is approximately equal to its neighbors that leads to LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors, then LOF value will be high:

$$\text{LOF}_{p_i} = \frac{\sum_{p_j \in N_p} \text{LRD}_{p_j}}{|N_p|} \cdot \frac{1}{\text{LRD}_{p_i}}$$

#### Connectivity-based Outlier Factor
[COF](https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cof) (Connectivity-based Outlier Factor) is another version of LOF. In LOF, the theory is that data points are distributed in circle around the instance, but in the case there is a linear relationship between data points, the distance metric in LOF is no longer correct. COF calculates the anomaly score based on average *chain distance* between points and their neighbors. Therefore, COF is suitable for local and dependency outliers but like LOF, the time cosuming of COF is quite high with large dataset.

As same as LOF, COF firstly find $k$ nearest neighbors of point $p$ then arrange them in order of closest distance to $p$. Call $e_k$ is the *edge distance*, equals to each pair of points distance, example $e_2$ is the distance between $P_2$ and $P_3$, we calculates average chain distance for each instance:

$$\text{ACD}_p = \sum_{i=1}^k \frac{2(k+1-i)}{k(k+1)} e_i$$

:::{image} ../image/connectivity_based_outlier_factor.png
:height: 180px
:align: center
:::
<br>

At last, anomaly score is generated by ratio of average chaining distance of instance and the average of average chaining distance of $k$ nearest neighbor of this point. The higher the score, the easier it is to be an outlier.

$$\text{COF}_p = \frac{\text{ACD}_p}{\frac{\sum ACD_k}{k}}$$

#### Isolation Forest
[Isolation Forest](https://pyod.readthedocs.io/en/latest/pyod.models.html#pyod.models.iforest.IForest) detects the outliers based on ensembling binary decision trees to isolate outliers from the others. Relying on the characteristics of outliers are few and difference, IForest built each tree using sub-sample of dataset, then randomly seleted a feature and a random threshold to split the tree. The process of splitting continue until all instance has been isolated or the tree reach the maximum height or all same-value data points go into same node. The outliers will have short path to the root than others, especially when all tree in the forest say that. Formula of anomaly score depends on the average of path length to the root $\overline{h_p}$, number of instances in node - $n$ and unsuccesfull path in binary search tree $c(n)$:

$$\begin{aligned}
\text{iForest}_p &= 2^{-\frac{\overline{h}_p}{c(n)}} \\
c(n) &= 2 \cdot (\log(n-1)+0.577) - \frac{2(n-1)}{n} \\
\end{aligned}$$

Iforest just requires 2 params which are number of trees and sub-sample size - it works very well with small sample sizes and high-dimensional data, time consuming of this method is also fast and it can apply for all 4 types of anomaly. But iForest also has a disadvantage that a node in an iTree is split based on a threshold value, the data is split into left and right branches resulting in horizontal and vertical branch cuts - this will lead to some outliers are passed.