<a href="https://colab.research.google.com/github/okalenskyy/DS_Boilerplate/blob/main/01_DataAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1.1 Detect anomalies in a time-series dataset.

##Outlier Detection

###Z-Score

The Z-score method is a statistically based approach for outlier detection.
1. Compute the standard score, or Z-score, for each data point.
2. Compute how many standard deviations a data point deviates from the mean of the dataset.
3. Set a threshold for Z-score, and data points with Z-scores greater than it are considered outliers.
An important assumption made by the Z-score method is that the data is **normally distributed**, making it especially useful for datasets with symmetrical patterns around the mean.

In [None]:
from sklearn.datasets import load_breast_cancer
from scipy import stats

threshold = 2.5 # Set threshold
df = load_breast_cancer(as_frame=True).data # Assign dataframe
z_scores = stats.zscore(df)
outliers = df[abs(z_scores) > threshold]

***Pros***
* Ease of implementation
* Assumes that the data is distributed normally, which is a widely applicable assumption for situations in the real world.
* Offers a numerical assessment of the extremeness of each outlier based on standard deviations.

***Cons:***

* If the data is not normally distributed, Z-score will not be effective for detecting outliers. **The distribution check must be done before the aaplication!**
* It may be influenced by the presence of other outliers in the dataset.

**Note:** Check the dataset and context for the **threshold value** selection - this must be done carefully. *This is the place for experiment.*


###LOF - Local Outlier Factor

The Local Outlier Factor algorithm calculates a data point’s local density deviation in relation to its neighbors. LOF assigns an anomaly score to each data point, indicating how likely it is to be an outlier. Outliers are points that have a high anomaly score.

1. Calculate the LOF score for each data point by comparing the local density of each data point to the local densities of its neighbors.
2. An outlier is a data point whose local density is significantly lower than that of its neighbors.

LOF is useful for datasets with a range of densities or clusters, because it considers the concept of local density.

In **scikit-learn**, we can convert LOF scores to predictions by using the predict or *fit_predict* method, which assigns a values:

*    **1** - point is not an outlier
*   **-1** - point is likely to be outlier

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import LocalOutlierFactor

data = load_breast_cancer(as_frame=True).data
lof = LocalOutlierFactor(n_neighbours=20,contamination=0.1)
outliers = lof.fit_predict(data)
data["LOF"] = outliers

***Pros:***

* Effective in identifying outliers in datasets with varying densities or clusters.
* Doesn’t require assumptions about the underlying distribution of the data.
Provides anomaly scores that can be used to rank the outliers.

***Cons:***

* Sensitivity to the choice of parameters such as the number of neighbors (n_neighbors) and the contamination rate (contamination).
* Can be computationally expensive for large datasets.
* May require careful interpretation and adjustment of the anomaly scores threshold for outlier detection.

###Isolation Forest

The Isolation Forest algorithm is an effective and efficient **unsupervised ** outlier detection tool. It operates by isolating outliers as abnormalities in a random forest structure. Unlike typical decision trees, which divide data into non-overlapping sections, the Isolation Forest method **randomly selects features** and splits data points until outliers are isolated into individual leaves.

The approach takes advantage of the fact that outliers are expected to have shorter average path lengths in the random forest, making them easier to isolate.

1. Assign an anomaly score to each data point.    
2. In scikit-learn, use the *predict* or *fit_predict* method.
which assign a value:
*  1 - data point unlikely to be outlier
* -1 - point that is likely an outlier

**Lower scores indicating a higher risk of being an outlier.**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest

data = load_breast_cancer(as_frame=True).data
iso = IsolationForest(contamination=0.1)
outliers = iso.fit_predict(data)
data["ISO"] = outliers

***Pros:***

* Effective in identifying outliers **in high-dimensional** datasets.
* Can handle datasets with **mixed variable types** (numeric and categorical).
* Efficient for processing large datasets due to its random partitioning strategy.

***Cons:***
* Sensitivity to the choice of parameters, especially the contamination rate.
* May require tuning of hyperparameters, such as the number of trees in the forest.
* Interpretation of anomaly scores can be challenging!


###DBSCAN

DBSCAN is a density-based clustering technique that can also detect outliers. It gathers data points that are close to each other depending on a distance criterion. **Outliers** are data points that a**re far removed from any cluster**.

DBSCAN defines three types of data points:
* **Core Points** Data points within a specified neighborhood of a minimum number of other data points.
* **Border Points** Data points within the specified neighborhood of a core point, but do not have enough neighboring points to be considered core themselves.
* **Noise Points (Outliers)** Data points that are neither core nor border points.

Does not require a specification of the number of clusters.
Identifies outliers based on their separation from dense data regions. The *fit_predict* method of the DBSCAN estimator fits the model to the data, and the labels_ attribute contains the cluster labels assigned to each data point. **Outliers** are identified as data points **labeled as -1**.


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import DBSCAN

data = load_breast_cancer(as_frame=True).data
dbscan = DBSCAN()
outliers = dbscan.fit_predict(data[['mean radius']])
data["DBSCAN"] = outliers

**Pros:**

* Doesn’t require specifying the number of clusters in advance.
* Ideal for datasets with an unknown number of clusters
* Effective in detecting outliers in datasets with irregular shapes and varying densities.
* Robust to noise and able to handle datasets with complex structures.

**Cons:**

* Sensitivity to the choice of parameters, especially the eps and min_samples values.
* Performance can degrade for high-dimensional datasets.
* When use higher dimensions, will end up marking almost every point as an outlier.
* Difficulty in determining optimal parameter values for different datasets.

1.2 Conduct time-series analysis.

1.3 Create and analyze graph data using something like cuGraph.


1.4 Identify how much data is big data (or when to use which acceleration method).


1.5 Perform exploratory data analysis (EDA).


1.6 Visualize time-series data.
