# Anomaly Detection & Time Series | Assignment DA-AG-018

## Question 1: What is Anomaly Detection? Explain its types (point, contextual, and collective anomalies) with examples.
---
**Answer:**

**Anomaly Detection** (also known as outlier detection) is the process of identifying data points, events, or observations that deviate significantly from the majority of the data and do not conform to an expected pattern. These non-conforming instances are referred to as anomalies, outliers, exceptions, or novelties.

Anomalies can be broadly categorized into three types:

#### 1. Point Anomalies
A point anomaly is a single instance of data that is anomalous with respect to the rest of the data. It is the simplest type of anomaly and the primary focus of most research.
- **Example: Credit Card Fraud** 💳
  A user typically makes purchases under \$100 in their home city. A new transaction of \$5,000 from a different country would be a point anomaly and a strong indicator of fraud.

#### 2. Contextual Anomalies (Conditional Anomalies)
A contextual anomaly is a data instance that is considered anomalous within a specific context, but not otherwise. The algorithm must consider contextual information (e.g., time, location) to identify it.
- **Example: Seasonal Sales** 🧥
  A high volume of winter coat sales is normal in December (context: winter). However, the exact same sales volume in July (context: summer) would be a contextual anomaly, perhaps indicating a data error or a highly unusual marketing success.

#### 3. Collective Anomalies
A collective anomaly is a collection of related data instances that is anomalous with respect to the entire dataset. The individual data points within the collection may not be anomalies by themselves, but their occurrence together as a collection is.
- **Example: Electrocardiogram (ECG)** ❤️
  In an ECG reading that monitors a human heartbeat, a single beat might be within the normal range. However, a prolonged period of a flat signal (a collection of low-value points), while individually normal, collectively indicates a serious anomaly like cardiac arrest.

## Question 2: Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of their approach and suitable use cases.
---
**Answer:**

| Feature         | Isolation Forest                                                                                                       | DBSCAN (Density-Based Spatial Clustering)                                                              | Local Outlier Factor (LOF)                                                                                             |
| :-------------- | :--------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------- |
| **Approach** | **Isolation-based**. It builds an ensemble of decision trees. Anomalies are easier to "isolate" (require fewer splits to separate from other points) and will thus have shorter average path lengths in the trees. | **Density-based**. It groups together points that are closely packed. Points in low-density regions that don't belong to any cluster are identified as noise/anomalies. | **Local Density-based**. It measures the local density deviation of a data point with respect to its neighbors. An object is an outlier if its local density is significantly lower than that of its neighbors. |
| **Key Idea** | Anomalies are "few and different."                                                                                   | Anomalies are in sparse regions.                                                                       | Anomalies are in regions of lower local density compared to their surroundings.                                        |
| **Parameters** | `n_estimators`, `contamination` (expected proportion of outliers).                                                     | `eps` (neighborhood radius), `min_samples` (minimum points in a neighborhood).                         | `n_neighbors` (number of neighbors to consider).                                                                       |
| **Strengths** | - Very fast and efficient, especially for large datasets.<br>- Works well in high-dimensional spaces.<br>- Requires few parameters. | - Can find arbitrarily shaped clusters and anomalies.<br>- Does not assume a specific distribution.<br>- Robust to noise. | - Effective at detecting anomalies in datasets with varying densities.<br>- Does not require a global density parameter. |
| **Weaknesses** | - May struggle if anomalies are clustered together.<br>- Performance can degrade if normal and anomalous data are not clearly separable. | - Struggles with datasets of varying densities.<br>- Sensitive to the choice of `eps` and `min_samples`.<br>- Can be slow on large datasets. | - Computationally expensive ($O(n^2)$), not suitable for very large datasets.<br>- Can be sensitive to the `n_neighbors` parameter. |
| **Use Case** | **Fraud detection in finance or network security**, where fast processing of large, high-dimensional datasets is required. | **Geospatial analysis** to find noisy GPS points or **image processing** to identify defects in manufacturing. | **Intrusion detection in networks** or **detecting abnormal gene expressions**, where the definition of "normal" can vary across different regions of the data. |

## Question 3: What are the key components of a Time Series? Explain each with one example.
---
**Answer:**

A time series is a sequence of data points collected over time. To better understand and model a time series, it is often decomposed into four key components:

#### 1. Trend (T)
The trend represents the long-term, underlying direction of the data. It shows whether the series is generally increasing, decreasing, or remaining constant over an extended period.
- **Example:** The steady increase in global average temperatures recorded over the last century. 

#### 2. Seasonality (S)
Seasonality refers to regular, predictable patterns or fluctuations that repeat over a fixed and known period of time. This period can be daily, weekly, monthly, or yearly.
- **Example:** Retail sales of ice cream consistently peaking every summer and dropping every winter. 

#### 3. Cyclical Component (C)
The cyclical component consists of patterns that are not of a fixed period, unlike seasonality. These fluctuations are often related to longer-term economic or business cycles and their duration is usually at least 2 years.
- **Example:** The boom-and-bust cycles in the housing market, which may repeat every 5-10 years, influenced by broad economic conditions.

#### 4. Irregular / Residual Component (I or R)
This component, also known as noise or random variation, is what remains after the trend, seasonality, and cyclical components have been removed from the time series. It represents the unpredictable and unsystematic fluctuations in the data.
- **Example:** A sudden, sharp drop in the stock price of a company due to unexpected negative news.

## Question 4: Define Stationarity in time series. How can you test and transform a non-stationary series into a stationary one?
---
**Answer:**

#### Definition of Stationarity
A time series is said to be **stationary** if its statistical properties—specifically its mean, variance, and autocorrelation—are all constant over time. This means that a stationary series does not have a trend or seasonal effects, and its statistical characteristics are independent of the time at which they are observed. This property is a fundamental assumption for many time series forecasting models like ARIMA.

#### How to Test for Stationarity
You can test for stationarity using both visual methods and statistical tests:
1.  **Visual Inspection**: Plot the time series and look for obvious trends or seasonal patterns. Plotting the rolling mean and rolling standard deviation can also help; if they are not constant, the series is likely non-stationary.
2.  **Statistical Tests**: The most common test is the **Augmented Dickey-Fuller (ADF) Test**.
    - **Null Hypothesis ($H_0$)**: The time series is non-stationary (it has a unit root).
    - **Alternative Hypothesis ($H_1$)**: The time series is stationary.
    To accept that the series is stationary, we need to reject the null hypothesis. This is done by checking if the **p-value** from the test is less than a chosen significance level (e.g., 0.05).

#### How to Transform a Non-Stationary Series into a Stationary One
If a time series is found to be non-stationary, it must be transformed before applying forecasting models. Common techniques include:
1.  **Differencing**: This is the most common method. It involves computing the difference between consecutive observations. First-order differencing is calculated as:
    $$Y'_t = Y_t - Y_{t-1}$$
    If the series is still not stationary, you can apply second-order differencing (i.e., difference the differenced series). This helps to remove trends.

2.  **Transformation**: To stabilize a non-constant variance, you can apply mathematical transformations like taking the **logarithm**, **square root**, or **Box-Cox transformation** of the series.

Often, a combination of transformation and differencing (e.g., taking the log and then differencing) is required to make a series fully stationary.

## Question 5: Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in terms of structure and application.
---
**Answer:**

| Model     | Full Name                           | Structure & Key Idea                                                                                                     | Application / Use Case                                                                                      |
| :-------- | :---------------------------------- | :----------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------- |
| **AR** | AutoRegressive                      | `AR(p)`: The current value ($Y_t$) is a linear combination of its own `p` past values. **Depends on past values.** | Modeling stationary time series where the next value has a strong correlation with recent past values.    |
| **MA** | Moving Average                      | `MA(q)`: The current value ($Y_t$) is a linear combination of the `q` past forecast errors. **Depends on past errors.** | Modeling stationary time series where shocks or random spikes affect the output for a fixed duration.     |
| **ARIMA** | AutoRegressive Integrated Moving Average | `ARIMA(p,d,q)`: Combines AR and MA with `d` levels of **differencing (Integration)** to handle non-stationary data.         | Forecasting non-stationary data that has a clear trend but no seasonality, like stock prices or economic GDP. |
| **SARIMA**| Seasonal ARIMA                      | `SARIMA(p,d,q)(P,D,Q)m`: Extends ARIMA by adding **seasonal components** (`P,D,Q`) where `m` is the length of the season (e.g., 12 for monthly data). | Forecasting data with both trend and clear seasonality, such as monthly airline passenger numbers or quarterly retail sales. |
| **SARIMAX**| SARIMA with eXogenous Variables     | `SARIMAX`: Extends SARIMA to include **external predictor variables (exogenous variables)** that can influence the target series. | Forecasting electricity demand (`Y_t`) using weather forecasts (temperature, holidays) as external predictors. This often leads to more accurate models. |


## Question 6: Load a time series dataset (e.g., AirPassengers), plot the original series, and decompose it into trend, seasonality, and residual components.
---
**Answer:**

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.datasets import get_rdataset

# 1. Load the AirPassengers dataset
# The dataset is available in statsmodels, but we'll use R's dataset for easy access
air_passengers = get_rdataset("AirPassengers").data

# Convert the 'time' column to a proper datetime index
# R dataset time is in float format e.g., 1949.000
air_passengers['time'] = pd.to_datetime(air_passengers['time'].apply(
    lambda x: f"{int(x)}-{int((x - int(x)) * 12) + 1:02d}"
))
air_passengers.set_index('time', inplace=True)
series = air_passengers['value']

# 2. Plot the original series
plt.figure(figsize=(12, 6))
plt.plot(series)
plt.title('Original AirPassengers Time Series')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.grid(True)
plt.show()

# 3. Decompose the time series
# We use a multiplicative model for decomposition as the seasonality grows with the trend
decomposition = seasonal_decompose(series, model='multiplicative')

# 4. Plot the decomposed components
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.suptitle('Time Series Decomposition of AirPassengers Data', y=1.02)
plt.tight_layout()
plt.show()

**Output Explanation:**

The code first plots the original time series of airline passengers, which clearly shows an upward trend and a repeating yearly seasonal pattern.

Next, it performs a seasonal decomposition and plots the four resulting components:
1.  **Observed**: The original time series data.
2.  **Trend**: A smooth line showing the long-term upward movement in the number of passengers.
3.  **Seasonal**: A repeating wave pattern that captures the yearly fluctuations (peaks in summer, troughs in winter).
4.  **Resid**: The random, irregular noise remaining in the data after removing the trend and seasonal components.

## Question 7: Apply Isolation Forest on a numerical dataset (e.g., NYC Taxi Fare) to detect anomalies. Visualize the anomalies on a 2D scatter plot.
---
**Answer:**

For this example, we'll create a synthetic dataset that mimics NYC Taxi Fare data, with features for `trip_distance` and `fare_amount`. We will then introduce some anomalies.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# 1. Generate synthetic taxi fare data
np.random.seed(42)
# Generate 500 normal trips
n_samples = 500
trip_distance = np.random.uniform(1, 15, n_samples)  # Miles
fare_amount = 2.50 + (trip_distance * 2.75) + np.random.normal(0, 3, n_samples) # Base fare + per mile + noise
normal_data = np.column_stack((trip_distance, fare_amount))

# Generate 20 anomaly trips
n_anomalies = 20
anomaly_dist = np.random.uniform(0.1, 25, n_anomalies)
anomaly_fare = np.random.uniform(5, 200, n_anomalies)
anomalies = np.column_stack((anomaly_dist, anomaly_fare))

# Combine normal data and anomalies
X = np.vstack((normal_data, anomalies))

# 2. Apply Isolation Forest
# Set contamination to the known proportion of anomalies
iso_forest = IsolationForest(contamination=float(n_anomalies) / (n_samples + n_anomalies), random_state=42)
y_pred = iso_forest.fit_predict(X)

# y_pred will be 1 for inliers and -1 for outliers (anomalies)
outlier_mask = y_pred == -1

# 3. Visualize the anomalies
plt.figure(figsize=(10, 7))

# Plot the inliers (normal data)
plt.scatter(X[~outlier_mask, 0], X[~outlier_mask, 1], c='blue', label='Normal Trips', alpha=0.6)

# Plot the outliers (anomalies)
plt.scatter(X[outlier_mask, 0], X[outlier_mask, 1], c='red', marker='x', s=100, label='Anomalies')

plt.title('Anomaly Detection in Taxi Fare Data using Isolation Forest')
plt.xlabel('Trip Distance (miles)')
plt.ylabel('Fare Amount ($)')
plt.legend()
plt.grid(True)
plt.show()

**Output Explanation:**

The scatter plot visualizes the synthetic taxi trip data. Most data points (blue dots) follow a clear linear relationship: as trip distance increases, so does the fare. The Isolation Forest algorithm successfully identifies the points that deviate from this pattern, marking them as red 'x's. These anomalies represent trips that are either unusually expensive for a short distance or unusually cheap for a long distance.

## Question 8: Train a SARIMA model on the monthly airline passengers dataset. Forecast the next 12 months and visualize the results.
---
**Answer:**

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.datasets import get_rdataset

# 1. Load and prepare the data
air_passengers = get_rdataset("AirPassengers").data
air_passengers['time'] = pd.to_datetime(air_passengers['time'].apply(
    lambda x: f"{int(x)}-{int((x - int(x)) * 12) + 1:02d}"
))
air_passengers.set_index('time', inplace=True)
series = air_passengers['value']

# 2. Train a SARIMA model
# A commonly used order for this dataset is (1,1,1)x(1,1,1,12)
# (p,d,q) = (1,1,1) for non-seasonal components
# (P,D,Q,m) = (1,1,1,12) for seasonal components with a monthly period (m=12)
model = SARIMAX(series, 
                order=(1, 1, 1), 
                seasonal_order=(1, 1, 1, 12),
                enforce_stationarity=False,
                enforce_invertibility=False)

results = model.fit(disp=False)

# 3. Forecast the next 12 months
forecast_steps = 12
forecast = results.get_forecast(steps=forecast_steps)

# Get forecast values and confidence intervals
pred_values = forecast.predicted_mean
pred_ci = forecast.conf_int()

# 4. Visualize the results
plt.figure(figsize=(12, 6))
plt.plot(series, label='Observed')
plt.plot(pred_values, label='Forecast', color='red')
plt.fill_between(pred_ci.index, 
                 pred_ci.iloc[:, 0], 
                 pred_ci.iloc[:, 1], color='pink', alpha=0.5, label='95% Confidence Interval')

plt.title('Air Passengers Forecast using SARIMA')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
plt.grid(True)
plt.show()

**Output Explanation:**

The plot shows the original Air Passengers data (blue line) up to the end of 1960. The SARIMA model's forecast for the next 12 months is shown as a red line, which continues the upward trend and seasonal pattern observed in the historical data. The pink shaded area represents the 95% confidence interval, indicating the range within which the true future values are likely to fall.

## Question 9: Apply Local Outlier Factor (LOF) on any numerical dataset to detect anomalies and visualize them using matplotlib.
---
**Answer:**

We will generate a synthetic dataset with clusters of varying densities to demonstrate the strength of LOF.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

# 1. Generate synthetic data
n_samples = 300
blobs_params = dict(random_state=42, n_samples=n_samples, n_features=2)
X, _ = make_blobs(centers=[[0, 0], [5, 5]], cluster_std=[0.5, 1.5], **blobs_params)

# Add some random noise points (outliers)
np.random.seed(42)
outliers = np.random.uniform(low=-4, high=8, size=(20, 2))
X = np.vstack([X, outliers])

# 2. Apply Local Outlier Factor (LOF)
lof = LocalOutlierFactor(n_neighbors=20, contamination='auto')
y_pred = lof.fit_predict(X)

# LOF returns 1 for inliers and -1 for outliers
outlier_mask = y_pred == -1

# 3. Visualize the results
plt.figure(figsize=(10, 7))

# Plot the inliers
plt.scatter(X[~outlier_mask, 0], X[~outlier_mask, 1], c='blue', label='Inliers', alpha=0.7)

# Plot the outliers
plt.scatter(X[outlier_mask, 0], X[outlier_mask, 1], c='red', marker='x', s=100, label='Outliers')

plt.title('Anomaly Detection using Local Outlier Factor (LOF)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

**Output Explanation:**

The plot shows two main clusters of data (blue dots) with different densities. The Local Outlier Factor algorithm correctly identifies the scattered points (red 'x's) as outliers because their local density is much lower compared to their neighbors. This demonstrates LOF's ability to find anomalies even when the data has regions of varying density.

## Question 10: You are working as a data scientist for a power grid monitoring company... Explain your real-time data science workflow.
---
**Answer:**

As a data scientist for a power grid company, my goal is to build a robust system for real-time anomaly detection and short-term energy demand forecasting. Here is my proposed workflow:

#### 1. Anomaly Detection in Streaming Data

Given the real-time, high-frequency nature of the data (every 15 minutes), the choice of algorithm must prioritize speed and efficiency.

* **Chosen Algorithm**: **Isolation Forest**.
* **Why?**: Isolation Forest is computationally efficient and scales well with large datasets, making it ideal for streaming applications. Its approach of isolating anomalies rather than profiling normal data works well for real-time detection. While LOF is powerful, its computational complexity ($O(n^2)$) makes it unsuitable for high-velocity streams. DBSCAN would require continuous re-clustering and is sensitive to density parameters which might shift over time.
* **Implementation**: I would train an initial Isolation Forest model on a historical dataset of normal energy consumption. For incoming data points, I would use the trained model to quickly calculate an anomaly score. If the score exceeds a predefined threshold, an alert would be triggered for investigation.

#### 2. Time Series Model for Short-Term Forecasting

The forecasting model needs to be sophisticated enough to handle multiple seasonalities and external factors.

* **Chosen Model**: **SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables)**.
* **Why?**: Energy consumption has multiple layers of seasonality (daily, weekly, yearly) and is heavily influenced by external factors. SARIMAX is the perfect fit:
    * The **SARIMA** component can model the complex trend and seasonal patterns inherent in energy usage.
    * The **X (eXogenous)** component allows me to incorporate crucial external predictors like **weather conditions** (temperature, humidity), **day of the week**, and **public holidays**, which significantly improves forecast accuracy.

#### 3. Validation and Performance Monitoring

A model's performance can degrade over time (concept drift), so continuous validation is essential.

* **Initial Validation**: I would use a **time-series cross-validation** technique on historical data. A hold-out set (e.g., the most recent month of data) would be used to evaluate the final model's performance using metrics like **Mean Absolute Percentage Error (MAPE)** and **Root Mean Squared Error (RMSE)**.
* **Ongoing Monitoring**: Once deployed, the system would continuously monitor the model's accuracy by comparing its forecasts against the actual energy usage. I would set up a dashboard to track MAPE over time. If the error rate consistently exceeds a certain threshold, an alert would be triggered to notify the data science team. The model would be **retrained periodically** (e.g., weekly or monthly) on fresh data to ensure it adapts to new patterns.

#### 4. Business Impact and Operational Benefits

This dual anomaly detection and forecasting solution would provide significant value:

* **Operational Efficiency**: Accurate short-term forecasts allow grid operators to perform **load balancing** more effectively, ensuring a stable power supply and preventing costly blackouts. It helps in deciding when to bring additional power plants online or purchase energy from other grids.
* **Cost Reduction**: By predicting demand, the company can optimize energy generation and procurement, purchasing power when prices are low and avoiding expensive last-minute acquisitions.
* **Proactive Maintenance**: The real-time anomaly detection system can identify equipment malfunctions or potential energy theft (e.g., a sudden, unexplainable drop in consumption at a location), allowing for rapid response and minimizing losses.
* **Strategic Planning**: Long-term analysis of the forecast data can help in planning for future infrastructure upgrades and investments.