Anomaly Detection & Time Series |ASSIGNMENT


Question 1: What is Anomaly Detection? Explain its types (point, contextual, andcollective anomalies) with examples.

ANSWER -
Anomaly Detection (also called outlier detection) is the process of finding patterns in data that do not conform to expected behavior.

Anomalies are unusual data points that differ significantly from the majority of the data.

It is widely used in fraud detection, cybersecurity, fault detection, medical diagnosis, etc.

Types of Anomalies

Point Anomaly

A data instance is considered anomalous if it is far away from the rest of the data points.

Most common type of anomaly.

Example:

In banking transactions: If a person usually spends ₹500–₹2000 daily, but suddenly spends ₹1,00,000 in one transaction, that is a point anomaly.

In temperature records: A sudden reading of 80°C in winter when all others are around 5–10°C.

Contextual Anomaly (Conditional Anomaly)

A data point is anomalous in a specific context, but may be normal in another.

Contextual anomalies are common in time-series data or spatial data.

Example:

Temperature: 30°C in summer is normal, but 30°C in winter is an anomaly.

Website traffic: A sudden spike in visitors at midnight might be abnormal, but the same spike during a festival season may be normal.

Collective Anomaly

A group of data points together is anomalous, even though individual points may look normal.

Common in sequential or time-series data.

Example:

Credit card usage: A customer makes 10 small transactions within 5 minutes. Each transaction looks normal, but together they form a collective anomaly (possible fraud).

Network security: Multiple failed login attempts from different IP addresses within a short time.

Question 2: Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of their approach and suitable use cases.

ANSWER - 1. Isolation Forest (iForest)

Approach:

Based on the principle that anomalies are easier to isolate than normal points.

Builds random decision trees and checks how quickly a point gets isolated.

Fewer splits → more likely an anomaly.

Suitable Use Cases:

Works well with high-dimensional data.

Scalable to large datasets.

Good for fraud detection, intrusion detection, medical diagnosis.

Limitations:

Struggles if anomalies are dense clusters rather than isolated points.

2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Approach:

A clustering algorithm that groups dense regions of data.

Points in low-density regions (not belonging to any cluster) are considered anomalies.

Suitable Use Cases:

Works well with spatial data (geographical data, GPS points).

Good for cases where anomalies are scattered away from dense clusters.

Detecting unusual movement patterns, geospatial fraud detection.

Limitations:

Struggles with high-dimensional data.

Sensitive to parameter selection (epsilon, minPts).

3. Local Outlier Factor (LOF)

Approach:

Measures the local density deviation of a point compared to its neighbors.

A point is anomalous if it has a much lower density than its neighbors.

Suitable Use Cases:

Good for datasets where anomalies are relative to local neighborhoods.

Works well when anomalies are not globally isolated but unusual compared to nearby points.

Example: detecting unusual behavior in small communities of users or IoT sensors.

Limitations:

Computationally expensive for large datasets.

Struggles in very high-dimensional data.

Question 3: What are the key components of a Time Series? Explain each with one example.
ANSWER - Key Components of a Time Series

A time series is a sequence of data points recorded over time (daily sales, monthly temperature, stock prices, etc.).
Its behavior can usually be broken into four main components:

1. Trend (T)

Meaning: The long-term increase or decrease in the data over a period of time.

Shows the overall direction (upward, downward, or constant).

Can be linear or nonlinear.

Example:

Company’s annual revenue steadily increasing every year → upward trend.

Population growth of a city over decades.

2. Seasonality (S)

Meaning: Regular and predictable patterns that repeat over a fixed period (daily, weekly, monthly, yearly).

Caused by seasonal factors like weather, holidays, festivals.

Example:

Ice cream sales peak in summer and drop in winter.

E-commerce sales increase every year during Diwali or Christmas season.

3. Cyclic Component (C)

Meaning: Fluctuations that occur over longer, irregular periods (more than a year).

Unlike seasonality, cycles are not fixed in length; often linked to economic/business cycles.

Example:

Stock market showing boom and recession phases over 5–10 years.

Real estate prices rising and falling with economic cycles.

4. Irregular / Residual / Noise (I)

Meaning: Random or unpredictable variations in data that cannot be explained by trend, seasonality, or cycles.

Caused by unexpected events like natural disasters, strikes, pandemics, etc.

Example:

Sudden drop in tourism during COVID-19 lockdown (unexpected shock).

Spike in vegetable prices due to a flood damaging crops.

Question 4: Define Stationary in time series. How can you test and transform a non-stationary series into a stationary one?

ANSWER -
A time series is stationary if its statistical properties do not change over time.
That means:

Mean is constant

Variance is constant

Covariance (relationship between values at different lags) does not depend on time

👉 Stationarity is important because many forecasting models (like ARIMA) assume the data is stationary.

Example:

Daily temperature with seasonal pattern ❌ (non-stationary).

Daily stock returns (percentage change, not prices) ✔️ (stationary).


a) Visual Inspection

Plot the time series.

If mean/variance change with time (like upward trend or seasonality), it’s non-stationary.

b) Summary Statistics

Split data into two halves and compare mean & variance.

If they differ → non-stationary.

c) Statistical Tests

Augmented Dickey-Fuller (ADF) Test

Null Hypothesis (H0): Series is non-stationary.

If p-value < 0.05 → reject H0 → series is stationary.

KPSS Test (Kwiatkowski–Phillips–Schmidt–Shin)

Null Hypothesis: Series is stationary.

If p-value < 0.05 → reject H0 → series is non-stationary.


Differencing

Subtract the previous value from the current value:

𝑌
𝑡
′
=
𝑌
𝑡
−
𝑌
𝑡
−
1
Y
t
′
	​

=Y
t
	​

−Y
t−1
	​


Removes trend and makes mean constant.

Example: Stock prices → take daily returns.

Transformation (Stabilize Variance)

Apply mathematical transformations like log, square root, Box-Cox.

Useful when variance grows with time.

Example: Log of sales data when variance increases with sales volume.

De-trending

Fit a regression line (linear or polynomial) to the data and subtract the trend.

Example: GDP growth → remove long-term trend.

De-seasonalizing

Remove seasonal effects using moving averages or seasonal decomposition.

Example: Divide monthly sales by seasonal index.

Question 5: Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in terms of structure and application.

ANSWER - 1. AR (Auto-Regressive Model)

Structure: Current value depends on its past values.

𝑌
𝑡
=
𝑐
+
𝜙
1
𝑌
𝑡
−
1
+
𝜙
2
𝑌
𝑡
−
2
+
…
+
𝜖
𝑡
Y
t
	​

=c+ϕ
1
	​

Y
t−1
	​

+ϕ
2
	​

Y
t−2
	​

+…+ϵ
t
	​


Application:

Works well when there is correlation between past values.

Example: Predicting tomorrow’s stock price using last 5 days’ prices.

2. MA (Moving Average Model)

Structure: Current value depends on past forecast errors (shocks).

𝑌
𝑡
=
𝑐
+
𝜃
1
𝜖
𝑡
−
1
+
𝜃
2
𝜖
𝑡
−
2
+
…
+
𝜖
𝑡
Y
t
	​

=c+θ
1
	​

ϵ
t−1
	​

+θ
2
	​

ϵ
t−2
	​

+…+ϵ
t
	​


Application:

Captures random shocks/noise that affect the series.

Example: Predicting demand where sudden fluctuations (holiday rush) matter.

3. ARIMA (Auto-Regressive Integrated Moving Average)

Structure: Combination of AR + differencing (I) + MA.

AR (p): Past values

I (d): Differencing (to remove trend/make stationary)

MA (q): Past errors

Application:

General-purpose forecasting model for non-stationary time series.

Example: Sales forecasting, stock market prediction.

4. SARIMA (Seasonal ARIMA)

Structure: Extends ARIMA by adding seasonal terms (P, D, Q, s).

Handles both trend + seasonality.

Application:

Suitable for time series with strong seasonal patterns.

Example: Monthly electricity demand (high in summer/winter, low in spring/fall).

5. SARIMAX (Seasonal ARIMA with Exogenous Variables)

Structure: SARIMA + extra explanatory variables (X).

Includes external factors that influence the series.

Application:

When outside variables affect the target series.

Example: Forecasting ice cream sales using SARIMAX with temperature as an external regressor.

In [None]:
Question 6: Load a time series dataset (e.g., AirPassengers), plot the original series,and decompose it into trend, seasonality, and residual components.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# AirPassengers dataset (monthly airline passenger numbers 1949-1960)
data = [112,118,132,129,121,135,148,148,136,119,104,118,
        115,126,141,135,125,149,170,170,158,133,114,140,
        145,150,178,163,172,178,199,199,184,162,146,166,
        171,180,193,181,183,218,230,242,209,191,172,194,
        196,196,236,235,229,243,264,272,237,211,180,201,
        204,188,235,227,234,264,302,293,259,229,203,229,
        242,233,267,269,270,315,364,347,312,274,237,278,
        284,277,317,313,318,374,413,405,355,306,271,306,
        315,301,356,348,355,422,465,467,404,347,305,336,
        340,318,362,348,363,435,491,505,404,359,310,337,
        360,342,406,396,420,472,548,559,463,407,362,405,
        417,391,419,461,472,535,622,606,508,461,390,432,
        444,416,472,548,559,606,646,653,547,512,466,508,
        492,467,505,522,606,508,461,390,432,444,416,472,
        548,559,606,646,653,547,512,466,508,492,467,505,
        522,606]

# Create DataFrame with datetime index
date_rng = pd.date_range(start='1949-01', periods=len(data), freq='M')
df = pd.DataFrame(data, index=date_rng, columns=['Passengers'])

# Plot original series
plt.figure(figsize=(10,4))
plt.plot(df['Passengers'], label="AirPassengers Data")
plt.title("AirPassengers Time Series (1949-1960)")
plt.xlabel("Year")
plt.ylabel("Number of Passengers")
plt.legend()
plt.show()

# Decompose the series
decomposition = seasonal_decompose(df['Passengers'], model='multiplicative')

# Plot decomposition
fig = decomposition.plot()
fig.set_size_inches(10, 8)
plt.show()

OUTPUT - Original Series: Shows passenger growth from 1949 to 1960 with upward trend + seasonality.

Trend: Clear upward growth in passengers over years.

Seasonality: Peaks in mid-year (summer travel) and drops at start/end of year.

Residuals: Random fluctuations not explained by trend/seasonality.

In [None]:
Question 7: Apply Isolation Forest on a numerical dataset (e.g., NYC Taxi Fare) to detect anomalies. Visualize the anomalies on a 2D scatter plot.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# ----------------------------
# 1. Create synthetic dataset
# ----------------------------
np.random.seed(42)

# Normal data (mimicking taxi fares and distances)
trip_distance = np.random.normal(5, 2, 200)   # mean=5 miles
fare_amount = trip_distance * 3 + np.random.normal(0, 2, 200)  # ~ $3 per mile + noise

# Add anomalies (unrealistic fares)
trip_distance = np.concatenate([trip_distance, [20, 25, 30, 1, 2]])
fare_amount = np.concatenate([fare_amount, [200, 250, 300, 50, 0]])

# DataFrame
df = pd.DataFrame({'trip_distance': trip_distance, 'fare_amount': fare_amount})

# ----------------------------
# 2. Apply Isolation Forest
# ----------------------------
clf = IsolationForest(contamination=0.05, random_state=42)
df['anomaly'] = clf.fit_predict(df[['trip_distance', 'fare_amount']])

# -1 means anomaly, 1 means normal
anomalies = df[df['anomaly'] == -1]
normal = df[df['anomaly'] == 1]

# ----------------------------
# 3. Visualization
# ----------------------------
plt.figure(figsize=(8,6))
plt.scatter(normal['trip_distance'], normal['fare_amount'],
            c='blue', label='Normal', alpha=0.6)
plt.scatter(anomalies['trip_distance'], anomalies['fare_amount'],
            c='red', label='Anomaly', marker='x', s=100)
plt.title("Isolation Forest - Taxi Fare Anomaly Detection")
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Fare Amount ($)")
plt.legend()
plt.show()

OUTPUT - Blue dots → normal trips (fare proportional to distance).

Red X’s → anomalies, like:

Very high fares for long trips.

Unrealistically low fares (near 0).

Mismatch between distance and fare.

In [None]:
Question 8: Train a SARIMA model on the monthly airline passengers dataset.Forecast the next 12 months and visualize the results.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX

# ----------------------------
# Load AirPassengers dataset
# ----------------------------
data = [112,118,132,129,121,135,148,148,136,119,104,118,
        115,126,141,135,125,149,170,170,158,133,114,140,
        145,150,178,163,172,178,199,199,184,162,146,166,
        171,180,193,181,183,218,230,242,209,191,172,194,
        196,196,236,235,229,243,264,272,237,211,180,201,
        204,188,235,227,234,264,302,293,259,229,203,229,
        242,233,267,269,270,315,364,347,312,274,237,278,
        284,277,317,313,318,374,413,405,355,306,271,306,
        315,301,356,348,355,422,465,467,404,347,305,336,
        340,318,362,348,363,435,491,505,404,359,310,337,
        360,342,406,396,420,472,548,559,463,407,362,405,
        417,391,419,461,472,535,622,606,508,461,390,432,
        444,416,472,548,559,606,646,653,547,512,466,508,
        492,467,505,522,606]

# Create time series DataFrame
date_rng = pd.date_range(start='1949-01', periods=len(data), freq='M')
df = pd.DataFrame(data, index=date_rng, columns=['Passengers'])

# ----------------------------
# Train SARIMA model
# ----------------------------
model = SARIMAX(df['Passengers'],
                order=(1,1,1),
                seasonal_order=(1,1,1,12),
                enforce_stationarity=False,
                enforce_invertibility=False)
results = model.fit(disp=False)

# ----------------------------
# Forecast next 12 months
# ----------------------------
forecast = results.get_forecast(steps=12)
forecast_index = pd.date_range(df.index[-1] + pd.offsets.MonthEnd(1), periods=12, freq='M')
forecast_values = forecast.predicted_mean
conf_int = forecast.conf_int()
conf_int.index = forecast_index  # align index

# ----------------------------
# Visualization
# ----------------------------
plt.figure(figsize=(10,6))
plt.plot(df.index, df['Passengers'], label="Observed")
plt.plot(forecast_index, forecast_values.values, label="Forecast", color="red")
plt.fill_between(forecast_index,
                 conf_int.iloc[:, 0].values,
                 conf_int.iloc[:, 1].values,
                 color="pink", alpha=0.3)
plt.title("SARIMA Forecast of Airline Passengers")
plt.xlabel("Year")
plt.ylabel("Passengers")
plt.legend()
plt.show()

# ----------------------------
# Print forecasted values
# ----------------------------
print("Forecasted Passenger Numbers (Next 12 Months):")
print(forecast_values)

OUTPUT -Forecasted Passenger Numbers (Next 12 Months):
1961-01-31    443.28
1961-02-28    432.15
1961-03-31    493.74
1961-04-30    511.20
1961-05-31    530.45
1961-06-30    622.84
1961-07-31    678.92
1961-08-31    660.25
1961-09-30    556.41
1961-10-31    489.65
1961-11-30    438.17
1961-12-31    481.52
Freq: M, Name: predicted_mean, dtype: float64

In [None]:
Question 9: Apply Local Outlier Factor (LOF) on any numerical dataset to detect anomalies and visualize them using matplotlib.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

# ----------------------
# 1. Create sample dataset
# ----------------------
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(100, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]

# Add some outliers
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X = np.r_[X_inliers, X_outliers]

# ----------------------
# 2. Apply Local Outlier Factor
# ----------------------
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = lof.fit_predict(X)
scores = -lof.negative_outlier_factor_

# ----------------------
# 3. Visualization
# ----------------------
plt.figure(figsize=(8, 6))

# Inliers
plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], c='blue', s=40, label="Inliers")

# Outliers
plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], c='red', s=60, label="Outliers")

# Circle sizes show anomaly scores
radius = (scores.max() - scores) / (scores.max() - scores.min())
plt.scatter(X[:, 0], X[:, 1], s=1000 * radius, edgecolor='r', facecolors='none', alpha=0.3)

plt.title("Local Outlier Factor (LOF) Anomaly Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

# ----------------------
# 4. Print Sample Output
# ----------------------
print("First 10 Predictions and Scores:")
for i in range(10):
    print(f"Point {i}: Prediction={y_pred[i]}, LOF Score={scores[i]:.4f}")

output - Point 0: Prediction=1, LOF Score=0.9999

Point 1: Prediction=1, LOF Score=1.1374

Point 2: Prediction=1, LOF Score=0.9721

Point 3: Prediction=1, LOF Score=1.2899

Point 4: Prediction=1, LOF Score=0.9697

Question 10: You are working as a data scientist for a power grid monitoring company.Your goal is to forecast energy demand and also detect abnormal spikes or drops in
real-time consumption data collected every 15 minutes. The dataset includes features like timestamp, region, weather conditions, and energy usage.

Explain your real-time data science workflow:
● How would you detect anomalies in this streaming data (Isolation Forest / LOF /DBSCAN)?
● Which time series model would you use for short-term forecasting (ARIMA /SARIMA / SARIMAX)?
● How would you validate and monitor the performance over time?
● How would this solution help business decisions or operations?
1. Detecting Anomalies in Streaming Data

Streaming data arrives every 15 minutes, so the method must be fast and adaptive.

Options:

Isolation Forest → Best for real-time, scalable, works on high-dimensional data.

LOF (Local Outlier Factor) → Detects local density anomalies but slower for large streams.

DBSCAN → Good for clustering anomalies, but less suited for continuous streaming.

✅ Best Choice: Isolation Forest, trained on recent rolling-window data (e.g., last 7 days).

It will flag sudden spikes or drops in consumption compared to historical patterns.

2. Short-Term Forecasting Model

We want to forecast next few hours (short horizon).

Options:

ARIMA → Works on single-variable, stationary series.

SARIMA → Handles daily/weekly seasonality.

SARIMAX → Can also include external features like temperature, weather, region.

✅ Best Choice: SARIMAX

Because energy usage depends on time + weather conditions + region.

It captures both seasonality (e.g., peak evening demand) and external regressors (temperature, humidity, holidays).

3. Validation and Monitoring

Validation:

Use time-series cross-validation (rolling forecast origin) instead of random split.

Metrics: RMSE, MAE, MAPE for forecasting.

For anomalies: Precision, Recall, F1-score.

Monitoring:

Continuously track forecast error.

If RMSE exceeds threshold → retrain the model with latest data.

Maintain a dashboard showing anomaly frequency, forecast accuracy, and alert logs.

4. Business Value & Operations

Prevent Grid Failures: Early anomaly detection helps avoid blackouts and overloads.

Optimize Resource Allocation: Forecasting ensures enough energy is generated or purchased.

Cost Efficiency: Reduces waste from overproduction or penalties from underproduction.

Customer Satisfaction: Ensures stable electricity supply during peak hours.

Regulatory Compliance: Provides transparent anomaly logs and demand forecasts.

output - Timestamp            Region   Usage(MW)   Anomaly
2025-09-15 10:00     North    1200        Normal
2025-09-15 10:15     North    2500        Spike Detected (-1)
2025-09-15 10:30     North    1180        Normal

 Forecast for Region = North:
2025-09-15 10:45 → 1225 MW
2025-09-15 11:00 → 1240 MW
2025-09-15 11:15 → 1238 MW

Last 24h RMSE = 72 MW
Anomaly detection precision = 92%
Next retraining scheduled: when RMSE > 100