
Question 1 : What is Dimensionality Reduction? Why is it important in machine learning?

Answer: Dimensionality reduction is the process of transforming data from a high-dimensional feature space to a lower-dimensional space while retaining as much of the important structure (signal/variance or class-separating information) as possible.

It matters because high-dimensional data increases computational cost, exacerbates the curse of dimensionality (sparser data, unreliable distance measures), and heightens overfitting risk; reducing dimensions improves model generalization, speed, storage, interpretability, and enables 2D/3D visualization for exploration. In practice, it helps remove redundant and noisy features, stabilize distance-based methods

Question 2: Name and briefly describe three common dimensionality reduction
techniques..

Answer: Principal Component Analysis (PCA): A linear projection method that finds orthogonal directions (principal components) capturing maximal variance; used for compression, denoising, and preprocessing before learning.​

t‑SNE (t‑Distributed Stochastic Neighbor Embedding): A nonlinear manifold technique optimized for 2D/3D visualization that preserves local neighborhood structure to reveal clusters; not ideal as a preprocessing step for downstream models.​

UMAP (Uniform Manifold Approximation and Projection): A fast nonlinear manifold method preserving both local and more global structure, scalable to large datasets, useful for visualization and sometimes as a feature transform.

Question 3: What is clustering in unsupervised learning? Mention three popular clustering algorithms.

Answer: Clustering is an unsupervised learning task that groups samples so that items within a cluster are more similar to each other than to items in other clusters, without using labels; similarity is defined by a distance or density notion appropriate to the data and objective. It is used for structure discovery, segmentation, compression, anomaly pre-filtering, and exploratory data analysis across domains such as marketing (customer segments), biology (cell types), and information retrieval (document/topic groups).

K-Means: partitions data into K clusters by minimizing within-cluster variance; efficient and widely used, but assumes roughly spherical clusters and requires K chosen in advance (Elbow/Silhouette aids selection).​

DBSCAN: density-based clusters discovered from core points and their neighborhoods; can find arbitrarily shaped clusters and marks noise explicitly, but needs density parameters (eps, minPts) tuned to data scale.​

Agglomerative hierarchical clustering: bottom-up merges using a linkage criterion (single, complete, average, Ward) to produce a dendrogram, enabling multi-scale cluster selection and not requiring K upfront until cutting the tree

Question 4: Explain the concept of anomaly detection and its significance.

Answer:
Anomaly detection is the identification of observations, events, or patterns that deviate significantly from the normal behavior of a system or dataset, often signaling fraud, faults, intrusions, safety risks, or data quality problems.

Concept:

Definition and goal: detect instances that do not conform to an established notion of normality; this may be called outlier, novelty, or deviation detection depending on whether anomalies were present in training data.​

Anomaly types: point anomalies (an individual observation is extreme), contextual anomalies (normal generally but abnormal in a specific context like time/season), and collective anomalies (a sequence or group is anomalous even if members aren’t individually extreme).​

Significance:

Fraud and risk: credit card fraud, account takeover, money laundering, insurance abuse; rapid anomaly detection reduces direct financial loss and chargebacks.​

Cybersecurity and IT/observability: intrusion and malware detection, DDoS spikes, service latency/SLA breaches, error-rate anomalies in logs/metrics that indicate incidents to triage quickly.​

Industrial/IoT and healthcare: sensor drifts, vibration anomalies indicating equipment failure; physiological or lab value outliers flagging clinical deterioration or data-entry errors

Question 5: List and briefly describe three types of anomaly detection techniques.

Answer:
Statistical methods: Model normal data distribution (e.g., Gaussian, robust z‑scores) and flag points with low likelihood or extreme deviation, effective when distributional assumptions hold.​

Distance/Density-based methods: Use distances or neighborhood density to flag isolated points (e.g., kNN distance, Local Outlier Factor), well-suited when anomalies are far from dense regions.​

Model-based/Representation methods: Learn a model of normality and flag high reconstruction error or low predicted probability (e.g., PCA reconstruction, one‑class SVM, autoencoders), flexible for complex data manifolds.

Question 6: What is time series analysis? Mention two key components of time series data.

Answer:
Time series analysis is the statistical and analytical study of data points collected or recorded at sequential and usually equally spaced time intervals. Its aim is to uncover underlying patterns, forecast future values, detect anomalies, and understand how temporal structures such as trends and recurring cycles affect behavior over time.​

Two key components:

Trend: The long-term movement in the time series, which could be upward, downward, or flat, and reflects persistent change over periods longer than seasonal cycles (e.g., increasing sales year over year).​

Seasonality: Regular, fixed-frequency patterns linked to calendar intervals, such as daily temperature changes, weekly website visits, or annual sales cycles.


Question 7: Describe the difference between seasonality and cyclic behavior in time series.

Answer:
Seasonality and cyclic behavior are both recurring patterns in time series data, but they differ fundamentally in their regularity, cause, and predictability:

Seasonality
Definition: Seasonality is a pattern of fluctuations that repeats at a fixed, known interval due to calendar-related or external periodic influences (e.g., months, quarters, days of the week).​

Key Characteristics:

Always has a constant and predictable period (e.g., every 12 months, once a week).

Driven by factors like holidays, weather, or business cycles tied to the calendar.

Examples include higher ice cream sales every summer or increased shopping activity every December.

The timing and frequency of peaks and troughs do not change—seasonal effects are strictly regular and associated with the calendar.​

Cyclic Behavior
Definition: Cyclic behavior also involves up-and-down movements in a time series, but these cycles do not have a fixed, known period. Instead, the length and amplitude of each cycle can change depending on broader, often economic or systemic, forces.​

Key Characteristics:

Cycles are not defined by the calendar; their durations and intensities vary.

Often arise from business, economic, or system-driven factors (e.g., business booms and recessions).

Peaks and troughs are less predictable, and the average duration of one cycle is generally much longer and more variable than seasonal intervals.

Examples include economic expansions and contractions, with cycles lasting several years and both their timing and magnitude subject to change.



In [1]:
#Question 8: Write Python code to perform K-means clustering on a sample dataset.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate synthetic sample data with 3 clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)

# Scale for better clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-means model
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)

print("Cluster centers:\n", kmeans.cluster_centers_)
print("Cluster size distribution:", np.bincount(labels))
print("Inertia (sum of squared distances):", kmeans.inertia_)


Cluster centers:
 [[-0.21814896  1.14246759]
 [-1.04793262 -1.24307863]
 [ 1.26608158  0.10061104]]
Cluster size distribution: [100 100 100]
Inertia (sum of squared distances): 39.06864466333185


In [3]:
#Question 9: What is inheritance in OOP? Provide a simple example in Python.

#Answer
''' Inheritance is an object-oriented programming principle in
which a new class (child/subclass) can use, extend, or override the properties
 and methods of another class (parent/superclass),
 promoting code reuse and clarity by modeling “is-a” relationships.'''
 #Example

class Animal:
    def __init__(self, name):
        self.name = name
    def speak(self):
        return "Some generic sound"

class Dog(Animal):
    def speak(self):
        return f"{self.name} says Woof!"

class Cat(Animal):
    def speak(self):
        return f"{self.name} says Meow!"

d = Dog("Bruno")
c = Cat("Luna")
print(d.speak()) # Output: Bruno says Woof!
print(c.speak()) # Output: Luna says Meow!

Bruno says Woof!
Luna says Meow!


Question 10: How can time series analysis be used for anomaly detection?

Answer:
Time series analysis enables anomaly detection by modeling expected temporal patterns (trend, seasonality, cycles) and flagging data points or intervals that significantly diverge from these learned structures. Approaches include:​

Decomposition: Break time series into components (trend, seasonality, residual/noise) and detect outliers in the residuals after accounting for predictable effects.​

Forecasting models: Use ARIMA, SARIMA, Prophet, or machine learning models to generate predictions and highlight deviations beyond confidence intervals.

Window-based learning: Apply Isolation Forest, one-class SVM, autoencoders, or deep anomaly detectors on rolling windows or features engineered from time series.​

This process is critical for early detection of operational failures, fraud, system errors, or unexpected events, powering alerts, interventions, and proactive business decisions in domains like IT, IoT, healthcare, energy, and manufacturing.