**Naive Approach**

1. What is the Naive Approach in machine learning?

The Naive Approach, also known as the Naive Bayes algorithm, is a simple and widely used probabilistic machine learning method based on Bayes' theorem. It is primarily used for classification tasks, where the goal is to assign a class label to a given input data point.


2. Explain the assumptions of feature independence in the Naive Approach.

Assumptions of feature independence in the Naive Approach:

The Naive Approach assumes that all features used to describe an input data point are independent of each other. This means that the presence or absence of one particular feature does not affect the presence or absence of any other feature.

In mathematical terms, this assumption can be stated as: P(x_i | C, x_j) = P(x_i | C), where x_i and x_j are two different features, and C is the class label. This assumption significantly simplifies the calculation of probabilities required for classification.

3. How does the Naive Approach handle missing values in the data?

When dealing with missing values in the data, the Naive Approach typically

1.   ignores the missing values during the probability estimation step. In other words, the algorithm only considers the observed features for making predictions.
2.   If the missing values are prevalent, imputation techniques like mean, median, or mode imputation can be applied to replace missing values before using the Naive Approach.



4. What are the advantages and disadvantages of the Naive Approach?

Advantages of the Naive Approach:

*   Simple and easy to implement.
*   Efficient and computationally inexpensive, making it suitable for large datasets.
*   Can work well with high-dimensional data and is particularly effective in text classification tasks.


Disadvantages of the Naive Approach:

* The assumption of feature independence may not hold true for all datasets, leading to suboptimal performance.
* It is susceptible to the "zero-frequency" or "zero-count" problem, where it assigns zero probability to unseen features, making predictions unreliable.
* The Naive Approach may struggle with data that has strong dependencies between features.

5. Can the Naive Approach be used for regression problems? If yes, how?

Yes, the Naive Approach can be adapted for regression problems by using a variation called the Naive Bayes Regression. In this approach, the conditional probability distribution of the target variable given the features is assumed to follow a specific distribution, such as Gaussian (Normal) distribution.

6. How do you handle categorical features in the Naive Approach?

*  Handling categorical features in the Naive Approach:

*  For categorical features, the Naive Approach uses the concept of likelihood and prior probabilities to calculate the conditional probabilities required for classification.
*  The likelihood is estimated by counting the occurrences of each category within each class, while the prior probability is estimated by counting the occurrences of each class label in the training data.
*  Laplace smoothing (additive smoothing) is often applied to handle categories that are not observed in the training data and to prevent zero probabilities.

7. What is Laplace smoothing and why is it used in the Naive Approach?

Laplace smoothing, also known as additive smoothing, is used to address the zero-frequency problem that may arise when a category is not observed in the training data for a particular class.

During Laplace smoothing, a small constant (usually 1) is added to all the category counts for each class before calculating the probabilities.
This ensures that no category has a probability of zero, preventing the algorithm from assigning zero probability to unseen features.

8. How do you choose the appropriate probability threshold in the Naive Approach?

Choosing the appropriate probability threshold in the Naive Approach:

The Naive Approach outputs class probabilities for each data point, and a threshold is used to determine the final predicted class.
The appropriate threshold depends on the specific problem and the desired trade-off between precision and recall.

9. Give an example scenario where the Naive Approach can be applied.
**Text Classification:** The Naive Approach is commonly used for tasks like sentiment analysis, spam detection, and document categorization. In this scenario, each document is represented by the frequencies of words (features), and the algorithm can predict the class (e.g., positive/negative sentiment) based on these features' probabilities.

**KNN:**

10. What is the K-Nearest Neighbors (KNN) algorithm?

The K-Nearest Neighbors (KNN) algorithm is a simple and non-parametric machine learning algorithm used for both classification and regression tasks. It makes predictions based on the similarity of the new data point to its k nearest neighbors in the training dataset.


11. How does the KNN algorithm work?

**For classification:** To predict the class label of a new data point, KNN finds the k closest data points to the new point (nearest neighbors) in the feature space, based on a chosen distance metric (e.g., Euclidean distance). The class label of the new data point is then determined by a majority vote of the classes among its k nearest neighbors.
**For regression:** To predict the target value of a new data point, KNN calculates the average (or weighted average) of the target values of its k nearest neighbors.

12. How do you choose the value of K in KNN?

The choice of K is critical in KNN, as it affects the algorithm's performance. A small K can lead to a noisy decision boundary, making the model sensitive to outliers and overfitting, while a large K can result in oversmoothing and loss of local patterns.
The value of K is typically chosen based on cross-validation, where different values of K are tested on a validation set, and the one that gives the best performance (e.g., highest accuracy or lowest mean squared error) is selected.

13. What are the advantages and disadvantages of the KNN algorithm?

**Advantages of the KNN algorithm:**
Simple to understand and implement.
Can be used for both classification and regression tasks.
Non-parametric nature makes it flexible and can handle complex decision boundaries.
Does not require training, so it can adapt to new data easily.

**Disadvantages of the KNN algorithm: **

Can be computationally expensive, especially for large datasets, as it requires calculating distances to all training data points.
Sensitive to the choice of distance metric and feature scaling.
Struggles with high-dimensional data, as the curse of dimensionality can affect the distance calculations.
Can be biased towards classes with higher frequencies in the dataset, making it perform poorly on imbalanced datasets.

14. How does the choice of distance metric affect the performance of KNN?

The choice of distance metric in KNN:
The distance metric measures the similarity between data points. The commonly used distance metrics are Euclidean distance (for continuous features), Manhattan distance, and Minkowski distance.
The choice of distance metric can significantly affect the performance of KNN, as some distance metrics might be more suitable for certain types of data and problems.

15. Can KNN handle imbalanced datasets? If yes, how?

KNN can handle imbalanced datasets, but its predictions may be biased towards the majority class due to the voting mechanism.
Techniques like oversampling the minority class, undersampling the majority class, or using different weights for different classes can be applied to balance the dataset and improve the model's performance on the minority class.

16. How do you handle categorical features in KNN?

Categorical features in KNN can be handled by using appropriate distance metrics for categorical data, such as Hamming distance or Jaccard distance.

17. What are some techniques for improving the efficiency of KNN?

Using efficient data structures like KD-trees or Ball-trees to speed up the neighbor search process.
Applying dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) to reduce the number of features and computational complexity.

18. Give an example scenario where KNN can be applied.

Handwritten Digit Recognition: Given a dataset of images containing handwritten digits and their corresponding labels, KNN can be used to classify new unseen images of digits. The algorithm would find the k-nearest neighbors based on pixel intensities and assign the most frequent class among the neighbors as the predicted digit label for the new image.

**Clustering:**

19. What is clustering in machine learning?

Clustering in machine learning is a type of unsupervised learning technique used to group similar data points into clusters based on their inherent patterns or similarities.


20. Explain the difference between hierarchical clustering and k-means clustering.

**Hierarchical clustering** builds a tree-like structure of clusters, known as a dendrogram, by iteratively merging or dividing clusters based on similarity.
It can be agglomerative, where each data point initially forms its own cluster and is then merged together, or divisive, where all data points start in a single cluster and are recursively divided.

**K-means clustering** aims to partition the data into a pre-specified number of clusters, denoted as 'k.'
It starts by randomly initializing 'k' cluster centroids and assigns each data point to the nearest centroid, creating 'k' clusters.

21. How do you determine the optimal number of clusters in k-means clustering?

**Elbow method:**  Plot the sum of squared distances (inertia) between data points and their cluster centroids for different values of 'k.' The optimal value of 'k' is usually where the inertia starts to level off, creating an elbow-like shape in the plot.

**Silhouette score:** Calculate the silhouette score for different values of 'k.' The silhouette score measures how similar an object is to its own cluster compared to other clusters. The value ranges from -1 to 1, and a higher silhouette score indicates better-defined clusters.

**Gap statistic:** Compare the within-cluster dispersion of the data for different values of 'k' with a reference distribution to find the optimal 'k' that maximizes the gap.

**Domain knowledge:** Sometimes, prior knowledge about the data or the problem can guide the choice of 'k.'

22. What are some common distance metrics used in clustering?

Euclidean distance: The straight-line distance between two points in space.

Manhattan distance: The sum of the absolute differences between the coordinates of two points.

Cosine similarity: Measures the cosine of the angle between two vectors and determines their similarity regardless of their magnitudes.

Jaccard similarity: Used for binary data, it measures the ratio of the size of the intersection to the size of the union of two sets.

Hamming distance: Calculates the number of positions at which two strings of equal length differ.

23. How do you handle categorical features in clustering?

One-Hot Encoding: Create binary columns for each category in the categorical feature. If a data point belongs to a particular category, the corresponding binary column is set to 1; otherwise, it is set to 0.

Label Encoding: Assign a unique integer label to each category in the categorical feature. However, this method can introduce unintended ordinal relationships between categories, which may not be suitable for some clustering algorithms.

24. What are the advantages and disadvantages of hierarchical clustering?

**Advantages of hierarchical clustering:**
Produces a hierarchical representation of data through dendrograms, allowing visualization at different levels of granularity.
Does not require the number of clusters to be specified beforehand.
Can work with various distance metrics and linkage methods to handle different types of data and relationships.

**Disadvantages of hierarchical clustering:**

Computationally more expensive, especially for large datasets.
Once a decision is made to merge or divide clusters, it cannot be undone, potentially leading to suboptimal results.
The choice of linkage method (e.g., single, complete, average) can significantly affect the clustering outcome.

25. Explain the concept of silhouette score and its interpretation in clustering.

The silhouette score is a metric used to evaluate the quality of clustering results. It measures how similar a data point is to its own cluster compared to other clusters.

26. Give an example scenario where clustering can be applied.

Retail Customer Segmentation: In a retail business, clustering can be used to group customers based on their purchasing behavior, demographics, and preferences. This can help identify different customer segments, such as high-value customers, budget shoppers, occasional buyers, etc. Understanding these segments can aid in targeted marketing strategies, personalized recommendations, and tailoring product offerings to better meet the needs of each customer group,

**Anomaly Detection:**


27. What is anomaly detection in machine learning?

Anomaly detection in machine learning refers to the process of identifying rare or unusual observations or patterns in a dataset that deviate significantly from the norm or expected behavior. These unusual instances are known as anomalies or outliers.


28. Explain the difference between supervised and unsupervised anomaly detection.

Supervised Anomaly Detection: In this approach, the algorithm is trained on a labeled dataset, which contains both normal instances and labeled anomalies. The model learns the patterns of normal behavior and can then predict whether new data points are normal or anomalies based on the learned labels.

Unsupervised Anomaly Detection: Here, the algorithm is trained on an unlabeled dataset that contains only normal instances. The model's goal is to learn the distribution of normal data points and identify deviations from this distribution as anomalies. Unsupervised methods are particularly useful when labeled anomaly data is scarce or unavailable.

29. What are some common techniques used for anomaly detection?

**Statistical Methods:** These methods involve using statistical techniques to identify data points that significantly deviate from the expected distribution. Examples include z-score, mean-shift, and Gaussian mixture models.

**Machine Learning Algorithms:** Various machine learning techniques can be employed for anomaly detection, such as k-nearest neighbors, isolation forests, one-class SVM, autoencoders, and density-based clustering algorithms like DBSCAN.

**Time Series Analysis:** Anomaly detection in time series data involves analyzing the patterns and trends over time to identify abnormal behavior. Techniques like Seasonal Hybrid ESD (Extreme Studentized Deviate) and Prophet are commonly used in this context.

30. How does the One-Class SVM algorithm work for anomaly detection?

The One-Class SVM (Support Vector Machine) algorithm is a popular method for anomaly detection. It works by learning a boundary that encapsulates the normal instances in the feature space.

31. How do you choose the appropriate threshold for anomaly detection?

Domain Expertise: In some cases, domain experts may have a clear understanding of what constitutes an anomaly and can help set a reasonable threshold.

Quantile of Scores: The threshold can be set based on the quantiles of the anomaly scores produced by the algorithm. For example, the threshold could be set at the 95th percentile of the scores, marking the top 5% of the instances as anomalies.

Cost-Sensitive Approach: An approach that considers the costs of false positives and false negatives can be used to find the threshold that minimizes the overall cost of misclassification.

Grid Search: In cases where labeled anomaly data is available, a grid search can be performed to find the threshold that optimizes some evaluation metric like F1-score, precision-recall curve, or receiver operating characteristic (ROC) curve.

32. How do you handle imbalanced datasets in anomaly detection?

**Resampling:** One can oversample the minority class (anomalies) or undersample the majority class (normal instances) to balance the dataset.

**Cost-sensitive learning:** Modify the learning algorithm to assign different misclassification costs for anomalies and normal instances, giving more weight to the minority class.

**Ensemble methods:** Utilize ensemble techniques like bagging or boosting to improve the model's ability to detect anomalies in imbalanced datasets.

**Anomaly generation**: Augment the dataset by generating synthetic anomalies to balance the class distribution.

33. Give an example scenario where anomaly detection can be applied.

In a network security setting, anomaly detection can be used to identify potential cyber threats and attacks. The goal is to distinguish normal network traffic patterns from suspicious or malicious activities. Instead of relying on predefined signatures of known attacks, anomaly detection techniques can identify novel attacks and zero-day exploits.

**Dimension Reduction:**


34. What is dimension reduction in machine learning?

Dimension reduction in machine learning is the process of reducing the number of features or variables in a dataset while preserving the essential information. The goal is to simplify the data representation, making it more manageable and efficient for analysis and modeling.


35. Explain the difference between feature selection and feature extraction.

Feature selection involves selecting a subset of the original features from the dataset based on certain criteria. These criteria can be statistical measures like correlation, information gain, or significance tests.

eature extraction, on the other hand, transforms the original features into a new set of features by applying mathematical transformations or techniques. The new features are a combination of the original ones and are created to represent the data in a more compact and meaningful way. Principal Component Analysis (PCA) is an example of a feature extraction technique.

36. How does Principal Component Analysis (PCA) work for dimension reduction?

PCA is a widely used dimension reduction technique that aims to transform the original features into a new set of uncorrelated variables called principal components. These principal components are linear combinations of the original features and are sorted in descending order of variance.

37. How do you choose the number of components in PCA?

Scree Plot: Plot the eigenvalues from the PCA and look for an "elbow" point, where the eigenvalues start to level off. This point indicates a good number of components to retain.

Explained Variance: Calculate the cumulative explained variance ratio from the eigenvalues. Retain enough components to explain a significant portion (e.g., 95%) of the total variance.

Cross-validation: Utilize cross-validation techniques to assess the performance of the model with different numbers of components and choose the value that leads to the best performance.

38. What are some other dimension reduction techniques besides PCA?

t-distributed Stochastic Neighbor Embedding (t-SNE): Primarily used for visualization, t-SNE reduces the dimensionality while preserving the local structure of the data points, making it effective for clustering and manifold learning.

Linear Discriminant Analysis (LDA): LDA is a supervised dimension reduction technique that maximizes the separability between classes while projecting the data into a lower-dimensional space.

Autoencoders: Autoencoders are neural networks used for unsupervised feature learning. They aim to reconstruct the input data from a reduced-dimensional representation, effectively learning a compressed representation of the data.

Non-negative Matrix Factorization (NMF): NMF factorizes the original data matrix into non-negative matrices, effectively discovering parts-based representations of the data.

39. Give an example scenario where dimension reduction can be applied.

In marketing, businesses often deal with vast amounts of customer data, including demographics, purchase history, online behavior, and more. Analyzing this data in its original high-dimensional form can be computationally expensive and challenging to interpret.

**Feature Selection:**

40. What is feature selection in machine learning?

Feature selection in machine learning is the process of selecting a subset of the most relevant and informative features from the original set of variables in a dataset. The goal is to improve model performance, reduce overfitting, and speed up computation by eliminating irrelevant or redundant features that may not contribute significantly to the target variable prediction.


41. Explain the difference between filter, wrapper, and embedded methods of feature selection.

**Filter Methods:** Filter methods rank features based on statistical measures or scores and select the top-ranked features for the model. These methods are independent of the chosen machine learning algorithm. Examples of filter methods include correlation-based feature selection, mutual information, and chi-squared tests.

**Wrapper Methods:** Wrapper methods evaluate the performance of a machine learning algorithm using different subsets of features. They select features based on the model's performance, considering the specific learning algorithm as a black box. Examples of wrapper methods include recursive feature elimination (RFE) and forward/backward selection.

**Embedded Methods:** Embedded methods incorporate feature selection as part of the model building process. These methods learn which features to use while training the model. Common examples of embedded methods are LASSO (Least Absolute Shrinkage and Selection Operator) and regularization techniques in linear models.

42. How does correlation-based feature selection work?

Correlation-based feature selection is a filter method that selects features based on their correlation with the target variable.

Compute Correlation: Calculate the correlation coefficient between each feature and the target variable.

Rank Features: Rank the features based on their correlation values with the target variable. Higher absolute correlation values indicate higher relevance.

Select Features: Choose the top 'k' features with the highest correlation scores. The value of 'k' can be predetermined or determined using cross-validation.

43. How do you handle multicollinearity in feature selection?

Multicollinearity occurs when two or more features in the dataset are highly correlated with each other.
 the following approaches:

Correlation Threshold: Set a correlation threshold, and if the absolute correlation coefficient between two features exceeds the threshold, remove one of the features from the dataset.

Variance Inflation Factor (VIF): Calculate the VIF for each feature. VIF quantifies how much the variance of a feature is inflated due to multicollinearity. Features with high VIF values (typically VIF > 5 or 10) are likely to be highly correlated, and one of them can be removed.

Principal Component Analysis (PCA): PCA can be used to transform the original features into a set of uncorrelated principal components. This can help mitigate multicollinearity and create a more compact representation of the data.

44. What are some common feature selection metrics?

Information Gain / Mutual Information: Measures the reduction in uncertainty about the target variable given the knowledge of a feature.

Chi-squared Test: Assesses the statistical dependence between two categorical variables.

Correlation Coefficient: Measures the linear relationship between two continuous variables.

p-value: Indicates the statistical significance of the relationship between a feature and the target variable.

Recursive Feature Elimination (RFE) Score: A wrapper method that recursively removes features and assesses the impact on model performance.

Regularization Coefficients (e.g., LASSO): Embedded methods that use penalty terms to shrink or eliminate the coefficients of less important features.

45. Give an example scenario where feature selection can be applied.

In credit risk assessment, a bank or financial institution aims to predict the likelihood of a loan applicant defaulting on their payments. They gather data on various factors such as income, credit score, debt-to-income ratio, employment history, and more.

**Data Drift Detection:**

46. What is data drift in machine learning?

Data drift in machine learning refers to the phenomenon where the statistical properties of the training data and the incoming data during deployment change over time. In other words, the data distribution in the production environment deviates from the data distribution used to train the model.


47. Why is data drift detection important?

Data drift detection is essential because machine learning models assume that the data they encounter during deployment will be similar to the data used during training. If the data distribution changes significantly, it can lead to a degradation in model performance and accuracy. Detecting data drift allows organizations to monitor the model's effectiveness and take appropriate actions to maintain its performance over time.

48. Explain the difference between concept drift and feature drift.

Concept Drift: Concept drift refers to the situation where the underlying relationship between the input features and the target variable changes over time. In other words, the target variable's behavior evolves, leading to different patterns and correlations between features and the target.

Feature Drift: Feature drift, on the other hand, occurs when the statistical properties of the input features change while the target variable remains the same. The relationship between features and the target may remain constant, but the distribution of feature values shifts over time.

49. What are some techniques used for detecting data drift?

Monitoring Metrics: Continuously monitor key metrics, such as accuracy, precision, recall, and F1-score, to observe changes in model performance over time.

Statistical Tests: Utilize statistical tests, such as the Kolmogorov-Smirnov test, Chi-squared test, or Mann-Whitney U test, to compare the distributions of incoming data with the training data.

Drift Detection Algorithms: Implement drift detection algorithms like Drift Detection Method (DDM), Page-Hinkley Test, or Cumulative Sum (CUSUM) to detect significant changes in data distributions.

Data Comparison: Keep a separate validation dataset or set up an A/B test to compare the model's predictions on different data batches.

50. How can you handle data drift in a machine learning model?

Re-training the Model: Periodically retrain the model using updated data to ensure it adapts to the changing data distribution.

Monitoring and Logging: Implement robust monitoring and logging mechanisms to track model performance and detect data drift in real-time.

Ensemble Methods: Utilize ensemble techniques like model stacking, where multiple models are combined to make predictions, to reduce the impact of data drift on individual models.

Adaptive Learning: Implement adaptive learning techniques that allow the model to adjust its parameters continuously as new data arrives, ensuring it remains relevant in dynamic environments.

Re-calibration: Re-calibrate the model's probabilities or decision thresholds based on the current data distribution to better align with the changing context.

Transfer Learning: Transfer knowledge from the old model to a new model by initializing the new model with the weights and parameters of the old model. Fine-tuning can then be performed on the new data.

Retraining Window: Limit the time window for training data, focusing only on recent data, to capture recent trends and reduce the impact of older data.

**Data Leakage:**


51. What is data leakage in machine learning?

Data leakage in machine learning refers to the situation where information from the target variable or future data is unintentionally leaked into the training process, leading to overly optimistic performance metrics.


52. Why is data leakage a concern?

Data leakage is a significant concern because it can lead to the development of misleadingly accurate models. These models may perform well on the training and validation data, but they will likely fail to generalize to new, unseen data.

53. Explain the difference between target leakage and train-test contamination.

**Target Leakage: **Target leakage occurs when information that is only available after the target variable is determined is used during model training.
**Train-Test Contamination: **Train-test contamination occurs when data from the test (or validation) set influences the training process.

54. How can you identify and prevent data leakage in a machine learning pipeline?

Holdout Method: Use proper data splitting techniques like train-test split or cross-validation to ensure data used for training is distinct from data used for testing.

Time-Based Validation: In time series data, use a rolling window approach, where the training data comes before the test data, to avoid future information leaking into the training process.

Feature Engineering: Be cautious when engineering features to ensure no information from the target variable or the test set is used.

Data Transformation Order: Be mindful of the order in which data preprocessing and feature engineering steps are applied to prevent contamination.

Feature Selection: Perform feature selection based only on the training data to avoid using information from the test set.

Target Encoding: Avoid target encoding (e.g., mean encoding) features using the target variable from the entire dataset, as this can lead to target leakage.

55. What are some common sources of data leakage?

Temporal Leakage: When working with time series data, using future information in training, such as using future events to predict past events.

Data Preprocessing: Applying transformations or imputations that use information from the entire dataset, including the test set.

Feature Engineering: Creating features based on the target variable or using future information to create features.

Target Encoding: Encoding categorical variables based on the target variable, which can introduce information leakage.

Human Error: Manually including data in the training set that should only be available during testing.

56. Give an example scenario where data leakage can occur.

Data Leakage Scenario: If the dataset includes a feature indicating whether a transaction was already marked as fraudulent by a previous version of the fraud detection model, this information could be used to train a new model.

**Cross Validation:**


57. What is cross-validation in machine learning?

Cross-validation in machine learning is a technique used to assess the performance of a predictive model by partitioning the available data into subsets or folds. The model is trained on a portion of the data (training set) and then evaluated on the remaining unseen data (validation set).


58. Why is cross-validation important?

Cross-validation is important for several reasons:

a. Reliable performance estimation: It provides a more robust and less biased estimate of the model's performance by reducing the risk of overfitting to a specific dataset.

b. Efficient use of data: Cross-validation allows the utilization of the available data for both training and testing, which is especially crucial when data is limited.

c. Hyperparameter tuning: It aids in optimizing the model's hyperparameters by allowing the evaluation of different parameter settings on multiple validation sets.

d. Model selection: Cross-validation helps compare and select the best-performing model among different algorithms or configurations.

59. Explain the difference between k-fold cross-validation and stratified k-fold cross-validation.

**a. K-fold cross-validation:** In k-fold cross-validation, the data is randomly divided into k subsets (or folds) of approximately equal size. Each fold is used as the validation set once while the remaining k-1 folds are used for training. This process is repeated k times, and the performance metrics are averaged.

**b. Stratified k-fold cross-validation:** Stratified k-fold cross-validation is used when the target variable has imbalanced class distributions. It ensures that each fold has a representative distribution of the target classes. In other words, the class proportions in each fold are maintained close to the overall class proportions in the entire dataset. This is particularly useful when one class is significantly smaller than the others, preventing bias in the performance evaluation.

In summary, k-fold cross-validation is random in its partitioning, while stratified k-fold cross-validation is mindful of class imbalances and ensures more consistent performance evaluation in such scenarios.

60. How do you interpret the cross-validation results?

Interpreting cross-validation results involves analyzing the model's performance metrics obtained during the cross-validation process. The main metrics typically used for evaluation depend on the nature of the problem:

a. Regression problems: Common metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), etc. Lower values indicate better performance.

b. Classification problems: Common metrics include Accuracy, Precision, Recall, F1-score, and Area Under the Receiver Operating Characteristic curve (AUC-ROC), among others. Higher accuracy and F1-score, and higher AUC-ROC values (closer to 1), indicate better performance.