# **GENERAL LINEAR MODEL :**

1)

The purpose of the General Linear Model (GLM) is to analyze the relationship between a dependent variable and one or more independent variables. It is a flexible statistical framework that allows for the modeling of various types of data and can be used for prediction, inference, and hypothesis testing.

2)

The key assumptions of the General Linear Model include:

Linearity: The relationship between the dependent variable and the independent variables is assumed to be linear.

Independence: The observations are assumed to be independent of each other.
Homoscedasticity: The variance of the dependent variable is assumed to be constant across different levels of the independent variables.

Normality: The dependent variable is assumed to follow a normal distribution.

3)

The interpretation of coefficients in a GLM depends on the type of model and the coding scheme used for the predictors. In general, the coefficient represents the change in the mean of the dependent variable associated with a one-unit change in the corresponding independent variable, while holding other variables constant. The sign of the coefficient indicates the direction of the relationship, and the magnitude indicates the strength of the relationship.

4)

A univariate GLM involves analyzing a single dependent variable with one or more independent variables. It focuses on examining the relationship between the dependent variable and each independent variable separately. In contrast, a multivariate GLM involves analyzing multiple dependent variables simultaneously using the same set of independent variables. It allows for the examination of the relationships among the dependent variables and their relationships with the independent variables.

5)

Interaction effects in a GLM occur when the effect of one independent variable on the dependent variable depends on the level or presence of another independent variable. In other words, the relationship between the dependent variable and one independent variable is not constant across different levels of another independent variable. Interaction effects are represented by the interaction terms between the relevant independent variables in the GLM equation.

6)

Categorical predictors in a GLM can be handled through a process called "dummy coding" or "effect coding." Dummy coding involves creating binary variables, also known as dummy variables, to represent the different levels or categories of the categorical predictor. Each level/category is compared to a reference category, and the resulting dummy variables are included in the GLM as independent variables. Effect coding is another coding scheme that allows for comparing each level/category to the overall mean.

7)

The design matrix in a GLM is a matrix that represents the relationship between the dependent variable and the independent variables. It contains the values of the independent variables for each observation. Each column in the design matrix represents a different independent variable or predictor, including any interaction terms or transformations that may be included in the model.

8)

The significance of predictors in a GLM can be tested using hypothesis tests, such as the t-test or F-test. The t-test is used to test the significance of individual coefficients/predictors, while the F-test is used to test the overall significance of a group of predictors or the model as a whole. The p-values associated with the tests indicate the probability of obtaining the observed results if the null hypothesis (no effect of the predictor) is true. If the p-value is below a pre-specified significance level (e.g., 0.05), the predictor is considered statistically significant.

9)

Type I, Type II, and Type III sums of squares are different methods for partitioning the sum of squares in a GLM when there are multiple predictors or terms in the model.

Type I sums of squares assess the unique contribution of each predictor while controlling for other predictors in a hierarchical manner. The order in which the predictors are entered into the model affects the Type I sums of squares.
Type II sums of squares assess the unique contribution of each predictor after taking into account the contributions of other predictors in the model. The Type II sums of squares are not affected by the order of predictor entry.
Type III sums of squares assess the contribution of each predictor after taking into account all other predictors in the model. It considers the full model and is appropriate when there are interactions or correlated predictors. Type III sums of squares are not affected by the order of predictor entry.

10)

Deviance in a GLM is a measure of the difference between the observed responses and the responses predicted by the model. It represents the lack of fit of the model to the data and is analogous to the concept of residual sum of squares in linear regression. Lower deviance indicates a better fit of the model to the data. Deviance is used in the estimation of model parameters, hypothesis testing, model comparison, and goodness-of-fit assessment in GLM analysis.

# **REGRESSION :**

11)

Regression analysis is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It aims to find the best-fitting mathematical function or line that describes the relationship between the variables. The purpose of regression analysis is to understand and predict the values of the dependent variable based on the values of the independent variables.

12)

Simple linear regression involves a single independent variable and a dependent variable. It seeks to establish a linear relationship between the two variables. Multiple linear regression, on the other hand, involves two or more independent variables and a dependent variable. It aims to capture the combined effect of multiple independent variables on the dependent variable.

13)

The R-squared value, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that can be explained by the independent variables in a regression model. It ranges between 0 and 1, where 0 indicates that the independent variables do not explain any of the variability in the dependent variable, and 1 indicates that the independent variables explain all the variability. In general, a higher R-squared value indicates a better fit of the regression model to the data.

14)

Correlation and regression are related but distinct concepts. Correlation measures the strength and direction of the linear relationship between two variables, while regression aims to model and predict the values of one variable based on the values of other variables. Correlation does not imply causation, whereas regression can be used to infer causal relationships when certain assumptions are met.

15)

In regression analysis, coefficients represent the estimated change in the dependent variable associated with a one-unit change in the corresponding independent variable, holding all other variables constant. These coefficients indicate the direction and magnitude of the relationship between the independent variables and the dependent variable. The intercept, also known as the constant term, represents the value of the dependent variable when all independent variables are zero.

16)

Outliers are extreme values that differ significantly from the overall pattern of the data. They can strongly influence regression analysis, especially if they have a large impact. Dealing with outliers depends on the specific situation. Sometimes, outliers are data errors and can be removed. Other times, the data can be transformed to make it less affected by outliers. There are also specialized regression techniques that can handle outliers more effectively.

17)

Ridge regression and ordinary least squares (OLS) regression are both regression techniques, but they differ in how they estimate the regression coefficients. OLS regression minimizes the differences between predicted and actual values. Ridge regression adds a penalty term to shrink the coefficients, making them more stable. Ridge regression is especially useful when dealing with multicollinearity, where variables are highly correlated.

18)

Heteroscedasticity refers to a situation where the variability of the residuals (the differences between predicted and actual values) is not constant across all levels of the independent variables. This violates an assumption of traditional linear regression, which assumes constant variance. Heteroscedasticity can affect the reliability of the regression model. To address it, one can use methods that adjust for heteroscedasticity or transform the variables to achieve more consistent variability.

19)

Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. It poses a problem because it becomes challenging to determine the individual effects of the correlated variables on the dependent variable. To handle multicollinearity, one can remove one of the correlated variables, combine them into a single variable, or use techniques like principal component analysis (PCA) to reduce the number of variables.

20)

Polynomial regression is a type of regression analysis that allows for nonlinear relationships between variables. It involves using polynomial functions to model the relationship between the independent and dependent variables. Unlike simple linear regression, polynomial regression can capture curved or nonlinear patterns in the data. It is useful when the relationship between the variables cannot be adequately described by a straight line.

# **LOSS FUNCTION :**

21)

A loss function measures the discrepancy between predicted and actual values in machine learning. Its purpose is to evaluate model performance and guide parameter adjustments during learning.



22)

A convex loss function has a single global minimum and forms a convex shape, while a non-convex loss function has multiple local minima and is more complex.

23)

Mean squared error (MSE) is a commonly used loss function that measures the average squared difference between the predicted and actual values. It is calculated by taking the average of the squared differences between each predicted and actual value.

24)

Mean absolute error (MAE) is a loss function that measures the average absolute difference between the predicted and actual values. It is calculated by taking the average of the absolute differences between each predicted and actual value.

25)

Log loss, also known as cross-entropy loss or binary cross-entropy loss, is a loss function commonly used in classification tasks. It quantifies the difference between the predicted probabilities of classes and the true class labels. It is calculated by taking the negative logarithm of the predicted probability for the true class

26)

The choice of an appropriate loss function depends on the specific problem and the nature of the data. For example, MSE is often used in regression tasks where the goal is to minimize the average squared difference between predicted and actual values. For classification tasks, cross-entropy loss is commonly used to optimize the predicted probabilities of class labels.

27)

Regularization is a technique used to prevent overfitting in machine learning models. It involves adding a penalty term to the loss function that encourages the model to have simpler or more regular parameter values. This penalty helps to control the complexity of the model and prevent it from fitting the training data too closely.

28)

Huber loss combines squared loss and absolute loss, handling outliers by treating differences below a threshold as squared errors and above the threshold as absolute errors.

29)

Quantile loss measures the deviation between predicted and actual quantiles, often used to predict specific percentiles of the target variable.

30)

Squared loss (MSE) squares the differences between predicted and actual values, emphasizing outliers, while absolute loss (MAE) treats all differences equally regardless of sign, being more robust to outliers.

# **Optimizer (GD):**

31)

In machine learning, an optimizer refers to an algorithm or method used to adjust the parameters of a model in order to minimize the error or loss function. The purpose of an optimizer is to find the optimal set of parameters that can make the model perform better on the given task or problem.

32)

Gradient Descent (GD) is an optimization algorithm that iteratively adjusts the parameters of a model to minimize a loss function. It works by calculating the gradients of the loss function with respect to the parameters and updating the parameters in the direction of steepest descent.

33)

Variations of Gradient Descent include Batch Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent. Each variation differs in the amount of training data used to compute the gradients and update the parameters.

34)

The learning rate in GD determines the step size taken in each parameter update. Choosing an appropriate value involves experimentation and balancing between convergence speed and stability. Techniques like manual tuning, learning rate schedules, and adaptive methods can be used to select the learning rate.

35)

GD may struggle with local optima. Strategies to address this include exploring different initial parameter values, using adaptive methods, and trying different architectures to improve the chances of finding a global minimum.

36)

Stochastic Gradient Descent (SGD) is a variation of GD where the parameters are updated using the gradients computed for individual training examples. It differs from GD, which uses the gradients computed on the entire dataset, by being more computationally efficient but introducing higher variance.

37)

Batch size in GD refers to the number of training examples used to compute the gradients and update the parameters in each iteration. The choice of batch size impacts training in terms of computational efficiency, convergence speed, and generalization. Larger batch sizes can result in smoother convergence but require more memory.

38)

Momentum in optimization algorithms, including GD, introduces a parameter that accumulates a fraction of the previous parameter updates. It helps accelerate convergence, especially in the presence of high curvature, plateaus, or noisy gradients.

39)

Batch GD uses the entire training dataset to compute gradients and update parameters in each iteration. Mini-batch GD uses a subset (mini-batch) of the data, striking a balance between accuracy and efficiency. SGD updates parameters using gradients computed for individual training examples, leading to faster but more noisy convergence.

40)

The learning rate affects the convergence of GD. A larger learning rate may cause instability, overshooting the minimum, while a smaller learning rate may result in slower convergence. Selecting an appropriate learning rate is essential to balance convergence speed and stability during training.

# **Regularization:**

41)

Regularization is a technique used in machine learning to prevent overfitting and improve the generalization ability of a model. It involves adding a penalty term to the loss function during training, which encourages the model to learn simpler and more robust representations.

42)

L1 and L2 regularization are two common types of regularization techniques. L1 regularization, also known as Lasso regularization, adds the absolute value of the parameter coefficients to the loss function. L2 regularization, also known as Ridge regularization, adds the squared magnitude of the parameter coefficients to the loss function. L1 regularization encourages sparsity by driving some coefficients to exactly zero, while L2 regularization encourages small weights but does not force them to be exactly zero.

43)

Ridge regression is a linear regression technique that uses L2 regularization. It adds the squared magnitude of the parameter coefficients multiplied by a regularization parameter to the loss function. Ridge regression shrinks the parameter estimates towards zero, reducing their magnitude and making them less sensitive to changes in the input data. It helps prevent overfitting by controlling the complexity of the model.

44)

Elastic net regularization is a combination of L1 and L2 regularization. It adds both the L1 and L2 penalty terms to the loss function with different weightings controlled by a hyperparameter. Elastic net regularization is useful when there are many correlated features and it encourages both feature selection (by driving some coefficients to zero) and grouping of correlated features (by maintaining groups of nonzero coefficients).

45)

Regularization helps prevent overfitting by adding a penalty for complexity to the loss function. By encouraging simpler models with smaller parameter values, regularization reduces the model's ability to fit noise or outliers in the training data. It helps the model generalize better to unseen data and reduces the likelihood of overfitting, where the model performs well on the training data but poorly on new data.

46)

Early stopping is a technique related to regularization that helps prevent overfitting. Instead of training the model for a fixed number of iterations or epochs, early stopping monitors the model's performance on a validation set during training. Training is stopped when the validation loss starts to increase, indicating that the model's generalization ability is deteriorating.

47)

Dropout regularization is a technique commonly used in neural networks. During training, dropout randomly sets a fraction of the neuron outputs to zero at each update, effectively "dropping out" those neurons. This helps prevent co-adaptation of neurons and reduces overfitting by making the network more robust and less reliant on specific neurons.

48)

The regularization parameter determines the strength of the regularization penalty applied during training. It controls the trade-off between fitting the training data and keeping the model's parameters small. The value of the regularization parameter is typically chosen through techniques like cross-validation or grid search, where multiple values are tested, and the one that yields the best performance on a validation set is selected.

49)

Feature selection and regularization are related but distinct concepts. Feature selection refers to the process of choosing a subset of relevant features from the available set of features. It aims to improve model performance by reducing the dimensionality and removing irrelevant or redundant features. Regularization, on the other hand, adds a penalty term to the loss function to control the complexity of the model and prevent overfitting. While feature selection can be achieved through regularization methods like L1 regularization (Lasso), regularization itself focuses on controlling the parameter values rather than explicitly selecting features.

50)

Regularized models strike a trade-off between bias and variance. By adding a regularization penalty to the loss function, the model's complexity is reduced, resulting in higher bias (reducing the model's ability to fit the training data exactly). However, regularization also helps reduce variance by preventing the model from overfitting. The trade-off is that regularization seeks to find the optimal balance between bias and variance, leading to better generalization performance on unseen data.


# **SVM:**

51)

Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding an optimal hyperplane that separates the data points of different classes or predicts the value of a target variable.

52)

The kernel trick is a technique used in SVM to transform the input data into a higher-dimensional feature space. It allows SVM to learn non-linear decision boundaries in the original input space without explicitly calculating the transformed feature vectors. The kernel function calculates the similarity between data points in the transformed space, making it possible to find complex decision boundaries.

53)

Support vectors in SVM are the data points that lie closest to the decision boundary. They are the critical elements that determine the location and orientation of the decision boundary. These points support the construction of the boundary and hence are called support vectors. They are important because the decision boundary is completely determined by them, and changing their positions could affect the boundary and classification results.

54)

The margin in SVM refers to the separation between the decision boundary and the nearest data points from each class. SVM aims to maximize this margin because a larger margin generally leads to better generalization and robustness of the model. By maximizing the margin, SVM can achieve better classification performance and improved resistance to noise or outliers.

55)

Handling unbalanced datasets in SVM can be done by adjusting the class weights or using techniques such as undersampling or oversampling. One approach is to assign higher weights to the minority class to increase its importance during training. Another approach is to balance the dataset by undersampling the majority class or oversampling the minority class to create an equal representation of classes.

56)

Linear SVM finds a linear decision boundary in the input space, assuming that the data is linearly separable. Non-linear SVM, on the other hand, uses the kernel trick to transform the data into a higher-dimensional space where a linear decision boundary can be found. By using different kernel functions, non-linear SVM can model complex decision boundaries that are not possible in the original input space.

57)

The C-parameter in SVM is a regularization parameter that controls the trade-off between achieving a larger margin and allowing misclassifications. A smaller value of C creates a wider margin but allows more misclassifications, leading to a more tolerant model. A larger value of C tries to minimize the number of misclassifications, potentially resulting in a smaller margin and a more strict model.

58)

Slack variables in SVM are introduced to handle cases where the data points are not linearly separable. They allow some training examples to be on the wrong side of the margin or misclassified. The slack variables represent the degree of misclassification or violation of the margin. By allowing some slack, SVM can find a compromise between maximizing the margin and minimizing misclassifications.

59)

In SVM, hard margin refers to the case where no misclassifications are allowed. It assumes that the data is perfectly separable by a hyperplane. Soft margin, on the other hand, allows some misclassifications by introducing slack variables. Soft margin SVM is more flexible and can handle cases where the data points are not linearly separable or when there are outliers.

60)

In an SVM model, the coefficients (also known as weights or dual coefficients) represent the importance of each training example in determining the decision boundary. They indicate the contribution of each support vector to the classification or regression task. The sign of the coefficients determines the class label, and their magnitude reflects the importance or influence of the corresponding support vector.

# **Decision Trees:**

61)

A decision tree is a supervised machine learning algorithm that makes predictions by recursively partitioning the data based on feature values. It works by creating a tree-like structure of decision nodes and leaf nodes, where each decision node represents a feature and each leaf node represents a prediction or a class label.

62)

Splits in a decision tree are made by selecting a feature and a threshold value to divide the data into subsets. The goal is to find the splits that maximize the separation of classes or minimize the impurity within each subset.

63)

Impurity measures, such as the Gini index and entropy, quantify the disorder or uncertainty of class labels in a subset. They are used in decision trees to evaluate the quality of splits and determine the optimal feature and threshold value for partitioning the data.

64)

Information gain measures the reduction in entropy or impurity achieved by splitting the data based on a particular feature. It quantifies how much information is gained by considering that feature for splitting. The feature with the highest information gain is chosen as the splitting criterion.

65)

Missing values in decision trees can be handled by various strategies such as assigning the majority class or the class with the highest probability, imputing the missing value based on statistical measures, or treating missing values as a separate category during splitting.

66)

Pruning is a technique in decision trees that reduces the complexity of the tree by removing unnecessary branches. It helps prevent overfitting and improves the model's generalization by simplifying the decision boundaries and reducing noise or outliers' impact on the tree.

67)

A classification tree is used for predicting categorical or discrete class labels, while a regression tree is used for predicting continuous numerical values. Classification trees have class labels at the leaf nodes, while regression trees have predicted values based on the average or majority of the target variable.

68)

Decision boundaries in a decision tree are interpreted as the feature conditions that determine the path to take from the root to a specific leaf node. These conditions represent the rules for making predictions or classifying instances based on the values of the input features.

69)

Feature importance in decision trees measures the significance of each feature in the model's predictive power. It quantifies the relative contribution of each feature in splitting the data and making accurate predictions. It helps identify the most influential features in the decision-making process.

70)

Ensemble techniques combine multiple decision trees to improve predictive performance. Random Forest and Gradient Boosting are popular ensemble methods that utilize decision trees as base learners. They leverage the diversity and aggregation of multiple trees to achieve better accuracy and robustness in predictions.

# **Ensemble Techniques:**

71)

Ensemble techniques in machine learning combine multiple models to make predictions or solve a problem collectively. They leverage the diversity and aggregation of multiple models to improve accuracy and robustness.


72)

Bagging (Bootstrap Aggregating) is an ensemble technique where multiple models are trained on different bootstrap samples of the training data. Each model is trained independently, and their predictions are combined by averaging or voting to make the final prediction.

73)

Bootstrapping in bagging involves randomly sampling the training data with replacement to create multiple bootstrap samples. These samples are used to train individual models in the ensemble, allowing each model to have slightly different training data.

74)

Boosting is an ensemble technique that iteratively trains weak models to improve their performance. Models are trained sequentially, and each subsequent model focuses on correcting the mistakes made by the previous models. The final prediction is made by aggregating the predictions of all the models.

75)

AdaBoost (Adaptive Boosting) and Gradient Boosting are both boosting algorithms. AdaBoost assigns weights to the training instances based on their classification performance, whereas Gradient Boosting uses gradient descent to minimize a loss function by iteratively adding weak models.

76)

Random forests are an ensemble technique that combines multiple decision trees. They introduce randomness in two ways: by training each tree on a random subset of the features and by randomly sampling the training data with replacement. Random forests reduce overfitting and provide robust predictions.

77)

Random forests calculate feature importance based on how much each feature reduces the impurity or error in the model. By measuring the average reduction in impurity over all trees, they determine the importance of each feature in making accurate predictions.

78)

Stacking in ensemble learning involves training multiple models on the training data and using their predictions as inputs for a meta-model. The meta-model is trained to make the final prediction by learning from the predictions of the base models. It combines the strengths of different models and learns to weigh their predictions effectively.


79)

The advantages of ensemble techniques include improved predictive accuracy, robustness to noise and outliers, and the ability to handle complex relationships. However, they can be computationally expensive, prone to overfitting if not properly tuned, and challenging to interpret compared to individual models.

80)

The optimal number of models in an ensemble depends on the dataset, the complexity of the problem, and the trade-off between computational resources and performance. It can be determined through techniques like cross-validation, monitoring performance on a validation set, or using early stopping criteria based on the performance plateau.