# General Linear Model

**Q1. What is the purpose of the General Linear Model (GLM)?**

**Ans** : The General Linear Model (GLM) is a flexible statistical framework used for analyzing the relationships between variables. It serves the purpose of modeling and understanding the relationships between one or more independent variables and a dependent variable. The GLM can handle a wide range of data types and distributions, making it applicable to various types of data analysis scenarios.

The GLM is an extension of the linear regression model and provides a general framework that encompasses other regression techniques, such as multiple regression, logistic regression, Poisson regression, and ANOVA (Analysis of Variance). It allows for the modeling of continuous, binary, count, and categorical outcomes.

**Q2. What are the key assumptions of the General Linear Model?**

**Ans** : It is important to check these assumptions before interpreting the results of the GLM. Violations of these assumptions may lead to biased estimates, invalid inferences, or incorrect interpretations. Various diagnostic techniques and statistical tests can be employed to assess the assumptions and address any violations if necessary.

**Q3. How do you interpret the coefficients in a GLM?**

**Ans** : In a General Linear Model (GLM), the coefficients represent the estimated effect or impact of the independent variables on the dependent variable. The interpretation of the coefficients depends on the type of GLM and the scale of the variables involved. Here are some common guidelines for interpreting coefficients in a GLM:

Continuous Independent Variables:

Positive Coefficient: A positive coefficient indicates that an increase in the independent variable is associated with an increase in the dependent variable, holding other variables constant. The magnitude of the coefficient represents the amount of change in the dependent variable for a one-unit increase in the independent variable.
Negative Coefficient: A negative coefficient indicates that an increase in the independent variable is associated with a decrease in the dependent variable, holding other variables constant.
Binary Independent Variables:

Coefficient near +1 or -1: For a binary independent variable, such as a dummy variable representing two groups, a coefficient close to +1 indicates that the presence of that group positively affects the dependent variable, compared to the reference group. Similarly, a coefficient close to -1 indicates a negative effect.
Coefficient near 0: A coefficient close to 0 suggests that there is no significant difference in the dependent variable between the two groups represented by the binary variable.
Categorical Independent Variables:

Coefficient per Category: When using categorical variables with more than two categories, the coefficients represent the difference in the dependent variable for each category compared to a reference category. The coefficient for each category represents the average change in the dependent variable when moving from the reference category to that specific category, holding other variables constant.

**Q4. What is the difference between a univariate and multivariate GLM?**

Ans :the main distinction between univariate and multivariate GLMs is that the former analyzes a single dependent variable, while the latter examines multiple dependent variables simultaneously. The choice between a univariate and multivariate approach depends on the research objectives and the nature of the data being analyzed.

**Q5. Explain the concept of interaction effects in a GLM.**

Ans : In a General Linear Model (GLM), interaction effects refer to the combined effect of two or more independent variables on the dependent variable that is different from the sum of their individual effects. In other words, an interaction effect occurs when the relationship between an independent variable and the dependent variable changes based on the level or presence of another independent variable.

The presence of an interaction effect suggests that the effect of one independent variable on the dependent variable depends on the level or condition of another independent variable. This indicates that the relationship between the independent variables and the dependent variable is not simply additive, but rather there is a synergistic or modifying effect between the variables.

To better understand interaction effects, let's consider an example with two independent variables,
 and
, and a dependent variable,
. Suppose
 represents the level of education (e.g., high school vs. college), and
 represents the level of work experience (e.g., low vs. high). The presence of an interaction effect implies that the effect of education (
) on the dependent variable (Y) differs depending on the level of work experience (
), or vice versa.

Interpreting interaction effects involves considering the main effects (independent variables' effects individually) and the interaction term (product of the independent variables).

Understanding and interpreting interaction effects is crucial as they can reveal more nuanced relationships and help identify conditional effects. Graphical visualization, such as interaction plots or slope plots, can be helpful in illustrating and interpreting these effects, allowing for a deeper understanding of the relationship between the variables in a GLM.

**Q6. How do you handle categorical predictors in a GLM?**

Ans : Handling categorical predictors in a General Linear Model (GLM) requires converting the categorical variables into a suitable format that can be used in the model. The approach for handling categorical predictors depends on the type of categorical variable (nominal or ordinal) and the software or library being used for the analysis. Here are two common approaches:

Dummy Coding:

Dummy coding is used for nominal categorical variables where there is no inherent order or ranking among the categories.
In this approach, each category of the categorical variable is represented by a binary (0/1) dummy variable.
One category is chosen as the reference or baseline category, and the remaining categories are encoded as separate binary variables.
The reference category is typically omitted from the model to avoid multicollinearity.
For example, if the categorical variable is "Color" with categories "Red," "Green," and "Blue," the dummy coding would create two dummy variables: "IsGreen" and "IsBlue." A value of 1 in "IsGreen" indicates the presence of the "Green" category, while a value of 0 indicates its absence.
Ordinal Encoding:

Ordinal encoding is used for ordinal categorical variables where there is a specific order or ranking among the categories.
In this approach, the categories are assigned numerical codes based on their order or rank.
The numerical codes reflect the relative magnitude or position of the categories.
For example, if the categorical variable is "Education" with categories "High School," "College," and "Graduate School," they could be encoded as 1, 2, and 3, respectively.
After encoding the categorical predictors, they can be included in the GLM along with the continuous predictors. The regression coefficients associated with the categorical predictors represent the differences in the dependent variable between the respective categories, compared to the reference or baseline category.

It's important to note that the choice of encoding scheme and the reference category can affect the interpretation of the coefficients and the statistical results. Additionally, software packages or libraries may have built-in functions or methods to handle categorical predictors automatically. Therefore, it is recommended to consult the documentation or user guide specific to the software or library being used for GLM analysis.

**Q7. What is the purpose of the design matrix in a GLM?**

Ans : The design matrix, also known as the model matrix or the predictor matrix, plays a crucial role in a General Linear Model (GLM). It serves the purpose of organizing and representing the independent variables or predictors in a structured format that can be used for statistical analysis.

The design matrix is a rectangular matrix where each row corresponds to an observation or data point, and each column represents a specific predictor variable, including both continuous and categorical variables. The design matrix allows the GLM to model and analyze the relationships between the predictors and the dependent variable.

By representing the predictors in the design matrix, the GLM can estimate the regression coefficients or parameters associated with each predictor. The design matrix is used to formulate the mathematical equations and perform the model estimation and statistical inference in the GLM. It enables the GLM to analyze the relationships between the predictors and the dependent variable, determine the significance of the predictors, and make predictions or inference based on the fitted model.

The design matrix is a fundamental component of the GLM and serves as the basis for conducting various statistical analyses, such as hypothesis testing, parameter estimation, and model evaluation.

**Q8. How do you test the significance of predictors in a GLM?**

Ans : To test the significance of predictors in a General Linear Model (GLM), you can use hypothesis testing, specifically by examining the p-values associated with each predictor's coefficient. The p-value represents the probability of observing a coefficient as extreme as the estimated value, assuming the null hypothesis is true.

Here's the general procedure for testing the significance of predictors in a GLM:

Compare p-values to the significance level: If the p-value is less than the chosen significance level (
), reject the null hypothesis and conclude that there is a significant relationship between the predictor and the dependent variable. If the p-value is greater than or equal to the significance level (
), fail to reject the null hypothesis and conclude that there is no significant relationship between the predictor and the dependent variable.

It's important to note that the significance of a predictor depends on both its coefficient and its associated p-value. A significant coefficient (non-zero) with a small p-value suggests a strong evidence of a relationship between the predictor and the dependent variable.

It's also worth considering other factors such as the effect size, confidence intervals, and the specific goals of the analysis to fully interpret and understand the significance of predictors in a GLM.

**Q9. What is the difference between Type I, Type II, and Type III sums of squares in a GLM?**

Ans : In a General Linear Model (GLM), the Type I, Type II, and Type III sums of squares are methods for partitioning the variation in the dependent variable (total sum of squares) into components associated with different predictor variables or sets of predictor variables. These methods differ in the order in which the predictors are entered into the model and the effects they consider when estimating the sums of squares.

Type I Sum of Squares:

Type I sums of squares, also known as sequential or hierarchical sums of squares, assess the unique contribution of each predictor variable to the model while controlling for the effects of previously entered predictors.
The order in which the predictors are entered into the model affects the Type I sums of squares.
Type I sums of squares are influenced by the order in which the predictors are entered and can lead to different conclusions depending on the order chosen.
This method is suitable for situations where there is a clear hierarchical or sequential relationship among the predictors.
Type II Sum of Squares:

Type II sums of squares, also known as partial sums of squares, assess the unique contribution of each predictor variable to the model while considering the effects of other predictors but not their interactions.
Type II sums of squares are calculated by removing the influence of each predictor variable individually and measuring the remaining variation.
This method is useful when there are interactions or complex relationships among the predictors.
Type II sums of squares are not affected by the order in which the predictors are entered.
Type III Sum of Squares:

Type III sums of squares, also known as marginal or adjusted sums of squares, assess the unique contribution of each predictor variable to the model while considering the effects of all other predictors, including their interactions.
Type III sums of squares estimate the contribution of each predictor when all other predictors, including their interactions, are already in the model.
This method is appropriate when there are interactions among the predictors and you want to estimate the individual effects while considering the other predictors and interactions.
Type III sums of squares are not affected by the order in which the predictors are entered.
It's important to note that the choice between Type I, Type II, and Type III sums of squares depends on the research question, the nature of the predictors, and the specific hypotheses being tested. The method chosen can affect the interpretation of the effects and the conclusions drawn from the analysis. Consulting statistical software or referencing statistical textbooks or resources can provide further guidance on the appropriate choice of sums of squares in different situations.

**Q10. Explain the concept of deviance in a GLM.**

Ans : In a General Linear Model (GLM), deviance is a measure of the goodness of fit of the model and is used in assessing the adequacy of the model to the data. Deviance represents the difference between the observed data and the predicted values from the GLM.

Deviance can be thought of as a measure of the lack of fit or discrepancy between the observed data and the expected values under the fitted model. It quantifies how well the model accounts for the observed variability in the data.

In a GLM, the deviance is calculated by comparing the log-likelihood of the model with the log-likelihood of a saturated model, which is a hypothetical model that perfectly fits the observed data. The saturated model has a separate parameter for each data point, resulting in a perfect fit.

The deviance is calculated as twice the difference between the log-likelihood of the saturated model and the log-likelihood of the fitted model. Mathematically, it can be expressed as:

The deviance can be used to compare different models or assess the improvement of a model compared to a null model (intercept-only model). Lower deviance values indicate a better fit of the model to the data.

Deviance is commonly used in GLMs, such as logistic regression, Poisson regression, and negative binomial regression. In these models, the deviance is often used to compare nested models or to perform likelihood ratio tests to evaluate the significance of predictors or to compare different models.

It's important to note that the deviance is typically asymptotically distributed as a chi-square distribution, which allows for statistical inference and hypothesis testing based on the deviance. Additionally, the concept of deviance is closely related to other measures such as residual deviance, null deviance, and AIC (Akaike Information Criterion), which provide further information on model fit and model comparison in GLMs.

# Regression

**11. What is regression analysis and what is its purpose?**

~ Regression analysis is a statistical method that shows the relationship between two or more variables.It's purpose is to estimate the effect of some explanatory variable on the dependent variable.

**12. What is the difference between simple linear regression and multiple linear regression?**

~ Simple linear regression has one inpendent variable and one dependent variable whereas Multiple linear regression has two or more independent variable and one dependent variable.

**13 How do you interpret the R-squared value in regression?**

~ The interpretation of R-squared value in regression is how well the regression model explains observed data. A high R-Squared value means that many data points are close to the linear regression function line. A low R-Squared value means that the linear regression function line does not fit the data well.

**14 What is the difference between correlation and regression?**

~ Correlation measures the strength of a linear relationship between two variables whereas regression measures how those variables affect each other using an equation.

**15 What is the difference between the coefficients and the intercept in regression?**

~ In regression, coefficient is the slope which is termed as unit movement in y-axis with respest to unit movement in x-axis whereas intercept is termed as a meeting point on y-axis when value on x-axis is zero.

**16 How do you handle outliers in regression analysis?**

~ Outliers in regression analysis are handled by removing them from the observations, treating them, or using algorithms that are well-suited for dealing with such values on their own.

**17 What is the difference between ridge regression and ordinary least squares regression?**

~ Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model. This modification is done by adding a penalty parameter that is equivalent to the square of the magnitude of the coefficients.

**18 What is heteroscedasticity in regression and how does it affect the model?**

~ Heteroscedasticity refers to a situation where the variance of the residuals is unequal over a range of measured values. If heteroskedasticity exists, the population used in the regression contains unequal variance, the analysis results may be invalid.

**19 How do you handle multicollinearity in regression analysis?**


~ Multicollinearity in regression analysis can be handled by feature selection and feature extraction.

**20 What is polynomial regression and when is it used?**

~ Polynomial regression is a kind of linear regression in which the relationship shared between the dependent and independent variables is modeled as the nth degree of the polynomial. It is used when linear regression models may not adequately capture the complexity of the relationship.

# Loss Function

**21 What is a loss function and what is its purpose in machine learning?**

~ Loss function is a method of evaluating how well your algorithm is modeling your dataset. Loss functions measure of how good your model is in terms of predicting the expected outcome.

**22 What is the difference between a convex and non-convex loss function?**

~ A convex loss function has only one global minima and no local minima whereas a non-convex loss function has one or more local minima and global minima.

**23 What is mean squared error (MSE) and how is it calculated?**

~ Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data.

**24 What is mean absolute error (MAE) and how is it calculated?**

~ Mean absolute error is an arithmetic average of the absolute errors. MAE is calculated as the sum of absolute errors divided by the sample size.

**25 What is log loss (cross-entropy loss) and how is it calculated?**

~ Log Loss is a common evaluation metric for binary classification models. It measures the performance of a model by quantifying the difference between predicted probabilities and actual values. It is calculated by predicting class probability and comparing it to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value.

**26 How do you choose the appropriate loss function for a given problem?**

~ Most machine learning algorithms use some sort of loss function in the process of optimization or finding the best parameters for your data. The choice of the loss function is directly related to the activation function used in the output layer of your neural network.

**27 Explain the concept of regularization in the context of loss functions.**

~ Regularization in the context of loss function refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting.

**28 What is Huber loss and how does it handle outliers?**

~ Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. To detect outliers it assigns less weight to observations identified as outliers.

**29 What is quantile loss and when is it used?**

~ Quantile regression loss function is applied to predict quantiles. A quantile is the value below which a fraction of observations in a group falls.It is used when Machine learning algorithms are aiming to predict a particular variable quantile.

**30 What is the difference between squared loss and absolute loss?**

~ The Squared loss is a measure of the quality of an estimator, it is always positive, and values which are closer to zero are better whereas the absolute loss is the second moment of the error, and includes both the variance of the estimator and its bias.



# Optimizer (GD)

**31 What is an optimizer and what is its purpose in machine learning?**

~ An optimizer is an algorithm or function that adapts the neural network's attributes, like learning rate and weights. It's purpose is to assist in improving the accuracy and reduces the total loss.

**32 What is Gradient Descent (GD) and how does it work?**

~ Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. It works by taking the fastest route towards the minimum point from each step to converge fast. It is done by taking the partial derivative at each step to find the direction towards the local minimum.

**33 What are the different variations of Gradient Descent?**

~ Different variations of Gradient Descent are batch gradient descent, stochastic gradient descent and mini-batch gradient descent.

**34 What is the learning rate in GD and how do you choose an appropriate value?**

~ Learning rate determines how fast or slow we will move towards the optimal weights. In order for Gradient Descent to work, we must set the learning rate to an appropriate value. We take the value of the learning rate to be 0.1, 0.01 or 0.001. The value of the step should not be too big as it can skip the minimum point and thus the optimisation can fail.

**35 How does GD handle local optima in optimization problems?**

~ By putting a fraction of the past weight update to the current weight update. This helps prevent the optimization problem from getting stuck in local minima or by using variant of GD called SGD (stochastic gradient descent).

**36 What is Stochastic Gradient Descent (SGD) and how does it differ from GD?**

~ Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for optimizing machine learning models. It addresses the computational inefficiency of traditional Gradient Descent methods when dealing with large datasets in machine learning projects. In SGD, instead of using the entire dataset for each iteration, only a single random training example or a small batch is selected to calculate the gradient and update the model parameters.

**37 Explain the concept of batch size in GD and its impact on training.**

~ The batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model's internal parameters are updated. Batch size is important because it affects both the training time and the generalization of the model. A smaller batch size allows the model to learn from each individual example but takes longer to train. A larger batch size trains faster but may result in the model not capturing the nuances in the data.

**38 What is the role of momentum in optimization algorithms?**

~ Momentum is a strategy for accelerating the convergence of the optimization process by including a momentum element in the update rule. This momentum factor assists the optimizer in continuing to go in the same direction even if the gradient changes direction or becomes zero.

**39 What is the difference between batch GD, mini-batch GD, and SGD?**

~ In Gradient Desecent whole training data is used to update the network's parameters.

~ In Stochastic Gradient Descent, we update the parameters after every single observation and we know that every time the weights are updated it is known as an iteration.

~ In Mini-batch Gradient Descent, we take a subset of data and update the parameters based on every subset.

**40 How does the learning rate affect the convergence of GD?**

~ The learning rate determines how big the step would be on each iteration. If learning rate is very small, it would take long time to converge and become computationally expensive. If learning rate is large, it may fail to converge and skip the minimum.



# Regularization

**41 What is regularization and why is it used in machine learning?**

~ Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting.Using Regularization, we can fit our machine learning model appropriately on a given test set and hence reduce the errors in it.

**42 What is the difference between L1 and L2 regularization?**

~ L1 Regularization adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.

~ L2 Regularization adds the “squared magnitude” of the coefficient as the penalty term to the loss function.

**43 Explain the concept of ridge regression and its role in regularization.**

~ Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. It is used in regularization for reducing overfitting.

**44 What is the elastic net regularization and how does it combine L1 and L2 penalties?**

~ Elastic net regression is a hybrid of lasso and ridge regression that uses combination of the L1 and L2 norms as the penalty term. This allows it to balance between feature selection and feature preservation, and to deal with situations where lasso and ridge regression may fail. It combines L1 and L1 penalties linearly.

**45 How does regularization help prevent overfitting in machine learning models?**

~ Regularization tunes the loss function by adding a penalty term, that prevents excessive fluctuation of the coefficients. Thereby, reducing the chances of overfitting.

**46 What is early stopping and how does it relate to regularization?**

~ Early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Early stopping belongs to the class of methods of regularization.

**47 Explain the concept of dropout regularization in neural networks.**

~ Dropout is a regularization method approximating concurrent training of many neural networks with various designs. During training, some layer outputs are ignored or dropped at random. This makes the layer appear and is regarded as having a different number of nodes and connectedness to the preceding layer.

**48 How do you choose the regularization parameter in a model?**

~ On the training set, we estimate several different Ridge, Lasso or Elastic net regressions, with different values of the regularization parameter and on the validation set, we choose the best model with best regularization parameter which gives the lowest MSE on the validation set by hyperparameter tuning.

**49 What is the difference between feature selection and regularization?**

~ Feature selection removes the dimensions e.g. columns from the input data and results in a reduced data set for model inference. Regularization is where we are constraining the solution space while doing optimization.

**50 What is the trade-off between bias and variance in regularized models?**

~ Bias and variance are inversely connected. If we decrease the variance, it will increase the bias. If we decrease the bias, it will increase the variance.

# SVM

**51 What is Support Vector Machines (SVM) and how does it work?**

~ Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks. It works on the objective of the SVM algorithm, that is to find the optimal hyperplane in an N-dimensional space that can separate the data points in different classes in the feature space. The hyperplane tries that the margin between the closest points of different classes should be as maximum as possible.

**52 How does the kernel trick work in SVM?**

~ Kernel trick transforms the training set of data so that a non-linear decision surface is able to transform to a linear equation in a higher number of dimension spaces.

**53 What are support vectors in SVM and why are they important?**

~ Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Support vectors are important because using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

**54 Explain the concept of the margin in SVM and its impact on model performance.**

~ Margin in SVM is the distance between few closest points of different classes of datapoints. If the Margin is more then the set datapoints will be more clearly seperated and model prediction and performance will be better and if Margin is less, it accuracy model prediction and performance.

**55 How do you handle unbalanced datasets in SVM?**

~ A popular algorithm for handling is Penalized-SVM. During training, we can use the argument class_weight='balanced' to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

**56 What is the difference between linear SVM and non-linear SVM?**

~ When the data points are linearly separable into two classes, the data is called linearly-separable data. We use the linear SVM classifier to classify such data.

~ When the data is not linearly separable, we use the non-linear SVM classifier to separate the data points.

**57 What is the role of C-parameter in SVM and how does it affect the decision boundary?**

~ The C-parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly.

**58 Explain the concept of slack variables in SVM.**

~ Slack variables are introduced to allow certain constraints to be violated. That is, certain training points will be allowed to be within the margin. We want the number of points within the margin to be as small as possible we want their penetration of the margin to be as small as possible.

**59 What is the difference between hard margin and soft margin in SVM?**

~ In Marginal plane, if there are no error points then it is called Hard margin and if error points are present in the marginal plane then it is called Soft Margin.

**60 How do you interpret the coefficients in an SVM model?**

~ C parameter in SVM is Penalty parameter of the error term. You can consider it as the degree of correct coefficient that the algorithm has to meet or the degree of optimization the the SVM has to meet. For greater values of C, there is no way that SVM optimizer can misclassify any single point.



# Decision Trees

**61 What is a decision tree and how does it work?**

~ Decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes. It follows a tree-like model of decisions and their possible consequences. The algorithm works by recursively splitting the data into subsets based on the most significant feature at each node of the tree

**62 How do you make splits in a decision tree?**

~ For each split, calculate the entropy of each child node independently.

~ Calculate the entropy of each split using the weighted average entropy of child nodes.

~ Choose the split with the lowest entropy or the greatest gain in information.

~ Repeat these steps to obtain homogeneous split nodes.

**63 What are impurity measures (e.g., Gini index, entropy) and how are they used in decision trees?**

~ In the context of Decision Trees, Entropy is a measure of disorder or impurity in a node and Gini Index aims to decrease the impurities from the root nodes to the leaf nodes of a decision tree model. They are used to find Information Gain, further this information gain is used to decide best split in Decision tree.

**64 Explain the concept of information gain in decision trees.**

~ Information gain is the basic criterion to decide whether a feature should be used to split a node or not. The feature with the optimal split, the highest value of information gain at a node of a decision tree is used as the feature for splitting the node.

**65 How do you handle missing values in decision trees?**

~ Decision Tree can automatically handle missing values. Decision Tree is usually robust to outliers and can handle them automatically.

**66What is pruning in decision trees and why is it important?**

~ Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are unrelevant and unnecessary. Pruning is important as it reduces the complexity of the decision tree being long and avoiding unnecessary branching of the decision tree.

**67 What is the difference between a classification tree and a regression tree?**

~ Classification trees are used when the dataset needs to be split into classes that belong to the response variable. Regression trees, on the other hand, are used when the response variable is continuous.

**68 How do you interpret the decision boundaries in a decision tree?**

~ The first node of the tree called the root node contains the number of instances of all the classes respectively. Basically, we have to draw a line called decision boundary that separates the instances of different classes into different regions called decision regions.

**69 What is the role of feature importance in decision trees?**

~ Feature importance refers to technique that assigns a score to features based on how significant they are at predicting a target variable. The scores are calculated on the weighted Gini indices. Easy way to obtain the scores is by using the feature_importances_ attribute from the trained decision tree model.

**70 What are ensemble techniques and how are they related to decision trees?**

~ Ensemble techniques are the methods that create multiple models and then combine them to produce improved results. Ensemble methods in machine learning usually produce more accurate solutions than a single model would. Ensemble techniques are related to decision trees as tehy use one or more decision trees to predict the outcome or accuracy of any machine learning models.



# Ensemble Techniques

**71 What are ensemble techniques in machine learning?**
~ Ensemble methods are techniques that create multiple models and then combine them to produce improved results. Ensemble methods in machine learning usually produce more accurate solutions than a single model would.

**72 What is bagging and how is it used in ensemble learning?**

~ Bagging is a homogeneous weak learners model that learns from each other independently in parallel and combines them for determining the model average. It is used in ensemble learning to reduce the variance.

~ Steps to use Bagging:

Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement.

Step 2: A base model is created on each of these subsets.

Step 3: Each model is learned in parallel with each training set and independent of each other.

Step 4: The final predictions are determined by combining the predictions from all the models.

**73 Explain the concept of bootstrapping in bagging.**

~ Bootstrap Aggregating, also known as bootstrapping is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases the variance and helps to avoid overfitting.

**74What is boosting and how does it work?**

~ Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models is added.

**75What is the difference between AdaBoost and Gradient Boosting?**

~ Adaboost is computed with a specific loss function and becomes more rigid when comes to few iterations. But in Gradient boosting, it assists in finding the proper solution to additional iteration modeling problem as it is built with some generic features. From this, it is noted that Gradient boosting is more flexible when compared to AdaBoost because of its fixed loss function values.

**76 What is the purpose of random forests in ensemble learning?**

~ Random forest algorithm is an ensemble learning technique combining numerous classifiers to enhance a model's performance. Random Forest is a supervised machine-learning algorithm made up of several decision trees. By default Random forest generates hundred decision trees which can be changed set as per required parameter value.

**77 How do random forests handle feature importance?**

~ The more a feature decreases the impurity, the more important the feature is. The final feature importance, at the Random Forest level, is it's average over all the trees. The sum of the feature's importance value on each trees is calculated and divided by the total number of trees.

**78 What is stacking in ensemble learning and how does it work?**

~ Stacking is one of the most popular ensemble machine learning techniques used to predict multiple nodes to build a new model and improve model performance. Stacking uses predictions for multiple nodes for example kNN, decision trees, or SVM to build a new model. This final model is used for making predictions on the test dataset.

**79 What are the advantages and disadvantages of ensemble techniques?**

~ Advantages : Ensemble methods offer several advantages over single models, such as improved accuracy and performance, especially for complex and noisy problems. They can also reduce the risk of overfitting and underfitting by balancing the trade-off between bias and variance, and by using different subsets and features of the data.

~ Disadvantages : Ensembling is less interpretable, the output of the ensembled model is hard to predict and explain. The art of ensembling is hard to learn and any wrong selection can lead to lower predictive accuracy than an individual model. Ensembling is expensive in terms of both time and space.

**80 How do you choose the optimal number of models in an ensemble?**

~ Steps for optimal number of models in ensemble :-

Step 1 : Find the KS of individual models.

Step 2: Index all the models for easy access.

Step 3: Choose the first two models as the initial selection and set a correlation limit.

Step 4: Iteratively choose all the models which are not highly correlated with any of the any chosen model.

Step 5: Time to check the performance of individual sequential combination.

Step 6: Choose the combination of models where the performance peaks.