##Linear Regression

1. **What is linear regression ?**
   - Algorithm that use a **linear combination of features to predict a continuous target**
   - **Weights are multiplied to each of the features** prior to linear combination
   - Weights are adjusted based on the error of the linear combination with the actual target value

   <br>

2. **What are the assumptions of a linear regression model ?**
   - **Constant variance** of feature across different target values (homoscedasticity)
   - **Normal distribution** of residuals of the model
   - **Independence** between data points (lack of autocorrelation)
   - **Linearity** of linear combination of features and target

   <br>

3. **What are the methods to test the assumptions ?**
   - **Heteroscedasticity** if residual plot presents a fanning pattern
   - **Non-normal** if the quantiles are not aligned in the Q-Q plot with the normal quantiles
   - **Dependence or Non-linearity** if the clear trends are observed in a sequence of points in residual plot

   <br>
   
4. **What are the steps one can take to correct for the violated assumptions ?**
   - **Heteroscedasticity** can be reduced by taking log of the features / target variables
   - **Non-normality** of residuals can be reduced by taking log of the features / target variables
   - **Dependence** can be taken into account by including lag terms or more complex time series model
   - **Non-linearity** can be taken into account by including square or interaction term
   
   <br>

5. **What is multicollinearity, why is a problem ?**
   
   - Multicollinearity is when **two or more features are linearly correlated with each other**
   - The **coefficients** of the regression model become **unstable** and do not reflect the true coefficients
   - The influence of collinear features on the target is **randomly assigned to the coefficients of those features**
   - Removing or **adding a data point** might **change coefficients** significantly
   
   <br>
   
6. **How do we detect multicollinearity ? Under what circumstances can we overlook multicollinearity ?**

   - Compute the **Variance Inflation Factor (VIF), only for linear regression**
     - Loop through each feature and run a regression model using each feature as the target
     - The VIF of the particular feature $i$ is $\frac{1}{1-R_i^2}$
   - **VIF > 10 in general is regarded as a sign of collinearity**
   - We can **overlook multicollinearity** if we are only **concerned with predictions**
   - But the test data must have same collinearity structure as the trained data

   <br>
   
7. **What is the interpretation of $\beta$ ? What does $\beta_0$ (the bias) mean ?**

   - A unit of **increase of a certain feature** is correlated with **$\beta$ amount of increase / decrease in the target** value holding all the other features constant
   - $\beta_0$ represents the **mean value of target when all the other features have a value of zero**

   <br>

8. **How does the interpretation of $\beta$ change if log is taken on the the features / target variables ?**
   - **Logging the target:** 
   
     A unit of increase of a feature correlated with (100 x $\beta$) % increase / decrease in the target
     
     <br>
     
   - **Logging a feature:** 
   
     1 % of increase of a feature correlated with ($\beta$ / 100) units increase / decrease in the target
   
     <br>

   - **Logging a feature and the target:** 
   
     1 % of increase of a feature correlated with $\beta$ % increase / decrease in the target
   
     <br>

9. **What is the interpretation of the p-value of $\beta$ ? What is the null and alternative hypothesis ?**

   - **Null:** $\beta =$ 0 
   - **Alternative:** $\beta \neq$  0
   - If the p-value is less than the significance level, then we can reject the null of $\beta =$ 0, i.e. feature is not correlated to changes in the target variable

   <br>

10. **Can you compare beta coefficient in multiple linear regression ? If not, what will allow us to ?**

   - **Normalizing** (substract by mean and divided by standard deviation column-wise) **the features and target**
   - Since variables are normalized, we do not need to fit a bias ($\beta_0$) term

   <br>

11. **What are some causes for under-fitting ? How to avoid them ? What disadvantages do they bring ?**

   - **Causes:**
     - Overly simplistic model, not enough feature
     - Not enough polynomial / transformation of features
 
   - **Consequences:**
     - High bias, low variance
     - Inaccurate predictions due to under-learning from the data
   
   - **Solution:**
     - More features and complex (non-linear) model

   <br>
   
12. **What are some causes for over-fitting ? How to avoid them ?**

   - **Causes:**
     - Overly complex model, too many feature
     - Too many polynomial / transformation of features
 
   - **Consequences:**
     - Low bias, high variance
     - Inaccurate predictions due to over-learning from noise and edge cases in the data
   
   - **Solution:**
     - Regularization
     - Feature selection
     - Use K-fold cross validation to decide complexity of model
     
   <br>

13. **How to spot an outlier in multiple linear regression ?**

   - **Univariate scatter plots / Boxplots**
   - **Normalized residual plot:**
     -  More than 2 standard deviation from mean considered as outlier
   - High residual values in additional to high leverage makes an influential outlier

   <br>
   
14. **What does it mean in multiple regression when a point has high levarage ? What constitue an influential point ? Why are they important ?**

   - **High levarage** means some feature(s) of a data point far deviates from the mean value of the feature(s) across data point
   - **Influence = Leverage x Residual**
   - High leverage and high residual data points are influential, i.e. affects the model ($\beta$s) a lot when absent / present
   - **Important because you do not a few data points controlling your model, and if so, you should know why**
    
    <br>

15. **How to determine how many features to select for your model ? What difference does it make if you are fitting on on a large (does not fit into memory) dataset ?**
    
   - **Forwards / Backwards elimination** (Not feasible with big dataset, computationally intensive)
   - **Lasso** (L1) regularization (shrink unimportant features towards zero)
   - **Random forest feature importance**
   - **Adjusted $R^2$ with K-fold cross validation** (F1 score with K-fold if logistic regression)
   - Product knowledge
   - Univariate correlation / chi-square (if categorical) with target

    <br>
  
16. **What is the cost function for linear regression ?**

   - **Mean Squared Error (MSE)**, $\sum_{i=1}^n (\hat{y} - y)^2$

    <br>
    
17. **Using the cost function, describe two algorithms to fit a linear regression model.**
  
   **Analytical Approach**
   - Take the partial derivate with respect to the particular $\beta$ of the cost function. That is the gradient of the cost with respect to the particular $\beta$
     
     $$\frac{\partial}{\partial{\beta_i}} \sum_{d=1}^n (\beta^T X - y)^2$$
     
   - Set the gradient to 0 and solve for $\beta$
   
   <br>
     
   **Gradient Descent**
   - If regularization / transformation / polynomial terms are involved (derivate not closed-form or hard to solve)
   - Randomly initialize parameters and compute **gradient (as shown above)** and **cost**
   - Add / Minus (depending on cost function) product of learning rate and gradient
   - Repeat until cost is minimized / maximized (depending on cost function) 
   - Usually use second order method (gradient of gradient) or Hessian free ([LBFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS)) as it is more efficient

   <br>
  
18. **What are the differences between running linear regression on a large (does not fit into memory) vs a small dataset ?**

   **Optimization**

   - Cannot do optimization on the whole dataset
   - Stochastic gradient descent / Mini-batch
   - Compute gradient and update parameters $\beta$s on one / a few data points
   - Until the convergence of cost
   
   <br>
   
   **Interpretation**
   
   - p-value of $\beta$ are not meaningful any more. Will be statistically significant given large enough data
   - Normality assumption is usually met (due to large dataset)
   - Usually cannot examine residual plot for diagnostics (unless sample data points)

<br>

##Logistic Regression

**Questions for Linear Regression are applicable to Logistic Regression, except for the ones listed below.**  

<br>

1. **What is the difference between logistic regression and linear regression ?**

   - Logistic regression employs the logit function to model a **discrete non-continuous and binary target**
   - Logistic regression models the **probability of obtaining a positive class** (`y=1`) 
   - A **linear relationship would not suffice in modeling the probability** since slight changes in the values of the features would change the probability of obtaining a positive class when the probability is close to the decision boundary between the positive and negative class
   - $p(\text{Y=1 | X; }\theta) = \frac{1}{1 + e^{-\theta^TX}}$
   
   <br>

2. **How do you use logistic regression to tackle a classification problem with multiple labels ?**

   - **Multiple one-vs-rest logistic regression model (Softmax Regression)**
   - $p(\text{Y=i | X; }\theta) = \frac{e^{\theta_i^TX}}{\sum_{j=1}^k e^{-\theta_j^TX}}$

   <br>

3. **Interpretation of logistic regression model coefficients**

   - Say $\beta_1 = -0.0621$, then a unit of increase in feature 1 would lead to $e^{-0.0621} = 0.94$ as likely (in odds) to observe target = 1
   - Say $\beta_1 = 0.0621$, then a unit of increase in feature 1 would lead to $e^{0.0621} = 1.06$ as likely (in odds) to observe target = 1
   - Say $\beta_1 = 0.0621$ and feature 1 is $log_2$ transformed, then doubling feature 1 would lead to $e^{0.0621} = 1.06$ as likely to observe target = 1

   <br>
   
4. **How to find the decision boundary in the feature space for a trained logistic regression model ? (Work through an example when regressed with 1 feature)**

   - **Say $p=0.5$ for the boundary between positive and negative classes**, and we have $\theta$ since the model is trained 
   
     $$0.5 = \frac{1}{1 + e^{-\theta^TX}}$$
     
     <br>
     
   - Rearrange the equation above to get X on the left hand side and plot the line
   
   <br>

5. **What is the cost function for logistic regression ?**

   - **Likelihood:** 
   
      $$\prod_{i=1}^n logit(\theta^TX)^y \times (1 - logit(\theta^TX))^{(1-y)}$$
      
      <br>
      
   - **Log Likelihood:** 
   
      $$\sum_{i=1}^n y \cdot log(logit(\theta^TX)) + (1-y) \cdot log(1 - logit(\theta^TX))$$
      
      <br>
      
   - **Negative Log Likelihood :**
   
      $$- \sum_{i=1}^n y \cdot log(logit(\theta^TX)) + (1-y) \cdot log(1 - logit(\theta^TX))$$

   <br>
   
6. **Coefficient of features (in logistic regression) is the opposite sign of what you expect. What might be the problem ?**

   **Check for:**
   - Data validity 
   - Outliers 
   - Multicollinearity (Solution: PCA, regularization)
   - Confounders

<br>

## Regularization

1. **Why is regularization necessary ?**

   - When there is high variance in a model (over-fitting)
   - Make the model fit the noise in the data less and the underlying phenomenon / signal more

   <br>
   
2. **What are some ways to regularize a logistic regression model ?**

   - Lasso (L1) 
   - Ridge (L2)
   - PCA
   - Select fewer features / Better feature selection (See Q15 above)

   <br>
   
3. **Briefly describe the principles behind (Lasso and Ridge) regularization ? What are the differences between Lasso and Ridge regularization ?**
   
   - Lasso / Ridge regularization can be thought of as a kind of Bayesian estimate of the beta coefficients
   - Lasso (L1) assumes the coefficients are distributed in a Lapalacian distribution
   - Ridge (L2) assumes the coefficients are distributed in a Gaussian distribution
   - As seen in the below illustration, Lapalacian favors values at zero and large values and Gaussian favors small values
     ![](images/linear_logistic_regression/l1_l2.png)
   - Given two features that highly correlated, L1 will assign a zero coefficient to one feature and a large coefficient to another, L2 will assign small coefficients to both features
   - L1 performs feature selection resulting in a model with fewer features, but L2 often performs better in prediction
   - Elastic Net (a combination of L1 and L2) is (in theory) better than L1 / L2 alone
   - **Cost functions for L1 and L2:**
     - **L1:** 
     
       $$\text{Cost Function (linear / logistic)} + \lambda \sum_{i=1}^{n-1} |\beta_i|$$
       
       <br>
       
     - **L2:**
     
       $$\text{Cost Function (linear / logistic)} + \lambda \sum_{i=1}^{n-1} \beta_i^2$$
       
     - $\lambda$ should be selected based on prediction performance on K-fold cross validation

   - [Reference 1](https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization), 
     [Reference 2](http://statweb.stanford.edu/~jtaylo/courses/stats203/notes/penalized.pdf)

<br>

## Evaluation

1. **Explain what precision, recall and specificity are.**

   <br>

2. **What are the differences between a precision-recall curve and a ROC curve ?**

   <br>
   
3. **You have built a logistic regression model to predict fraud, how do you decide the threshold for deciding fraud or not fraud ?**

<br>

## Validation

1. **How do you test a model to ensure it is robust ?**

   <br>

2. **Explain K-fold cross validation.** 

   <br>

3. **Given you have trained and tested a model with cross validation and you deploy the model and it performs poorly. Provide some reasonable explanations.**

