##Linear Regression

1. **What is linear regression ?**
   - Algorithm that use a linear combination of features to predict a continuous target
   - Weights are multiplied to each of the features prior to linear combination
   - Weights are adjusted based on the error of the linear combination with the actual target value

   <br>

2. **What are the assumptions of a linear regression model ?**
   - **Constant variance** of feature across different target values (homoscedasticity)
   - **Normal distribution** of residuals of the model
   - **Independence** between data points (lack of autocorrelation)
   - **Linearity** of linear combination of features and target

   <br>

3. **What are the methods to test the assumptions ?**
   - **Heteroscedasticity** if residual plot presents a fanning pattern
   - **Non-normal** if the quantiles are not aligned in the Q-Q plot with the normal quantiles
   - **Dependence or Non-linearity** if the clear trends are observed in a sequence of points in residual plot

   <br>
   
4. **What are the steps one can take to correct for the violated assumptions ?**
   - **Heteroscedasticity** can be reduced by taking log of the features / target variables
   - **Non-normality** of residuals can be reduced by taking log of the features / target variables
   - **Dependence** can be taken into account by including lag terms or more complex time series model
   - **Non-linearity** can be taken into account by including square or interaction term
   
   <br>

5. **What is multicollinearity, why is a problem ?**
   
   - Multicollinearity is when **two or more features are linearly correlated with each other**
   - The **coefficients** of the regression model become **unstable** and do not reflect the true coefficients
   - The influence of collinear features on the target is **randomly assigned to the coefficients of those features**
   - Removing or **adding a data point** might **change coefficients** significantly
   
   <br>
   
6. **How do we detect multicollinearity ? Under what circumstances can we overlook multicollinearity ?**

   - Compute the **Variance Inflation Factor (VIF), only for linear regression**
     - Loop through each feature and run a regression model using each feature as the target
     - The VIF of the particular feature $i$ is $\frac{1}{1-R_i^2}$
   - **VIF > 10 in general is regarded as a sign of collinearity**
   - We can overlook multicollinearity if we are only concerned about get accurate predictions
   - But the test data must have same collinearity structure as the trained data

   <br>
   
7. **What is the interpretation of $\beta$ ? What does $\beta_0$ (the bias) mean ?**

   - A unit of increase of a certain feature is correlated with $\beta$ amount of increase / decrease in the target value holding all the other features constant
   - $\beta_0$ represents the mean value of target when all the other features have a value of zero

   <br>

8. **How does the interpretation of $\beta$ change if log is taken on the the features / target variables ?**
   - **Logging the target:** 
   
     A unit of increase of a feature correlated with (100 x $\beta$) % increase / decrease in the target
     
     <br>
     
   - **Logging a feature:** 
   
     1 % of increase of a feature correlated with ($\beta$ / 100) units increase / decrease in the target
   
     <br>

   - **Logging a feature and the target:** 
   
     1 % of increase of a feature correlated with $\beta$ % increase / decrease in the target
   
     <br>

9. **What is the interpretation of the p-value of $\beta$ ? What is the null and alternative hypothesis ?**

   - **Null:** $\beta =$ 0 
   - **Alternative:** $\beta \neq$  0
   - If the p-value is less than the significance level, then we can reject the null of $\beta =$ 0, i.e. feature is not correlated to changes in the target variable

   <br>

10. **Can you compare beta coefficient in multiple linear regression ? If not, what will allow us to ?**

   - Normalizing (substract by mean and divided by standard deviation column-wise) the features and target
   - Since variables are normalized, we do not need to fit a bias ($\beta_0$) term

   <br>

11. **What are some causes for under-fitting ? How to avoid them ? What disadvantages do they bring ?**

   - **Causes:**
     - Overly simplistic model, not enough feature
     - Not enough polynomial / transformation of features
 
   - **Consequences:**
     - High bias, low variance
     - Inaccurate predictions due to under-learning from the data
   
   - **Solution:**
     - More features and complex (non-linear) model

   <br>
   
12. **What are some causes for over-fitting ? How to avoid them ?**

   - **Causes:**
     - Overly complex model, too many feature
     - Too many polynomial / transformation of features
 
   - **Consequences:**
     - Low bias, high variance
     - Inaccurate predictions due to over-learning from noise and edge cases in the data
   
   - **Solution:**
     - Regularization
     - Feature selection
     - Use K-fold cross validation to decide complexity of model
     
   <br>

13. **How to spot an outlier in multiple linear regression ?**

   - **Univariate scatter plots / Boxplots**
   - **Normalized residual plot:**
     -  More than 2 standard deviation from mean considered as outlier
   - High residual values in additional to high leverage makes an influential outlier

   <br>
   
14. **What does it mean in multiple regression when a point has high levarage ? What constitue an influential point ? Why are they important ?**

   - **High levarage** means some feature(s) of a data point far deviates from the mean value of the feature(s) across data point
   - **Influence = Leverage x Residual**
   - High leverage and high residual data points are influential, i.e. affects the model ($\beta$s) a lot when absent / present
   - **Important because you do not a few data points controlling your model, and if so, you should know why**
    
    <br>

15. **How to determine how many features to include for your model ? What difference does it make if you are fitting on on a large (does not fit into memory) dataset ?**
    
   - Forwards / Backwards elimination (Not feasible with big dataset, computationally intensive)
   - Lasso (L1) regularization (shrink unimportant features towards zero)
   - Random forest feature importance
   - Adjusted $R^2$ with K-fold cross validation (F1 score with K-fold if logistic regression)
   - Product knowledge

    <br>
  
16. **What is the cost function for linear regression ?**

   - **Mean Squared Error (MSE)**, $\sum_{i=1}^n (\hat{y} - y)^2$

    <br>
    
17. **Using the cost function, describe two algorithms to fit a linear regression model.**
  
   **Analytical Approach**
   - Take the partial derivate with respect to the particular $\beta$ of the cost function
     
     $$\frac{\partial}{\partial{\beta_i}} \sum_{d=1}^n (\beta^T X - y)^2$$
   
   **Gradient Descent**
   - If regularization / transformation / polynomial terms are involved (derivate not closed-form)
   - 

   <br>
  
18. **What are the differences between running linear regression on a large (does not fit into memory) vs a small dataset ?**

    <br>
   
19. **What is the difference between stochastic gradient methods and batched gradient methods ?**


##Logistic Regression

**Questions for Linear Regression are applicable to Logistic Regression, except for the ones listed below.**  

<br>

1. **Interpretation of logistic regression model coefficients**

   <br>
   
2. **How to find the decision boundary for a trained logistic regression model ? (Work through an example when regressed with 1 feature)**

   <br>

3. **What is the cost function for logistic regression ?**

   <br>
   
4. **Coefficient of features is the opposite sign of what you expect ? What would you do ?**

   <br>

## Regularization



## Evaluation


## Validation

