# Modelling:

#### 1. What is bias variance trade-off?

* The Bias-Variance trade-off is the central problem in supervised machine learning. Ideally, one wants to choose a model that accurately captures the regularities in its training data, but also generilizes well to unseen data. Unfortunately, these goals are contradictory, and often impossible to do both. 
* __Bias__ represents the error as a result of misaligned assumptions in the learning algorithm that do not represent the true relationship between predictors and the response variable. 
* __Variance__ represents the error from sensitivity to fluctuations in the training set. High variance can cause an algorithm to model the noise in the training data, rather than intended outputs. 
* __Discussion__ Models with low-bias are usually more complex (ex: higher-order regression polynomials), enabling them to represent the training set from accurately. However, they may also represent the noise in the trianing set, making their predictions less accurate. On the other hand, models with high-bias are simple (ex: linear regression polynomials), but may produce lower variance prediction when applied beyond the training set. 
* __Sources:__ [wiki](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)


#### 2. Derive bias variance trade-off?


* __Derivation of Variance:__  
$Var(X) = E[(X-E[X])^2]$  
$Var(X) = E[(X^2 + E[X]^2 - 2 * X E[X])]$  
$Var(X) = E[X^2] + E[X]^2 - 2 * E[X] * E[X]$  
$Var(X) = E[X^2] - E[X]^2$  

* __Variance in $y$__  
$ Var(y) = E[ (y - E[y])^2]$  
$ Var(y) = E[ (E[y] + e - E[y])^2]$  
$ Var(y) = E[e^2]$  
$ Var(y) = Var(e) + E[e]^2$; But: $E[e] = 0$  
$ Var(y) = Var(e)$  

* __Sum of Squares Error:__  
$SS = E[(y - \hat{y})^2]$  
$SS = E[(y^2 + \hat{y}^2 - 2 * y * \hat{y}]$  
$SS = E[y^2] + E[\hat{y}^2] - 2 E(y*\hat{y})$  
$SS = Var(y) + E[y]^2  + Var(\hat{y}) + E[\hat{y}]^2  - 2 E(y*\hat{y})$
$SS = Var(e) + Var(\hat{y}) + \left( E[y] - E[\hat{y}]\right)^2$  
$SS = irreducible\space error + Variance + Bias^2$


#### 3. Linear regression algorithm with stochastic gradient decent
$J(\theta) = \sum_{i=1}^{m} L(\hat{y_i} - y_i)$  
$\space \space \theta = \theta - \alpha * \left(\hat{y_i} - y_i  \right) ^2 $

In [5]:
def coefficients_sgd(train, learning_rate, iter, pred):
    """Stochastic Gradient Descent
    Args:
        train (numpy.ndarray): input data
        learning_rate (float): rate of learning
        iter (int): No. of iterations
    Returns:
        list, float:
    """
    coef = [0.0] * len(train[0])
    sum_err = 0.0
    for i in range(iter):
        sum_err = 0.0
        for row in train:
            yhat = pred(row, coef)
            err = yhat - row[-1]
            sum_err += err ** 2
            coef[0] = coef[0] - learning_rate * err

            for j in range(len(row) - 1):
                coef[j + 1] = coef[j + 1] - learning_rate * err * row[j]
    return coef, sum_err

#### 4. Simple linear regression derivation 

$y = b_0 + b_1 x$  
$RSS = \sum_{i=1}^{n} \left( y_i - \hat{y_i} \right)$  
$\frac{\partial{RSS}}{\partial{b_0}} = \sum_{i=1}^{n} -2 \left[ y_i - b_0 - b_1x_i \right] = 0$  
$\frac{\partial{RSS}}{\partial{b_0}} = 2 \left[ nb_0 + b_1 \sum_{i=1}^{n}  x_i - \sum_{i=1}^{n} y_i \right] = 0$  
$b_0 = \frac{\sum_{i=1}^{n} y_i}{n} - b_1 \frac{\sum_{i=1}^{n} x_i}{n}$  
$b_0 = \bar{y} - b_1 \bar{x}$ 
  
$\frac{\partial{RSS}}{\partial{b_1}} = \sum_{i=1}^{n} -2 x_i \left[ y_i - b_0 - b_1x_i \right] = 0$  
$\frac{\partial{RSS}}{\partial{b_1}} = \sum_{i=1}^{n} -2 x_i \left[ y_i - b_0 - b_1x_i \right] = 0$  
$\frac{\partial{RSS}}{\partial{b_1}} = \sum_{i=1}^{n} -2 x_i \left[ y_i - \bar{y} + b_1 \bar{x} - b_1x_i \right] = 0$  
  
$\frac{\partial{RSS}}{\partial{b_1}} = -2 \sum_{i=1}^{n} \left[ x_iy_i - \bar{y} x_i + b_1\bar{x}x_i - b_1x_i^2 \right] = 0$  
$\sum_{i=1}^{n} \left[ x_iy_i - \bar{y} x_i \right] - b_1 \sum_{i=1}^{n} \left[ b_1x_i^2 - \bar{x}x_i \right] = 0$  
$b_1 = \frac{\sum_{i=1}^{n} \left( x_iy_i - \bar{y} x_i \right)}{\sum_{i=1}^{n} \left( x_i^2 - \bar{x}x_i \right)}$  

#### 5. Gini Impurity

$gini\_impurity = 1 - \sum_{i}^{C} p_i^2$

$gini\_index:$
```
for each branch in split:  
    Calculate percent branch represents (Used for weighting)
    Calculate gini_impurity
Weight each branch's gini_impurity by the share of samples it represents
Sum the weighted gini index for each split.
```

$cross\_entropy  = \sum_{i}^{C} - p_i \log(p_i)$

#### 5. CART


In [9]:
def build_decision_tree(train, max_depth, min_leaf_size): 
    root = get_best_split(train)
    split(root, max_depth, min_leaf_size, 1)
    return root

def split(node, max_depth, min_leaf_size, level):
    left, right = node.left, node.right
    
    if left is None or right is None: to_terminal(node)
    
    if level > max_depth: return to_terminal(node)
    
    #Do left and right
    if len(left) <= min_leafe_size: node.left = to_terminal(node)
    else: 
        node.left = get_split(left)
        split(node.left, max_depth, min_leaf_size, level + 1)

def get_split(data):
    classes = data['class'].unique()
    gini_lowest = 'inf'
    split_groups = (None, None)
    split_index, split_val = None, None
    
    for col in data.columns:
        for row in data:
            a, b = make_split(data, col, row[col])
            gini = calc_gini_index([a, b], classes)
            if gini < gini_lowest:
                gini_lowest = gini
                split_groups = (a, b)
                split_index = col
                split_val = row[col]
    return {'groups':split_groups, 'index': split_index, 'val': split_val}

#### 6. Bagged Tree Algorithm

In [2]:
def bagging(train, test, max_depth, min_size, sample_size, n_trees):
    trees = []
    for _ in range(n_trees):
        sample = subsample(train, sample_size)
        tree = build_tree(sample, max_depth, min_size)
        trees.append(tree)
    predictions = [bagging_predict(trees, row) for row in test.values]

    return predictions

#### 7. Random Forest Algorithm

In [3]:
def random_forest(train, test, max_depth, min_size, sample_ratio, tree_ct,
                  feature_ct):
    """Build a random forest from the training dataset and predict outcomes for
    the test dataset

    Args:
        train (pandas.DataFrame): training dataset
        test (pandas.DataFrame): test dataset
        max_depth (int): max depth of any tree
        min_size (int): min size of the dataset at the terminal node
        sample_ratio (float): ratio of re-sampling
        tree_ct (int): no. of trees in the foredst
        feature_ct (int): no. of features for each tree

    Returns:
        (list): list of predictions
    """
    trees = []

    for _ in range(tree_ct):
        sample = subsample(train, sample_ratio)
        tree = build_forest(sample, max_depth, min_size, feature_ct)
        # print_tree(tree)
        trees.append(tree)

    predictions = [bagging_predict(trees, row) for row in test.values]
    return predictions
def build_forest(train, max_depth, min_size, n_features):
    """Build a tree

    Args:
        train (pandas.DataFrame): training dataset
        max_depth (int): max depth of the tree
        min_size (int): min size of samples in terminal node
        n_features (int): no. of features for each tree

    Returns:
        dict: tree with left and right sub-trees
    """
    root = get_split(train.values, n_features)
    _build_forest(root, max_depth, min_size, n_features, 1)
    return root

def get_split(dataset, n_features):
    """This method randomly picks n_features and returns the best split across
    all the features

    Args:
        dataset(numpy.ndarray): input dataset
        n_features (int): no. of features to be considered for splitting

    Returns:
        dict: of split parameters
    """
    class_values = list(set(row[-1] for row in dataset))
    b_idx, b_val, b_score, b_groups = 999, 999, 999, None

    indicies = random.sample(range(len(dataset[0]) - 1), n_features)
    features = indicies
    # print("features:", features)
    for feature in features:
        for row in dataset:
            groups = split_data(dataset, feature, row[feature])
            gini = gini_index(groups, class_values)

            if gini < b_score:
                b_idx, b_val, b_score, b_groups = feature, row[feature], gini, groups

    return {"index": b_idx, "value": b_val, "groups": b_groups}


#### 8. Light GBM

#### 9. XGBoost

#### 10. Maximum Likelihood Estimation

$\log\left( \frac{p}{1-p}\right) = \beta^T X$  

Model:   
$y_i \sim Binomial(m_i, p_i) \space \forall \space i = 0, 1, 2, ...m; m = no. of samples$  
  
$Pr[Y_1 = y_1, Y_2 = y_2... Y_m = y_m | p_1, p_2, p_3...p_m] = \prod_{i=1}^{m} p_i^{y_i} (1-p_i)^{1-y_i} = \mathcal{L} $  
  
$\log(\mathcal{L} ) = \sum_{i=1}^{m} \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]$  
  
$\log(\mathcal{L} ) = \sum_{i=1}^{m} \left[ y_i \log \left( \frac{p_i}{1-p_i}\right) + \log(1-p_i) \right]$  
  
$\log(\mathcal{L} ) = \sum_{i=1}^{m} \left[ y_i \beta^T x_i + \log\left(1-\frac{\exp{\beta^T X_i}}{1+ \exp{\beta^T X_i}}\right) \right]$  
  
$\log(\mathcal{L} ) = \sum_{i=1}^{m} \left[ y_i \beta^T x_i - \log\left(1+\exp{\beta^T X_i}\right) \right]$  
  
$\frac{\partial{\log(\mathcal{L})}}{\partial{\beta}} = 0 = \sum_{i=1}^{m} \left[y_i x_i - \frac{\exp(\beta^Tx_i)}{1+\exp(\beta^Tx_i)}  \right]$  
  
$\frac{\partial{\log(\mathcal{L})}}{\partial{\beta}} = 0 = \sum_{i=1}^{m} \left[x_i \left(y_i - p(x_i; \beta) \right) \right]$  
  
$\frac{\partial^2{\log(\mathcal{L})}}{\partial{\beta} \partial{\beta}} = - \sum_{i=1}^{m} x_i\frac{-\exp(\beta^Tx_i)}{(1+ \exp(\beta^Tx_i)) (1+ \exp(\beta^Tx_i))}$  
$\frac{\partial^2{\log(\mathcal{L})}}{\partial{\beta} \partial{\beta}} = - \sum_{i=1}^{m} x_i p_i (1-p_i)$  
  
$\beta_{iter + 1} = \beta_{iter} - \frac{\partial{\log(\mathcal{L})}}{\partial^2{\log(\mathcal{L})}}$  : Newton method


#### 9. Derive NN - backpropagation

#### 11. SVN

#### 12. ARIMA

#### 13. Impact of a given feature on response variable

#### 14. Types of error calcualtion

* __MSE (Mean Squared Error)__: $\frac{\sum_{i=1}^{m} (y_i - \hat{y_i})^2}{m}$ 
    * Includes both variance of the estimator (how widely spread the estimates are from one data sample to another) and bias of the estimator (how far off the average estimated value is from the truth) 
    * RMSE also represents the standard error of the estimator
    * Inflates large errors (or outliers)
    * Squaring is nicer than taking the absolute value, e.g. it is smooth. It also leads to a definition of variance which has nice mathematical properties, e.g. it is additive. But for me the theorem that really justifies using standard deviation over the mean absolute error is the central limit theorem. The central limit theorem is at work whenever we measure the mean and standard deviation of a distribution we assume to be normal (e.g. heights in a population) and use that to make predictions about the entire distribution, since a normal distribution is completely specified by its mean and standard deviation.
* __MAD__: Mean Absolute Deviation $\frac{\sum_{i=1}^{m} \mid y_i - \hat{y_i} \mid}{m}$ 
    * Resistent to outliers
* __MAPE__: Mean Absolute Percentage error $\frac{\sum_{i=1}^{m} \mid \frac{(y_i - \hat{y_i}) }{y_i} \mid}{m}$ 
    * Works well when the response variable is symetric and normally distributed
    * Does not work for skewed distribution
    * If the response variable has 0 then MAPE has no value
    * Direct business interpretation (for ex: % of error in the sales forecast)


#### 15. How to evaluate a binary classifier?

* __Basic Metrics__

|Metric|Formula|Intepretation|
|-------|-----|----------------|
|Accuracy|$\frac{TP +TN}{TP + TN + FP + FN}$|Overall accuracy of the model|
|Precision|$\frac{TP}{TP + FP}$|How many Positives are accurate|
|Recall (aka Sensitivity) |$\frac{TP}{TP + FN}$|Positive sample covered|
|Specitivity |$\frac{TN}{TN + FP}$|Negative sample covered|
|F1 Score |$\frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$|Precision Recall Score|
* __ROC (Receiver Operating Characteristics)__: Plot of TPR, FPR
* __AUC (Area Under of the Curve)__: Perfect predictor has TPR = 1.0 when FPR = 0.0. Random predictor has the value of TPR = FPR.   

#### 18. How to deal with imbalanced binary response variable?

* __Re-sample data__: For logistic regression models unbalanced training data affects only the estimate of the model intercept (although this of course skews all the predicted probabilities, which in turn compromises your predictions). Fortunately the intercept correction is straightforward: Provided you know, or can guess, the true proportion of 0s and 1s and know the proportions in the training set you can apply a rare events correction to the intercept. Details are in King and Zheng (2001).
* 


19. What models can be used to predict a binary response variable? What are are the differences between these?
21. Why might it be better to include fewer features compared to many?
22. Given training data on tweets and their tweets, how would you predict the no. of re-tweets of a given tweet after 7 days after only observing 2 days worth of data?
23. How would you construct a feed to show relevant content for a site that involved user interactions with items?
24. How would you design the people you may know feature on Linkedin or FB?
25. How would you predict who someone may want to send a Snapchat or Gmat to?
26. How would you suggest to a franchise where to open a new store?
27. In a search engine, query auto complete solution
28. Given a database of all previous alumni donations to your university, how would you predict which recent alumni are more likely to donate?
29. You're Uber and you want to design a heatmap to recommend to drivers where to wait for passenger. How would you approach this?
30. How would you build a model to predict a March Madness bracket?
31. You want to run a regression to predict the probability of a flight delay, but there are flights with delays up to 12 hours that are really messing up your model. How will you address this?
32. Derive MLE estimation from likelihood function?
33. Program a Naive Bayes algorithm
34. Program a k-NN algorithm (a) iterative and (b) vectorized
35. What are differences between generative and discriminative models? Examples.
36. Derive likelihood function for BG NBD
37. Typical loss functions used: 
    a. Least Squared Error
    b. Logistic Loss
    c. Hinge Loss
    d. Cross-entropy
38. What is the cost function in gradient descent?
39. What is the general form of gradient descent? What are the typical parameters to control gradient descent?
40. What is Newton's algorithm? What is the Newton-Rahpson method?
41. Derive linear regression co-efficients using Normal equation?
42. What is the mathematical form of Least Mean Square Algorithm?
43. What is locally weighted regression? 
44. What is the sigmoid function
45. What is the general form of logistic regression
46. What is the genral form of softmax regression?
47. What are generalized regression models? How can the Bernauli, Gaussian, Poisson, Geomertric distributions be used within GLM?
48. What are the assumptions in GLM?
49. What is SVM? How does it work?
50. How does Gaussian Discriminant analysis work?
51. Learning Theory: What is Union Bound?
52. Learning Theory: What is Hoeffding inequality?
53. Learning Theory: What is Training Error?
54. Learning Theory: What is Probably Approximately Correct (PAC)?
55. Learning Theory: What is Shattering?
56. Learning Theory: What is Upper Bound Theorem?
57. Learning Theory: What is VC dimension?
58. Learning Theory: What is Vapnik Theorem?
59. What is EM algorithm? How is it used in discovering LAten variables?
60. What is k-means clustering? How does the algorithm work?
61. How to optimize k-means? how to find optimal k?
62. What are the assumptions with k-means algorithm?
63. What is k-protype algorithm? When is it typically used?
64. What is Hierarchical clustering? What are the different linkage functions that can be used?
65. What are the metrics that can be used to assess clusters? Solihouette, Calinski-Harabaz?
66. What is PCA? How does it work? Code it up
67. What is ICA? How does it work?
68. What is NN? What is the general form?
69. What are the typical activation functions used? Describe Sigmoid, Tanh, ReLU, Leaku ReLU?
70. How does the backpropagation algorithm work?
    a. Cross-entropy loss
    b. Learning Rate
    c. Backpropagation
    d. Updating weights
    e. Dropout
71. What is CNN?
    a. Convolutional Layer Requirement
    b. Batch normalization
72. What is RNN?
    a. Types of gates? Input, Forget, Gate, Output gate
    b. LSTM
73. Re-inforcement Learning and Control
    a. What is policy
    b. What is markov decision process
    c. What is value function?
    d. What are bellman equation
    e. Value iteration algorithm
    f. MLE
    g. Q-learning
74. Classification evaluation metrics:
    a. Accuracy, Precision, Recal (Sensitivity), Specificity, F1 Score
    b. ROC (TPR aka Recall aka Sensitivity, FPR aka 1- specificity)
    c. AUC



 

#### 75. Regression Metrics:
* **Total Sum of Squares (TSS):** Measures the variation in the observed data  
$ \sum_{i=0}^{m} (y_i - \bar{y})^2$   

* __Residual Sum of Squares (RSS):__ Measures the variation in the modelling errors  
$ \sum_{i=0}^{m} (y_i - \hat{y_i})^2$  
* __Explained Sum of Squares (ESS):__  Measures variation in the modelled values  
$ \sum_{i=0}^{m} (\hat{y_i} - \bar{y})^2$   

* $R^2$ or co-efficient of variation   
$ (1- \frac{RSS}{TSS})$  
  
* Model performance: Mallow's Cp, AIC, BIC, Adjusted R^2


#### 76. Regularization:
* __Definition:__ Regularization is a technique used to reduce the complexity of the model, the objective is to increase bias and avoid overfitting the training sample. Regularization is usually done through addition of regularization term to the cost function of the machine learning model. 
* __LASSO aka L1 Regularization:__ Lasso shriks co-efficients towards 0. When the $\lambda$ is sufficiently large, the lasso method is likely to end up  shrinking some of the coefficients to 0. If there is a group of highly correlated variables, Lasso tends to select one from the group and ignore the rest. $$J(\theta) = \sum_{i = 1}^{m}{\left( y_i - \theta_0 - \sum_{j=1}^{n} \theta_j x_{i\space j}\right) ^2} + \lambda \sum_{j=1}^{n}{\mid \theta_j \mid}$$
* __Ridge aka L2 Regulization:__ Ridge regularization shriks co-efficient to 0. But, due to the nature of the penalty term, ridge penalization always yields models that have all the $n$ predictors. $$J(\theta) = \sum_{i = 1}^{m}{\left( y_i - \theta_0 - \sum_{j=1}^{n} \theta_j x_{i\space j}\right) ^2} + \lambda \sum_{j=1}^{n}{ \theta_j}^2$$
* __Elastic Net:__ L1 regularization is conservative with highly-correlated variables. Elastic net combines the cost function of both L1 and L2 to allow for more flexbility when compared to L1. 
$$J(\theta) = \sum_{i = 1}^{m}{\left( y_i - \theta_0 - \sum_{j=1}^{n} \theta_j x_{i\space j}\right) ^2} + \lambda_1 \sum_{j=1}^{n}{\mid \theta_j \mid} + \lambda_2 \sum_{j=1}^{n}{ \theta_j}^2$$


#### 77. Diagnostics:
* Discovering overfitting, underfitting through training and cv error
