# Machine Learning

<h2>End-to-End ML Workflow</h2>

<ol>

<li>Clarify the Problem and Constraints</li>
    <ul>
        <li>Dependent Variable</li>
        <li>Current State, Baseline Performance</li>
        <li>Is ML Needed?</li>
        <li>Legal/Ethical/Regulatory Concerns</li>
        <li>Impact of Incorrect Predictions</li>
        <li>Technical Requirements</li>
        <li>Latency, throughput, deployment</li>
    </ul>

<li>Establish Metrics</li>
<li>Constraints on recall, etc.</li>

<li>Understand Data Sources</li>

<li>Explore the Data</li>

<li>Clean the Data</li>

<li>Feature Engineering</li>
    <ul>			
        <li>Filtering</li>
        <li>Transformations</li>
        <li>Binning</li>
        <li>Dimension Reduction</li>
        <li>One-Hot Encoding</li>
        <li>Hashing</li>
        <li>Stemming, Lemmatization, BOW, N-Grams, Word Embeddings</li>
    </ul>

<li>Model Selection</li>
<li>Model Training and Evaluation</li>
<li>Deployment</li>
<li>Iterate</li>

# Problems

<h4>Question 1</h4>

Say you are building a binary classifier to an unbalanced dataset (99% vs. 1%). How should you handle this situation?

<i>Answer:</i>

- Check whether you can get more data
- Look at appropriate performance metrics: precision, recall, F1 score, ROC curve
- Resample either by oversampling the rare samples or undersampling the abundant ones (bootstrapping)
- Consider generating synthetic samples such as with SMOTE
- Consider adjusting the probability threshold to something other than $0.5$

<h4>Question 2</h4>

What differences would you expect in a model that minimizes squared error vs. a model that minimizes absolute error?

<i>Answer:</i>

Errors are squared before being averaged in MSE, meaning there is a relatively high weight given to large errors. Outliers affect MSE more than MAE; MAE is more robust to outliers.

The gradient is more straightforward to calculate with MSE than MAE, which requires linear programming (?) to compute the gradient.

Therefore, if the model needs to be computationally easier to train or doesn't need to be robust to outliers, then MSE should be used. Otherwise, MAE is a better option.

<h4>Question 3</h4>

When performing K-means clustering, how do you choose k?

<i>Answer:</i>

The elbow method is one alternative. The first few clusters will usually explain a lot of the variation in the data, but past a certain point, the amount of information added is diminishing. Looking at a graph of explained variation by number of clusters, there should be a sharp change in the y-axis at some level of $k$. The explained variation is quantified by the within-cluster sum of squared errors.

The silhouette score is another method, which aims to measure how similar points are in its cluster compared to other clusters. It looks at:

$\frac{(x-y)}{max(x,y)}$

$x$ is the mean distance to the examples of the nearest cluster, and $y$ is the mean distance to other examples in the same cluster, and y is the mean distance to other examples in the same cluster. The coefficient varies between $-1$ and $1$ for any given data point. A value of $1$ implies that the point is in the right cluster.

Another alternative is to rely on business intuition.

<h4>Question 4</h4>

How can you make your models more robust to outliers?

<i>Answer:</i>

- Add regularization
- Try different models (e.g., tree-based models are less affected than linear models)
- "Winsorize" the data: cap the data at arbitrary thresholds. e.g., at $90\%$ winsorization, we take the top and bottom $5\%$ of values and set them to the values of the $95^{th}$ and $5^{th}$ percentile respectively.
- Transform the data: e.g., do a log transformation when the response variable follows an exponential or right-skewed distribution
- Change the error metric to be more robust to outliers (like MAE or Huber loss)
- Remove outliers: if sure they are not worth incorporating into the model

<h4>Question 5</h4>

In a multiple linear regression analysis, you have reason to believe that several of the predictors are correlated. How would the results of the regression be affected, and how would you deal with this problem?

<i>Answer:</i>

There will be two primary problems:

- The coefficient estimates and signs will vary dramatically, depending on what particular variables are included in the model. The confidence intervals may include zero, bringing about uncertainty of direction.
- The p-values will be misleading.

This can be dealt with by removing or combining correlated predictors. It is possible to use interaction terms. Additionally, you could 1) center the data and 2) try to obtain a larger size of sample (thereby giving narrower confidence intervals). Lastly, you could apply regularization.

<h4>Question 6</h4>

Describe the motivation behind random forests. What are two ways they improve upon individual decision trees?

<i>Answer:</i>

Random forests are used because individual decision trees are prone to overfitting. There are a few ways in which they allow for stronger out-of-sample prediction than individual decision trees:

- As in other ensemble models, bootstrap aggregating (bagging) will lead to a model yielding more consistent results, as it leads to diversity in training data for each tree.
- Using $m \lt p$ features (as the algorithm does) at each split helps to de-correlate the decision trees.
- Easy to implement and fast to run
- They produce interpretable feature importance values

<h4>Question 7</h4>

<p>We want to predict the likelihood of transactions being fraudulent, however the data has many rows with missing values in various columns. How would you deal with this?</p>

<i>Answer:</i>

<ol>
<li>Step 1: Clarify the missing data</li>
    <ul>
        <li>Is the number of missing values uniform by feature?</li>
        <li>Are the missing values numerical or categorical?</li>
        <li>How many features are missing data?</li>
        <li>Is there a pattern to waht's missing?</li>
    </ul>

<li>Step 2: Establish a baseline</li>
    <ul>
        <li>A good answer considers that the missing data may not be a problem. e.g., if it's for transactions that were almost never fraudulent, or features that are unimportant like name.</li>
        <li>Can a baseline model be built that meets the business goals, without having to deal with the missing data?</li>
    </ul>

<li>Step 3: Impute missing data</li>
    <ul>
        <li>Mean, median, reference category</li>
        <li>Nearest neighbors methods</li>
    </ul>

<li>Step 4: Check performance with imputed data</li>
    <ul>
        <li>Does performance increase by including the imputed data, as measured by cross-validation?</li>
    </ul>

<li>Step 5: Other Approaches</li>
    <ul>
        <li>e.g., using a third-party dataset to fill in gaps</li>
    </ul>
</ol>

<h4>Question 8</h4>

Say you are running a simple logistic regression to solve a problem but find the results to be unsatisfactory. What are some ways you might improve your model, or what models might you look into instead?

<i>Answer:</i>

- Normalize features: such that particular weights do not dominate within the model
- Add additional features
- Address outliers
- Selecting features (reducing noise)
- Hyperparameter tuning (e.g. regularization) with cross-validation
- Nonlinear models (logistic regression produces linear decision boundaries)

<h4>Question 9</h4>

Say you are running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your $\beta$ coefficient?

<i>Answer:</i>

It remains unchanged.

<h4>Question 10</h4>

Compare and contrast gradient boosting and random forests.

<i>Answer:</i>

Both use ensembles of decision trees, and are flexible models that don't need much data preprocessing. There are two main differences:

- In gradient boosting, trees are built one at a time, such that successive weak learners learn from the mistakes of preceding weak learners. In random forests, the trees are built independently at the same time.

- The output of gradient boosting combines the results of the weak learners with each successive iteration, whereas in random forests, the trees are combined at the end (through averaging or majority).

Because of these differences, gradient boosting is often more prone to overfitting than are random forests due to their focus on mistakes over training iterations and the lack of independence in tree building. Gradient boosting hyperparameters are harder to tune than those of random forests, and may take longer to train than random forests because the trees are built sequentially.

<h4>Question 11</h4>

<p>Say DoorDash is launching in Singapore. For this new market, you want to predict the ETA for a delivery to reach a customer after an order has been placed on the app. From an earlier beta test, there were $10,000$ delvieries made. Do you have enough training data to create an accurate ETA model?</p>

<i>Answer:</i>
<ol>
<li>Step 1: Clarify what "good" [ETA] means</li>
    <ul>
        <li>How accurate does the ETA model need to be? The accuracy needed might be higher for the order-driver matching algorithm than what DoorDash needs to display to the customer. Also, it might be acceptable to under-promise and over-deliver.</li>
    </ul>
</br>
<li>Step 2: Assess baseline [ETA] performance</li>
    <ul>
        <li>A baseline model can be something as simple as the estimated driving time plus average preparation time</li>
    </ul>
</br>
<li>Step 3: Determine how more data improves accuracy</li>
    <ul>
        <li>A learning curve depicts how the accuracy changes when we train a model on a progressively larger percentage of the data. The point at which a metric like $R^2$ drops significantly compared to using less data is a signal to start re-evaluating features rather than adding data.</li>
    </ul>
</br>
<li>Step 4: In case performance isn't good enough</li>
    <ul>
        <li>Are there too few features? Maybe add market supply and demand indicators, traffic patterns, etc.</li>
        <li>Are there too many features, causing the model to overfit? e.g., are there as many as there are data points? If so, consider dimension reduction or feature selection techniques.</li>
        <li>Will different models better-handle the smaller data?</li>
    </ul>

<h4>Question 12</h4>

Say we are running a binary classification loan model, and rejected applicants must be supplied with a reason for rejection. Without digging into the weights of the features, how would you supply these reasons?

<i>Answer:</i>

We could look at partial dependence plots (a.k.a. response curves) to assess how any one feature affects the model's decision. A PDP shows the marginal effect of a feature on the predicted target of a machine learning model.

Suppose we have the following situation:
- $100K$ income, $10K$ debt, $2$ credit cards, FICO score $700$
- $100K$ income, $10K$ debt, $2$ credit cards, FICO score $720$
- $100K$ income, $10K$ debt, $2$ credit cards, FICO score $600$
- $100K$ income, $10K$ debt, $2$ credit cards, FICO score $650$
$
If the third and fourth instances were rejected but instances one and two were accepted, we can conclude that (since all-else is equal) this is because of a FICO score.

<h4>Question 13</h4>

Say you have a large corpus of words. How would you identify synonyms?

<i>Answer:</i>

We can find word embeddings (e.g., using Word2Vec) which produces vectors based on words' contexts. Once we have these, we can run an algorithm like k-means clustering or KNN to find particular synonyms for a given word.

<h4>Question 14</h4>

What is the bias-variance trade-off? How can it be expressed using an equation?

<i>Answer:</i>

The bias-variance trade-off can be expressed as:

$\text{Total Model Error} = \text{Bias} + \text{Variance} + \text{Irreducible Error}$

Flexible models tend to have low bias and high variance, and more rigid models do the opposite.

The bias term comes from the error that occurs when a model underfits data (is too simple) and the variance term represents the error that occurs when a model overfits data, i.e., is too susceptible to changes in training data.

When creating a machine learning model, we want both bias and variance to be low.

<h4>Question 15</h4>

Describe cross-validation and the motivation behind it.

<i>Answer:</i>

Cross-validation assesses the performance of an algorithm in several resamples of the training data. It evaluates model performance on the portion of data not present in the subsample. The process is to:

1. Randomly shuffle data into $k$ equally sized blocks

2. For each $i$ in fold $1, \ldots, k$, train the model on all the data except for field $i$, and evaluate the validation error using block $i$.

3. Average the $k$ validation errors from step 2 to get an estimate of the true error.

This accomplishes the following:
- Avoids training and testing on the same subset of points
- Avoids using a dedicated validation set, with which no training can be done

<h4>Question 16</h4>

How would you build a lead scoring algorithm to predict whether a prospective company is likely to convert into being an enterprise customer?

<i>Answer:</i>

Step 1: Clarify lead scoring requirements

- Are we building this for our own company's sales leads, or an extensible version of our product?
- Bus requirements (e.g., does it need to be easy to explain internally and externally)
- Are we running the algorithm only on our sales database, or a larger landscape of all companies?

Step 2: Explain the features you would use

Elements which should influence whether a prospective company turns into a customer
- Firmographic data: what type of company, industry, amount of revenue, or employee count?
- Marketing data: have they interacted with marketing materials, like clicking on email-campaign links?
- Sales activity: has the prospective company interacted with sales? How many meetings took place, and how recent was the last one?
- Deal details: what products/licenses/etc. are being bought. What's the size of the deal/contract?
- Then feature engineering upon these features

Step 3: Explain models you would use

- Lead scoring can be done with a binary classifier that predicts the probability of a lead occurring. Logistic regression has an easily interpretable result. The log-odds is a probability score for purchasing a particular item. But it cannot capture complex interactions and can be numerically unstable under conditions like correlated covariates.

- A compromise between logistic regression and less interpretable models like neural networks or SVMs would be tree-based models, in which feature importance can readily be measured.

<h4>Question 19</h4>

Explain what information gain and entropy are in the context of a decision tree.

<i>Answer:</i>

Information gain is based on entropy, so we'll start with that:

$Entropy = \sum_{i=1} - P(Y=k) ~log ~P(Y=k)$

The amount of entropy shows how homogenous a sample is. A completely homogenous sample will have an entropy of $0$.

Information gain is based on the decrease in entropy ($H$) after splitting on an attribute.

$IG(X_j, Y) = H(Y) - H(Y|X_j)$

<h4>Question 20</h4>

What is $L1$ and $L2$ regularization, and the difference between them?

<i>Answer:</i>

Both methods prevent overfitting by coercing the coefficients of a regression model toward zero.

$L1$, or Lasso, adds the absolute value of the coefficients as a penalty term, whereas 'ridge' $L2$ regularization adds the squared magnitude of the coefficients as the penalty term.

$Loss(L1) = L + \lambda |w_i|$

$Loss(L2) = L + \lambda |w_i^2|$

where $L = \sum_{i=1}^n (y_i - f(x_i))^2$

$L1$ forces any weight closer to zero, irrespective of magnitude, whereas with $L2$, the rate at which the weight approaches zero becomes slower as the weight approaches zero.

<h4>Question 21</h4>

Describe gradient descent and the motivations behind SGD.

<i>Answer:</i>

The gradient descent algorithm takes small steps in the direction of steepest descent to optimize a particular objective function. The size of the steps the algorithm takes are proportional to the negative gradient of the function at the current value of the parameter being sought.

The stochastic version of the algorithm, SGD, uses an approximation of the gradient by using only 1 randomly selected sample at each step to calculate the derivative of the function, making this version of the algorithm faster and more attractive for situations involving lots of data.

The gradient descent algorithm will update x as follows until it reaches convergence.

$x^{t+1} = x^t - \alpha_t \nabla f(x')$

i.e., we calculate the negative of the gradient of $f$ and scale that by some constant and move in that direction at the end of each iteration.

<h4>Question 22</h4>

Assume we have a classifier that produces a score between 0 and 1 for the probability of a particular loan being fraudulent. How would the ROC curve change if we took the square root of that score? If it doesn't change, what kind of functions would change the curve?

<i>Answer:</i>

ROC curves plot the TPR vs. the FPR. If all scores change simultaneously, then none of the actual classifications change, leading to the same TPR and FPR.

In contrast, any function that is not monotonically increasing would change the ROC curve, since the relative ordering would not be maintained.

e.g., $f(x)=-x$, $f(x)=-x^2$, or a stepwise function.

<h4>Question 25</h4>

Compare and contrast Gaussian Naive Bayes and logistic regression. When would you use one over the other?

<i>Answer:</i>

Advantages:
- GNB requires only a small number of observations to be adequately trained. It is also easy to use and reasonably fast to implement.
- Logistic regression has a simple interpretation in terms of class probabilities, and it allows inferences to be made about features and identification of feature importance.

Disadvantages:
- GNB assumes the variables are independent
- Logistic regression is not highly flexible. It may fail to capture interactions between features and so it may lose predictive power. The lack of flexibility can also lead to overfitting.

Differences:
- Since logistic regression directly learns P(Y|X), it is a discriminative classifier, whereas GNB directly estimates $P(Y)$ and $P(X|Y)$ and so it is a generative classifier.

Similarities:
- Both methods are linear decision functions generated from training data.
- GNB's implied $P(Y|X)$ is the same as that of logistic regression (but with particular parameters).

Logistic regression would be preferable assuming data size is not an issue, since the assumption off conditional independence breaks down if features are correlated. However, in cases where training data are limited or the data-generating process includes strong priors, using GNB may be preferable.

<h4>Question 27</h4>

Describe the kernel trick in SVMs and give a simple example. How do you decide which kernel to use?

The idea behind the kernel trick is that data that cannot be separated by a hyperplane in current dimensionality can be separable by projecting it onto a higher dimensional space.

$k(x,y) = \Phi(x)^T \phi(y)$

Say we have two examples and want to map them to a quadratic space. We have the following:

$\Phi(x_1,x_2) = \begin{bmatrix} 1 \\ x_1 \\ x_2 \\ x_1^2 \\ x_2^2 \\ x_1 x_2 \\ \end{bmatrix}$

and we can use the following:

$k(x,y) = (1 + x^Ty)^2 = \Phi(x)^T \Phi(y)$

If we now change $n=2$ (quadratic) to arbitrary $n$, we can have arbitrarily complex $\Phi$. As long as we perform computations in the original feature space (without feature transformation), then we avoid long compute time while still mapping to a higher dimension.

For linear problems, we can try a linear or logistic kernel. For nonlinear problems, we can try either the radial basis function (RBF) or Gaussian kernels. We could also try many kernels along with a grid search of parameters.

<h4>Question 32</h4>

Describe the idea behind PCA and describe its formulation in matrix form. Go through the procedural description.
- PCA aims to reconstruct data into a lower dimensional setting, and so it creates a small number of linear combinations of a vector $x$ (of $p$ dimensionality) to explain the variance within $x$. We want to find the vector $w$ of weights such that we can define the following linear combination.

$y_i = w_i^Tx = \sum_{j=1}^p w_{ij} x_j$

subject to the constraint that w is orthonormal and that $y_i$ is uncorrelated with $y_j$, and $Var(y_i)$ is maximized.

We perform the following procedure, in which we first find $y_1 = w_1^Tx$ with maximal variance, meaning that the series are obtained by orthogonally projecting the data onto the first principal direction, $w_1$. We then find $y_2 = w_2^Tx_1$ is uncorrelated with $y_1$ and has maximal variance, and we continue this procedure iteratively until ending with the $k^{th}$ dimension such that $y_1, \ldots, y_k$ explain the majority of variance, $k \lt p$.

To solve, not we havec the following for the variance of each $y$, utilizing the covariance matrix of $x$.

$var(y_i) = w_i^T var(x) w_i = w_i^T \sum w_i$

Without any constraints, we could choose arbitrary weights to masximize this variance, and hence we will normalize by assuming orthonormality of $w$, which guarantees that $w_i^T w_i = 1$.

<h4>Question 34</h4>

How would you approach creating a music recommendation algorithm for Discover Weekly (a 30-song weekly playlist personalized to an individual user)?

Step 1: Clarify Details of Discover Weekly
- What is the goal of the algorithm?
- Do we recommend just songs, or include podcasts?
- Is the goal to recommend new music?
- Does the playlist need to change week-to-week if the user doesn't listen to it?

Step 2: Describe What Data Features You Would Use
- User-song interactions
- Metadata about the song
- Audio features
- Demographic features, e.g., region

Step 3: Explain Collaborative Filtering Model Setup
- Collaborative filtering uses data from feedback users have provided. A user-song matrix constitutes the dataset, and since explicit song ratings are lacking, we can proxy liking a song by using the number of times a user streamed it.
- Similarity measurement, KNN...

Step 4: Additional Considerations
- Cold start problem
- Scale
- Retraining 
- How to measure/track system over time