# Project Based Questions

### Likely To Lapse Models

#### Objective
* Improve retention through intervening with players who are likely to lapse
* Identifying players who are likely to lapse and 
* Identify key features that are predictive of likelihood to lapse

#### What was the definition of lapse?
* For the purposes of the model, the targe feature was modelled as a user who will login in the next 7 days and does not login in the 7 days after next week
* A key business metric that is used in Zynga is the CURR (current user retention rate), which is defined as a player who logged-in in the past 7 days and during the 7 days prior to last week. 
* A player is considered lapsed if he / she has logged-in in the past 7 days and does not login in any of the next 7 days

#### What were the features used?
* Time Windowing: -1, -2, -3, -4 week
* LB Spins, MLB Spins, 
* Cash Hands, Tourn Hands, 
* Challenges completed, tickets redeemed
* Chat used, chat count
* Player level, changes
* Login days, hands played days, purch days
* Chips purchased, gold purchased
* txn ct, chips txn ct, gold txn ct
* Engg: momentum features (-3/-4, -2/-3, -1/-2)

#### What was the process adopted to treat missing values?
* Tourn Hands were Null for most players. Replacement with 0. 
* Although the chat used flag was 1, chat count was sometimes 0, this was corrected
* 

#### What was the process used to deal with correlated features?
* There were lots of features that were correlated for ex: chated and chat count. Typical decision was to use the more informative variable for ex: chat count
* But, some values such gold txn count, and total txn count, in this case, ingredient features gold ct and chips txn ct were used inplace of the added up ct. During iterations, either all the ingredient features were used or only the aggregated feature was used. 
* Part of the goal was to derive importance and check the partial dependency for each of the features, hence we used the ingredient features a lot more than the aggregated features

#### What was the process used for any variable transformations?
* count variables like hands played, login days are all right skewed, and have a long-right tail

#### What were the techniques considered for prediction?
* Logistic Regression, Random Forest, XGBoost, GBT

#### What are the weakness of each technique?
* Logistic Regression: 
    * __Pros:__ Intrepretability
    * __Con:__ High Bias, Less accurate, no multi collenearity, missing value treatment needs to be meaningful 
* Random Forest:
    * __Pros:__ Low Bias, robust to correlated features, can handle thousands of features, sensitive to training sample
    * __Con:__ Can overfit, interpretability is low, slow
    

#### Which was the technique used and why?
* Random Forest (payer) and XGBoost (Non-payer)
    * It was chosen based on performance and tuning
    * We tried to maximize Recall because the purpose of the project was to intervene for all players who are likely to lapse. We kept precision above 0.7 to avoid too many false positives. 
    
#### How were the hyper parameters tuned?
* Random Forest
    * __criterion__: gini, cross entropy
        * gini $\sum_{i=1}^{n_{class}} p_i (1-p_i)$
        * cross entropy $\sum_{i=1}^{n_{class}}-p_i \log{p_i}$
    * __n_estimators__: No. of trees. More trees reduces variance. Averaging over many trees reduces variance. 
    * __min samples leaf count__: no. of samples in the leaf. Increasing it reduces the height of the trees. Which also reduces overfitting. 
    * __min samples split__: 
    * max depth: max depth of the random forest. If a tree is too deep it overfits. We saw about 3-5 depth had the impact we needed. 
    * __max features__: typically sqrt(feature_ct), or log2(feature_ct)

    * __max leaf nodes__: Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
    * __min impurity decrease__:A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
    * __bootstrap or sub sample__: Whether bootstrap samples are used when building trees.
    * __oob_score__: This method simply tags every observation used in different tress. And then it finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train itself.
    * __class_weight__: 'balanced' or 'balanced_subsample' or None

#### What problems does each of the hypermeter solve?

#### How does RandomForest work?
* 
#### How does Gradient Boosted Trees work?

#### How does XGBoost work?

#### What insights were derived from the model?

#### How would you assess the output of logistic regression?

#### Derive the estimation procedure for logistic regression?



#### How does maximum likelihood work?
In this discussion, we will lay down the foundational principles that enable the optimal estimation of a given algorithm’s parameters using maximum likelihood estimation and gradient descent. Using the logistic regression, we will first walk through the mathematical solution, and subsequently we shall implement our solution in code.

Maximum Likelihood
The logistic model uses the sigmoid function (denoted by sigma) to estimate the probability that a given sample y belongs to class 1 given inputs X and weights W,

 $$P(y=1∣x)=\sigma(W^TX)$$
where the sigmoid of our activation function for a given n is:

$$y_i=\sigma(a_i)=\frac{1}{1+e^{−a_i}}$$
The accuracy of our model predictions can be captured by the objective function L, which we are trying to maxmize.

$$L=\prod_{i=1}^M y_i^{t_i}(1−y_i)^{1−t_i}$$
If we take the log of the above function, we obtain the maximum log likelihood function, whose form will enable easier calculations of partial derivatives. Specifically, taking the log and maximizing it is acceptable because the log likelihood is monotomically increasing, and therefore it will yield the same answer as our objective function.

$$L=\sum_{i=1}^M t_i \log{y_i} + (1−t_i)\log(1−y_i)$$
In our example, we will actually convert the objective function (which we would try to maximize) into a cost function (which we are trying to minimize) by converting it into the negative log likelihood function:
$$J=-\sum_{i=1}^M t_i \log{y_i} + (1-t_i)\log{1-y_i}$$

__Gradient Descent__
Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complex or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible. We can show this mathematically:
$$w = w + \delta w$$

where the second term on the right is defined as the learning rate times the derivative of the cost function with respect to the the weights (which is our gradient):

 $$\delta w=η \delta J(w)$$
Thus, we want to take the derivative of the cost function with respect to the weight, which, using the chain rule, gives us:

$$\frac{\partial J }{\partial w_j} = \sum_{i=1}^{N} \frac{\partial J}{\partial y_i} \frac{\partial y_i}{\partial a_i} \frac{\partial a_i}{\partial w_j}\$$

Thus, we are looking to obtain three different derivatives. Let us start by solving for the derivative of the cost function with respect to y:
$$\frac{\partial J}{\partial y_i} = t_i \frac{1}{y_i} + (1-t_i) \frac{-1}{1-y_i} = \frac{t_i}{y_i}-\frac{1-t_i}{1-y_i}$$

Next, let us solve for the derivative of y with respect to our activation function:

$$y_i=\sigma(a_i)=\frac{1}{1+e^{−a_i}}$$
$$\frac{\partial y_i}{\partial a_i} = y_i (1-y_i)$$

And lastly, we solve for the derivative of the activation function with respect to the weights:
$$a_i = W^T X_n$$
$$\frac{a_i}{w_j} = x_{ij}$$

Now we can put it all together and simply.

$$\frac{\partial J}{\partial w_j} = - \sum_{i=1}^M \frac{t_i}{y_i} y_i (1-y_i) x_{ij} - \frac{1-t_i}{1-y_i} y_i (1-y_i) x_{ij}$$
We can get rid of the summation above by applying the principle that a dot product between two vectors is a summover sum index. That is:
$$a^T b = \sum_{i=1}^M a_i b_i$$

Therefore, the gradient with respect to w is:

$$\frac{\partial J}{\partial w} = X^T (Y-T)$$



If you are asking yourself where the bias term of our equation (w0) went, we calculate it the same way, except our x becomes 1. We will demonstrate how this is dealt with practically in the subsequent section.

∂J∂w0=∑n=1N(yn−tn)xn0=∑n=1N(yn−tn)
Coded Example
We shall now use a practical example to demonstrate the application of our mathematical findings. We will create a basic linear regression model with 100 samples and two inputs. Our inputs will be random normal variables, and we will center the first 50 inputs around (-2, -2) and the second 50 inputs around (2, 2). These two clusters will represent our targets (0 for the first 50 and 1 for the second 50), and because of their different centers, it means that they will be linearly separable.
```python
import numpy as np
import matplotlib.pyplot as plt

N = 100
D = 2

X = np.random.randn(N,D)

# center the first 50 points at (-2,-2)
X[:50,:] = X[:50,:] - 2*np.ones((50,D))

# center the last 50 points at (2, 2)
X[50:,:] = X[50:,:] + 2*np.ones((50,D))

# labels: first 50 are 0, last 50 are 1
T = np.array([0]*50 + [1]*50)

# In order to easily deal with the bias term, we will simply add another N-by-1 vector of ones to our input matrix.
# add a column of ones
# ones = np.array([[1]*N]).T # old
ones = np.ones((N, 1))
Xb = np.concatenate((ones, X), axis=1)

# randomly initialize the weights
w = np.random.randn(D + 1)

# calculate the model output
z = Xb.dot(w)

def sigmoid(z):
    return 1/(1 + np.exp(-z))

Y = sigmoid(z)

# calculate the cross-entropy error
def cross_entropy(T, Y):
    E = 0
    for i in xrange(N):
        if T[i] == 1:
            E -= np.log(Y[i])
        else:
            E -= np.log(1 - Y[i])
    return E

# let's do gradient descent 100 times
learning_rate = 0.1
for i in xrange(100):
    if i % 10 == 0:
        print cross_entropy(T, Y)

    # gradient descent weight udpate
    w += learning_rate * Xb.T.dot(T - Y)

    # recalculate Y
    Y = sigmoid(Xb.dot(w))

print("Final w:", w)

```



#### What is the difference between MLE and gradient descent
__Maximum likelihood estimation__ is a general approach to estimating parameters in statistical models by maximizing the likelihood function defined as

$$L(\theta|𝑋)=f(𝑋|\theta)$$
that is, the probability of obtaining data 𝑋 given some value of parameter 𝜃. Knowing the likelihood function for a given problem you can look for such 𝜃 that maximizes the probability of obtaining the data you have. Sometimes we have known estimators, e.g. arithmetic mean is an MLE estimator for 𝜇 parameter for normal distribution, but in other cases you can use different methods that include using optimization algorithms. ML approach does not tell you how to find the optimal value of 𝜃 -- you can simply take guesses and use the likelihood to compare which guess was better -- it just tells you how you can compare if one value of 𝜃 is "more likely" than the other. This is treated as a cost function in gradient descent optimization algorithm. 

__Gradient descent__ is an optimization algorithm. You can use this algorithm to find minimum (or maximum, then it is called gradient ascent) of many different functions. The algorithm does not really care what is the function that it minimizes, it just does what it was asked for. So with using optimization algorithm you have to know somehow how could you tell if one value of the parameter of interest is "better" than the other. You have to provide your algorithm some function to minimize and the algorithm will deal with finding its minimum.

You can obtain maximum likelihood estimates using different methods and using an optimization algorithm is one of them. On another hand, gradient descent can be also used to maximize functions other than likelihood function.  

Ref: [link](https://stats.stackexchange.com/questions/183871/what-is-the-difference-between-maximum-likelihood-estimation-gradient-descent)
#### What is the differences between GD, BGD, MBGD, SGD?

#### What are alternative models to estimating lapse behavior?



### Txn Prediction and LTV

#### What was the objective of the project?
* Poker wanted to target users with very low LTV or expected low txn with interstitials 
* Poker wanted to estimate how much residual revenue was present in the current payer base
* Poker wanted to detect which users are likely to have high LTV

#### What were the outputs generated?

#### What was the technique adopted and why?

#### How does BG / NBD work?

#### How does Gamma / Gamma model work?

#### What were the alternatives considered?

#### What was the validation and how was it performed?

#### What is the likelihood function for BG|NDB?

#### What is the likelihood function for Gamma Gamma?

#### What were the insights gathered from the model?


#### What were the experiments conducted to achieve the stated objectives?



### Automated Feature Engg

#### What was the objective of the project?

#### What was done?

#### What were the ouputs created?



### Bot or not model

#### What was the objective of the model?

#### What was the bot signatures calculated?

#### How was the source of truth established?

#### How much of the fraud was detected?

#### What was the impact of the detection?

#### How was the model scaled to millions of users?

#### What was the techniques used to do the prediction?

#### How were the hyperparameters tuned?

#### What was the scoring frequency?



### Bid Recommendation Service

#### What was the objective of the service?

#### What was the optimization technique used?


#### How was local optimization avoided?

#### Which partners were optimized?


#### How was validation done?


### Forecasting System for KPIs
#### What were the objectives?
#### What was the techniques used?
#### What were the assumptions made about the data?
#### How were the assumptions tests?
#### How was the validation done?
#### How were the results tracked and monitored?
#### What were the issues?

#### What were the alternate models considered?
#### How could the system be improved?



### Bayesian AB Testing library
#### What was the purpose of the library
#### What were the underlying assumptions?
#### What is the theory behind Bayesian AB testing
#### How were different types of variables treated for AB testing proportion, count, continuous?

#### What is the meanning of conjugate prior?
#### Why is Beta a prior to Binomial distribution?

#### What is Thompson sampling?
#### How was this model used to make decisions?

#### What were the alternatives considered?

#### How does this approach compare to frequentist?

#### How is Type I error handled in Bayesian statistics?

#### How is Type II error handled in Bayesian statistics?

#### How does this compare with using Chi-Square test or Anova?

#### What is the shortcomings of the Bayesian approach?



### Flight Delay Prediction

### Media Mix Modelling

### Forecasting sales

### Marketing Budget Allocation

### Product Upsell Recommendation