Data Science Interview Prep

# Data Science Principles

## What are the differences between over-fitting and under-fitting?

In machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on new unseen data.

In **overfitting**, a model learns the specific examples instead of discovering general ideas about them. It describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many features relative to the number of observations. A model that has been overfitted, has poor predictive performance, as it overreacts to minor fluctuations in the training data. It fails to generalise well.

**Underfitting** occurs when a model cannot capture the underlying trend of the data. It has made too many simplifying assumptions. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

The balance between overfitting and underfitting is related to the **bias/variance** trade-off, with **bias** coming from underfitting and **variance** coming from **overfitting**.

There are techniques to combat overfitting, such as **regularization**.

![image.png](attachment:image.png)

## What is the bias/variance trade-off?

The error emerging from any model can be broken down into three components mathematically.

$$ Err(x) = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

**Bias** is error introduced into the model due to over simplification. It comes from having a prejudice. High bias leads to **underfitting**. Model makes simplified assumptions.

**Variance** is error introduced into the model due to complexity. The model learns noise from the training set and struggles to generalise well. High variance leads to **overfitting**.

Normally, as complexity increases, we would see a reduction in error due to lower bias in the model. However this only happens up to a specific point, after which if we keep increasing complexity, we end up over-fitting the model and have high variance.

The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance. However decreasing one automatically increases the other, hence the idea of a trade-off.

![image.png](attachment:image.png)

Consider the **K-Nearest Neighbors Algorithm** . For small K, there is high variance as it's only looking at very close neighbors. However by increasing the value of K, which increases the number of neighbors that contribute to the prediction, variance is reduced and bias is increased.

![image.png](attachment:image.png)

## Explain over-fitting to a non-technical audience

**Example 1: Suit design**

Suppose you’re designing a suit to sell off the rack, but you have to base its measurements off only your own body.

If you want this suit to fit a lot of paying customers, which is the better strategy?

1. Measure every fold and wrinkle in your skin, every oddly angled joint, every misshapen muscle.
2. Take a reasonable number of standard measurements, but avoid using the idiosyncrasies of your own body shape.

The first one is going to fit *you* very well. It’s exactly tailored to fit all the weird idiosyncrasies of your body. But it’s not going to sell well off the rack, because those weird idiosyncrasies are just that—weird idiosyncrasies. Very few other people will have all those same idiosyncrasies, and so there are very few people for whom this suit will fit well.

In contrast, while the second strategy will not fit you quite as well, it’ll use the standard measurements that a lot of people have. It’ll fit a lot of people’s bodies reasonably well.

The first strategy leads to **overfitting**. You fitted a model—ahem, a suit—on a sample in a way that was very good for the sample you fitted it on, and it ended up being a very poor fit for other samples. When we describe overfitting as “fitting noise”, the “noise” is all the weird quirks of your own body, random deviations from a population-wide average that shouldn’t be learned—ahem, fitted—by a suit.

The second strategy strikes a balance between how well it fits you and how well it will generalize to the population at large. Because your body shape is in many ways similar to many people’s body shapes—there are many people with your height and leg length—you want the suit to fit you reasonably well.

This second strategy should lead to less overfitting than the first, and is generally the one you want in almost all supervised learning scenarios.

**Example 2: Studying for an exam**

Imagine you’re taking a math class that your friend took last semester and he gives you his old exam.

You have three options:

1. Do the right thing and say thanks but no thanks
2. Memorize the answer to each question
3. Treat the exam as a practice test and study different versions of the questions
For explanatory purposes, let’s assume you choose against 1.

Suppose you go with 2. and commit every answer to memory. There is only one scenario where this ends well — it’s the exact same exam. For obvious reasons, this is quite risky. Even if the questions are similar but not exact, you won’t be able to answer them correctly.

Suppose you go with 3. and understand how to obtain the correct answers as opposed to simply memorizing them. This is the safer scenario as you’re much more likely to obtain correct answers on new questions that you haven’t seen before.

Like cheating on an exam, machine learning suffers from the same pitfall. If you simply memorize the answers to every question then you will not be able to generalize to new questions. If instead you understand how the answer is obtained then you are better off.

Before you go and take an exam, take a practice test with questions you haven’t seen before. This is the point of validation data. I’m sure it will teach you that 2. isn’t such a great choice.

##  Explain what regularization is and why it is useful

Background (two scenarios where we might want to explore regularization):
- often, a model fails to generalize on unseen data. This could happen when the model tries to accommodate all kinds of changes in the data including those belonging to both the actual pattern and also the noise. As a result, the model ends up becoming a complex model having significantly high variance due to overfitting, thereby impacting the model performance on unseen data. The goal would then be to **reduce the variance** while making sure that the model does not become biased (underfitting).
- alternatively, a large number of features and the related coefficients result in computationally intensive models. The goal would then be to perform **feature selection** to reduce the number of features whilst maintaining performance.

Another advantage would be to **counter multicollinearity**.

Regularization is a term used for constraining your machine learning model. 

For linear models, regularization is typically achieved by constraining the coefficients of the model. It works by adding a **penalty term** to the cost function.

There are two main types of regularization techniques. They are as following:
- Ridge regression (L2 norm)
- Lasso regression (L1 norm)

In **Ridge Regression**, a penalty term equivalent to the square of the magnitude of the coefficients is added to the loss function. This forces the learning algorithms to not only fit the data but also keep the model coefficients as small as possible.

**Lasso Regression** adds a penalty term equivalent to the **absolute value** of the magnitude of the coefficients. Lasso (Least Absolute Shrinkage and Selection Operator) performs estimation and selection, as some of the coefficients will be exactly zero.

For both the amount of penalisation is controlled by a **hyperparameter** lambda. When lambda is 0, we are looking at linear regression without regularization. When lambda is infinity, all coefficents will be zero. To find the optimal value of **lambda**, we can use cross-validation.

## When is Ridge regression favorable over Lasso regression?

ISLR’s authors Hastie, Tibshirani asserted that, *"in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression."*

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In the presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

## What are the benefits of Feature Selection?

1. Improvement in performance. In Regression cases, look at adjusted $R^2$ which only increases if the newly added predictor is good.
2. Decreaste complexity and data storage needs
3. Improve interpretation, understanding of how features relate to each other.

## How do you perform Feature Selection?

1. Domain Knowledge
2. Wrapper Methods. These are computationally intensive. E.g. Recursive Feature Elimination or Stepwise Selection
3. Filter Methods. These are part of pre-processing and take place before modelling. E.g. Removing features with high correlation using VIF.
4. Embedded Methods. This is when feature selection is included in the actual Machine Learning algorithm. E.g. Lasso regression

## How do you deal with class imbalance?

Consider the scenario where only 2 in 1,000 cases are positive. Then even a classifier that always predicts "no" would be 99.8% accurate. Such cases are common in the medical field or credit card fraud where we have imabalanced data.

Accuracy is therefore not a good metric. To deal with the class imbalance we can do the following.

1. Assign weights to classes such that the minority classes get larger weights. We could use weights that are inversely proportional to class frequency.

2. Oversampling, where we pick more samples from the minority class. However this could lead to overfitting as we have exact duplicates.

3. Undersampling, where we pick less samples from the majority class. Howether this is not making use of a large amount of data available.

4. SMOTE - synthetic minority oversampling. This creates new sample data artificially.

All of these would cause the data to be more balanced, and we would then proceed with modelling stage.

Additionally, in the case of disease detection in the medical field, we may want to alter the prediction threshold using the ROC curve to ensure greater emphasis is given on Recall/False Negatives. We would want a lower threshold as the impact of a False Negative would be devastating.

## What is a hyperparameter?

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. 

Different model training algorithms require different hyperparameters, some simple algorithms (such as ordinary least squares regression) require none. 

Given hyperparameters, the training algorithm learns the parameters from the data. 

For instance, Lasso is an algorithm that adds a regularization hyperparameter to ordinary least squares regression, which has to be set before estimating the parameters through the training algorithm.

# ML Algorithms

## Logistic Regression

### What is logistic regression ?

Logistic Regression despite having the name 'regression' is a classification algorithm. It is used to predict 'classes' and most commonly a binary outcome.

For example, it can be used to predict:
1. whether an email is spam (class 1) or not (class 0)
2. whether a tumor is malignant (class 1) or not (class 0)

## K-Nearest Neighbours

### What is the KNN algorithm and how does it work?

KNN is a supervised learning algorithm used for both classification and regression tasks. It is a **distance-based classifier** meaning that it assumes a smaller distance between two points means that the points are more similar. Each column acts as a dimension.

Step 1: Choose a point

Step 2: Find the K-nearest points. K is a pre-defined constant such as 1, 3, 5, 11

Step 3: Predict a label for the current point. In classification, take the most common class whereas in regression, take the average target metric.

We can use **weighted averages** based on the distance of the neighbours.

Note that KNN **requires scaling** so that all distances are on the same scale.

We choose **odd values of K** to prevent ties with the class majority 'voting'.

It is also known as lazy learner because it involves minimal training of model. Hence, it doesn’t use training data to make generalization on unseen data set (nothing happens in the fit step, only in the predict step are distances calculated).

### We use euclidean distance to calculate the distance between nearest neighbors. Why not manhattan distance?

We don’t use manhattan distance because it calculates distance horizontally or vertically only. It has dimension restrictions. On the other hand, euclidean metric can be used in any space to calculate distance. Since, the data points can be present in any dimension, euclidean distance is a more viable option.

### How to choose K?

When K is too small, any given prediction only takes into account a small number of points, giving a risk of overfitting.

When K is too large, model begins to underfit the data.

Therefore suggest fitting KNN for different values of K, generating predictions and comparing performance metrics. 

K should be odd to prevent ties.

![image.png](attachment:image.png)

## Naive Bayes

### What is Naive Bayes Algorithm and why is it Naive ?

The Naive Bayes Algorithm is based on Bayes' Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

Naive Bayes is so ‘naive’ because it assumes that all of the features in a data set are equally important and **independent**. As we know, these assumption are rarely true in real world scenario.

### Explain prior probability, likelihood and marginal likelihood in context of Naive Bayes algorithm?

**Prior probability** is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would  be classified as spam.

**Likelihood** is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word ‘FREE’ is used in previous spam message is likelihood. **Marginal likelihood** is, the probability that the word ‘FREE’ is used in any message.

$$ P(c | x) = \frac{P(x | c) * P(c)}{P(x)} $$
where $P(c | x)$ is the posterior probability, $P(x | c)$ is the likelyhood, $P(c)$ is the class prior probability and $P(x)$ is the predictor prior probability

### Can you give an example which uses Naive Bayes?

Naive Bayes can be used for document classification, for example when we are predicting whether a message is SPAM. 

 $$ P(\text{Spam | Word}) = \dfrac{P(\text{Word | Spam})P(\text{Spam})}{P(\text{Word})}$$  

Using the bag of words representation, you can then define $P(\text{Word | Spam})$ as

 $$P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document}}{\text{Word Frequency Across All Spam Documents}}$$  
 
However, this formulation has a problem: what if you encounter a word in the test set that was not present in the training set? This new word would have a frequency of zero! This would commit two grave sins. First, there would be a division by zero error. Secondly, the numerator would also be zero; if you were to simply modify the denominator, having a term with zero probability would cause the probability for the entire document to also be zero when you subsequently multiplied the conditional probabilities in Multinomial Bayes. To effectively counteract these issues, **Laplacian smoothing** is often used giving:   

 $$P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document} + 1}{\text{Word Frequency Across All Spam Documents + Number of Words in Corpus Vocabulary}}$$  

### What are the pros and cons of Naive Bayes?

Pros
* easy and fast, also good for multi-class problems
* outperforms other algorithms if independence assumption holds
* good for categorical inputs

Cons
* assumes predictors are independent
* for numerical inputs, assumes a normal distribution
* if a category is not observed in training but in test, whole probability will be zero. Hence the need for smoothing via Laplace.

## Decision Trees

### What are Decision Trees?

Decision tree is a supervised machine learning algorithm used for both **Regression and Classification** tasks.

It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision tree can handle both categorical and numerical data.

It is a **directed acyclic graph** (collection of nodes and edges that can only be traversed in a specific direction and with no loops).

We can think of each internal node being a choice and each leaf node representing a classification. When generating a prediction, a test instance is routed down the tree according to the values of the attributes in the successive nodes. When the instance reaches a leaf, it is classified according to the label assigned to the corresponding leaf.

Decision trees are a **greedy** algorithm, go down one branch fully before coming back to step 2.

### What is the Gini index?

### What are Entropy and Information Gain in the Decision tree algorithm ?

The core algorithm for building decision tree is called ID3 (Iterative Dichotomiser 3) ID3 uses Entropy and Information Gain to construct a decision tree.

**Entropy** is a measure of disorder or uncertainty, named after Shannon's entropy in *Information Theory*. We can think of it as an indicator of how "messy" data is. Higher entropy means less predictive power. If a sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one. At each step of the tree, we want to decrease entropy.
![image.png](attachment:image.png)

**Information Gain** 
The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.

$$ \text{Information Gain} = \text{Entropy}_{\text{Parent}} - \text{Entropy}_{\text{Child}} [ \text{child weighted average} ]$$

### How do you prune a decision tree?

When we remove sub-nodes of a decision node, this process is called pruning or opposite process of splitting.

We need to optimize the decision tree algorithm as if we continue to grow the tree, there is a high risk of overfitting the data to the training set.

Some hyper parameters to consider are:
* `max_depth`: too deep = overfit, too shallow = underfit
* `min_samples_split`: min number of samples required to split an internal node
* `min_samples_leaf`: min number of samples that a leaf node can contain. If the value is too big, that leads to underfitting as still too much uncertainty.

### What are the pros and cons of decision trees?

## Ensemble Methods

### What is XGBoost?

XGBoost is an open source library providing a high-performance implementation of gradient boosted decision trees.

**Boosting**

With a regular machine learning model, like a decision tree, we’d simply train a single model on our dataset and use that for prediction. We might play around with the parameters for a bit or augment the data, but in the end we are still using a single model. Even if we build an ensemble, all of the models are trained and applied to our data separately.

*Boosting*, on the other hand, takes a more iterative approach. It’s still technically an ensemble technique in that many models are combined together to perform the final one, but takes a more clever approach.
Rather than training all of the models in isolation of one another, boosting trains models *in succession*, with each new model being trained to correct the errors made by the previous ones. Models are added sequentially until no further improvements can be made.

The advantage of this iterative approach is that the new models being added are focused on correcting the mistakes which were caused by other models. In a standard ensemble method where models are trained in isolation, all of the models might simply end up making the same mistakes!

**Gradient Boosting** specifically is an approach where new models are trained to predict the residuals (i.e errors) of prior models.

![image.png](attachment:image.png)

# Evaluation Metrics

## What is MLE?

Let's start with an example. Suppose we have data points representing the weight (in kgs) of students in a class. The data points are shown in the figure below.
![image.png](attachment:image.png)

This appears to follow a normal distribution. But how do we get the mean and standard deviation for this distribution? One way is to directly compute the mean and sd of the given data, which comes out to be 49.8 Kg and 11.37 respectively. These values are a good representation of the given data but may not best describe the population.

We can use MLE in order to get more robust parameter estimates. Thus, MLE can be defined as **a method for estimating population parameters (such as the mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such that the probability (likelihood) of obtaining the observed data is maximized**.

In MLE, we can assume that we have a likelihood function $L(\theta,x)$, where $\theta$ is the distribution parameter vector and $x$ is the set of observations. We are interested in finding the value of $\theta$ that maximizes the likelihood with given observations (values of $x$).

We also assume that the observations $x_i$ are independent and identically distributed random variables drawn from a Probability Distribution.

It also becomes easier if we take the logarithm of the likelihood and look to maximise the log-likelihood instead.

## What is AIC?

AIC is the measure of fit which penalizes a model for the number of coefficients. It stands for Akaike information criterion.

$$ \text{AIC} = -2 \text{ln}(\hat{L}) + 2 k $$

where *k* is the length of the parameter space (i.e. number of features) and $\hat{L}$ is the maximum value of the likelihood.

We always prefer model with **minimum AIC value**.

The goal is to find a good balance between **good fit** (high log likelihood) and **complexity** (complex models, i.e. models with more features are penalized more). We want a model with few features but which fits the data well.

BIC is similar but Bayesian response, also takes into consideration the number of observations.

## What is a confusion matrix?

A confusion matrix is used to evaluate a classification model.

Mathematically, as per sklearn convention, it is a matrix $C$ such that $C_{ij}$ is equal to the number of observations known to be in group $i$ and predicted to be in group $j$, i.e. rows for observations and columns for predictions.
![image.png](attachment:image.png)

**TN** = True Negative. Both prediction and observation is negative.

**TP** = True Positive. Both prediction and observation is positive.

**FP** = False positive or type 1 error. Predicted positive but actually negative.

**FN** = False negative or type 2 error. Predicted negative but actually positive.

## What are Precision, Recall and Accuracy ?

**Precision**: out of true predictions, how many are actually true.
$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

**Recall**: out of true observations, how many were accuractly predicted.
$$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

Inverse relationship, one goes up other goes down.

**High precision** means lower FP, higher FN and a **High Threshold** for determining class 1/ predicting positive

**High recall** means lower FN, higher FP and a **Low Threshold** for determining class 1/ predicting positive

**Accuracy**: out of all preductions, how many were correct?
$$ \text{Accurracy} = \frac{\text{TP + TN}}{\text{# Observations}} $$

## Precision vs. Recall

**Example where precision is more important**

We are trying to predict whether a certain day is a good day to launch a satellite into space based on the weather.

In this case a False Postive would occur when the model suggests a day is good based on the weather but actually it wasn't. As a result the satellite would be launched and potentially damaged, costing millions.

On the other hand, a False Negative would occur when the model suggests a day is not a good day based on the weather but actually it was alright fo launch. As a result, we missed a good day but the consequences aren't that bad.

In this case a False Positive is a lot worse than a False Negative, so we'd like to minimize False Positives and thus have **High Precision**. This means a high threshold for predicting positive. 

**Example where precision is more important**

We are building a model to detect whether an email received is spam.

In this case a False Positive would occur when the model suggests an email is spam and sends it to the spam folder when in fact it wasn't spam. This can be disastrous as someone could miss a time sensitive important email.

On the other hand, a False Negative would occur when the model treats a junk email as not spam and leaves it in the user's inbox when in fact it is spam. This is midly annoying for the user.

In this case a False Positive is a lot worse than a False Negative, so we'd like to minimize False Positives and thus have **High Precision**. This means a high threshold for predicting positive. 

**Example where recall is more important**

We are building a model to screen for cancer.

In this case a False Positive would occur when the model suspects cancer but actually the person is healthy. This would cause a scare but further testing would rule out the cancer.

On the other hand, a False Negative would occur when the model suggests the person is healthy when in fact they are sick. This is disastrous as the person would miss out on follow-up tests and treatment. 

In this case, a False Negative is a lot worse than a False Positive, so we'd like to minimize False Negatives and thus have **High Recall**. This means a low threshold for predictive positive.

**Example of a call center for insurance claims**

We are building a model to detect if an insurance claim is fraudulent.

In this case, a False Positive would occur if the model suggests the claim is fraudulent when in fact it is genuine. This would be midly inconvenient for the customer but through further checks and evidence it would be determined to be truthful and the claim would be paid.

On the other hand, a False Negative would occur if the model suggested the claim was genuine when in fact it was fraudulent. This would result in the firm paying the claim and thus total loss.

In this case, a False Negative is a lot worse than a False Positive, so we'd like to minimize False Negatives and thus have **High Recall**. This means a low threshold for predictive positive.

**Can you cite some examples where both false positive and false negatives are equally important?**

Consider the case of a model for a bank trying to predict if a loan should be approved. 

A False Positive would occur if the model thought the business would be able to repay, and thus the loan as approved when it fact they couldn't, resulting in a loss to the bank.

A False Negative would occur if the model thought the business would not be able to repay the loan and thus the loan was denied when it shouldn't have been. This results in a loss of potential customer to the bank.

Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.

## What are Sensitivity and Specificity?

**Sensitivity** (also called the true positive rate or recall): out of true observations, how many were correctly predicted to be true? i.e. How many relevant items are selected?

**Specificity** (also callsed the true negative rate): out of negative observations, how many were correctly identified as such.

![image.png](attachment:image.png)

## What is the F1 score?

The $F_1$ score is a measure of a model's performance. It is a weighted average of Precision and Recall of a model, or more precisely it is the harmonic mean of the two.

It reaches its best value at 1, with perfect precision and recall. $F_1$ penalizes a model which skews to hard towards one of Precision or Recall. To have a high $F_1$, both Precision and Recall need to be high.

$$ F_1 = 2 * \frac{\text{Precision * Recall}}{\text{Precision + Recall}} $$

Note that is a specific case of $F_\beta$ score, which is given by

$$ F_\beta = (1+\beta)^2 * \frac{\text{Precision * Recall}}{(\beta^2 * \text{Precision}) + \text{Recall}} $$

Two commonly used values for $\beta$ are those corresponding to the $F_2$, which weighs recall higher than precision (by placing more emphasis on false negatives), and the $F_{0.5}$ measure, which weighs recall lower than precision (by placing more emphasis on false positives).

## What is an ROC curve and AUC?

A Receiver Operating Characteristics (ROC) graph is a technique for visualizing and selecting classifiers based on their performance. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

TPR: out of positive observations, how many were predicted to be true
$$ \text{TPR} = \frac{\text{TP}}{\text{TP + FN}} $$

FPR: out of negative observations, how many were predicted to be true
$$ \text{FPR} = \frac{\text{FP}}{\text{FP + TN}} $$

Moving along the curve varies the threshold for assigning a prediction to a class. In general, we're looking for the threshold which gives a High TPR and Low FPR, i.e. close to the top left corner. But depends on the business case. If recall is more important, i.e. want to minimize False Negatives then need to be high on the y-axis, no matter where on the x-axis.

From the ROC curve, we can get the Area Under the Curve (AUC). It has a value between 0.5 and 1, with 0.5 being random and 1 being perfect. This is useful for comparing different models.

![image.png](attachment:image.png)

## How would you evaluate a Logistic Regression model?

We can use the following methods:

1. Since logistic regression is used to predict probabilities, we can use AUC-ROC curve along with confusion matrix to determine its performance.
2. Also, the analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.

# Product

## Netflix Acquisition

**Let's say at Netflix we offer a subscription where customers can enroll for a 30 day free trial. After 30 days, customers will be automatically charged based on the package selected.
Let's say we want to measure the success of acquiring new users through the free trial.
How can we measure acquisition success and what metrics can we use to measure the success of the free trial?** 

Goal of this strategy is customer acquisition. 
User Journey: The prospective customer enrolls for the 30 day free trial after providing his/her payment details. 

There can be 3 outcomes:
1.	Once the free-trial period is over, the payment will be deducted from his account – the prospective customer becomes a customer
2.	The prospective customer uses the free trail for 30days and then on the last day opts out.
3.	The prospective customer opts out somewhere in the middle of the trial period.

The metrics which need to be measured to determine acquisition success are as follows:

1.	Number of prospective customers that converted to customers after 30 days.
2.	Number of customers who stayed with Netflix for atleast or more than 1 month after trial period ended.
3.	Customer churn rate – number of customer who cancelled as soon as trial period is over + number of customers who used and opted out within a month after trial period.

The metrics which need to be measured to determine success of free trial:

4.	Customer Acquisition Cost/Life Time Value ratio: for a successful business like Netflix it should be atleast 1:3
5.	Percentage of new customers who have been onboarded through this campaign daily, weekly, monthly – this determines the efficacy of the campaign

# Scenarios

## Cancer Detection/ Class Imbalance

**You are given a data set on cancer detection. You’ve build a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?**

Accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. 

Where only 4 in 100 cases are positive, then even a classifier that always predicts no would have an accuracy of 96%.

We can take the following steps:
1. Use undersampling, oversampling or SMOTE to make the data balanced
2. Alter the prediction threshold value and find the optimal threshold using the ROC curve/ F-score
3. Assign weights to classes such that the minority class gets a larger weight.
4. Use anomaly detection