## Type I error, type II error, and power of test

- Type 1 error: false positive, is the rejection of a `true` null hypothesis;

- Type 2 error: false negative, is the failing rejection of a `false` null hypothesis;

- Power of test: is defined as the probability of rejecting the null hypothesis which it is false;

## Over-fitting vs under-fitting

- Over-fitting usually happens when there is a small amount of the data and a large number of variables (model is over-complicated for the data), the model finished training ends up with modeling not only the information, but also the noise;

- Conversely if the model is not even properly modeling the information, we call it under-fitting;

- The ideal case is the model ends up modeling the information and leave out the noise;

- Over-fitting can be avoided by using cross-validation (like k-folds), and regularization techniques;

## When do you use the classification technique over the regression technique?

- Classification problems are mainly used when the output is the categorical variable (discrete), whereas regression techniques are used the output variable is continuous variable;

## What is the importance of data cleansing?

- As the name suggest, data cleansing is the process of removing/updating the information that is incorrect, incomplete, duplicated, irrelevant, or formatted improperly; It is very important to improve the quality of the data, eventually leads to a reliable/more accurate model;

## What are the important steps of data cleansing?

- Data correctness

- Removing duplicated data also irrelevant data

- Structural errors

- Outliers

- Treatment for missing data

## How is K-NN different from K-means clustering?

- K-nearest neighbors is a classification/regression ML algorithm, which is a subset of supervised learning;

- K-means is a clustering ML algorithm, which is a subset of unsupervised learning;

- K-NN is the number of nearest neighbors used to classify or predict (in case of continuous variable/regression) a test sample, whereas K-means is the number of clusters the algorithm is trying to learn from the data;

## What is p-value?

- p-value helps you determine the strengths of your results when you perform a hypothesis test;

- [浅谈p值（p-value是什么）](https://www.jianshu.com/p/4c9b49878f3d)

- [Chi-square distribution](https://en.wikipedia.org/wiki/Chi-square_distribution#:~:text=In%20probability%20theory%20and%20statistics,independent%20standard%20normal%20random%20variables.)

## What is the use of statistics in data science?

- Statistics provides tools and methodologies to identify the patterns and structures in data to allow human gain a deeper insight of the data;

## Supervised learning vs unsupervised learning

- Supervised learning requires labeled data from training, while unsupervised learning doesn't;

- Supervised learning finds application in classification and regression tasks;

- Unsupervised learning finds application in clustering and association rules mining;

## Explain normal distribution

- Normal distribution is also called the Gaussian Distribution...

## Mention some drawbacks of the linear model

- I don't really agree with this question, there is no so called drawbacks for different types of models, it is just whether it suits for different kinds of problem/situation;

- Linear model trains fast, and can be easily applied to big scale of data, but for some problem the data just simply don't have linear connections, but you cannot call it the drawbacks of the linear model...

## What is Naive Bayes and what does the naive mean there?

- Naive Bayes gives you the posterior probability of an event given what is known as prior knowledge.

- A naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable.

- Basically, it is  `naive` because it makes assumptions that may or may not turn out to be correct.

## How can you select k for k-means?

- Elbow method

- Silhouette score method (Silhouette source is the most prevalent while determining the optimal value of k)

## How to measure a model?

- Accuracy: correctly predicted samples amount / total samples amount

- Error rate incorrectly predicted samples amount / total samples amount

- Sensitive: correctly predicted positive samples amount / total positive samples amount

- Specificity: correctly predicted negative samples amount / total negative samples amount

- Precision: true positive rate + true negative rate

- Recall: is also known as the true positive rate

- F-Measure: 

---

- Inference time:

- Robustness

- Explainability

- scalability

---

- ROC (Receiver operating characteristic) curve

- PR (precision recall) curve

## What are lambda functions?

- A lambda function is a small anonymous function.

- A lambda function can take any number of arguments, but can only have one expression.

## What is reinforcement learning?

- Reinforcement learning is an unsupervised learning technique in machine learning.

- It is a state-based learning technique. The models have predefined rules for state change which enable the system to move from one state to another, while the training phase.

## What is entropy and information gain in decision tree algorithm?

- Entropy is used to check the homogeneity of a sample.

- If the value of entropy is `0` then the sample is completely homogeneous. On the other hand, if entropy has a value `1`, the sample is equally divided. Entropy controls how a Decision Tree decides to split the data. It actually affects how a Decision Tree draws its boundaries.

- The information gain depends on the decrease in entropy after the dataset is split on an attribute. Constructing a decision tree is always about finding the attributes that return highest information gain.

## What is cross-validation?

- Training dataset + testing dataset

- LOOCV (leave one out cross validation): like the simple training dataset + testing dataset partition, but each step/epoch we only use one data as the test data, and repeat this n times;

- K-fold cross validation: like the LOOCV but not using one data as the test data, instead of using one of the k-fold as test data each epoch, repeat k times;

- Training dataset + validation dataset + testing dataset;

## What is bias-variance trade off?

- Bias, variance are all concepts to the generalization problem in ML training, usually we will define a loss function, and then try to minimize it during the training, but purely minimizing the loss doesn't equal to the model will perform well in the more generalized/real data, with low loss the model even can be totally NOT usable when face the more generalized/real data; the gap between the loss of the training dataset and the loss of the validation/testing dataset is called generalization error;

- Generalization error basically has 3 types: random error, bias, and variance;

- Error = $bias^2 + variance$

- Random error is caused by the noise of the data, the outliers, etc, it is hardly be avoided;

- `Bias` is the gap between the expected value of the predictions from the trained model and the real value, basically it tells the __precision__ of the model;

- `Variance` measures how far a set of predictions are spread out from above the trained model's expected value, basically it tells the __stabilities__ of the model;

- Take the target shotting game as example, after trained, one buy with his favorite gun each shoot expected to be 9 (10 is the full score), but one shoot he got 7, here: the bias is 10 - 9 = 1, the variance is 9 - 7 = 2;

- Thus if your model is over-simplified for the dataset it usually leads to high bias + low variance, conversely if your model is over-complicated for the dataset it usually leads to low bias + high variance, which both are not what we want, tuning/adjust the model to balance between the bias and variance to best solve the problem is the bias-variance trade off.

- [谈谈 Bias-Variance Tradeoff](https://liam.page/2017/03/25/bias-variance-tradeoff/)

## The types of biases that occur during sampling

- Self-selection bias

- Under-coverage bias 

- Survivorship bias

- [Sampling bias: What is it and why does it matter?](https://www.scribbr.com/methodology/sampling-bias/)

## What is the Confusion Matrix?

- A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes:
    * True positive(TP) — Correct positive prediction
    * False-positive(FP, type 1 error) — Incorrect positive prediction
    * True negative(TN) — Correct negative prediction
    * False-negative(FN, type 2 error) — Incorrect negative prediction
    
- It helps in calculating various measures including:
    * error rate: $(FP + FN)/(P + N)$
    * specificity: $(TN / N)$
    * accuracy: $(TP + TN)/(P + N)$
    * sensitivity: $(TP / P)$
    * precision: $(TP / (TP + FP))$
    
- A confusion matrix is essentially used to evaluate the performance of a machine learning model;

## What are exploding/vanishing gradients?

- Exploding/vanishing gradients is the problematic scenario where large/small error gradients accumulate to result in very large/small updates to the weights of neural network models in the training stage. Hence the model becomes unstable and is unable to learn from the training data;

- Can be avoided via changing the initialization value, or use a different activation functions, or doing the data normalization;

## What is Law of Large Numbers?

- The `Law of Large Numbers` states that if an experiment is repeated independently a large number of times, the average of the individual results is close to the expected value. It also states that the sample variance and standard deviation also converge towards the expected value.

## What is the importance of A/B testing?

- It helps to pick the best variant among multiple hypotheses fairly;

## What is eigenvectors and eigenvalues?

- Eigenvectors depict the direction in which a linear transformation moves and acts by compressing, flipping, or stretching. They are used to understand linear transformations and are generally calculated for a correlation or covariance matrix.

- The eigenvalue is the strength of the transformation in the direction of the eigenvector. 

- An eigenvector’s direction remains unchanged when a linear transformation is applied to it.

- [如何理解矩阵特征值和特征向量？](https://www.matongxue.com/madocs/228/)

## What is systematic sampling and cluster sampling

- Systematic sampling is a type of probability sampling method. The sample members are selected from a larger population with a random starting point but a fixed periodic interval. This interval is known as the sampling interval. The sampling interval is calculated by dividing the population size by the desired sample size.

- Cluster sampling involves dividing the sample population into separate groups, called clusters. Then, a simple random sample of clusters is selected from the population.

## What are AutoEncoders?

- An AutoEncoder is a kind of artificial neural network. It is used to learn efficient data codings in an unsupervised manner. It is utilized for learning a representation (encoding) for a set of data, mostly for dimensionality reduction, by training the network to ignore signal noise.

- You can think it as an advanced version of PCA (because of its non-linear transformation units);

- Nowadays AutoEncoder also expands to generate a representation as close as possible to its original input from the reduced encoding.

- Sparse AutoEncoder

- Denoising AutoEncoder

- Variational AutoEncoder: the famous hand written digits recognition dataset is generated by using this technique;

- [当我们在谈论 Deep Learning：AutoEncoder 及其相关模型](https://zhuanlan.zhihu.com/p/27865705)

## How do you avoid the over-fitting during the training?

- Keeping the model simple

- Using cross validation

- Using regularization to penalize the model parameters that are more likely to cause the over-fitting
    * L1 regularization: $\sum\limits_{i=1}^n|y_i - f(x_i)|$
    * L2 regularization: $\sum\limits_{i=1}^n\big(y_i - f(x_i)\big)^2$
    
- Why usually L2 is more popular than L1: because of the related calculation is more convenient (derivation)

- [l1正则与l2正则的特点是什么，各有什么优势？](https://www.zhihu.com/question/26485586/answer/616029832)

    * In a nutshell, why L2 regularization is more popular then L1 is because of the derivation computing conveniences (because for an equation with absolute items its derivation is not continuous)!

## What are the differences among standardization, normalization and regularization?

- Standardization: ?

- Normalization: $x_{new} = \frac{x - \mu}{\sigma}$ $\mu = \text{Mean and } \sigma = \text{Standard Deviation}$, no matter the data we’re working with, after normalizing it, the mean will be equal to 0 and the standard deviation will be equal to 1; Refer here: [Batch Normalization Tensorflow Keras Example](https://towardsdatascience.com/backpropagation-and-batch-normalization-in-feedforward-neural-networks-explained-901fd6e5393e)

- Regularization: penalty regularization items (L1, L2)

## What is dimensionality reduction? What are its benefits?

- Dimensionality reduction is defined as the process of converting a data set with vast dimensions into data with lesser dimensions — in order to convey similar information concisely. 

- This method is mainly beneficial in compressing data and reducing storage space. It is also useful in reducing computation time due to fewer dimensions. Finally,  it helps remove redundant features — for instance, storing a value in two different units (meters and inches) is avoided.

## Some feature selection methods used to select the right variables

- The methods for feature selection can be broadly classified into two types:
    * Filter Methods, these methods involve:
        - Linear discrimination analysis
        - ANOVA
        - Chi-Square
    * Wrapper Methods, these methods involve:
        - Forward Selection: one feature at a time is tested, and a good fit is obtained;
        - Backward Selection: all features are reviewed to see what works better;
        - Recursive Feature Elimination: every different feature is looked at recursively and paired together accordingly. 
    * Others are Forward Elimination, Backward Elimination for Regression, Cosine Similarity-Based Feature Selection for Clustering tasks, Correlation-based eliminations etc.
    
## What are the different types of clustering algorithms?

- K-NN

- Hierarchical clustering

- Fuzzy clustering

## MAE vs MSE vs RMSE

- MAE: mean absolute error

- MSE: mean square error

- RMSE: $\sqrt{MSE}$

## How can outlier values be treated?

- Replace the value with mean, mode (__mode is a statistical term that refers to the most frequently occurring number found in a set of numbers__)

- Treat as missing value;

- Remove entire record;

## What is skewed distribution and uniform distribution

- The skewed distribution is a distribution in which the majority of the data points lie to the right or left of the center (left skewed distribution, right skewed distribution, they are all variants of normal distribution);

- A uniform distribution is a probability distribution in which all outcomes are equally likely.

## What is a Box-Cox Transformation?

- Box-Cox transformation is a way to normalize variables.

## What is the hyperbolic tree?

- Also known as hyper-tree;

- Displaying hierarchical data as a tree suffers from visual clutter as the number of nodes per level can grow exponentially.

## How to deal with imbalanced dataset?

The imbalanced dataset should be identified as below two cases:

1. The less part still covers the realities very well

2. The less part cannot cover the realities very well

If `1` is the case, under-sampling + using proper measurement metrics (confusion matrix, precision, and F1) should solve the problem already;

If `2` is the case, either reworking the dataset or reworking the problem itself (design different solution)

- ~~Undersampling consists in sampling from the majority class in order to keep only a part of these points~~

- Collecting more data

- Oversampling consists in replicating some points from the minority class in order to increase its cardinality

- Generating synthetic data consists in creating new synthetic points from the minority class (see SMOTE method for example) to increase its cardinality

[Handling imbalanced datasets in machine learning](https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)

[SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

##  What’s the difference between probability and likelihood?

- [What is the difference between “likelihood” and “probability”?](https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability#2647)

## What cross-validation technique would you use on a time series dataset?

- [Using k-fold cross-validation for time-series model selection](https://stats.stackexchange.com/a/14109/229537)

- Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!

- You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.
    * fold 1 : training [1], test [2]
    * fold 2 : training [1 2], test [3]
    * fold 3 : training [1 2 3], test [4]
    * fold 4 : training [1 2 3 4], test [5]
    * fold 5 : training [1 2 3 4 5], test [6]
    
## How is a decision tree pruned? 

- [Decision tree pruning](https://en.wikipedia.org/wiki/Decision_tree_pruning)

## What’s the F1 score? How would you use it?

- The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.
