# Data Science Interview Questions

# Machine Learning

### What is the difference between supervised and unsupervised learning?

- Supervised learning - train labelled dataset with inputs and known outcomes
- Unsupervised learning - input data doesn't have labelled outputs

### What is the bias-variance trade off?

Start by defining:
- **Bias**: this is error introduced to the model due to the
**oversimplification** of the ML algorithm. It can lead to underfitting, training the model at the time leads to simplified assumptions to make the target function easier to understand.
- **Variance**: error introduced in the model due to a **complex** ML algo, the model ends up learning noise from the dataset, performing badly on test data. This can lead to high *sensitivity (true positive rate/probability of detection)* and overfitting.

As you increase the complexity of the model, you see a reduction in error due to low bias until a particular point. As the model gets more and more complex, you end up overfitting and therefore getting high variance.

The goal of (supervised) ML is low bias and low variance to enable accurate prediction performance.
- **kNN** has high variance and low bias. By increasing *k* you increase the number of neighbours contributing to the prediction, therefore increasing model bias
- **SVM** has low bias and high variance. As you increase the C parameter, it increases the number of violations of margin allowed, therefore increasing bias and decreasing variance (fewer points' input go into the model).

Therefore:
> **Increasing BIAS reduces VARIANCE and vice versa**

- Low bias ML algo: Decision Trees, kNN, SVM
- High bias ML algo: linear regression, logistic regression

## What is a confusion matrix?

![confusionMatrix.jpg](attachment:confusionMatrix.jpg)

Therefore:
- **sensitivity** is *true positive rate*
- **precision** is the *positive predicted values*
- **specificity** is the *true negative rate*

### What is an ROC curve?

This is a graphical representation of the contrast between true positive and false positive rates at various thresholds.
It's often used as a proxy for sensitivity (true positive rate) and false positive rate.

### What is SVM?

**Support vector machines** is an ML algo used for both *regression* and *classification*.
If you have *n* features in a training set, SVM plots these in n-dimensional space with the value of the feature being the particular coordinate.
SVM uses hyperplanes to separate the different classes based on the provided kernel function.

The **support vectors** are the lines marking the distance from the classifier (separating hyperplane) to the closest datapoints.

![svm.png](attachment:svm.png)

### Explain the Decision Tree algorithm

This is a supervised learning algortihm used for **regression** and **classification**.
It is the process of breaking down the datasets into smaller and smaller subsets whilst developing a decision tree incrementally.
The final result is a tree with decision and leaf nodes. Both numerical and categorical data can be handled.

Removing sub-nodes of a decision tree is called **pruning**.

### Explain the Random Forest algorithm

This is an ML algo for **regression** and **classification**. 

It is used for dimensionality reduction for missing values treating outliers. It is an ensemble learning method - weak methods combine to form a powerful model.

In a random forest, grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. 

The forest then chooses the classification with the most votes (compared to all trees in the forest), and for regression it takes the average of the outputs by the different trees.



# Statistics

### Explain what a normal distribution is

Data distributed by a continuous probability distribution (the number of observations over the total, bucketed accordingly) where the distribution is centred around a central value without any bias towards the left or right, and the peak is at the centre.
The random variables are distributed in the form of a symmetrical bell shaped curve.

### Explain what a p-value is
**As a simple definition: the likelihood of an observed statistic occuring due to chance, given the sampling distribution**

*Also:
The probability of obtaining the observed results of a test with the assumption that the null hypothesis is correct (therefore the probability that the null hypothesis is true).*

A smaller p-value signifies stronger evidence of an alternative hypothesis.

### What is a confidence interval?

This is a type of interval estimate containing a range of values of the unknown population parameter.

**Confidence level** representes the frequency of possible confidence intervals containing the true value of the population parameter.
If confidence intervals are constructed using a given confidence level from an infinite number of sample statistics, the proportion of those intervals containing the true value of the parameter will be equal to the confidence level.

### What is a Box-Cox transformation?

Sometimes the **target variable** for a regression might not satisfy the criteria for a least squares regression- the residuals could curve as prediction increases or follow a skewed distribution.

In these scenarios, it is necessary to apply a transformation to the response variable/outcome so the data meets the assumptions.

A **Box Cox transformation** is a statistical technique to transform non-normal dependent variables into a normal shape (since most of the tests assume normality) - therefore increasing the scope of eligibility for a broad number of tests.

### What is selection bias?

**Selection bias** is when the sample obtained is not representative of the population intended to be analysed.
