# Machine Learning Quiz

### What are the top machine learning frameworks and libraries in python?


Could look at Google Trends or Stack Overflow Tags, but here is my general sense:
In python:
- tensorflow
- scikitlearn
- sparkml
- nltk
- keras
- pytorch

There are other libraries are often *necessary* for machine learning projects, although I'm not sure if I would call them "machine learning libraries":
- numpy
- pandas
- matplotlib/seaborn
- scipy
- statsmodels


### What is Machine Learning? ...vs. AI? ..vs. Deep Learning?

**Machine learning** refers to a family of algorithms that teach a computer to recognize patterns in data by example, rather than programming it with specific rules, and then use those patterns to make predictions.

**Deep learning** - A specific class of machine learning algorithms that use deep neural networks. Neural networks are a type of computing architecture loosely based on the human brain, and deep neural networks use multiple layers of neurons to hierarchically define features. It is typically used for complex problems like image classification and speech recognition.

**AI** - Artificial intelligence. Broadly speaking, AI refers to the ability of machines to simulate intelligent human decision making. It encompasses Machine Learning, Deep learning, and much else. Strong AI, which you often see in science fiction, has the capacity to understand or learn any intellectual task that a human being can or as the ability to perform "general intelligent action". Right now, we only have "weak AI", which can perform some specific human tasks as well as or better than humans.

**Data Science** - The intersection of computer science and statistics. I would say it incorporates all of the above, along with business acumen and/or domain knowledge, the rigorous application of the scientific method, and an empirical approach to problems.

### Supervised vs. Unsupervised Learning problems

**Supervised learning**: we have target values we are trying to predict, either continuous (regression problem) or categorical (classification problem)
 - Regression: Can we predict sale price of a home? The Boston House Prices dataset is a famous example of a data set geared toward a regression problem where the inputs are variables that describe a neighborhood and the output is a house price in dollars.
   
 - Classification: Can we predict handwritten numbers? The MNIST handwritten digits dataset is a famous example of a classification problem: the inputs are images of handwritten digits (pixel data) and the output is a class label for what digit the image represents (numbers 0 to 9).


**Unsupervised learning**: we don't have target values we are trying to predict. These are often clustering problems or dimensionality-reuduction problems.
 - Customer segmentation: Which groups customers are most similar to each other?
 - Data Compression: Can we compress this data while preserving some of its meaning (such that we could uncompress it and not lose too much information)?
 - Cocktail party problem - Can we separate a single audio stream with two or more different speakers into separate audio streams, each with only one speaker?
 - Style transfer?

### Classification vs. Regression Problems

- Classification problems predict classes (e.g., `hot dog` or `not hot dog`; `Survived` or `Didn't Survive` the Titanic Disaster)
- Regression problems predict a continuous variable (e.g., sale price at auction for heavy machinery, barrels of oil produced by an oil well)

### Classification Techniques
Classification algorithms are machine learning techniques for predicting which category the input data belongs to. 
Cases
* Predicting a clinical diagnosis based upon symptoms, laboratory results and historical diagnosis.
* Predicting whether a healthcare claim is fraudulent using data such as claim amount, drug predisposition, disease and provider

### Curse of Dimensionality

*When the number of features is large, there tends to be a deterioration in the performance _local approaches_  (e.g., KNN) that make predictions using only observations that are near the test observation for which a prediction must be made.*

This decrease in performance results from the fact that in higher dimensions, there is effectively a reduction in sample size. There are no neighbors nearby in the high-dimensional space.

Consider the hypercube.

### Bias/Variance Tradeoff

*Models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa.*

- We don't know the true parameters, β, we have to estimate them from the sample. 
- In the Ordinary Least Squares (OLS) approach, we estimate them as β̂  in such a way, that the sum of squares of residuals is as small as possible.

- **Bias** is the difference between the true population parameter and the expected estimator. It measures the accuracy of the estimates. 
- **Variance**, on the other hand, measures the spread, or uncertainty, in these estimates. where the unknown error variance σ2 can be estimated from the residuals as

There are three sources of error coming out of a model:
1. Bias error (underfitting) - algo is not capturing the relevant relationships between X and y
2. Variance error (overfitting) - algo is memorizing random noise in the training data
3. Irreducible error - resulting from noise in the problem itself

The OLS estimator has the desired property of being unbiased. However, it can have a huge variance. Specifically, this happens when:
- The predictor variables are highly correlated with each other;
- There are many predictors. As the number of features approaches the number of observations, the variance approaches infinity.

The general solution to this is: *reduce variance at the cost of introducing some bias*. This approach is called regularization and is almost always beneficial for the predictive performance of the model.

Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance.

### Regularized Regression

Regularization is the process of adding information to a model's objective function in order to solve an ill-posed problem or to prevent overfitting.

The underlying motivation is often to improve the generalizability of a learned model.

#### Ridge regression
- Ridge regression is one form of a regularized linear regression model.
we not only minimize the sum of squared residuals but also penalize the size of parameter estimates.
- penalizes *sum of squared coefficients* (the so-called L2 penalty) each Beta coefficient also has a penalty factor (λ)
- Setting λ to 0 is the same as using the OLS, while the larger its value, the stronger is the coefficients' size penalized.
- Ridge enforces the β coefficients to be lower, but it *does not enforce them to be zero*. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.

#### LASSO method
Least Absolute Shrinkage and Selection Operator
- Similar conceptually to ridge regression--also adds a penalty to each coefficient.
- L1 Penalty penalizes the sum of the *absolute values* of coefficients 
- Can zero out coefficients: For high values of λ, many coefficients are exactly zeroed under lasso, which is never the case in ridge regression.
 - ctually setting them to zero if they are not relevant. Therefore, you might end up with fewer features included in the model than you started with


Lasso is another extension built on regularized linear regression, but with a small twist. The loss function of Lasso is in the form:


### Gradient Descent

* ANSWER HERE

### entropy and information gain

### List at least 5 feature engineering and feature selection techniques

#### Feature Selection Techniques:
##### Filter Methods:
- **Univariate** (SelectKbest): consider each feature individually and using test statistic (such as chi-squared or pearsons r, if the target is continuous), pick those features most closely associated with the target.
- **Variance thresholding**: eliminate features with low levels of variance, because these cannot provide much signal to a model.
- **Correlated Feature Elimination**: begin with the pairwise correlation coefficients of all features, sorted by the absolute value of their correlation coefficients. For the most correlated pair of features, remove the one that is less correlated on average with all other features. Recalculate average correlations and continue until no two features are correalted above some threshold (e.g., no two features have a correlation coefficient with an absolute value above .9).

##### Wrapper Methods:
- **Forward Selection**: Start with an empty set of features, add features one-by-one to reach an optimal model. **(flesh this out more)**
- **Backward Elimination**: Start with the full set of features, remove features one-by-one to reach an optimal model. **(flesh this out more)**
- **Recursive Feature Elimination**: Perform a greedy search to find the best performing feature subset. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2^N combinations of features.
***
#### Feature Engineering Techniques:
- Create bins of continuous features
- Perform group-by-aggregate operations
- Create arithmetic feature interactions (e.g., new_feature = feature_A * feature_B)
- Use a clustering algorithm to assign clusters of observations (being sure to exclude the target from this step)
- For datetime, extract date parts (e.g., year, month, and day-of-week), or calculate ranges between dates.
- If data are spatial, you can do a lot with basic geometry, such as using the physical distance to a nearest-neighbor
- (There are endless possibilities, and domain knowledge should definitely play a role in suggesting new possible features.)


### What is a decision tree? How does a decision tree decide to make a split?


This splitting algorithm considers each of the features in turn, and for each feature selects the value of that feature that minimizes the impurity of the bins. 

### How would you go about building a bayesian model?
### How does a random forest work?
### How does gradient boosting work?

### Ensemble models

- Combining the results of multiple models to create more stable and accurate predictions.
- Two examples of decision tree ensembles are random forests and gradient-boosted trees.

### Bagging and boosting


- Both Bagging and Boosting are *ensemble techniques*.
- Bagging is *parallel*, boosting is *sequential*.

#### Bagging
- Short for "bootstrap aggregation"
- Divide the training set into multiple partitions, train models separately on each partition, and then use the average predictions from all models to make a final prediciton.

#### Boosting
- A model is first trained on all of the training data and predictions are made on the same data. 
- Predictions that the model got (the most) wrong are then "up-weighted", and then the model is trained again so it "focuses" on trying to improve the predicitons it got wrong in the previous iteration. 
- The process is repeated until some stopping condition is reached. 

###  PCA

*Plain English*: 
- Principal component analysis (PCA) is a method, taken from linear algebra, that provides a lower-dimensional representation of a dataset.

*More technical*: 
- PCA takes a dataset (matrix)
- performs an *orthogonal linear transformation that projects the matrix onto a new coordinate system such that the greatest variance is captured by the first coordinate* (the first principal component), the second-greatest variance is captured by the second coordinate, etc.
- You can retain as many principal components as you want to reduce the dimensions arbitrarily.

Uses: 
- *Reveal the latent structure* of a dataset in a way that best explains the variance within it. 
- *Decorrelate* the features within a dataset to mitigate problem of multicollinearity (while sacrificing some interpretability of the final model built from such a transformed dataset)
- *Visualize* high-dimensional datasets, because the human eye can only see in three dimensions (and really only two--think of how difficult it is to read a 3D scatterplot, even if you've done it many times and can mainuplate it in space).



### Cross validation

A technique for estimating the generalizability of a predictive model and detecting the problem of overfitting. 
- The classic approach is *k-folds cross validation*: 
  - Rather than simply randomly partitioning a dataset into a TRAINING set and a TEST, we do this multiple (k) times. 
  - In practice, k is usually set to 5 or 10, but this can vary.
  - LOOCV (leave one out cross validation) is a special case of k-fold CV in which k is set to equal n.
  - there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
  
  Example:
  - 5-fold cross validation: A set of n observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set , and the remainder as a training set. The test error is estimated by averaging the five resulting MSE estimates.

## Classification Algorithms

### Confusion matrix

- A Confusion Matrix is a specific type of *contingency table for evaluating the performance of a supervised classification algorithm*, so named because it shows if a model is confusing two classes (i.e., commonly mislabeling one as another). 
- The rows of the table show counts of the predicted class while the columns represent the instances of the actual class (or vice versa).
- It gives a more fine-grained view than simple accuracy (the proportion of "right answers" out of all guesses) by showing *how* a model failed. 
- In the simple case with only two classes, positive and negative, the confusion matrix contains four quadrants, which represent counts of:
  - True Positives
  - False Positives
  - True Negatives
  - False Negatives

#### Precision
- TP / (TP + FP)
- The fraction of records that were positive from the group that the classifier predicted to be positive
- Also called positive predicted value

#### Recall
- TP / (TP + FN)
- The fraction of positive examples the classifier got right
- Also called sensitivity, hit rate, or true positive rate (used in ROC curve, see below)


### ROC-AUC

**Receiver Operator Characteristic - Area Under Curve** 
- Metric used in machine learning for evaluating and comparing supervised classification algorithms.  
- A receiver operator characteristic curve is a graphical plot that shows the performance of a binary classifier as its discrimination threshold is varied. 
- The ROC curve is the **sensitivity** (True Positive Rate, or *recall*-- see above) as a function of **fall-out** (False Positive Rate, or 1 - specificity)
- In theory, the area under the ROC curve varies between 0 and 1, with an uninformative classifier yielding 0.5, no better than random guessing. 
- A measurement of zero would represent the "perverse" case of the model always giving the wrong response, which would be a kind of 'inversely accurate prediction'.


### Unbalanced dataset

- Classification datasets can be “unbalanced" when there are many observations of one class and few of another. 
- Accuracy-driven models will over-predict the majority class.
  - If a dataset of credit card transactions is 99.9% NOT FRAUD and 0.1% FRAUD, then a model that makes a prediction of NOT FRAUD *every time* will be 99.9% accurate while missing every single case of fraud.

#### Possible solutions:
- **Downsample/Upsample**
  - undersampling consists in sampling from the majority class in order to keep only a part of these points
  - oversampling consists in replicating some points from the minority class in order to increase its cardinality
- **Generate Synthetic data**: create new synthetic points from the minority class (see SMOTE method for example) to increase its cardinality
- **Cost-sensitive learning**: 
  - Incorporate the monetary costs of a "miss" into the objective function. 
  - For instance, in the context of fraud detection, there are relatively few fraud events, making an unbalanced dataset. 
  - We know however that the costs of a false negative are often much higher than the costs of a false positive
    - i.e., failing to detect an instance of fraud can be more costly than incorrectly flagging a legitimate transaction.


### HPO

- *Hyperparameter optimization*: choosing a set of optimal hyperparameters for a machine learning algorithm. 
- A hyperparameter is a variable which controls the learning process (e.g., neuron weights or learning rates) rather than the patterns actually learned (i.e., parameters). 
- HPO is guided by some performance metric, typically measured by cross-validation on the training set. 
- One approach to HPO is *grid search*, which is an exhaustive search through all possible combinations of manually specified values for different hyperparameters.  
  - It can result in combinatorial explosion, because it trains and evaluates a model for every combination of suggested values of the hyperparameters. 
  - However, it can usually be easily parallelized, because each model using a specific combination of hyperparameter values is typically independent of all other models built using different combinations.
- An alternative approach is Bayesian Optimization, which operates sequentially, meaning that the hyperparameters chosen at the next step will be influenced by the performance of all the previous steps. Bayesian Optimization makes principled decisions about how to balance exploring new regions of the parameter space vs exploiting regions that are known to perform well. It’s generally much more efficient to use Bayesian Optimization than alternatives like Grid Search and Random Search.


### Pipeline

In general, a data pipeline is an object that takes data as input and produces data as output and with a series of specific operations in between. There are *data processing* pipelines that perform Extract, Transform and Load (ETL) operations, and there are also *machine learning* pipelines that take raw features as input and produce predictions as as output. A machine learning pipeline may also include steps such as preprocessing, feature extraction, dimensionality reduction, etc.

### Indexing

Informally, an index refers to the order in which data are organized for easy reference or access.

In SQL-speak:
Indexing means creating an index--a pointer to data in a table, a special lookup table that the database search engine can use to speed up data retrieval.

In pandas-speak: 
Indexing, or reindexing a series (and by extention a dataframe), would be to assign alter the index, or axis labels, which are technically immutable n-dimensional arrays implementing an ordered, sliceable set. Rows and columns can be thought of as hash-like structures, where the keys are the index and the values are arrays. So indexing is specifiying the values to use as the keys, which we interact with as row labels and column labels. Indexes can be hierarchically structured using multi-indexes.


### What is EDA? or What do you do when you first get a dataset?	....or Walk me through your EDA process.

EDA stands for Exploratory Data Analysis  
Fundamental components (in my opinion):
- missingness
- dispersion
- outliers/anomalies - detection
- correlation of features with each other
- correlation of features with the target.

My approach involves the following steps:
1. Examine head and tail of dataframe, datatypes of columns, numbers of unique values per column
1. Missingness:
 1. Counts of missing (null/NaN) values
 1. Create a datafarme of Boolean Missingness Flags
 1. Examine pairwise correlations of missingness flag for all columns
1. Split dataframe into continuous and categorical features
  1. Possibly also deal with datetime and free-text, which are topics unto themselves.
1. Continuous features:
 1. Simple descriptive statistics (mean, median, min, max, quartiles, skewness) for each column
 1. Correlation matrix (optionally visualized as a heatmap, ordered according to a hierarchical clustering algorithm.)
 1. Outlier detection, using robust methods:
    1. MAD - median absolute deviation
    2. Mahalanobis distance
  1. Visualize continuous distributions, using univariate and covariate plots:
   1. Scattermatrix of subsets of continuous features (look at it before and after dropping outliers)
   1. Strip plots and/or boxplots of single features of interest
   1. Scatterplots of each continuous feature with the target
1. Categorical features:
  1. Bar charts representing value-counts of categorical features.
  1. Look at frequency counts - consider ways of rolling up
  1. Look at correlation w
1. Maps, line plots with time, if you have a spatial or temporal problem.

### How do you handle Missing Values?


The problem of missing values is both *common and insidious*, because it *introduces bias* into everything it touches.   

**Examples** include:
- Non-observed population segments
- Participant dropout in longitudinal studies
- Malfuncitoning sensors
- Network problems in data transfer

Data can be missing in three different ways (called different **missingness regimes**):
1. MCAR - missing completely at random, a truly stochastic process (e.g., your thermomenter randomly fails to record one out of every hundred temperature readings)
1. MAR - missing at random, a deterministic but noisy process that removes data based on other data (this is the single most common) (e.g., your thermometer fails when windspeeds are above 15 mph)
1. MNAR - Missing not at random - a deterministic but noisy process that removes data based on the data itself (like your thermometer fails at temps above 75 degrees)
 
What you *can't* do:
1. Listwise deletion (aka `df.dropna()` in pandas). You can only do this if the data is missing completely at random, and this is impossible to prove theoretically and very unlikely in practice.
1. Gather more data. You won’t be saved by big data. As a dataset with missing values gets larger and larger we get closer and closer to the wrong value for our coefficient estimates and this leads to false confidence.
1. Just impute the mean. It doesnt matter how the data is missing, it will lead to larger bias in parameter estimates and larger model error.
 
What you *can* do: 
1. Establish your missingness regime
1. Use a modern multiple imputation technique like MICE (multiple imputation by chained equations)
1. Use auxillary features… ie features you're not expecting to use in predition model, but you will use in imputation, because they are known to be strongly correlated with the missing data
1. Anticipate extra compute time
1. Report all the things as a good scientist and good citizen:
  - How many rows are missing one or more values
  - the fraction of missing values
  - the pattern of missing values
  - your imputation strategy


### Encoding

- Encoding means creating an abstract (mathematical) representation of some entity or phenomenon.
- For example, we could encode a digital image as a matrix of RGB values for each pixel or sounds based on their waveform properties
- A very common problem in machine learning is how to encode categorical data. Most models only allow numeric values as input.

#### Encoding Categorical (Nominal or Ordinal) data
- Three approaches:
  1. One-hot encoding
  1. Feature Hashing
  1. Binary Encoding
  
##### One-hot encoding
 - (OHE, or dummy variabales) Is the classic approach
 - a single column with a finite number of categories (string values) is translated into several columns, each representing each possible category, with binary Boolean values (0 for False or 1 for True)  

- **Pros**
  - Easy to implement
  - May not have a lot of negative impacts if cardinality is low.  

- **Cons**:
  - OHE representation produces very high dimensionality, this causes an increase in the model’s training and serving time and memory consumption.
  - OHE can easily cause a model to overfit the data.
  - OHE can’t handle categories that weren’t in the training data (like new URLs, new device types etc), this can be problematic in domains that change all the time.
    - Can be handled with a catch all "other" category
  - One-hot-encoded data can also be difficult for decision-tree-based algorithms, especially if the categorical feature has high cardinality:
     - Independence From the splitting algorithm's point of view, each binary variable is treated as if theyre all independent, which theyre not.
     - Creates a sparse matrix, where each feature has a very few number of "hot" rows, and thus not much signal to provide.
     - A binary variable can only be split in one place, and a categorical variable with q levels can be split in `2q / 2 −1` ways.


##### Feature Hashing
- implements the hashing trick
- similar to one-hot encoding but with fewer new dimensions and some info loss due to collisions. 
- collisions do not significantly affect performance unless there is a great deal of overlap.
- **Pros** 
  - It is low dimensional thus it is very efficient in processing time and memory
  - it can be computed with online learning because we don’t need to go over all the data and build a dictionary of all possible categories 
  - mapping is not affected by new kinds of categories.
  
- ***Cons***  
  - Hashing functions sometimes have collision so if H(New York) = H(Tehran), then the model can’t know what city were in the data. 
   - There are some sophisticated hashing function that try to reduce the number of collisions
   - In any case, studies have shown that collisions usually don’t substantially reduce a model's performance. 
  - Hashed features are not interpretable so doing things like feature importance and model debugging is difficult


##### Binary Encoding
- Can be thought of as a hybrid of one-hot and hashing encoders. 
- Creates fewer features than one-hot, while preserving some uniqueness of values in the the column.
- Can work well with higher dimensionality ordinal data.

```
- The categories are encoded by OrdinalEncoder if they aren’t already in numeric form.
- Then those integers are converted into binary code, so for example 5 becomes 101 and 10 becomes 1010
- Then the digits from that binary string are split into separate columns. So if there are 4–7 values in an ordinal column then 3 new columns are created: one for the first bit, one for the second, and one for the third.
- Each observation is encoded across the columns in its binary form.
```