# Machine Learning Quiz

### What are the top machine learning frameworks and libraries in python?


Could look at Google Trends or Stack Overflow Tags, but here is my general sense:
In python:
- tensorflow
- scikitlearn
- sparkml
- nltk
- keras
- pytorch

There are other libraries are often *necessary* for machine learning projects, although I'm not sure if I would call them "machine learning libraries":
- numpy
- pandas
- matplotlib/seaborn
- scipy
- statsmodels


### Supervised vs. Unsupervised Learning problems

* ANSWER HERE

### Classification vs. Regression Problems

* ANSWER HERE

### Curse of Dimensionality

* ANSWER HERE

### Bias/Variance Tradeoff

* ANSWER HERE

### Regularized Regression

* ANSWER HERE

### Gradient Descent

* ANSWER HERE

### entropy and information gain

### List at least 5 feature engineering and feature selection techniques

#### Feature Selection Techniques:
##### Filter Methods:
- **Univariate** (SelectKbest): consider each feature individually and using test statistic (such as chi-squared or pearsons r, if the target is continuous), pick those features most closely associated with the target.
- **Variance thresholding**: eliminate features with low levels of variance, because these cannot provide much signal to a model.
- **Correlated Feature Elimination**: begin with the pairwise correlation coefficients of all features, sorted by the absolute value of their correlation coefficients. For the most correlated pair of features, remove the one that is less correlated on average with all other features. Recalculate average correlations and continue until no two features are correalted above some threshold (e.g., no two features have a correlation coefficient with an absolute value above .9).

##### Wrapper Methods:
- **Forward Selection**: Start with an empty set of features, add features one-by-one to reach an optimal model. **(flesh this out more)**
- **Backward Elimination**: Start with the full set of features, remove features one-by-one to reach an optimal model. **(flesh this out more)**
- **Recursive Feature Elimination**: Perform a greedy search to find the best performing feature subset. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2^N combinations of features.
***
#### Feature Engineering Techniques:
- Create bins of continuous features
- Perform group-by-aggregate operations
- Create arithmetic feature interactions (e.g., new_feature = feature_A * feature_B)
- Use a clustering algorithm to assign clusters of observations (being sure to exclude the target from this step)
- For datetime, extract date parts (e.g., year, month, and day-of-week), or calculate ranges between dates.
- If data are spatial, you can do a lot with basic geometry, such as using the physical distance to a nearest-neighbor
- (There are endless possibilities, and domain knowledge should definitely play a role in suggesting new possible features.)


### What is a decision tree? How does a decision tree decide to make a split?


### How would you go about building a bayesian model?
### How does a random forest work?
### How does gradient boosting work?

### Ensemble models

- Combining the results of multiple models to create more stable and accurate predictions.
- Two examples of decision tree ensembles are random forests and gradient-boosted trees.

### Bagging and boosting


- Both Bagging and Boosting are *ensemble techniques*.
- Bagging is *parallel*, boosting is *sequential*.

#### Bagging
- Short for "bootstrap aggregation"
- Divide the training set into multiple partitions, train models separately on each partition, and then use the average predictions from all models to make a final prediciton.

#### Boosting
- A model is first trained on all of the training data and predictions are made on the same data. 
- Predictions that the model got (the most) wrong are then "up-weighted", and then the model is trained again so it "focuses" on trying to improve the predicitons it got wrong in the previous iteration. 
- The process is repeated until some stopping condition is reached. 

###  PCA

*Plain English*: 
- Principal component analysis (PCA) is a method, taken from linear algebra, that provides a lower-dimensional representation of a dataset.

*More technical*: 
- PCA takes a dataset (matrix)
- performs an *orthogonal linear transformation that projects the matrix onto a new coordinate system such that the greatest variance is captured by the first coordinate* (the first principal component), the second-greatest variance is captured by the second coordinate, etc.
- You can retain as many principal components as you want to reduce the dimensions arbitrarily.

Uses: 
- *Reveal the latent structure* of a dataset in a way that best explains the variance within it. 
- *Decorrelate* the features within a dataset to mitigate problem of multicollinearity (while sacrificing some interpretability of the final model built from such a transformed dataset)
- *Visualize* high-dimensional datasets, because the human eye can only see in three dimensions (and really only two--think of how difficult it is to read a 3D scatterplot, even if you've done it many times and can mainuplate it in space).



### Cross validation

A technique for estimating the generalizability of a predictive model and detecting the problem of overfitting. 
- The classic approach is *k-folds cross validation*: 
  - Rather than randomly partition a dataset into a TRAINING set and a TEST set, we do this k times. 
  - In practice, k is usually set to 5 or 10, but this can vary.

## Classification Algorithms

### Confusion matrix

- A Confusion Matrix is a specific type of *contingency table for evaluating the performance of a supervised classification algorithm*, so named because it shows if a model is confusing two classes (i.e., commonly mislabeling one as another). 
- The rows of the table show counts of the predicted class while the columns represent the instances of the actual class (or vice versa).
- It gives a more fine-grained view than simple accuracy (the proportion of "right answers" out of all guesses) by showing *how* a model failed. 
- In the simple case with only two classes, positive and negative, the confusion matrix contains four quadrants, which represent counts of:
  - True Positives
  - False Positives
  - True Negatives
  - False Negatives

#### Precision
- TP / (TP + FP)
- The fraction of records that were positive from the group that the classifier predicted to be positive
- Also called positive predicted value

#### Recall
- TP / (TP + FN)
- The fraction of positive examples the classifier got right
- Also called sensitivity, hit rate, or true positive rate (used in ROC curve, see below)


### ROC-AUC

**Receiver Operator Characteristic - Area Under Curve** 
- Metric used in machine learning for evaluating and comparing supervised classification algorithms.  
- A receiver operator characteristic curve is a graphical plot that shows the performance of a binary classifier as its discrimination threshold is varied. 
- The ROC curve is the **sensitivity** (True Positive Rate, or *recall*-- see above) as a function of **fall-out** (False Positive Rate, or 1 - specificity)
- In theory, the area under the ROC curve varies between 0 and 1, with an uninformative classifier yielding 0.5, no better than random guessing. 
- A measurement of zero would represent the "perverse" case of the model always giving the wrong response, which would be a kind of 'inversely accurate prediction'.


### Unbalanced dataset

- Classification datasets can be “unbalanced" when there are many observations of one class and few of another. 
- Accuracy-driven models will over-predict the majority class.
  - If a dataset of credit card transactions is 99.9% NOT FRAUD and 0.1% FRAUD, then a model that makes a prediction of NOT FRAUD *every time* will be 99.9% accurate while missing every single case of fraud.

#### Possible solutions:
- **Downsample/Upsample**
- **Cost-sensitive learning**: 
  - Incorporate the monetary costs of a "miss" into the objective function. 
  - For instance, in the context of fraud detection, there are relatively few fraud events, making an unbalanced dataset. 
  - We know however that the costs of a false negative are often much higher than the costs of a false positive
    - i.e., failing to detect an instance of fraud can be more costly than incorrectly flagging a legitimate transaction.


#### Logistic regression
- Take our features and multiply each by a weight and then sum them up
- Feed this weighted sum into a sigmoid function, which will return a number between zero and one
  - We can think of this as a probabilty estimate... the probability that the observation belongs to one of the classes
- Anything above 0.5 we'll classify as 1 and anything below 0.5, we'll classify as 0.
- Now we need to know what are the best weights or regression coefficients to use, and how to find them.
  - This is an optimization problem which can be solved in many ways, including by stochastic gradient descent. (see below)
  
##### Stochastic gradient descent
- Set all weights equal to 1
- For each piece of data in the dataset:
 - Calculate the gradient of one piece of data
 - Update the weights vector by alpha * the gradient
 - return the weights vector

**Pros**:
- Computationally inexpensive
- Easy to implement
- Easy to interpret (we have coefficients, i.e., the weights)

**Cons**
 - Prone to underfitting. May have low accuracy

#### SVM Support Vector Machine


The support vector machine is a maximal margin classifier that seeks to construct a hyperplane that linearly separates training observations of one class from the other class(es).
 
Preprocessing data for SVM
Intuitively, it makes sense that SVMs might need scaling. Since SVMs are sensitive to the distance of points relative to a hyperplane, if one dimension had units in the thousands, the distance along that dimension would overwhelm another dimension with values in [0,1]. And the model would focus disproportionately on this larger dimension. Scaling overcomes this.


#### KNN (K-nearest neighbors)
The algorithm can be summarized as:


### HPO

**Receiver Operator Characteristic - Area Under Curve**. Metric used in machine learning for evaluating and comparing supervised classification algorithms.  A receiver operator characteristic curve is a graphical plot that shows the performance of a binary classifier as its discrimination threshold is varied. The ROC curve is the sensitivity (True Positive Rate) as a function of fall-out (False Positive Rate, 1 - specificity)

In theory, the area under the ROC curve varies between 0 and 1, with an uninformative classifier yielding 0.5, no better than random guessing. A measurement of zero would represent the "perverse" case of the model always giving the wrong response, which would be a kind of 'inversely accurate prediction'.

- *Hyperparameter optimization*: choosing a set of optimal hyperparameters for a machine learning algorithm. 
- A hyperparameter is a variable which controls the learning process (e.g., neuron weights or learning rates) rather than the patterns actually learned (i.e., parameters). 
- HPO is guided by some performance metric, typically measured by cross-validation on the training set. 
- One approach to HPO is *grid search*, which is an exhaustive search through all possible combinations of manually specified values for different hyperparameters.  
  - It can result in combinatorial explosion, because it trains and evaluates a model for every combination of suggested values of the hyperparameters. 
  - However, it can usually be easily parallelized, because each model using a specific combination of hyperparameter values is typically independent of all other models built using different combinations.
- An alternative approach is Bayesian Optimization, which operates sequentially, meaning that the hyperparameters chosen at the next step will be influenced by the performance of all the previous steps. Bayesian Optimization makes principled decisions about how to balance exploring new regions of the parameter space vs exploiting regions that are known to perform well. It’s generally much more efficient to use Bayesian Optimization than alternatives like Grid Search and Random Search.


### Pipeline

In general, a data pipeline is an object that takes data as input and produces data as output and with a series of specific operations in between. There are *data processing* pipelines that perform Extract, Transform and Load (ETL) operations, and there are also *machine learning* pipelines that take raw features as input and produce predictions as as output. A machine learning pipeline may also include steps such as preprocessing, feature extraction, dimensionality reduction, etc.

### Indexing

Informally, an index refers to the order in which data are organized for easy reference or access.

In SQL-speak:
Indexing means creating an index--a pointer to data in a table, a special lookup table that the database search engine can use to speed up data retrieval.

In pandas-speak: 
Indexing, or reindexing a series (and by extention a dataframe), would be to assign alter the index, or axis labels, which are technically immutable n-dimensional arrays implementing an ordered, sliceable set. Rows and columns can be thought of as hash-like structures, where the keys are the index and the values are arrays. So indexing is specifiying the values to use as the keys, which we interact with as row labels and column labels. Indexes can be hierarchically structured using multi-indexes.


### What is EDA? or What do you do when you first get a dataset?	....or Walk me through your EDA process.

EDA stands for Exploratory Data Analysis
In broad terms: i think the fundamental things can be boiled down to:
- missingness
- dispersion
- outliers/anomalies - detection
- correlation of features amonng each other
- correlation of features with the target.

My approach involves the following steps:
1. Examine head and tail of dataframe, datatypes of columns, numbers of unique values per column
1. Missingness:
 1. Counts of missing (null/NaN) values
 1. Create a datafarme of Boolean Missingness Flags
 1. Examine pairwise correlations of missingness flag for all columns
1. Split dataframe into continuous and categorical features
  1. Possibly also deal with datetime and free-text, which are topics unto themselves.
1. Continuous features:
 1. Simple descriptive statistics (mean, median, min, max, quartiles, skewness) for each column
 1. Correlation matrix (optionally visualized as a heatmap, ordered according to a hierarchical clustering algorithm.)
 1. Outlier detection, using robust methods:
    1. MAD - median absolute deviation
    2. Mahalanobis distance
  1. Visualize continuous distributions, using univariate and covariate plots:
   1. Scattermatrix of subsets of continuous features (look at it before and after dropping outliers)
   1. Strip plots and/or boxplots of single features of interest
   1. Scatterplots of each continuous feature with the target
1. Categorical features:
  1. Bar charts representing value-counts of categorical features.
  1. Look at frequency counts - consider ways of rolling up
  1. Look at correlation w
1. Maps, line plots with time, if you have a spatial or temporal problem.

### Encoding

Encoding means creating an abstract (mathematical) representation of some entity or phenomenon.

For example:
- You could encode a digital image as a matrix of RGB values for each pixel
- You could encode categorical data using the one-hot encoding method, in which a single column with a finite number of categories (string values) is translated into several columns, one representing each possible category, with binary Boolean values (0 for False or 1 for True) .