# Research

## Models:
These are the models that every data scientist should be familiar with.

Favorite Classification Models:
- LinearSVC
- k-NN
- Support Vector Machine Algorithm
- XGBoost
- Random Forest

### Hyperparameters to tune for each model type
Below are the most common hyperparameters for the different models

**k-NN** 
- **n_neighbors**: decreasing K decreases bias and increases variance, which leads to a more complex model
- **leaf_size**: 'determines how many observations are captured in each leaf of either the BallTree of KDTree algorithms, which ultimately make the classification. The default equals 30. You can tune leaf_size by passing in a range of integers, like n_neighbors, to find the optimal leaf size. It is important to note that leaf_size can have a serious effect on run time and memory usage. Because of this, you tend not to run it on leaf_sizes smaller than 30 (smaller leafs equates to more leafs)'
- **weights**: 'is the function that weights the data when making a prediction. “Uniform” is an equal weighted function, while “distance” weights the points by the inverse of their distance (i.e., location matters!). Utilizing the “distance” function will result in closer data points having a more significant influence on the classification'
- **metric**: 'can be set to various distance metrics (see here) like Manhattan, Euclidean, Minkowski, or weighted Minkowski (default is “minkowski” with a p=2, which is the Euclidean distance). Which metric you choose is heavily dependent on what question you are trying to answer'

**Random Forest, Decision Trees**
- **n_estimators (random forest only)**: number of decision trees used in making the forest (default = 100). Generally speaking, the more uncorrelated trees in our forest, the closer their individual errors get to averaging out. However, more does not mean better since this can have an exponential effect on computation costs. After a certain point, there exists statistical evidence of diminishing returns. Bias-Variance Tradeoff: in theory, the more trees, the more overfit the model (low bias). However, when coupled with bagging, we need not worry'
- **max_depth**: 'an integer that sets the maximum depth of the tree. The default is None, which means the nodes are expanded until all the leaves are pure (i.e., all the data belongs to a single class) or until all leaves contain less than the min_samples_split, which we will define next. Bias-Variance Tradeoff: increasing the max_depth leads to overfitting (low bias)'
- **min_samples_split**: 'is the minimum number of samples required to split an internal node. Bias-Variance Tradeoff: the higher the minimum, the more “clustered” the decision will be, which could lead to underfitting (high bias)'
- **min_samples_leaf**: 'defines the minimum number of samples needed at each leaf. The default input here is 1. Bias-Variance Tradeoff: similar to min_samples_split, if you do not allow the model to split (say because your min_samples_lear parameter is set too high) your model could be over generalizing the training data (high bias)'
- **criterion**: 'measures the quality of the split and receives either “gini”, for Gini impurity (default), or “entropy”, for information gain. Gini impurity is the probability of incorrectly classifying a randomly chosen datapoint if it were labeled according to the class distribution of the dataset. Entropy is a measure of chaos in your data set. If a split in the dataset results in lower entropy, then you have gained information (i.e., your data has become more decision useful) and the split is worthy of the additional computational costs'

**AdaBoost and Gradient Boosting**
- **n_estimators**: is the maximum number of estimators at which boosting is terminated. If a perfect fit is reached, the algo is stopped. The default here is 50. Bias-Variance Tradeoff: the higher the number of estimators in your model the lower the bias.
- **learning_rate**: is the rate at which we are adjusting the weights of our model with respect to the loss gradient. In layman’s terms: the lower the learning_rate, the slower we travel along the slope of the loss function. Important note: there is a trade-off between learning_rate and n_estimators as a tiny learning_rate and a large n_estimators will not necessarily improve results relative to the large computational costs.
- **base_estimator (AdaBoost) / Loss (Gradient Boosting)**: is the base estimator from which the boosted ensemble is built. For AdaBoost the default value is None, which equates to a Decision Tree Classifier with max depth of 1 (a stump). For Gradient Boosting the default value is deviance, which equates to Logistic Regression. If “exponential” is passed, the AdaBoost algorithm is used.


**Support Vector Machines (SVM)**
- **C**: is the regularization parameter. As the documentation notes, the strength of regularization is inversely proportional to C. Basically, this parameter tells the model how much you want to avoid being wrong. You can think of the inverse of C as your total error budget (summed across all training points), with a lower C value allowing for more error than a higher value of C. Bias-Variance Tradeoff: as previously mentioned, a lower C value allows for more error, which translates to higher bias.
- **gamma**: determines how far the scope of influence of a single training points reaches. A low gamma value allows for points far away from the hyperplane to be considered in its calculation, whereas a high gamma value prioritizes proximity. Bias-Variance Tradeoff: think of gamma as inversely related to K in KNN, the higher the gamma, the tighter the fit (low bias).
- **kernel**: specifies which kernel should be used. Some of the acceptable strings are “linear”, “poly”, and “rbf”. Linear uses linear algebra to solve for the hyperplane, while poly uses a polynomial to solve for the hyperplane in a higher dimension (see Kernel Trick). RBF, or the radial basis function kernel, uses the distance between the input and some fixed point (either the origin or some of fixed point c) to make a classification assumption. More information on the Radial Basis Function can be found here.

## Evaluation Metrics

### Precision – What percent of your predictions were correct?
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

TP – True Positives
FP – False Positives

Precision – Accuracy of positive predictions.
Precision = TP/(TP + FP)

### Recall – What percent of the positive cases did you catch? 
Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.

FN – False Negatives

Recall: Fraction of positives that were correctly identified.
Recall = TP/(TP+FN)

### F1 score – What percent of positive predictions were correct? 
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. **As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy**.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

### Most important
Recall will be the metric to focus on, because saying a well will fail when it is still fine, is no biggie. 

Saying a well is working when its actually broken is a biggie.

## Visualization Metrics


### Precision-Recall Curve
Precision-Recall curves should be used when there is a moderate to large class imbalance.

Our dataset has a very large class imbalance, so we chose to use the precision-recall curve.

[https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/)

### Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

[https://www.geeksforgeeks.org/confusion-matrix-machine-learning/](https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)

## Methodology
This project was built using th ROSEMED methodology.
- **'R'**: Research the domain and relevant data science tools
- **'O'**: Obtain the data
- **'S'**: Scrub the data and remove any NaNs, missing values, duplicates, or outliers
- **'E'**: Explore the data and look for correlations and insights
- **'M'**: Model the data using the most relevant classifiers for the data
- **'E'**: Evaluate the models and choose the model that is most suitable for the data
- **'D'**: Deploy the models