### Data Transformations
In data: a **instance** is a row; a **feature** is an attribute or column; and the **target** is the label or class
<br>
<br>
**Binary**: data in two numbers
<br>
**Numeric**: data are numbers
<br>
**Ordinal**: data that has order; data is numerical
<br>
**Nominal**: data without order; usually categories 

Feature engineering with pandas:
<br>
Representing data to fit the prediciton tast, such as **one-hot encoding and get dummies**
- **normalization of data**, such as **logs** and **z-scores**: 
<p style="text-align: center;">$z-score = \frac{x - \mu}{\sigma}$, where $x$ is the value, $\mu$ is the mean, and $\sigma$ is the standard deviation
<br>
- **standard deviation** found by: averaging the square of the deviations and then taking the square root of that
<br>

### Clustering (k-means)
**Classification** involves grouping points with respect to a target or label.
<br>
**Clustering** involves grouping data with respect to a similarity metric; finding k cluster centers for k cluster groups

**Within cluster variance** or the sum of squared erros; it is the sum of the squared error between data poitns and their cluster's center; a lower sum is a better clustering quality
<p style="text-align: center;"> $$\sum_{i=1}^{k}\sum_{point \in C_i}^{} dist(p, c_i)^{2}$$ </p>
Sums over all clusters, which explais the limits for the first sum, and sums over all points in a cluster, explaining the limits for the second sum.
<br>
<br>

**k-means algorithm:** inputs are k, number of clusters, and D, data poitns
1. Randomly select k data points to be clusters.
2. Until there is no change in assignment of clusters or max iterations is reached:
<br>
    i. assign each data point to nearest cluster point by distance
<br>
    ii. update cluster centroids to represent new cluster
3. Return cluster label for all data points

**Ways of choosing k:**
<br>
**Elbow Method:**
1. Choose a set of k to run k-means for. 
2. Run k-means for each k and find the sum of sqaured errors for each
3. Find change in slope between each consecutive sum
4. The turining point, or elbow, is the run where the largest difference in slope is calculated.

**Silhouette Score:** ranges from [-1, 1]
<br>
For each data point:
- $a(o)$: calculate the average distance between the data point and members of its clusters; gives an idea of how tight the cluster is
- $b(o)$: find the minimum average distance between the data point and the other points in the cluster; measures how close the closest cluster is
- $s(o)$: is the silhouette coefficient, which is:
<p style="text-align: center;"> $s(o) = \frac{b(o) - a(o)}{max[a(o), b(o)]}$ </p>
- Average of silhouette coefficients for every data point is the **Silhouette score**
<br>


**Pitfalls**:
<br>
The **Silhouette Coefficient** does not try to make the sizes of clusters the same. It is vulnerable to placing uotlier data points in their own cluster to maximize coefficient.

The **Elbow Method** approaches zero as k approaches the number of data points.

There are **normalization errors** such as having features be in different scales. Normalize all features so the calculations aren't weighed by any particular feature.

### Classification (decision trees)
#### Terms:
**Impurity**: uncertainty or entropy
<br>
**Information:** one way to measure impurity
<br>
**Instance:** is a row or tuple
<br>
**Class:** is also known as a label or target
<br>
**Gini index:** another way to measure impurity
<br>
**Node**: represents conditions and children of a node represents possible outcomes on a condition
<br> leaf nodes represent labels

**Characterizing purity (Gini/Info)**

**Information** of a dataset: 
<p style="text-align: center;"> $$-\sum_{i=1}^{m} p_i\log_2(p_i)$$ </p>
where $m$ is the total number of labels, and $p_i$ is the proportion of elements with label i.


**Gini Index** of a dataset:
<p style="text-align: center;"> $$1 - \sum_{i=1}^{m} p_i^2$$ </p>
where $m$ is the total number of labels, and $p_i$ is the proportion of elements with label i. Gini is used in CART and requires there to be binary decision splits.

**Gain** or **reduction of impurity**:
Say we are splitting using category $A$, then 
<p style="text-align: center;"> $Gain(A) = Info(D) - Info_{A}(D)$. </p>
which is the Info given no split and then the information given a splitting condition.

**Using Gini Index**:
<p style="text-align: center;"> $$Gini(D) = 1 - \sum_{i=1}^{m} p_i^2$$ </p>
After splitting on an attribute, the gini index can be calculate by:
<p style="text-align: center;"> $Gini_A(D) = \frac{|D_1|}{D}Gini(D_1) + \frac{|D_2|}{D}Gini(D_2)$ </p>
Then the change in impurity is:
<p style="text-align: center;"> $\Delta Gini_A(D) = Gini(D) - Gini_A(D)$ </p>
Goal is to choose attributes and splits that maximize the change in impurity.

#### Decision Trees Parameters:
**Max Depth:** how long the tree is allowed to grow
<br>
**Minimum leaf size**: number of data points allowed in a leaf
<br>
**Post-pruning:** minimizes the number of rules used, and minimizes increase in error from pruning; can prevent overfitting if the tree is too large; done using validation set to cut trees once if is very large.

### Neural Networks

#### Activiation Functions
Activiation functions in neural networks are used to define a label or output given a set of inputs; they are differentiable, which is a big deal because it allows us to backpropagate model's erorr when training to optimize weights 
- sigmoid: maps an input to a number between or including 0 or 1
<p style="text-align: center;"> $\sigma(x) = \frac{1}{1 + e^-x}$ </p>
- softmax function: similar to the sigmoud except it can be used for multiclass classification
- tanh or hyperbolic tangent: range is from -1 to 1, the shape is similar to shape of sigmoid function; advantage is that negative inputs will be mapped strongly negative, zero inputs mapped near zero, and positive will be mapped strongly positive
- ReLU: rectified linear unit; range is 0 to $\infty$; disadvantage is that when a negative number is passed in, it gets mapped to 0. 
<p style="text-align: center;"> $R(x) =max[0, x]$ </p>
- softplus: is an alternative to traditional functoins because it is differentiable


Feed forward neural network: just means there is not a cycle in the connections between the nodes; almost like everything is traveling in one direction (forward)

#### Perceptron aka Node
Built from weights and possibly a bias term; a squashing function (activiation function) completes the perceptron by outputting a relevant number.

#### Backpropogation 
Way to improve the prediction of the model by updating the model's weights based on the prediction error. Variant of Stochastic Gradient Descent

#### Additional details
**Bath size**: number of samples processed before making change to the model; must be more than or equal to one and less than or equal to number of sample in training set
<br>
**Epoch:** a run through of an entire dataset
<br>
**Stopping Criteria:** such as **alpha** (learning rate), max number of **iterations**, decreases in loss functions



### Error Metrics
**Accuracy, Precision, Recall**
<br>
**Confusion Matrix** contains number of TP, FP, TN, FN, which can be used to calculate accuracy, precision, recall
<br>
Accuracy: TP + TN / P + N
<br>
Error: FP + FN / P + N
<br>
Recall, Sensitivity, True Positive Rate,: TP / P 
<br>
Specificity, True Negative Rate: TN / N
<br>
Precision: TP / TP + FP
<br>
F-score, F1, harmonic mean of precision and recall: 2 * precsision * call / precision + recall

**Mean Absolute Error:** average of absolute value of all errors

**Root Mean Squared Error:** square root of average of square of all errors

**Area under the ROC curve (AUC):** probability that a randomly chosen positive label prediction will be greater than a random chosen negative label predictoin; AUC of 0.5 means model is no better than random chance; the plot is FPR on the x-axis and TPR on the y-axis

#### Goodness of Model
Use techniques like cross-validation and bootstrapping

**Cross-Validation:**
<br>
k-folds: data is randomly partitioned into k mutually exclusive sets; use one of the k folds as a test set, and use the rest to train the data; overll error is the error on average of all the k test sets' erros. 


### Ensemble
**Simple combiners:** combining predictions by averaging or using other functions, such as min or max
<br>
**Bagging:** boostrap on the training set and average those predictions; train each model in ensemble using a boostrapped sample of the data, which diversifies the models. Use majority vote to decide on a final prediction. Each model must use the same algorithm for prediction/classification 
<br>
**Random Forests:** bagging of many decision trees; a model determined by number of trees; % of features sampled; % of instances used; max depth; min leaf
<br>
**Boosting:** resampling of data based on classification error; tuples are weighted and each classifier is trained on the rows that were previously misclassified by prior models; final prediction is the weighted average of all predictions, where the weights are the accuracies of the models
<br>
**Stacking:** uses predictions from multiple models to build a new model, which is used to make predictions on the test set; uses cross-validation
<br>
**Blending:** Similar to stacking except it uses a holdout (validation) set from the train set to make predictions; then the holdout set and predictions are used to build a model which is run on the test set 


Answer for the sample midterm questions from class: false, true, false