# **Classification**

## **Logistic Regression**

- We can try to use linear regression for classification problems also, but when there are outliers, it would bias the best fit line and the classification won't happen properly.
- Hence, we go for logistic regression
- Here, the positive values are assigned +1 and negative values are assigned to -1
- For eg: yes:1 and no:1
- When we take a positive class value (yes) and calculate the distance between that point and the plane (y=mx+c) the distance should be positive (as slope will be positive) (in this case we do not take the absolute distance)
- Similarly, when we take a negative class (no) the distance will be negative.
- Now, inorder to see if a point has been correctly classified, we multiply the point x its distance from the best fit line
- If the resulting value > 0, then it is correctly classified
- If the resulting value < 0, then it is incorrectly classified
- Here, the cost function, ie, the point x its distance from the best fit line is maximized to get the best fit line
- When there are outliers present, this distance can become skewed 
- Inorder to avoid that, we bring in the sigmoid function which is 1/1+( e (to the power -(z)))
- Here z would be the equation of the best fit line
- The sigmoid function transforms the "resulting value" to a value between -1 and +1 
- Hence, the effect of outliers are minimized.

### **Multiclass logisitic regression**

- Whenever we have more than one class, the model assumes that class A is +ve and class B and C are negative. It does so for all combinations of A, B and C resulting in three models M1, M2 and M3.
- Now, whenever a new data point is given to the model, it returns the probabilities of that point belonging to the three classes.
- The class which has the highest probability is the one that the point actually belongs to
- This can be implemented by using the "ovr" value in the multiclass attribute

## **Decision Tree**

- Decision tree is another important classifier that lets us to make decisions based on criterions
- The decision tree is split by using features.
- To determine, which feature gives the best split we need to understand what is a good split
- So, if the feature splits then one node should contain all values and the other node should contain none.
- That is called as a pure split.

### **Entropy**
- It is the degree of randomness
- The value is between 0 and 1
- 0 means that the randomness is reduced to none and hence it is a good and pure split
- 1 means that there is absolute randomness and hence it is an impure split, where after the split there is equal number of values in both categories ( 3 yes, 3 no)
- The aim is to reduce the entropy
- H(s) = -P+.Log(P+) - P-.Log(P-)

### **Information Gain**
- Gain(S,A)=H(s)-Summation(|Sv|/|S|)xH(Sv)
- H(s): entropy of the root node split
- Sv:current node no:of yes and no
- s: total no:of yes and no in current node
- H(sv):entropy of current node
- Information Gain calculates the entropy of all splits in a decision tree and then compares it with every other decision tree generated.
- The value of information gain should be more for the decision tree to be the best decision tree
- The entropy of H(S)> the entropy all other nodes

### **Gini Impurity Index**
- 1-summation[(P+)2-(P-)2]
- The calculation of entropy is computationally inefficient as calculating the log values takes more time.
- In the gini index, when the node splits the the values into two equal categories, the maximum value would be 0.5, after that, it stars to decrease.
- However in entropy this value goes upto 1 and then decreases.

### **Numerical Variables Split**
- When numerical values are involved, it first sorts those values
- Then, it assigns a threshold value which helps to make the split
- Following, it builds a decision tree for each of the threshold value, it finds the decision tree which has the greatest value of information gain and chooses it.
- This is why decision tree is computationally in-efficient.

## **Performance metrics for classification**

- Imbalance in a dataset is identified when the class ratio is more than 70-30 
- In balanced datasets, we can use the accuracy score
- In imbalanced datasets, we may use the precision, recall or F-1 score based on which measure seems important to us

### **Confusion Matrix**

- This can give the Type 1 and type 2 errors
- Our goal would be to minimize both these errors for a better model


| Predicted           | Positive (P)  | Negative (N) Actual |
|---------------------|---------------|---------------|
| Positive (P)        | True Positive (TP)  | False Positive (FP) |
| Negative (N)        | False Negative (FN) | True Negative (TN)  |



### **Accuracy**
- TP+TN/TP+FP+FN+TN

### **Precision (positive predictive value or False positive rate)**

- Answers the question: Among all the positive values predicted by you, how much did you correctly predict as positive?
- TP/TP+FP

### **Recall (Sensitivity or True positive rate)**

- Answers the question: Among all the positive values in the dataset, how many could you predict as positive?
- TP/TP+FN

### **F-Beta Score(Harmonic mean of precision and recall)**

- It is designed to give more weight to either precision or recall score based on the importance
- (1+B2).[(PrecisionxRecall)/(BxPrecision)+Recall]
- The Beta value would be 1 and it would be called as F1 score if we give equal importance to both precision and recall
- The Beta value should be **greater than** 1, if we want to give more importance to **recall**
- The Beta value should be **less than** 1, if we want to give more importance to **precision**

### **ROC Curve (Reciever operator characteristics)**

- The ROC curve is mostly used as a metric fro binary classification problems.
- The ROC curve is a graph where the True positive rate values is plotted against the false positive rate value for their respective thresholds.
- The more the curve is towards 1 in the y-axis, it means that the model has performed well
- Now, with the aid of a domain expert, after generating the graph, we ask him "to which value (trp or fpr) should i give weightage and if so, by how much)
- From his advice, we can then choose the threshold value which we actually want and go on to build the model

### **AUC (Area under the curve)**

- The area under this ROC curve is simply the probability that the model is good.
- The area under this curve should be more than 0.5, or else it would just be again a random probability (we know if we have two classes, then the probility that one value will occur would be exactly half)
- As this auc value increases, the model seems to perform better

### **Multiclass classification metrics**

- The diagonal values of the confusion matrix are the true positive values, and the rest are as the follow.




