# Credit Card fraud detection

### Steps
- ##### Load data/Data understanding
    > Here, you need to load the data and understand the features present in it. This would help you choose the features that you will need for your final model.
- ##### select features
    > select features other than pcs components if necessary and create a subset
- ##### EDA
    - since all are gaussian, no need to do z-scaling
    > Usually, in this step, you need to perform univariate and bivariate analyses of the data, followed by feature transformations, if necessary. For the current data set, because Gaussian variables are used, you do not need to perform Z-scaling. However, you can check if there is any skewness in the data and try to mitigate it, as it might cause problems during the model building phase.
    
    - check for skewness 
        - mitigate skewness if present

    > Can you think of the reason why skewness can be an issue while modelling? Well, some of the data points in a skewed distribution towards the tail may act as outliers for the machine learning models that are sensitive to outliers; hence, this may cause a problem. Also, if the values of any independent feature are skewed, depending on the model, __skewness__ may affect model assumptions or may impair the interpretation of feature importance.
- ##### Train - Test split
    - K - fold cross validation
        - k value need to taken inelligently as the target variable is highly imbalance - only 500 out the toal are fraud cases. we will hardly have 50 entries in test fold if __k=10__. So if k is less then the accuracy might not be good
    > Now, you are familiar with the train/test split that you can perform to check the performance of your models with unseen data. Here, for validation, you can use the k-fold cross-validation method. You need to choose an appropriate k value so that the minority class is correctly represented in the test folds.
- ##### Model building/Hyper params tuning
    > This is the final step at which you can try different models and fine-tune their hyperparameters until you get the desired level of performance.

### Class imbalance

* undersampling 
    * using only 500 0's ans 500 1's for balancing imblanace
* Randomly/Uniformly oversampling minor class 
    * repeating some of the minority class entries but this will only exagarate the data
* SMOTE: Synthetic Minority Over-Sampling Technique
    * Generate new data points that lie vectorially between two data points that belong to the minority class. These data points are randomly chosen and then assigned to the minority class. This method uses the K-nearest neighbours to create random synthetic samples
* ADAptive SYNthetic (ADASYN)
    *  This is similar to SMOTE, with a minor change in the generation of synthetic sample points for minority data points. For a particular data point, the number of synthetic samples that it will add will have a density distribution, whereas for SMOTE, the distribution will be uniform. The aim here is to create synthetic data for minority examples that are harder to learn rather than easier ones. 

- To sum it up, ADASYN offers the following advantages:

    - It lowers the bias introduced by the class imbalance.
    - It adaptively shifts the classification decision boundary towards difficult examples.

In [2]:
"""
Example:

from imblearn.over_sampling import SMOTE, ADASYN
X_smote, y_amote = SMOTE().fit_resample(X, y)
X_ada, y_ada = ADASYN().fit_resample(X, y)
"""


'\nExample:\n\nfrom imblearn.over_sampling import SMOTE, ADASYN\nX_smote, y_amote = SMOTE().fit_resample(X, y)\nX_ada, y_ada = ADASYN().fit_resample(X, y)\n'

### KNN

* K-nearest neighbour is a simple, supervised machine learning algorithm that is used for both classification and regression tasks. It performs these tasks by identifying the neighbours that are the nearest to a data point. For classification tasks, it takes the majority vote and for regression tasks it takes the average value from the neighbours. 

* The k in KNN specifies the number of neighbours that the algorithm should focus on. For example, if k = 3 then for a particular test data, the algorithm observes the three nearest neighbours and takes the majority vote from them. Depending on the majority of the classes from the three nearby points, the algorithm classifies the test data.

* the k value should be an odd number because you have to take the majority vote from the nearest neighbours by breaking the ties.


if k is very low -> k = 1 -> overfitting
if k is very high -> k = 101 -> underfitting

good range: k = 3 to 11

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html">K-NN doc reference</a>

### XGBoost

* eXtreme Gradient Boosting 
    * It is a decision tree-based ensemble ML algorithm that uses a gradient boosting framework. It is a highly effective and widely used machine learning method and has applications for structured and unstructured data

### Model selection

* Logistic Regression
    * Pros:
        - only works on Linearly separable data
        - Highly interpretable
    * Cons:
        - Fails if data overlaps
* K-NN
    * Pros:
        - Very interpretable
    * Cons:
        - computing is very high for large data: To find even one neighbour of a test point, we need to calculate distance from all th epoints in the dataset
* Decision Trees:
    * Pros:
        - Interpretation interms of flow charts
    * Cons:
        - Trees tend to overfit
        - working with large data becomes challenging because quadratic computing requires a lot of training time for large data sets.

- If the data is structured -> use Random Forests / XGBoost
- If the data is unstructured -> use Nerual Networks / LSTM

startified k-fold cross validation


### Hyper Parameter tuning

grid search cv and randomized search cv for hyper parameter tuning

### Model evaluation

* Accuracy is not always good metrics
* Confusion matrix is not always best metric as it depends on threshold
* same goes with Precision, Recall, and F1 score
* AUC-ROC curve -> TPR vs FPR
    - Better the ROC curve, better the model
    - The more the Area under the curve, better the model
    - the best threshold would be one at which the TPR is high and FPR is low, i.e., misclassifications are low.