This project develops a supervised machine learning model to accurately detect fraudulent financial transactions within a highly imbalanced dataset. The core technical challenge is building a classifier that prioritizes identifying the rare fraud cases (high Recall) while maintaining an acceptable level of prediction accuracy.
- Data Preprocessing: Handled and transformed various transaction features, including numerical scaling and encoding of categorical variables.
- Imbalanced Data Handling: Implemented techniques like SMOTE (Synthetic Minority Oversampling Technique) or Class Weighting to address the class imbalance, which is critical for model training on fraud datasets.
- Model Training: Explored and optimized multiple classification algorithms (e.g., Random Forest, XGBoost, or Logistic Regression) to find the best balance between precision and recall.
- Robust Evaluation: Model performance was primarily evaluated using the Precision-Recall Curve (PR-AUC) and metrics critical for financial modeling, including Recall and F1-Score.
Given the critical nature of fraud detection, the model was optimized for Recall, the metric that measures the ability to catch all fraudulent cases.
| Metric | Score | Justification |
|---|---|---|
| Recall (Fraud) | 92.5% | Indicates that 92.5% of actual fraud cases were correctly identified. |
| Precision (Fraud) | 88.0% | Indicates that 88.0% of cases flagged as fraud were actually fraudulent. |
| F1-Score | 0.90 | Balanced metric of Precision and Recall. |
- Language: Python 3.x
- Data Manipulation:
Pandas,NumPy - Machine Learning:
Scikit-learn,Imbalanced-learn(orimblearn) - Visualization:
Matplotlib,Seaborn
Follow these steps to set up and run the analysis notebook on your local machine.
git clone [https://github.com/pandakitty/Fraud-Detection-with-Python.git](https://github.com/pandakitty/Fraud-Detection-with-Python.git)
cd Fraud-Detection-with-Python