# Systems Report 
by Jose Márquez Jaramillo (jmarqu20)

## Data, Data Pipelines, and Model

### Data Description
The data consists of transaction details such as amount, date/time, merchant details, customer behavior patterns, and historical fraud reports. Data is both real-time and batch-processed from historical logs for model training. The data can be received by the system as `.csv`, `.json`, `.parquet`, or `.xlsx` files. A sample of a transaction is defined to contain:

| **Column**            | **Description**                                                              |
|-----------------------|------------------------------------------------------------------------------|
| trans_date_trans_time | Transaction date and time                                                    |
| cc_num                | Unique customer number/ID                                                    |
| merchant              | Merchant/vendor name for transaction                                         |
| category              | Category of purchase (e.g., entertainment, gas_transport, food_dining, etc.) |
| amt                   | Total amount of transaction                                                  |
| first                 | Customer first name                                                          |
| last                  | Customer last name                                                           |
| sex                   | Customer's sex                                                               |
| street                | Street address of customer                                                   |
| city                  | City address of customer                                                     |
| state                 | State of customer residency                                                  |
| zip                   | Zip code of customer                                                         |
| lat                   | latitude coordinate of customer address                                      |
| long                  | longitude coordinate of customer address                                     |
| city_pop              | Population of city                                                           |
| job                   | Customer's employment title                                                  |
| dob                   | Customer's date of birth                                                     |
| trans_num             | Unique transaction number                                                    |
| unix_time             | Timestamp of transaction                                                     |
| merch_lat             | Latitude of merchant/vendor                                                  |
| merch_long            | Longitude of merchant/vendor                                                 |
| is_fraud              | 1=fraudulent transaction, 0=non-fraudulent transaction                       |

## System Design
The proposed fraud detection system is designed to meet the functional and non-functional requirements specified by SecureBank. The system consists of multiple components, each designed to handle specific aspects such as data ingestion, sampling, model updates, and model evaluation. The overall architecture is modular, allowing for scalability and easy updates.
### Components of the system
1. **Data Ingestion Module**: The `data_engineering.py` module handles the import of transactin data in real-time from bank systems.
2. **Data Processing Pipeline**: The `feature_engineering.py` module includes data cleaning, transformation, and feature engineering.
3. **Machine Learning Models**: The `models` directory in the system contains baseline and pre-tuned implementations of a Random Forest (`random_forest_classifier.sav`), a Logistic Regression classified (`random_forest_classifier.sav`), and a Support Vector Classifier (`support_vector_classifier.sav`). Each of these pre-tuned models can be trained after ingesting new data.
4. **Notification System**: The system generates predictions which can then be sent to posterior notification systems for customers or fraud analysts in the bank.
### Process Flow Diagram 

<img src="system_pipeline.png" width="1014.75" height="518.5" />

The process flow diagram above shows the different endpoints and functionalities provided in the system. It is important to mention that for each of the following, the system generates a log which is stored within the system detailing the most important aspects of each action. 
1. **Inference**

    a. To create inference the interface is provided by the Flask Application, where the user would through an API call pass a sample and can also select the model to be used. 

    b. The system then uses `pipeline.py` to store the sample, call `feature_engineering.py` for transformations, and call `models.py` to generate a prediction.

    c. The prediction result is passed to the user and the prediction is also store in the system
    
2. **Data Ingestion**

    a. The user can connect to the system by using the Flask interface, thrhough which the user can pass a data file. The system accepts `.json, .csv, .parquet, .xlsx` files.

    b. Through the `pipeline.py` module, the system calls the Data Engineering.

    c. The `data_engineering.py` module then cleans and stores the data in the data/ directory. 

    d. The system pipeline instance is then updated including the data passed.

3. **Dataset creation**

    a. At startup, a pipeline instance from `deployment.py` is created. This makes any data file from the `data/` directory available for the system.

    b. The user can request the generation of a dataset by calling the flask application and passing the desired dataset version, and the desired sample size. 

    c. Through the `deployment.py` module and its corresponding pipeline instance, the `dataset.py` module is called.

    d. The `dataset.py` module samples from the pipeline instance, which includes all the files currently stored in the `data/` directory.

    e. The resulting dataset is stored in the `datasets/` directory in the system. 

4. **Training**

    a. The user can pass the model to be trained, the dataset, and the sample size. Whenever a dataset is not passed, a new automated dataset is generated. For dataset generation please refer above. 

    b. The pipeline from `deployment.py` uses `data_engineering.py` and `feature_engineering.py` to load and transform the dataset.

    c. The pipeline the uses the `model.py` module to load up the baseline module and train it with the dataset data. 

    d. The resulting trained model is stored in the `models/` directory.

    e. The pipeline calls `metrics.py` and the model performance is evaluated and the metrics are stored in the system.

    There are three pre-tuned models available for use in the model. We have found that the best results come from the Random Forest Classifier (`random_forest_classifier.sav`), however there is an instance of a Support Vector Machine Classifier (`support_vector_classifier.sav`) and a Logistic Regression Classifier (`logistic_regression_classifier.sav`). For more information about this analysis please refer to `model_performance_and_selection.ipynb`. 

### Data Pipeline

The data pipeline for the fraud detection system is designed to efficiently process large volumes of transaction data in real time, transforming raw data into a format suitable for analysis and decision-making. It consists of several key stages, each addressing specific aspects of data management and preparation:

1. **Data Ingestion**: The initial stage involves collecting transaction data as it occurs, utilizing APIs that interface directly with the bank's transaction processing systems. This real-time data ingestion ensures that the data used for fraud detection is current, reflecting the latest transactions made by bank customers. Additionally, batch ingestion processes are set up to import historical transaction data at regular intervals. 
This data is crucial for training the machine learning models and provides a comprehensive view of customer behavior over time. Once ingested, the data is stored in the `data/` directory of the system. The pipeline is dynamically designed to integrate new data continuously as it becomes available. This ensures that the models are always trained on the most up-to-date data, enhancing their accuracy and reliability. The system automatically triggers retraining cycles when significant new data is added or when the model's performance degrades below a predefined threshold.
This data pipeline is a critical component of the fraud detection system, ensuring that the data used for detecting fraudulent activities is reliable, relevant, and effectively structured for analysis. Through continuous monitoring and updating of the pipeline processes, the system maintains high standards of data quality and operational efficiency.

2. **Data Storage and Access**: Once ingested, the data is stored in the `data/` directory of the system. The `data/` directory is important as it serves as the main database for the system. In the directory, all of the ingested data from the bank is made available for sampling through the creation of datasets.

3. **Creation of Datasets**: Sampling is an important mechanism for the online training and evaluation of models. Datasets are created using stratified sampling from the data stored in the `data/` directory. The data is stratified by the `is_fraud` feature. This way we ensure that the data is representative and maintains fraudulent observations for model update and training. The created datasets are stored in the `datasets/` directory.

4. **Data Cleaning and Validation**: Once a dataset has been created, the next step is cleaning and validating it to ensure accuracy and consistency. This stage addresses issues such as missing values, duplicate records, and erroneous entries. Data validation rules are applied to ensure that all incoming data meets the required formats and standards, crucial for maintaining the integrity of the data used in subsequent analyses.

5. **Feature Engineering**: Feature engineering transforms raw datasets into meaningful attributes that significantly enhance the model's ability to detect fraud. This involves creating new variables from existing data that better capture the nuances and patterns of fraudulent transactions. Techniques such as aggregation (e.g., total transactions in the last hour), ratio calculations (e.g., amount to average transaction size), and time-based features (e.g., transactions in unusual hours) are employed. These features help in identifying outliers or unusual patterns that are often indicative of fraudulent activity.

6. **Data Normalization**: To ensure that the machine learning models function optimally, data normalization standardizes the range of independent variables or features of data. This is particularly important when features have different units or vary widely in scale, as it can significantly impact the performance of algorithms like neural networks and gradient boosting that are sensitive to input scale. Techniques such as Min-Max scaling or Z-score normalization are used depending on the distribution of data.

7. **Model Updates and Training**: Baseline models, as indicated in the previous section, can be updated by training them using data ingested within the system. When training, the system either uses a specified dataset or creates a new dataset. An important consideration while training is the use of SMOTE sampling during data transformations. SMOTE (Synthetic Minority Over-sampling Technique) is particularly useful for dealing with imbalanced datasets in binary classification for several reasons:
    - **Mitigating Imbalance**: In many real-world scenarios, datasets are imbalanced, meaning one class significantly outnumbers the other. For instance, in fraud detection, legitimate transactions far outnumber fraudulent ones. Such imbalance can bias the model toward the majority class, leading to poor classification performance on the minority class. SMOTE helps by creating synthetic samples from the minority class, thus balancing the class distribution.

    - **Improving Classifier Performance**: By synthesizing new examples in the minority class, SMOTE can help improve the decision boundaries of a classifier. Without enough examples from the minority class, a classifier might overfit to the majority class and underperform in predicting the minority class accurately. SMOTE generates new samples that help the classifier learn more generalized features of the minority class.

    - **Data Augmentation**: SMOTE performs a kind of data augmentation for the minority class by taking samples and modifying them slightly to create realistic, yet novel, examples. This method involves selecting samples that are close in the feature space, drawing a line between these samples in the space, and creating new samples along that line. This approach helps in providing more diversity within the minority class, which is crucial for learning robust features.

    - **Versatility and Ease of Integration**: SMOTE can be easily integrated with virtually any classifier and is compatible with a wide range of standard data preprocessing pipelines. It's a versatile technique that only requires the feature space and is agnostic to the classifier used.

    - **Better Generalization on Unseen Data**: By creating a more balanced dataset, SMOTE can help ensure that the classifier does not ignore the minority class, which can improve generalization when the model encounters unseen data. This is crucial in applications like medical diagnostics or fraud detection, where failing to detect rare important events could be very costly.

    The updated trained model is stored in the `models/` directory, making use of the date trained to differentiate it. Also, the trained model is evaluated on the training data and the metrics are stored in the `resources/models/` directory.

## Metrics Definition
The `Metrics` class defined in the `metrics.py` module provides a thorough evaluation framework for binary classification models by implementing several key metrics. Each metric serves a distinct purpose in assessing the model's effectiveness and suitability for deployment, particularly in sensitive applications like fraud detection.

**Accuracy** is a fundamental metric that quantifies the overall correctness of the model by measuring the proportion of true results (both true positives and true negatives) among the total number of cases examined. Its formula, `(TP + TN) / (TP + TN + FP + FN)`, offers a quick snapshot of model performance. However, accuracy alone can be misleading in cases of class imbalance, making other metrics equally essential.

**Precision** targets the reliability of positive predictions by calculating the ratio of true positive predictions to all positive predictions (true positives plus false positives). This metric is crucial in scenarios where false positives carry significant costs, such as flagging legitimate transactions as fraudulent, potentially causing customer dissatisfaction and operational disruptions.

**Recall**, or sensitivity, focuses on the model's ability to identify all relevant instances. By measuring the ratio of true positives to the sum of true positives and false negatives, recall provides insight into the model's effectiveness at catching every potential fraud case, which is critical to preventing financial loss.

**F1 Score** offers a balance between precision and recall by taking their harmonic mean. This metric is particularly useful when you need a single measure to reflect the model's performance in scenarios where both false positives and false negatives are costly.

The **Confusion Matrix** extends the evaluation by providing a detailed view of the model's predictions, showing the breakdown of true positives, true negatives, false positives, and false negatives. This matrix is vital for understanding the model's behavior in different scenarios, enabling more targeted adjustments to its configuration.

Lastly, the **False Positive Rate (FPR)** measures the proportion of incorrect positive predictions among all negative instances. Given its formula, `FP / (FP + TN)`, FPR is especially important in fraud detection to minimize unnecessary alerts, which can save resources and reduce customer friction.

### The `run_metrics` method

The `run_metrics` method compiles these metrics to provide an assessment of a model's performance. Initially, upon class instantiation, the Metrics class preprocesses the provided dataset using the `FeatureConstructor` and subsequently uses the model to predict outcomes. These predictions are then utilized to compute the various metrics.

The method individually calculates accuracy, precision, recall, F1 score, and generates the confusion matrix and FPR. Each of these metrics is derived from the true labels and the predicted labels of the dataset, ensuring that the evaluation reflects the model's performance under operational conditions.

Finally, `run_metrics` consolidates these individual metrics into a single dictionary, which includes accuracy, precision, recall, F1 score, the confusion matrix, and FPR. This dictionary serves multiple purposes—it can be used in reports, integrated into performance dashboards, or employed in further analyses to refine the model. Through this method, stakeholders gain a deep understanding of the model's strengths and weaknesses, enabling informed decisions regarding its deployment and ongoing improvement.

## Analysis on System Parameters and Configurations
The fraud detection system developed for SecureBank employs several critical parameters and configurations that optimize its performance, enhance its reliability, and ensure its adaptability to changing conditions. This analysis covers the system’s dataset evaluation, feature engineering, and model evaluation processes, which are integral to the system's overall effectiveness.

### Dataset Evaluation

The robustness of the fraud detection model heavily relies on the quality and comprehensiveness of the dataset it trains on. The dataset encompasses transaction data points like amount, date/time, customer information, merchant details, and historical transaction outcomes. Ensuring the dataset's integrity involves several key processes:

-   **Data Quality Assurance**: Regular checks are conducted to identify and rectify any missing, duplicate, or erroneous data entries. For instance, transactions with improbable amounts or those logged with incorrect timestamps are flagged for review.
-   **Relevance and Completeness**: The dataset must cover a broad spectrum of transaction types and scenarios to ensure the model can learn to detect various fraud patterns. This includes incorporating new types of fraud as they are identified.
-   **Balancing the Dataset**: Given the typically low incidence of fraud in transaction datasets, techniques like SMOTE (Synthetic Minority Over-sampling Technique) or random undersampling of the majority class are employed to balance the dataset, thereby avoiding model bias towards the more common non-fraudulent transactions.

### Feature Engineering Evaluation

Feature engineering transforms raw transaction data into informative, actionable insights that significantly enhance model performance:

-   **Feature Selection and Construction**: Critical to enhancing the predictive power of the model, this step involves identifying which features (e.g., transaction amount, time of day, merchant type) most effectively predict fraudulent activity. 
-   **Dynamic Feature Updating**: As fraud tactics evolve, so too must the features used in detection. 
-   **Evaluation of Feature Impact**: Regularly assessing the impact of each feature on the model’s performance helps in fine-tuning the feature engineering process. This might involve measuring feature importance scores directly from models like Decision Trees or Random Forests.

### Model Evaluation

The choice and configuration of the model are paramount, with ensemble methods such as Random Forest and Gradient Boosting Machines often favored for their robustness and accuracy:

-   **Model Training and Validation**: The model is trained on a designated set of data, with its performance validated through techniques like cross-validation to prevent overfitting. The training process is closely monitored to ensure that the model generalizes well to new, unseen data.
-   **Performance Metrics**: Metrics such as accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC-ROC) are calculated to evaluate the model’s effectiveness. Each metric provides insight into different aspects of the model’s performance in fraud detection.
-   **Continuous Learning and Adaptation**: The model is set up for continuous learning, where it is retrained periodically with new data or fine-tuned as needed to adapt to emerging fraud patterns. This ensures that the model remains effective over time.