#### Summer of Reproducibility - noWorkflow base experiment - Notebook 4

This Jupyter Notebook is dedicated to guiding you through the applications of noWorkflow in Data Science and Machine Learning. It is the result of our work during the Summer of Reproducibility at OSPO UCSC 2023, utilizing [noWorkflow](https://github.com/gems-uff/noworkflow).

This Notebook serves as a use case based on the problem of Fraud Detection. We have partially replicated the work titled "The Effect of Feature Extraction and Data Sampling on Credit Card Fraud Detection." Interested readers are encouraged to refer to the original work [here].(https://link.springer.com/article/10.1186/s40537-023-00684-w).

For the sake of clarity, we have divided this experiment into different notebooks:

1. Covers the steps from reading the dataset to Random Forest training, configuring a single trial.
2. Repeats all previous steps but with changes in the experimental setup, such as modified hyperparameters.
3. Utilizes noWorkflow to summarize the results from previous trials.
4. Repeats the experiment, changing the model and the order of operations.
5. Compares the modifications and differences between the last and first experiments.

**Please, remember to select the noWorkflow kernel before running these Notebooks.**

In [1]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb

from noworkflow.now.tagging.var_tagging import backward_deps, \
    global_backward_deps, store_operations, resume_trials, trial_diff, \
    trial_intersection_diff, var_tag_plot, var_tag_values

#### Reading the dataset

In [2]:
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [3]:
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply random undersampling over the extracted features

In this experiment, we are inverting the sequence between RandomUnderSampler and PCA calculation.

In [4]:
random_seed = now_variable('random_seed', 42)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X, y)

Evaluation(id=33, checkpoint=23.406392426, code_component_id=710, activation_id=30, repr=42)


#### Feature engineering: Apply PCA for feature extraction.

Here we define *pca_components* tag to keep the n_components argument in PCA

In [5]:
pca_components = now_variable('pca_components', 3)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X_resampled)

Evaluation(id=53, checkpoint=23.658996901, code_component_id=748, activation_id=50, repr=3)


#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. 

In [6]:
test_dim = now_variable('test_dim', 0.2)
X_train, X_test, y_train, y_test = train_test_split(X_pca, y_resampled, test_size=test_dim, random_state=random_seed)

Evaluation(id=70, checkpoint=23.799512257, code_component_id=779, activation_id=67, repr=0.2)


#### Scoring: model training and transforming features into predictions
##### XGBoost

Instantiate and evaluate a XGBoost classifier. Here we are tagging the model name in a model object 

In [7]:
xgb_model = now_variable('model', xgb.XGBClassifier())
xgb_model.fit(X_train, y_train)

Evaluation(id=90, checkpoint=23.897578456999998, code_component_id=813, activation_id=85, repr=XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, gamma=None,
              gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None,
              reg_alpha=None, reg_lambda=None, ...))


XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

#### Evaluating: evaluating the performance of models
##### RandomForest

Computing performance metrics. Two control variables are tagged here. *roc_rf* stores the ROC score classical metric in classification. On the other hand, *f1_rf* is the F1 score

In [8]:
y_pred = xgb_model.predict(X_test)

roc_metric = now_variable('roc_metric', roc_auc_score(y_test, y_pred))
f1_metric = now_variable('f1_metric', f1_score(y_test, y_pred))

print("XGBoost - ROC = %f, F1 = %f" % (roc_metric, f1_metric))

Evaluation(id=111, checkpoint=24.08765099, code_component_id=851, activation_id=100, repr=0.8780663780663781)
Evaluation(id=120, checkpoint=24.090723284, code_component_id=867, activation_id=100, repr=0.875)
XGBoost - ROC = 0.878066, F1 = 0.875000


### Experiment dependencies from roc_metric variable

In [9]:
dict_ops = backward_deps('roc_metric', False)
dict_ops

{25: ('y_test', 'complex data type'),
 24: ("now_variable('model', xgb.XGBClassifier())", 'complex data type'),
 23: ('xgb_model', 'complex data type'),
 22: ("now_variable('pca_components', 3)", '3'),
 21: ('pca_components', '3'),
 20: ('PCA(n_components=pca_components)', 'PCA(n_components=3)'),
 19: ('pca', 'PCA(n_components=3)'),
 18: ('X_resampled', 'complex data type'),
 17: ('X_pca', 'complex data type'),
 16: ('RandomUnderSampler(random_state=random_seed)', 'complex data type'),
 15: ('rus', 'complex data type'),
 14: ('X', 'complex data type'),
 13: ('df', 'complex data type'),
 12: ("df['Class']", 'complex data type'),
 11: ('y', 'complex data type'),
 10: ('y_resampled', 'complex data type'),
 9: ("now_variable('test_dim', 0.2)", '0.2'),
 8: ('test_dim', '0.2'),
 7: ("now_variable('random_seed', 42)", '42'),
 6: ('random_seed', '42'),
 5: ('train_test_split(X_pca, y_resampled, test_size=test_dim, random_state=random_seed)',
  'complex data type'),
 4: ('X_test', 'complex data

### Experiment dependencies from roc_metric
Save the operations dictionary in a shelve object with this trial_id as a key.

Steps are:
1. calls store operations() to store the dict into a shelve object with this trial_id key
2. verify the list of stored trials available to comparision with resume_trials()

In [10]:
trial_id = __noworkflow__.trial_id
store_operations(trial_id, dict_ops)

Dictionary stored in shelve.


In [11]:
resume_trials()

['edb94455-f97b-46f0-b30e-ed01eaf81081',
 'b86773c3-a3b7-40d0-a3ac-5ab4278826c2',
 'c33177a6-88be-4f78-ae96-ede68a5ab142']

### Next steps
The final [Notebook](./now_usecase_part_5.ipynb) will use the noWorkflow features to compare this experiment and the first one.