#### Summer of Reproducibility - noWorkflow base experiment

This notebook implements an experimental setup modeling a Credit Fraud problem.

In [17]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

from noworkflow.now.collection.prov_execution.execution import *

#### Reading the dataset

In [2]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [3]:
#now_tag('feature_eng')
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [4]:
pca_components = now_variable('pca_components', 15)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [5]:
random_seed = now_variable('random_seed', 321654)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [6]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.3)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [7]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [8]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Random Forest - ROC = 0.944368, F1 = 0.944625


In [9]:
ops_dict = get_pre('roc_rf')

In [10]:
__noworkflow__.trial_id

'c48bfd78-3cbd-416f-b88c-bbb916c1447d'

In [11]:
def store_operations(trial, ops_dict):
    import shelve

    # Store the dictionary in a shelve file
    with shelve.open('ops') as shelf:
        shelf[trial] = ops_dict
        print("Dictionary stored in shelve.")

In [12]:
id_1 = __noworkflow__.trial_id
store_operations(id_1, ops_dict)

Dictionary stored in shelve.


In [13]:
import shelve

shelf = shelve.open('ops')
list_id = list(shelf.keys())

In [14]:
list_id



['7349f530-1ee3-4b37-a800-3852144c14e9',
 'adb44336-9e4d-4db5-886b-9e3ac4072b42',
 '636a8483-5c2a-472d-a64c-d309b17088bf',
 'c48bfd78-3cbd-416f-b88c-bbb916c1447d']

In [21]:
list_id[3]

'c48bfd78-3cbd-416f-b88c-bbb916c1447d'

In [24]:
exp_compare(list_id[3], list_id[0])

Pipelines A and B differ in length
Key '0': Values at indices [2, 3] are equal.
Key '1': Values at indices [2, 3] are equal.
Key '2': Values at indices [2, 3] are equal.
Key '3': Values at indices [2, 3] are equal.
Key '4': Values at indices [2, 3] are equal.
Key '5': Values at indices [2, 3] are equal.
Key '6': Values at indices [2, 3] are different.
Key '7': Values at indices [2, 3] are different.
Key '8': Values at indices [2, 3] are different.
Key '9': Values at indices [2, 3] are different.
Key '10': Values at indices [2, 3] are different.
Key '11': Values at indices [2, 3] are different.
Key '12': Values at indices [2, 3] are equal.
Key '13': Values at indices [2, 3] are equal.
Key '14': Values at indices [2, 3] are equal.
Key '15': Values at indices [2, 3] are equal.
Key '16': Values at indices [2, 3] are equal.
Key '17': Values at indices [2, 3] are equal.
Key '18': Values at indices [2, 3] are equal.
Key '19': Values at indices [2, 3] are equal.
Key '20': Values at indices [2,