#### Summer of Reproducibility - noWorkflow base experiment

This notebook implements an experimental setup modeling a Credit Fraud problem.

In [24]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

from noworkflow.now.collection.prov_execution.execution import *

#### Reading the dataset

In [25]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [26]:
#now_tag('feature_eng')
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [27]:
pca_components = now_variable('pca_components', 15)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [28]:
random_seed = now_variable('random_seed', 321654)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [29]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.3)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [30]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [31]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Random Forest - ROC = 0.928545, F1 = 0.927152


### Experiment comparision

The steps are:
1. calls get_pre for a given tagged variable and keeps the operations_dictionary output
2. calls store operations() to store the dict into a shelve object with current trial_id key
3. load the shelve object to retrieve other stored experiment as well the current one
4. calls exp_compare passing two trial ids as argumens to make a comparision



In [32]:
ops_dict = get_pre('roc_rf')

In [34]:
id_1 = __noworkflow__.trial_id
store_operations(id_1, ops_dict)

Dictionary stored in shelve.


In [35]:
import shelve
shelf = shelve.open('ops')
list_id = list(shelf.keys())
list_id

In [40]:
exp_compare(list_id[-1], list_id[0])

Pipelines A and B differ in lenght
Key '0': Values are equal
Key '1': Values are equal
Key '2': Values are equal
Key '3': Values are equal
Key '4': Values are equal
Key '5': Values are equal
Key '6': Values are different
->>> ('random_seed', '42') ('random_seed', '321654')
Key '7': Values are different
->>> ('random_seed', '42') ('random_seed', '321654')
Key '8': Values are different
->>> ("now_variable('random_seed', 42)", '42') ("now_variable('random_seed', 321654)", '321654')
Key '9': Values are different
->>> ('test_dim', '0.2') ('test_dim', '0.3')
Key '10': Values are different
->>> ('test_dim', '0.2') ('test_dim', '0.3')
Key '11': Values are different
->>> ("now_variable('test_dim', 0.2)", '0.2') ("now_variable('test_dim', 0.3)", '0.3')
Key '12': Values are equal
Key '13': Values are equal
Key '14': Values are equal
Key '15': Values are equal
Key '16': Values are equal
Key '17': Values are equal
Key '18': Values are equal
Key '19': Values are equal
Key '20': Values are equal
Key 