#### Summer of Reproducibility - noWorkflow base experiment

This notebook implements an experimental setup modeling a Credit Fraud problem.

In [1]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

from noworkflow.now.collection.prov_execution.execution import *

#### Reading the dataset

In [2]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [3]:
#now_tag('feature_eng')
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [4]:
pca_components = now_variable('pca_components', 15)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

Evaluation(id=36, checkpoint=33.171132572, code_component_id=1219, activation_id=33, repr=15)


#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [5]:
random_seed = now_variable('random_seed', 321654)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

Evaluation(id=54, checkpoint=34.805158993, code_component_id=1252, activation_id=51, repr=321654)


#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [6]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.3)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

Evaluation(id=76, checkpoint=35.042643641999994, code_component_id=1294, activation_id=70, repr=0.3)


#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [7]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [8]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Evaluation(id=120, checkpoint=35.546215386, code_component_id=1371, activation_id=106, repr=0.9412034489084572)
Evaluation(id=129, checkpoint=35.548957195999996, code_component_id=1387, activation_id=106, repr=0.9411764705882353)
Random Forest - ROC = 0.941203, F1 = 0.941176


### Experiment comparision

The steps are:
1. calls get_pre for a given tagged variable and keeps the operations_dictionary output
2. calls store operations() to store the dict into a shelve object with current trial_id key
3. load the shelve object to retrieve other stored experiment as well the current one
4. calls exp_compare passing two trial ids as argumens to make a comparision



In [9]:
ops_dict = get_pre('roc_rf')

In [10]:
id_1 = __noworkflow__.trial_id
store_operations(id_1, ops_dict)

Dictionary stored in shelve.


In [11]:
import shelve
shelf = shelve.open('ops')
list_id = list(shelf.keys())
list_id

['851cd6dc-ac92-4bde-9a08-582aa9489e86',
 'c5fd4172-1dd8-4778-99e2-42354e1b3963']

In [12]:
a = shelf[list_id[-1]]

In [24]:
b = shelf[list_id[0]]

In [25]:
stra =  a[3][3]
strb = b[3][3]

In [None]:
# here, another problem with comparing string repr's. They comes truncated.

In [33]:
stra

'array([[-7.25241258e+04,  2.64778899e+02,  1.76054784e+00, ...,\n         3.06280762e+00,  4.95579346e+00,  2.83924603e+00],\n       [ 3.04548590e+04, -1.17494600e+01,  1.81956757e+00, ...,\n        -4.92914351e-01, -1.93680798e+00,  2.70069284e-01],\n       [ 1.20708557e+04, -6.99047635e+01, -1.09835651e+00, ...,\n        -1.20786608e+00,  1.88465257e-01, -5.04876146e-01],\n       ...,\n       [ 9.44078545e+04, -9.36205649e+01,  2.30013300e+00, ...,\n         2.21554905e+00,  4.97641343e+00,  1.97810945e+00],\n       [-4.12931267e+04,  2.46958467e+02, -1.93587076e+00, ...,\n        -1.20422242e+00,  5.18693233e-01,  5.99122943e-01],\n       [ 9.37448573e+04, -4.35812376e+01,  1.43479031e-01, ...,\n         3.62091130e-01,  1.10921740e+00, -1.16838798e-01]])'

In [29]:
stra

'array([[-7.25241258e+04,  2.64778899e+02,  1.76054784e+00, ...,\n         3.06280762e+00,  4.95579346e+00,  2.83924603e+00],\n       [ 3.04548590e+04, -1.17494600e+01,  1.81956757e+00, ...,\n        -4.92914351e-01, -1.93680798e+00,  2.70069284e-01],\n       [ 1.20708557e+04, -6.99047635e+01, -1.09835651e+00, ...,\n        -1.20786608e+00,  1.88465257e-01, -5.04876146e-01],\n       ...,\n       [ 9.44078545e+04, -9.36205649e+01,  2.30013300e+00, ...,\n         2.21554905e+00,  4.97641343e+00,  1.97810945e+00],\n       [-4.12931267e+04,  2.46958467e+02, -1.93587076e+00, ...,\n        -1.20422242e+00,  5.18693233e-01,  5.99122943e-01],\n       [ 9.37448573e+04, -4.35812376e+01,  1.43479031e-01, ...,\n         3.62091130e-01,  1.10921740e+00, -1.16838798e-01]])'

In [13]:
exp_compare(list_id[-1], list_id[0])

Pipelines have same lenght
Key '0': Values are equal
Key '1': Values are equal
Key '2': Values are equal
Key '3': Values are different
->>> ('X_test', 'array([[-7.25241258e+04,  2.64778899e+02,  1.76056672e+00, ...,\n         3.07040829e+00,  4.95513161e+00,  2.84047488e+00],\n       [ 3.04548590e+04, -1.17494600e+01,  1.81955477e+00, ...,\n        -5.10259399e-01, -1.92893639e+00,  2.83266278e-01],\n       [ 1.20708557e+04, -6.99047635e+01, -1.09833399e+00, ...,\n        -1.20006228e+00,  1.88603204e-01, -5.03428941e-01],\n       ...,\n       [ 9.44078545e+04, -9.36205649e+01,  2.30012388e+00, ...,\n         2.21301678e+00,  4.97701850e+00,  1.97873632e+00],\n       [-4.12931267e+04,  2.46958467e+02, -1.93586010e+00, ...,\n        -1.20695883e+00,  5.22192216e-01,  6.10366124e-01],\n       [ 9.37448573e+04, -4.35812376e+01,  1.43481517e-01, ...,\n         3.56927546e-01,  1.11234960e+00, -1.05529129e-01]])') ('X_test', 'array([[-7.25241258e+04,  2.64778899e+02,  1.76054784e+00, ...,\n