#### Summer of Reproducibility - noWorkflow base experiment

This notebook implements an experimental setup modeling a Credit Fraud problem.

In [1]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat
import numpy as np
#np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=2)


from noworkflow.now.tagging.var_tagging import *

#### Reading the dataset

In [2]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [3]:
#now_tag('feature_eng')
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [4]:
pca_components = now_variable('pca_components', 6)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

Evaluation(id=40, checkpoint=33.998055152000006, code_component_id=1229, activation_id=37, repr=6)


#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [5]:
random_seed = now_variable('random_seed', 546)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

Evaluation(id=58, checkpoint=35.013627238000005, code_component_id=1262, activation_id=55, repr=546)


#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [6]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.4)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

Evaluation(id=80, checkpoint=35.247921432000005, code_component_id=1304, activation_id=74, repr=0.4)


#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [7]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [8]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Evaluation(id=124, checkpoint=35.752115589000006, code_component_id=1381, activation_id=110, repr=0.9464630846540395)
Evaluation(id=133, checkpoint=35.755205419000006, code_component_id=1397, activation_id=110, repr=0.9448818897637796)
Random Forest - ROC = 0.946463, F1 = 0.944882


### Experiment comparision

The steps are:
1. calls get_pre for a given tagged variable and keeps the operations_dictionary output
2. calls store operations() to store the dict into a shelve object with current trial_id key
3. load the shelve object to retrieve other stored experiment as well the current one
4. calls exp_compare passing two trial ids as argumens to make a comparision



In [9]:
ops_dict = get_pre('roc_rf')

In [11]:
id_1 = __noworkflow__.trial_id
store_operations(id_1, ops_dict)

Dictionary stored in shelve.


In [12]:
import shelve
shelf = shelve.open('ops')
list_id = list(shelf.keys())
list_id
#exp_compare(list_id[-1], list_id[0])

['82fec392-b391-4185-abba-9aab60db6223',
 'cceac061-2ade-48cc-b96e-281123a08732']

In [13]:
exp_compare(list_id[-1], list_id[0])

Pipelines A and B differ in lenght
Key '0': Values are different
->>> ('roc_auc_score(y_test, y_pred_rf)', '0.9464630846540395') ('roc_auc_score(y_test, y_pred_rf)', '0.9292803970223326')
Key '1': Values are different
->>> ('y_pred_rf', 'array([1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,\n       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,\n       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,\n       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,\n       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,\n       0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,\n       1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,\n       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,\n       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,\n       0, 1, 1, 1, 0, 

### Testing difflib 

In [14]:
dict1 = shelf[list_id[-1]]
dict2 = shelf[list_id[0]]

In [15]:
dict1

{0: ('35.752115589000006',
  '124',
  'roc_auc_score(y_test, y_pred_rf)',
  '0.9464630846540395'),
 1: ('35.750042397',
  '123',
  'y_pred_rf',
  'array([1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,\n       1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,\n       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,\n       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,\n       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,\n       0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,\n       1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,\n       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,\n       1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,\n       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,\n       0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,\n       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,

In [16]:
%pip install diff_match_patch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Caskroom/miniconda/base/envs/noworkflow/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [17]:
from diff_match_patch import diff_match_patch

# Sample lists for demonstration
#list1 = ["apple", "banana", "orange", "grape"]
#list2 = ["apple", "grape", "kiwi", "pineapple"]

# Create a diff_match_patch object
dmp = diff_match_patch()

# Generate the differences
diffs = dmp.diff_main("\n".join(nd1), "\n".join(nd2))
dmp.diff_cleanupSemantic(diffs)

# Convert differences to HTML
html_diff = dmp.diff_prettyHtml(diffs)

# Write the HTML diff to a file
with open("list_diff_report.html", "w") as f:
    f.write(html_diff)

NameError: name 'nd1' is not defined

In [None]:
nd1 = {}
for key, values in dict1.items():
        nd1[key] = [values[2], values[3]]
        
nd2 = {}
for key, values in dict2.items():
        nd2[key] = [values[2], values[3]]

In [None]:
nd2

In [None]:
list1 = [str(value) for value in dict1.values()]
list2 = [str(value) for value in dict2.values()]

In [None]:
# Create a HtmlDiff object
html_diff = difflib.HtmlDiff()

# Generate the HTML diff report
diff_report = html_diff.make_table(list1, list2, "List 1", "List 2")

# Write the HTML report to a file
with open("list_diff_report.html", "w") as f:
    f.write(diff_report)

In [None]:
import difflib

In [None]:
diff = difflib.unified_diff(list1, list2, lineterm='')

In [None]:
print('\n'.join(diff))

### Dictdiffer

In [None]:
from dictdiffer import diff

In [None]:
differences = list(diff(dict1, dict2))

In [None]:
for change in differences:
    print(change)

In [None]:
### teste 

In [None]:
from noworkflow.now.collection.prov_execution.collector import Collector

In [None]:
col = Collector()

In [None]:
import numpy as np
np.diag([1, 2, 3])

In [None]:
a