#### Summer of Reproducibility - noWorkflow base experiment

This notebook implements an experimental setup modeling a Credit Fraud problem.

In [20]:
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import pandas as pd
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

#### Reading the dataset

In [21]:
now_tag('dataset_reading')
df = pd.read_csv('dataset/creditcard.csv', encoding='utf-8')

### Feature engineering stage

Separate the features and target variable. First step in feature treatment.

In [22]:
#now_tag('feature_eng')
X = df.drop('Class', axis=1)
y = df['Class']

#### Feature engineering: Apply PCA for feature extraction.

Here we define hyperparam_def tag given that n_components argument in PCA is required

In [23]:
pca_components = now_variable('pca_components', 15)
pca = PCA(n_components=pca_components)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X)

#### Feature engineering: Apply random undersampling over the extracted features

Another case of feature engineering operation with hyperparameter definition. Here is random_state value for RandmUnderSampler


In [24]:
random_seed = now_variable('random_seed', 321654)
rus = RandomUnderSampler(random_state=random_seed)
X_resampled, y_resampled = rus.fit_resample(X_pca, y)

#### Feature engineering: Spliting dataset into train and test

Here we have two hyperparameters assignments: the proportion of the test_size and the random_state. A guess here would be implement some logic to take all scalar values in hyperparam_def in cells. Not sure at the moment if there are any corner case where a hyperparameter could be vectorial or an object.

In [25]:
now_tag('feature_eng')
test_dim = now_variable('test_dim', 0.3)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=test_dim, random_state=random_seed)

#### Scoring: model training and transforming features into predictions
##### RandomForest

Train and evaluate Random Forest Classifier. Unsure now if adding a model_training tag would be redundant here. Scoring is enough at first sight.

In [26]:
#now_tag('scoring')
now_tag('model_training')
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

#### Evaluating: evaluating the performance of models
##### RandomForest
Computing performance metrics 

In [27]:
now_tag('evaluating')
y_pred_rf = rf.predict(X_test)

roc_rf = now_variable('roc_rf', roc_auc_score(y_test, y_pred_rf))
#roc_rf = roc_auc_score(y_test, y_pred_rf)
f1_rf = now_variable('f1_rf', f1_score(y_test, y_pred_rf))
#f1_rf = f1_score(y_test, y_pred_rf)

print("Random Forest - ROC = %f, F1 = %f" % (roc_rf, f1_rf))

Random Forest - ROC = 0.941662, F1 = 0.940789


In [28]:
ops_dict = get_pre('roc_rf')

In [29]:
__noworkflow__.trial_id

'6d78aede-c53c-41ff-8471-942312f26562'

In [11]:
def store_operations(trial, ops_dict):
    import shelve

    # Store the dictionary in a shelve file
    with shelve.open('ops') as shelf:
        shelf[trial] = ops_dict
        print("Dictionary stored in shelve.")

In [30]:
id_1 = __noworkflow__.trial_id
store_operations(id_1, ops_dict)

Dictionary stored in shelve.


In [31]:
id_1

'6d78aede-c53c-41ff-8471-942312f26562'

In [17]:
def exp_compare(trial_a, trial_b):
    import shelve
    from operator import itemgetter
    
    # Retrieve the dictionary a from the shelve file
    with shelve.open('ops') as shelf:
        dict_a = shelf[trial_a]
        #print("Retrieved dictionary:", dict_a)
        dict_b = shelf[trial_b]
        #print("Retrieved dictionary:", dict_b)
        
    # comparing two dicts

    # Define the indices of items to compare
    item_indices = (2, 3)  # Indexing is 0-based

    # Compare dictionaries' specific items using list comprehension and zip
    for key, values1 in dict1.items():
        values2 = dict2.get(key, [])  # Get values from dict2 or use an empty list if the key doesn't exist

        # Use list comprehension and zip to compare specific items
        comparisons = [v1 == v2 for v1, v2 in zip(itemgetter(*item_indices)(values1), itemgetter(*item_indices)(values2))]

        if all(comparisons):
            print(f"Key '{key}' has the same values for items {item_indices} in both dictionaries.")
        else:
            print(f"Key '{key}' has different values for items {item_indices} in the dictionaries.")

        return dict_a, dict_b

In [43]:
id_1 ='636a8483-5c2a-472d-a64c-d309b17088bf'
id_2 = '6d78aede-c53c-41ff-8471-942312f26562'

In [32]:
dict1, dict2 = exp_compare('636a8483-5c2a-472d-a64c-d309b17088bf', '6d78aede-c53c-41ff-8471-942312f26562')

Key '0' has the same values for items (2, 3) in both dictionaries.
Key '1' has the same values for items (2, 3) in both dictionaries.
Key '2' has the same values for items (2, 3) in both dictionaries.
Key '3' has the same values for items (2, 3) in both dictionaries.
Key '4' has the same values for items (2, 3) in both dictionaries.
Key '5' has the same values for items (2, 3) in both dictionaries.
Key '6' has the same values for items (2, 3) in both dictionaries.
Key '7' has the same values for items (2, 3) in both dictionaries.
Key '8' has the same values for items (2, 3) in both dictionaries.
Key '9' has the same values for items (2, 3) in both dictionaries.
Key '10' has the same values for items (2, 3) in both dictionaries.
Key '11' has the same values for items (2, 3) in both dictionaries.
Key '12' has the same values for items (2, 3) in both dictionaries.
Key '13' has the same values for items (2, 3) in both dictionaries.
Key '14' has the same values for items (2, 3) in both dict

In [13]:
def store_operations(trial, ops_dict):
    import shelve

    # Store the dictionary in a shelve file
    with shelve.open('ops') as shelf:
        shelf[trial] = ops_dict
        print("Dictionary stored in shelve.")

In [47]:
import shelve

shelf = shelve.open('ops')
shelf[id_2][45]

('746.133485483', '1613', "now_variable('random_seed', 321654)", '321654')