# Final Project
by Kris Johnson

dataset: [UCI Bitcoin Heist Ransomware](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). Brief Description: Entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, daily transactions on the network to form the Bitcoin graph. Filtered out network edges that transfer less than B0.3, as ransom amounts are rarely below this threshold.

Since this is fraud data, we're interested in catching the frauds with less concern for flagging normal transaction as fraudulent. Even more so since some of the labeled normal actually could be fraudulent: "Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware."

Issues:
- Massive Class imbalance
- Don't care which type of ransomware
    - can predict Binary Fraud or not, with sub classifier for type of Fraud
    - could have a classifier that identifies type of fraud, but probaby not useful from a business case

In [33]:
# Imports
import imblearn
from   imblearn.pipeline          import make_pipeline # scikit-learn Pipeline does not work with imblearn
import numpy as np
import pandas as pd

from   sklearn.ensemble           import RandomForestClassifier
from   sklearn.linear_model       import LogisticRegression
from   sklearn.preprocessing      import StandardScaler
from   sklearn.decomposition      import PCA
from   sklearn.metrics            import balanced_accuracy_score, recall_score, precision_score
from   sklearn.metrics            import make_scorer, fbeta_score

from   sklearn.model_selection    import RandomizedSearchCV
from   sklearn.model_selection    import cross_val_score

from   sklearn.model_selection    import train_test_split

In [19]:
random_seed = 42  # set to None to turn off seeding. Using to replicate results

# Load Preprocessed Data, Split into Train and Test Sets
Separately performed EDA and created possibly predictive extra columns using cumulative sums, account (wallet address) age, and time since last activity

In [3]:
remote_location = 'https://raw.githubusercontent.com/krisrjohnson/ml_lab_final_data/master/Bitcoin_Heist_min_df.csv'
df = pd.read_csv(remote_location)

In [4]:
X = df[df.columns[3:-1]]  # dropping 'address','year', and 'day'
X = X.drop('Date', axis=1)  # remove target variable
X = X.drop('label', axis=1)  # remove target variable

y = df['label']
y = np.ravel(y)
y = np.not_equal(y, 'white').astype(int)

In [7]:
# hold out a final test set, using random_state so results are reproducible
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=42)

### Randomized Cross Validation Search to get best params
Using F2 score to weigh recall much heavier than precision. Make your own scorer directly from the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

In [26]:
log_pipe = make_pipeline(
                         imblearn.over_sampling.SMOTE(n_jobs=-1),
                         StandardScaler(),
                         PCA(),
                         LogisticRegression(solver='liblinear',
                                            class_weight='balanced',
                                            random_state=random_seed)  # so results are replicatable
                    )

log_params = dict(smote__sampling_strategy     = np.arange(0,1,0.1),  # what new ratio of minority class to majority should be
                  smote__k_neighbors           = [1,3,5,7,10],  # number of neighbors to use to construct synthetic data
                  pca__n_components            = [2,3,5],  # How many main PCA "directions" to use
                  logisticregression__C        = np.logspace(0, 4, 10) )  # 1/C is regularization strength, larger C means less regularization

In [27]:
rf_pipe = make_pipeline(
                        imblearn.under_sampling.RandomUnderSampler(),
                        RandomForestClassifier(n_jobs=-1, 
                                               class_weight='balanced',
                                               random_state=random_seed)  # so results are replicatable
                       )

rf_params = dict(randomundersampler__sampling_strategy  = [.5, .75],  # new ratio of minority class to majority, .5 is even split, .75 is 3:1 minority
                 randomforestclassifier__n_estimators   = [100, 200, 400],  # number of Trees to fit, 100 is default
                 randomforestclassifier__max_depth      = [_**2 for _ in range(3,8)]  # max depth of fitted trees
                )
                 #  smote__k_neighbors                   = [4],

## Do the RandomizedSearch 
Random seed is set in the pipelines above so results will be replicatable

In [28]:
ftwo_scorer = make_scorer(fbeta_score, beta=2)

In [34]:
best_score, best_est = 0, None
best_ests_list, best_scores_list = [], []
algos = [(log_pipe, log_params), (rf_pipe, rf_params)]

for algo, params in algos:

    clf_rand_cv = RandomizedSearchCV(estimator=algo, 
                               param_distributions=params,
                               n_iter = 20,
                               cv=3,
                               n_jobs=-1,
                               scoring=ftwo_scorer,
                               verbose=True)
    clf_rand_cv.fit(X_train, y_train)
    
    print(clf_rand_cv.best_score_, end='\n\n')
    if clf_rand_cv.best_score_ > best_score:
        best_score, best_est = clf_rand_cv.best_score_, clf_rand_cv.best_estimator_
    
    best_scores_list.append(clf_rand_cv.best_score_)
    best_ests_list.append(clf_rand_cv.best_estimator_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 567, in run
    self.flag_executor_shutting_down()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 756, in flag_executor_shutting_down
    self.kill_workers()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 766, in kill_workers
    recursive_terminate(p)
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/backend/utils.py", line 28, in recursive_terminate
    _re

KeyboardInterrupt: 

    _recursive_terminate(process.pid)
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/backend/utils.py", line 92, in _recursive_terminate
    children_pids = subprocess.check_output(
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['pgrep', '-P', '1119']' died with <Signals.SIGINT: 2>.


# Best Models

In [15]:
log_pipe = make_pipeline(
                         imblearn.over_sampling.SMOTE(
                             sampling_strategy=.9, # what new ratio of minority class to majority should be
                             k_neighbors=3,  # number of neighbors to use to construct synthetic data
                             n_jobs=-1       # use all available workers
                         ),
                         StandardScaler(),
                         PCA(n_components=5),  # How many main PCA "directions" to use
                         LogisticRegression(solver='saga',           # faster for large datasets
                                           class_weight='balanced',  # Weights the log_loss function to account for class imablances
                                           C=21.54,                  # 1/C is regularization strength, larger C means less regularization
                                           n_jobs=-1))               # use all available workers

In [16]:
log_pipe.get_params()

{'memory': None,
 'steps': [('smote', SMOTE(k_neighbors=3, n_jobs=-1, sampling_strategy=0.9)),
  ('standardscaler', StandardScaler()),
  ('pca', PCA(n_components=5)),
  ('logisticregression',
   LogisticRegression(C=21.54, class_weight='balanced', n_jobs=-1, solver='saga'))],
 'verbose': False,
 'smote': SMOTE(k_neighbors=3, n_jobs=-1, sampling_strategy=0.9),
 'standardscaler': StandardScaler(),
 'pca': PCA(n_components=5),
 'logisticregression': LogisticRegression(C=21.54, class_weight='balanced', n_jobs=-1, solver='saga'),
 'smote__k_neighbors': 3,
 'smote__n_jobs': -1,
 'smote__random_state': None,
 'smote__sampling_strategy': 0.9,
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'pca__copy': True,
 'pca__iterated_power': 'auto',
 'pca__n_components': 5,
 'pca__random_state': None,
 'pca__svd_solver': 'auto',
 'pca__tol': 0.0,
 'pca__whiten': False,
 'logisticregression__C': 21.54,
 'logisticregression__class_weight': 'balanced',

In [17]:
rf_pipe = make_pipeline(imblearn.under_sampling.RandomUnderSampler(   # Because data is so large, undersampling non-ransomware labels
                                              sampling_strategy=0.5), # new ratio of minority class to majority, .5 is even split
                        RandomForestClassifier(class_weight='balanced', # weights to help with imbalanced data calculate automatically
                                               max_depth=25,          # max tree depth, deeper trees will overfit, shallow trees can't fully capture relationship
                                               n_estimators=400,      # number of Trees to fit, more trees means more likely to hit the asymptotic limit of fitting
                                               n_jobs=-1))            # use all workers available, speeds up training

In [18]:
rf_pipe.get_params()

{'memory': None,
 'steps': [('randomundersampler', RandomUnderSampler(sampling_strategy=0.5)),
  ('randomforestclassifier',
   RandomForestClassifier(class_weight='balanced', max_depth=25, n_estimators=400,
                          n_jobs=-1))],
 'verbose': False,
 'randomundersampler': RandomUnderSampler(sampling_strategy=0.5),
 'randomforestclassifier': RandomForestClassifier(class_weight='balanced', max_depth=25, n_estimators=400,
                        n_jobs=-1),
 'randomundersampler__random_state': None,
 'randomundersampler__replacement': False,
 'randomundersampler__sampling_strategy': 0.5,
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__ccp_alpha': 0.0,
 'randomforestclassifier__class_weight': 'balanced',
 'randomforestclassifier__criterion': 'gini',
 'randomforestclassifier__max_depth': 25,
 'randomforestclassifier__max_features': 'auto',
 'randomforestclassifier__max_leaf_nodes': None,
 'randomforestclassifier__max_samples': None,
 'randomforestclassi

# Best Model Results on Test set

In [None]:
# using rf_pipe, definied right above

In [None]:
pipe.fit(X_train, y_train)  # retrain best model on all training data
test_predictions = pipe.predict(X_test)  # create predictions

In [35]:
# Get evaluation metrics
precision = precision_score(y_test, test_predictions)
recall = recall_score(y_test, test_predictions)
f2 = fbeta_score(y_test, test_predictions, beta=2)

print('Best Recall: {:.4f}, Precision: {:.4f}, and F2 score: {:.4f}'.format(precision, recall, f2))

NameError: name 'test_predictions' is not defined