# Final Project
by Kris Johnson

# Research Goal
---

Can we accurately classify Bitcoin transactions into either Ransomware or non-ransomware categories?


Dataset from [UCI Bitcoin Heist Ransomware](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). Brief Description from UCI: 

    Entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, daily transactions on the network to form the Bitcoin graph. Filtered out network edges that transfer less than B0.3, as ransom amounts are rarely below this threshold.

Since this is equivalent to searching for rare fraud events, it's of more use to catch the fraud with less concern for flagging normal transactions as fraudulent. Even more so since some of the labeled normal data actually could be fraudulent: "Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware."

Issues:
- Massive Class imbalance
    - original data was 1.4% ransomWare vs 98.6% normal
- Don't care which type of ransomware
    - predict Binary Fraud or not
    - could classify type of ransomWare attack, but not useful from a business case

In [61]:
# Imports - collab has issues with importing only imblearn
import warnings # for dealing with non-convergence for default logistic regression
import imblearn
from   imblearn                   import under_sampling, over_sampling
from   imblearn.pipeline          import make_pipeline # scikit-learn Pipeline does not work with imblearn

import numpy as np
import pandas as pd

from   sklearn.ensemble           import RandomForestClassifier
from   sklearn.linear_model       import LogisticRegression
from   sklearn.preprocessing      import StandardScaler
from   sklearn.decomposition      import PCA
from   sklearn.metrics            import balanced_accuracy_score, recall_score, precision_score
from   sklearn.metrics            import make_scorer, fbeta_score

from   sklearn.model_selection    import RandomizedSearchCV
from   sklearn.model_selection    import cross_val_score

from   sklearn.model_selection    import train_test_split

In [19]:
random_seed = 42  # set to None to turn off seeding. Using to replicate results

# Load Preprocessed Data, Split into Train and Test Sets
Separately performed EDA and created possibly predictive extra columns using cumulative sums, account (wallet address) age, and time since last activity

The data has been downsampled from the original 2.9 million records to 558,964 records to satisfy size requirements. Only the majority class (non-Ransomware) was downsampled to preserve the precious few minority labels.

41,413 RansomWare labels to 517,551 normal transaction labels.

In [3]:
remote_location = 'https://raw.githubusercontent.com/krisrjohnson/ml_lab_final_data/master/Bitcoin_Heist_min_df.csv'
df = pd.read_csv(remote_location)

In [4]:
X = df[df.columns[3:-1]]  # dropping 'address','year', and 'day'
X = X.drop('Date', axis=1)  # remove target variable
X = X.drop('label', axis=1)  # remove target variable

y = df['label']
y = np.ravel(y)
y = np.not_equal(y, 'white').astype(int)

In [7]:
# hold out a final test set, using random_state so results are reproducible
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state=random_seed)

# Randomized Cross Validation Search to get best params


Becuase we're trying to predict RansomWare, a very rare event, we want a very high Recall. However, fitting to recall only would lead to predicting everything as RansomWare, since that'd be equivalent to a recall of 1. Recall measures how many of the ransomWare's in the data we found, with no downside to guessing wrong. The actual downside to guessing wrong can be captured with Precision, which measures how many of our ransomWare predictions were correct. And the f-score is a way to average Precision and Recall, using the Harmonic mean. So to fully contextualize our results we'll really need all three metrics. For the f-score, the basic Harmonic mean is typically referred to as F1 score. We'll weight it to lean to favoring higher recall at the cost of precision by using a Beta parameter, in this case 2, typically referred to as the F2 score.

Code for making your own scorer directly from the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

### Pipes and Defaults case

Pipes are very useful to manage a lot of the machine learning overhead. Here I create pipes to enable Cross Validation with randomized parameter searching. 

The Logistic Regression pipe use the SMOTE technique first to create synthetic data from the minority class as a form of over-sampling the minority. We only have numeric columns, which is a requirement for SMOTE. The sampling strategy infers how many to minority labels to synthesize to get the ratio of minority to majority labels to that sampling strategy decimal argument. The k_neighbors indicates how many datapoints to use to synthesize values in between.

Then for logistic I use a StandardScaler to automatically normalize and keep track of those normalizations. This insures the proper normalization gets applied to any validation or test set that get predicted from this pipeline later on, and is a main benefit of using pipes. Otherwise we'd have to manually maintain these values. PCA is used to compress help compress the data further to just its main components, or "directions" of most variability. Again the pipeline tracks how to convert into the main component directions, so I don't have to when doing predictions on the validation and test set. I search over the number of components to output to feed to logistic regression during cross validation. Finally I can then use logistic regression to perform binary classification, with class weighting of balanced to make sure errors on my ransomware class count for their proportionate amount. Balanced means the log loss function will heavily penalize errors involving minority classes, which are seen much rarer due to data imabalance. The 'saga' solver is better at 

The RandomForest pipeline implements undersampling on the majority class to get the minority to majority class ratio more aligned, based on the sampling strategy argument. Because the dataset is so large, relative to processing on a single machine, undersampling allows more training even though it's even more information loss with our original downsampling. 

The RandomForest classifier is setup to experiment on the number of trees to fit, where more trees means it's more likely to approach the asymptotic best fit at the cost of training time. And the max depth parameter is for determining how deep each decision tree is allowed to get, with deeper trees leading to more overfitting but shallow trees unlikely to capture the relationship between our input variables and our target.

I run a cross_val_score without supplying hyperparameters to get the default scores.

In [51]:
log_pipe = make_pipeline(
                         imblearn.over_sampling.SMOTE(n_jobs=-1),
                         StandardScaler(),
                         PCA(),
                         LogisticRegression(solver='saga',  # better for larger datasets
                                            class_weight='balanced',
                                            random_state=random_seed)  # so results are replicatable
                    )

log_params = dict(smote__sampling_strategy     = np.arange(0,1,0.1),  # what new ratio of minority class to majority should be
                  smote__k_neighbors           = [1,3,5,7,10],  # number of neighbors to use to construct synthetic data
                  pca__n_components            = [2,3,5],  # How many main PCA "directions" to use
                  logisticregression__C        = np.logspace(0, 4, 10) )  # 1/C is regularization strength, larger C means less regularization

In [52]:
rf_pipe = make_pipeline(
                        imblearn.under_sampling.RandomUnderSampler(),
                        RandomForestClassifier(n_jobs=-1, 
                                               class_weight='balanced',
                                               random_state=random_seed)  # so results are replicatable
                       )

rf_params = dict(randomundersampler__sampling_strategy  = [.5, .75],  # new ratio of minority class to majority, .5 is even split, .75 is 3:1 minority
                 randomforestclassifier__n_estimators   = [100, 200, 400],  # number of Trees to fit, 100 is default
                 randomforestclassifier__max_depth      = [_**2 for _ in range(3,8)]  # max depth of fitted trees
                )
                 #  smote__k_neighbors                   = [4],

In [53]:
# using f-score with beta=2 to over-emphasize Recall
ftwo_scorer = make_scorer(fbeta_score, beta=2)

In [62]:
warnings.filterwarnings("ignore")

algos = [(log_pipe, None), (rf_pipe, None)]

for algo, params in algos:
    algo.fit(X_train, y_train)

    recall = cross_val_score(algo, X_train, y_train, cv=3, scoring='recall')
    print('Recall', np.mean(recall))
    
    precision = cross_val_score(algo, X_train, y_train, cv=3, scoring='precision')
    print('Precision', np.mean(precision))
    
    f2 = cross_val_score(algo, X_train, y_train, cv=3, scoring=ftwo_scorer)
    print('F2', np.mean(f2), end='\n\n')
    

Recall 0.49397932215010876
Precision 0.31259532482194136
F2 0.4421033736061286

Recall 0.8609464283094449
Precision 0.3115339487941265
F2 0.6393018634013248



So for the defaults at least RandomForest looks more promising, with a default Recall of .86, with the trade off of a really low precision score.

RandomForests:

    Recall 0.8632324533161622
    Precision 0.31588642191599725
    F2 0.6410036453072456

## Do the RandomizedSearch 
Random seed is set in the pipelines above so results will be replicatable

Fitting both pipelines using the custom F2 score described above

In [None]:
ftwo_scorer = make_scorer(fbeta_score, beta=2)

In [34]:
best_score, best_est = 0, None
best_ests_list, best_scores_list = [], []
algos = [(log_pipe, log_params), (rf_pipe, rf_params)]

for algo, params in algos:

    clf_rand_cv = RandomizedSearchCV(estimator=algo, 
                               param_distributions=params,
                               n_iter = 20,
                               cv=3,
                               n_jobs=-1,
                               scoring=ftwo_scorer,
                               verbose=True,
                               random_state=random_seed)
    clf_rand_cv.fit(X_train, y_train)
    
    print(clf_rand_cv.best_score_, end='\n\n')
    if clf_rand_cv.best_score_ > best_score:
        best_score, best_est = clf_rand_cv.best_score_, clf_rand_cv.best_estimator_
    
    best_scores_list.append(clf_rand_cv.best_score_)
    best_ests_list.append(clf_rand_cv.best_estimator_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 567, in run
    self.flag_executor_shutting_down()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 756, in flag_executor_shutting_down
    self.kill_workers()
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 766, in kill_workers
    recursive_terminate(p)
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/backend/utils.py", line 28, in recursive_terminate
    _re

KeyboardInterrupt: 

    _recursive_terminate(process.pid)
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/site-packages/joblib/externals/loky/backend/utils.py", line 92, in _recursive_terminate
    children_pids = subprocess.check_output(
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/krisjohnson/anaconda3/envs/ml/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['pgrep', '-P', '1119']' died with <Signals.SIGINT: 2>.


In [64]:
clf_rand_cv.best_estimator_

AttributeError: 'RandomizedSearchCV' object has no attribute 'best_estimator_'

# Best Logistic Regression Model

In [None]:
log_pipe = make_pipeline(
                         imblearn.over_sampling.SMOTE(
                             sampling_strategy=.9, # what new ratio of minority class to majority should be
                             k_neighbors=3,  # number of neighbors to use to construct synthetic data
                             n_jobs=-1       # use all available workers
                         ),
                         StandardScaler(),
                         PCA(n_components=5),  # How many main PCA "directions" to use
                         LogisticRegression(solver='saga',           # faster for large datasets
                                           class_weight='balanced',  # Weights the log_loss function to account for class imablances
                                           C=21.54,                  # 1/C is regularization strength, larger C means less regularization
                                           n_jobs=-1))               # use all available workers

In [None]:
log_pipe.get_params()

# Best Overall Model - RandomForests

I'm choosing the best model as the randomForest tuned for max depth of trees to be 25 and fitting 400 decision trees, even though it's recall, the metric we're most interested in, is slightly worse than the default. The precision is significantly higher, even though it's less than .5, meaning it'll significantly decrease the amount of noise. Too low of precision is just unuseable

In [43]:
pipe = make_pipeline(imblearn.under_sampling.RandomUnderSampler(        # Because data is so large, undersampling non-ransomware labels
                                              sampling_strategy=0.5),   # new ratio of minority class to majority, .5 is even split
                        RandomForestClassifier(class_weight='balanced', # weights to help with imbalanced data calculate automatically
                                               max_depth=25,            # max tree depth, deeper trees will overfit, shallow trees can't fully capture relationship
                                               n_estimators=400,        # number of Trees to fit, more trees means more likely to hit the asymptotic limit of fitting
                                               n_jobs=-1))              # use all workers available, speeds up training

In [None]:
pipe = make_pipeline(
    imblearn.under_sampling.RandomUnderSampler(        
              sampling_strategy=0.5),   

     RandomForestClassifier(class_weight='balanced', 
               max_depth=25,            
               n_estimators=400,        
               n_jobs=-1))             

In [44]:
pipe

Pipeline(steps=[('randomundersampler',
                 RandomUnderSampler(sampling_strategy=0.5)),
                ('randomforestclassifier',
                 RandomForestClassifier(class_weight='balanced', max_depth=25,
                                        n_estimators=400, n_jobs=-1))])

In [45]:
pipe.get_params()

{'memory': None,
 'steps': [('randomundersampler', RandomUnderSampler(sampling_strategy=0.5)),
  ('randomforestclassifier',
   RandomForestClassifier(class_weight='balanced', max_depth=25, n_estimators=400,
                          n_jobs=-1))],
 'verbose': False,
 'randomundersampler': RandomUnderSampler(sampling_strategy=0.5),
 'randomforestclassifier': RandomForestClassifier(class_weight='balanced', max_depth=25, n_estimators=400,
                        n_jobs=-1),
 'randomundersampler__random_state': None,
 'randomundersampler__replacement': False,
 'randomundersampler__sampling_strategy': 0.5,
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__ccp_alpha': 0.0,
 'randomforestclassifier__class_weight': 'balanced',
 'randomforestclassifier__criterion': 'gini',
 'randomforestclassifier__max_depth': 25,
 'randomforestclassifier__max_features': 'auto',
 'randomforestclassifier__max_leaf_nodes': None,
 'randomforestclassifier__max_samples': None,
 'randomforestclassi

# Best Model Results on Test set

So now that we have our best model we can see how it works on a the held out test set as a final evaluation. We'll use F1 score here since it's more interpretible as it's more commonly used. 

In [None]:
# using rf_pipe, definied right above

In [46]:
pipe.fit(X_train, y_train)  # retrain best model on all training data
test_predictions = pipe.predict(X_test)  # create predictions

In [60]:
# Get evaluation metrics - primarily interested in Recall 
# need to know Precision as well, for context
# and f1 score is helpful to describe their joint relationship and to a broader audience
recall = recall_score(y_test, test_predictions)
precision = precision_score(y_test, test_predictions)
f1 = fbeta_score(y_test, test_predictions, beta=1)

print('Best Recall: {:.4f}, Precision: {:.4f}, and F1 score: {:.4f}'.format(recall, precision, f1))

Best Recall: 0.8204, Precision: 0.4281, and F1 score: 0.5626


So our final Recall score is .82, meaning we'll identify actual ransomWare 82% of the time and miss it the other 18%. So about one in five actual ransomWare transactions will be missed and labeled as normal transactions.

The downside is the Precision metric of .43 means that even though we'll catch 82% of ransomWare records, it's at the cost of our ransomWare guesses being wrong more than half the time. In other words, with a precision of .43, for every 100 times we predict ransomWare, we'll be right 43 times and wrong 57! That means a lot of wasted time looking at actual normal transaction records to try and verify if they're ransomWare or not.

In the end, our results certainly aren't as robust as we'd like, with such a low precisiion we're going to be wading through a lot of false alarms if we put this into production to catch ransomWare in real time. 

# Next steps

The initial data decription did point out that the 'white' labeled, or normal transactions, weren't all vetted to not be RansomWare. So for next steps we could turn this into a semi-supervised problem by taking the highest confidence False Positives, or predictions of ransomWare predictions that were labeled as normal, and dive into those further to see if they are in fact ransomWare. Then retrain our model on the updated data, look at the highest confidence False Positives to see if they're wrongly labeled, and do it all again. In this way we might uncover more ransomWare attacks that aren't being flagged, which would be a great win in further uncovering and identifying fraud in the blockchain. Exposing and stopping nefarious activity will ultimtely lead to more trust in the blockchain.