###  Goal of this project
Given a bitcoin address along with some meta-data pertaining to that address,  
we are challenged to predict if that address has been used to receive ransoms in the past.

### Notbook Workflow
- Setup Libraries
- Data-set
- Preprocessing
- Apply Classification Models
- Apply Voting Methods
- Apply Stacking Classifier
- The ROC curve
- Confusion Matrix
- Load The Models

# 

### Setup

- Data Manipulation Libraries:

In [1]:
import pandas as pd
import numpy  as np

- Visualization Libraries:

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
%matplotlib inline

- Storage Libraries:

In [3]:
import pickle

- Pre-processing Libraries:

In [4]:
from sklearn.preprocessing    import StandardScaler
from sklearn.model_selection  import train_test_split

- Classifier Libraries:

In [5]:
from sklearn.neighbors     import KNeighborsClassifier
from sklearn.linear_model  import LogisticRegression 
from sklearn.ensemble      import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.naive_bayes   import BernoulliNB, MultinomialNB, GaussianNB
from mlxtend.classifier    import StackingClassifier

Matrices Libraries:

In [6]:
from sklearn.metrics       import recall_score, precision_score, f1_score, accuracy_score, log_loss, plot_roc_curve , confusion_matrix, fbeta_score

# 

### Data-set

In [7]:
file_path = 'under_sample_Data.csv'

data = pd.read_csv(file_path)

# 

### Preprocessing

- Split the data into ( Features, Target)

In [8]:
X = data.drop('label',axis=1)
y = data.label

- Drop the unneeded column

In [9]:
to_be_dropped = ['address']
X = X.drop(to_be_dropped, axis=1, errors='ignore')

- Data Scaling

In [10]:
std_scale = StandardScaler()
X_sc = std_scale.fit_transform(X) 

- Split the data into (Train, Test) datasets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_sc, y,test_size = 0.2, random_state=42)

# 

### Apply Classification Models

- Set the models instances

In [12]:
lr_model  = LogisticRegression(solver="lbfgs", random_state=1)
knn_model = KNeighborsClassifier(n_neighbors=13, weights='uniform')
rf_model  = RandomForestClassifier(n_estimators=100, random_state=1)
et_model  = ExtraTreesClassifier(n_estimators=100, random_state=1)
Br_model  = BernoulliNB()
Gn_model  = GaussianNB()

models = ["lr_model", "knn_model", "rf_model", "et_model","Br_model","Gn_model"]

- Training the models

In [13]:
for model_name in models:
    
    curr_model = eval(model_name)
    
    curr_model.fit(X_train, y_train)

- Quick peek at each model performance

In [14]:
for model_name in models:
    curr_model = eval(model_name)
    print(f'{model_name} score: {curr_model.score(X_test, y_test)}')

lr_model score: 0.7520222141736086
knn_model score: 0.8883254859350477
rf_model score: 0.9553905589762164
et_model score: 0.9489315465411083
Br_model score: 0.7598696124592539
Gn_model score: 0.6070264396957624


In [15]:
model_vars = [eval(n) for n in models]
model_list = list(zip(models, model_vars))

# 

#### Get the prediction scores for each model.

- Prediction variables 

In [16]:
Lr_y_predict  = lr_model.predict(X_test)  # LogisticRegression
knn_y_predict = knn_model.predict(X_test) # KNeighborsClassifier
Rf_y_predict  = rf_model.predict(X_test)  # RandomForestClassifier
Et_y_predict  = et_model.predict(X_test)  # ExtraTreesClassifier
Br_y_predict  = Br_model.predict(X_test)  # BernoulliNB
Gn_y_predict  = Gn_model.predict(X_test)  # GaussianNB

- Precision vs Recall 

In [17]:
print(" Logistic Regression scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, Lr_y_predict), 
                                                                                      recall_score(y_test, Lr_y_predict)))
print("______________________________________________________________________")

print(" KNeighbors scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, knn_y_predict), 
                                                                                      recall_score(y_test, knn_y_predict)))
print("______________________________________________________________________")

print(" Random Forest scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, Rf_y_predict), 
                                                                                       recall_score(y_test, Rf_y_predict)))
print("______________________________________________________________________")

print(" Extra Trees scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, Et_y_predict), 
                                                                                       recall_score(y_test, Et_y_predict)))
print("______________________________________________________________________")

print(" BernoulliNB scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, Br_y_predict), 
                                                                                       recall_score(y_test, Br_y_predict)))
print("______________________________________________________________________")

print(" GaussianNB  scores in Precision: {:6.4f}, Recall: {:6.4f}".format(precision_score(y_test, Gn_y_predict), 
                                                                                       recall_score(y_test, Gn_y_predict)))

 Logistic Regression scores in Precision: 0.8028, Recall: 0.6678
______________________________________________________________________
 KNeighbors scores in Precision: 0.8529, Recall: 0.9384
______________________________________________________________________
 Random Forest scores in Precision: 0.9584, Recall: 0.9520
______________________________________________________________________
 Extra Trees scores in Precision: 0.9486, Recall: 0.9493
______________________________________________________________________
 BernoulliNB scores in Precision: 0.8451, Recall: 0.6361
______________________________________________________________________
 GaussianNB  scores in Precision: 0.5615, Recall: 0.9755


# 

- F scores using default dgree (beta = 1) and the second dgree (beta = 2)

In [18]:
print(" Logistic Regression scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, Lr_y_predict), 
                                                                                      fbeta_score(y_test, Lr_y_predict,2)))
print("______________________________________________________________________")

print(" KNeighbors scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, knn_y_predict), 
                                                                                      fbeta_score(y_test, knn_y_predict,2)))
print("______________________________________________________________________")

print(" Random Forest scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, Rf_y_predict), 
                                                                                      fbeta_score(y_test, Rf_y_predict,2)))
print("______________________________________________________________________")

print(" Extra Trees scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, Et_y_predict), 
                                                                                      fbeta_score(y_test, Et_y_predict,2)))
print("______________________________________________________________________")

print(" BernoulliNB scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, Br_y_predict), 
                                                                                      fbeta_score(y_test, Br_y_predict,2)))
print("______________________________________________________________________")

print(" GaussianNB scores in F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, Gn_y_predict), 
                                                                                      fbeta_score(y_test, Gn_y_predict,2)))
print("______________________________________________________________________")


 Logistic Regression scores in F1: 0.7291, F2: 0.6911
______________________________________________________________________
 KNeighbors scores in F1: 0.8936, F2: 0.9200
______________________________________________________________________
 Random Forest scores in F1: 0.9552, F2: 0.9533
______________________________________________________________________
 Extra Trees scores in F1: 0.9489, F2: 0.9491
______________________________________________________________________
 BernoulliNB scores in F1: 0.7258, F2: 0.6692
______________________________________________________________________
 GaussianNB scores in F1: 0.7127, F2: 0.8501
______________________________________________________________________




# 

### Apply Voting Methods

- **Max Voting**

In [19]:
# create voting classifier
voting_classifer = VotingClassifier(estimators= model_list,
                                    voting='hard', # <-- sklearn calls this hard voting
                                    n_jobs=-1)

In [20]:
voting_classifer.fit(X_train, y_train)

VotingClassifier(estimators=[('lr_model', LogisticRegression(random_state=1)),
                             ('knn_model',
                              KNeighborsClassifier(n_neighbors=13)),
                             ('rf_model',
                              RandomForestClassifier(random_state=1)),
                             ('et_model', ExtraTreesClassifier(random_state=1)),
                             ('Br_model', BernoulliNB()),
                             ('Gn_model', GaussianNB())],
                 n_jobs=-1)

In [21]:
y_pred = voting_classifer.predict(X_test)

In [22]:
print("Accuracy Score:", accuracy_score(y_test, y_pred) )
print("_____________________________________________________________________________")

print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_pred), 
                                                     recall_score(y_test, y_pred)))
print("_____________________________________________________________________________")

print("Scores of F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, y_pred),fbeta_score(y_test, y_pred,2)))

Accuracy Score: 0.9312447180973077
_____________________________________________________________________________
Precision: 0.9415,   Recall: 0.9196
_____________________________________________________________________________
Scores of F1: 0.9304, F2: 0.9239




# 

- **Average Voting**

In [23]:
# create voting classifier
voting_classifer = VotingClassifier(estimators=model_list,
                                    voting='soft', #<-- sklearn calls this soft voting
                                    n_jobs=-1)

In [24]:
voting_classifer.fit(X_train, y_train)

VotingClassifier(estimators=[('lr_model', LogisticRegression(random_state=1)),
                             ('knn_model',
                              KNeighborsClassifier(n_neighbors=13)),
                             ('rf_model',
                              RandomForestClassifier(random_state=1)),
                             ('et_model', ExtraTreesClassifier(random_state=1)),
                             ('Br_model', BernoulliNB()),
                             ('Gn_model', GaussianNB())],
                 n_jobs=-1, voting='soft')

In [25]:
y_pred = voting_classifer.predict(X_test)

In [26]:
print("Accuracy Score:", accuracy_score(y_test, y_pred) )
print("_____________________________________________________________________________")

print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_pred), 
                                                     recall_score(y_test, y_pred)))
print("_____________________________________________________________________________")

print("Scores of F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, y_pred),fbeta_score(y_test, y_pred,2)))

Accuracy Score: 0.9311843534951104
_____________________________________________________________________________
Precision: 0.9130,   Recall: 0.9531
_____________________________________________________________________________
Scores of F1: 0.9326, F2: 0.9448




# 

- **Weighted Voting**

In [27]:
# create voting classifier
weights = [1.5,3.8,4.2,2.2,3.1,2.4]
voting_model = VotingClassifier(estimators=model_list,
                                    voting='soft', 
                                    weights = weights,  #include weights
                                    n_jobs=-1)

In [28]:
voting_model.fit(X_train, y_train)

VotingClassifier(estimators=[('lr_model', LogisticRegression(random_state=1)),
                             ('knn_model',
                              KNeighborsClassifier(n_neighbors=13)),
                             ('rf_model',
                              RandomForestClassifier(random_state=1)),
                             ('et_model', ExtraTreesClassifier(random_state=1)),
                             ('Br_model', BernoulliNB()),
                             ('Gn_model', GaussianNB())],
                 n_jobs=-1, voting='soft',
                 weights=[1.5, 3.8, 4.2, 2.2, 3.1, 2.4])

In [29]:
y_pred = voting_model.predict(X_test)

In [30]:
print("Accuracy Score:", accuracy_score(y_test, y_pred) )
print("_____________________________________________________________________________")

print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_pred), 
                                                     recall_score(y_test, y_pred)))
print("_____________________________________________________________________________")

print("Scores of F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, y_pred),fbeta_score(y_test, y_pred,2)))

Accuracy Score: 0.937100084510443
_____________________________________________________________________________
Precision: 0.9223,   Recall: 0.9546
_____________________________________________________________________________
Scores of F1: 0.9382, F2: 0.9479




# 

### Apply Stacking Classifier

In [31]:
stacked = StackingClassifier(
    classifiers=model_vars, meta_classifier=RandomForestClassifier(), use_probas=False)

In [32]:
stacked.fit(X_train, y_train)

StackingClassifier(classifiers=[LogisticRegression(random_state=1),
                                KNeighborsClassifier(n_neighbors=13),
                                RandomForestClassifier(random_state=1),
                                ExtraTreesClassifier(random_state=1),
                                BernoulliNB(), GaussianNB()],
                   meta_classifier=RandomForestClassifier())

In [33]:
y_pred = stacked.predict(X_test)

In [34]:
print("Accuracy Score:", accuracy_score(y_test, y_pred) )
print("_____________________________________________________________________________")

print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_pred), 
                                                     recall_score(y_test, y_pred)))
print("_____________________________________________________________________________")

print("Scores of F1: {:6.4f}, F2: {:6.4f}".format(f1_score(y_test, y_pred),fbeta_score(y_test, y_pred,2)))

Accuracy Score: 0.9491126403477002
_____________________________________________________________________________
Precision: 0.9470,   Recall: 0.9514
_____________________________________________________________________________
Scores of F1: 0.9492, F2: 0.9505




# 

#### The ROC curve

In [35]:
from sklearn.metrics import roc_auc_score, roc_curve

In [36]:
fpr, tpr, thresholds = roc_curve(y_test, stacked.predict_proba(X_test)[:,1])

In [None]:
plt.plot(fpr, tpr,lw=2)
plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])


plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve for Ransomware problem');
print("ROC AUC score = ", roc_auc_score(y_test, stacked.predict_proba(X_test)[:,1]))

# 

#### Confusion Matrix

In [None]:
def make_confusion_matrix(model, threshold=0.5):
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    y_predict = (model.predict_proba(X_test)[:, 1] >= threshold)
    fraud_confusion = confusion_matrix(y_test, y_predict)
    plt.figure(dpi=80)
    sns.heatmap(fraud_confusion, cmap=plt.cm.Blues, annot=True, square=True, fmt='d',
           xticklabels=['White', 'Ransomware'],
           yticklabels=['White', 'Ransomware']);
    plt.xlabel('prediction')
    plt.ylabel('actual')

In [None]:
make_confusion_matrix(stacked)

# 

#### Store the models in Pickle files

In [None]:
for model in models:
    pickle.dump(model, open('model.pkl', 'wb'))

# 

- **End of This Notbook** 