---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

# Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud).
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [2]:
def answer_one():
    
    fraud = pd.read_csv('fraud_data.csv')
    
    tot = len(fraud)
    pos = len(fraud[fraud['Class'] == 1])

    print(tot,pos)
    
    return pos/tot # Return your answer
#answer_one()

In [3]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [4]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
    y_dummy = dummy_majority.predict(X_test)
    
    return (accuracy_score(y_test, y_dummy),recall_score(y_test, y_dummy)) # Return your answer
#answer_two()

### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [10]:
def answer_three():
    from sklearn.metrics import accuracy_score, recall_score, precision_score
    from sklearn.svm import SVC

    svm = SVC(kernel='rbf').fit(X_train, y_train)
    
    y_svm = svm.predict(X_test)
    
    return (
        accuracy_score(y_test,y_svm),
        recall_score(y_test,y_svm),
        precision_score(y_test,y_svm)
        ) # Return your answer
#answer_three()

(0.99078171091445433, 0.375, 1.0)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [19]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    svm = SVC(kernel='rbf',C=1e9, gamma=1e-07).fit(X_train, y_train)

    y_scores_svm = svm.decision_function(X_test)
    
    thres = lambda x: 0 if x < -220 else 1
    vthres = np.vectorize(thres)
    y_probs = vthres(y_scores_svm)
    #print(probs)
    #y_svm = svm.predict(X_test)
    
    confusion = confusion_matrix(y_test, y_probs)
    
    return confusion #confusion# Return your answer
#answer_four() # need to figure out the decision function part, rewatch lecture?

array([[5320,   24],
       [  14,   66]])

### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [51]:
def answer_five():
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_recall_curve, roc_curve, auc
    import matplotlib.pyplot as plt
    import pandas as pd
    
    lr = LogisticRegression().fit(X_train, y_train)
    lr_predicted = lr.predict(X_test)
    
    y_scores_lr = lr.decision_function(X_test)
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
    #compare=pd.DataFrame({'precision':precision, 'recall':recall}) #, 'thresholds':thresholds})
    #print(compare[(compare['precision'] > 0.74) & (compare['precision'] < 0.76)])
    
    #plt.figure(figsize=(15,15))
    #plt.xlim([0.0, 1.01])
    #plt.ylim([0.0, 1.01])
    #plt.plot(precision, recall, label='Precision-Recall Curve')
    #plt.xlabel('Precision', fontsize=16)
    #plt.ylabel('Recall', fontsize=16)
    #plt.axes().set_aspect('equal')
    #plt.xticks(np.arange(0, 1, 0.05))
    #plt.yticks(np.arange(0, 1, 0.05))
    #plt.show()
    
    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)
    roc_auc_lr = auc(fpr_lr, tpr_lr)
    
    #compare2=pd.DataFrame({'fpr_lr':fpr_lr, 'tpr_lr':tpr_lr})
    #print(compare2[(compare2['fpr_lr'] > 0.15) & (compare2['fpr_lr'] < 0.21)])
    
    #plt.figure(figsize=(15,15))
    #plt.xlim([-0.01, 1.00])
    #plt.ylim([-0.01, 1.01])
    #plt.plot(fpr_lr, tpr_lr, lw=3)
    #plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
    #plt.axes().set_aspect('equal')
    #plt.xticks(np.arange(0, 1, 0.05))
    #plt.yticks(np.arange(0, 1, 0.05))
    #plt.show()

    return (0.825,0.950) # Return your answer
#answer_five()

### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10, 100]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**   	|    ?	|   ? 	|

<br>

*This function should return a 5 by 2 numpy array with 10 floats.* 

*Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array.*

In [56]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    clf = LogisticRegression()
    grid_values = {'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}

    grid_clf_rec = GridSearchCV(clf, param_grid = grid_values, scoring = 'recall')
    grid_clf_rec.fit(X_train, y_train)
    
    #print(grid_clf_rec.cv_results_)
    res = grid_clf_rec.cv_results_['mean_test_score']
    
    return res.reshape(5,2) # Return your answer
#answer_six()

{'params': ({'C': 0.01, 'penalty': 'l1'}, {'C': 0.01, 'penalty': 'l2'}, {'C': 0.1, 'penalty': 'l1'}, {'C': 0.1, 'penalty': 'l2'}, {'C': 1, 'penalty': 'l1'}, {'C': 1, 'penalty': 'l2'}, {'C': 10, 'penalty': 'l1'}, {'C': 10, 'penalty': 'l2'}, {'C': 100, 'penalty': 'l1'}, {'C': 100, 'penalty': 'l2'}), 'split2_train_score': array([ 0.67934783,  0.7826087 ,  0.82065217,  0.82608696,  0.82608696,
        0.82608696,  0.83695652,  0.83695652,  0.83695652,  0.83695652]), 'mean_score_time': array([ 0.01154208,  0.00842683,  0.02152673,  0.01573261,  0.01726238,
        0.02810113,  0.00899291,  0.00619888,  0.00980457,  0.02846138]), 'mean_fit_time': array([ 0.13630144,  0.27647201,  0.19094539,  0.38142268,  0.30498497,
        0.50771483,  0.50320443,  0.43437942,  0.93603619,  0.45147189]), 'std_fit_time': array([ 0.02339944,  0.02623218,  0.02354314,  0.01185907,  0.03304191,
        0.05573472,  0.14386282,  0.07095647,  0.66949859,  0.03517778]), 'split1_train_score': array([ 0.67391304,  

array([[ 0.66666667,  0.76086957],
       [ 0.80072464,  0.80434783],
       [ 0.8115942 ,  0.8115942 ],
       [ 0.80797101,  0.8115942 ],
       [ 0.80797101,  0.80797101]])

In [55]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);

#GridSearch_Heatmap(answer_six())