---

_You are currently looking at **version 1.0** of this notebook. _

---

# SI 670 Assignment 3 - Linear classifiers, evaluation methods, and grid search (75 points total)

In this assignment you will train several linear classifier models and evaluate how effectively they predict instances of fraud using data based on [this dataset from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud). Then you'll perform a grid search to find optimal parameters. 
 
Each row in `fraud_data.csv` corresponds to a credit card transaction. Features include confidential variables `V1` through `V28` as well as `Amount` which is the amount of the transaction. 
 
The target is stored in the `class` column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

### Question 1 (5 points)
Import the data from `fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [17]:
def answer_one():
    # YOUR CODE HERE
    df = pd.read_csv("fraud_data.csv")
    counts = df.groupby('Class').size().tolist()
    return counts[0]/(counts[0]+counts[1])
answer_one()

0.9835891762319642

In [19]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2 (10 points)

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [23]:
def answer_two():
    # YOUR CODE HERE
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import recall_score
    
    dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train) 
    y_pred = dummy_majority.predict(X_test)
    accScore = accuracy_score(y_pred, y_test)
    recallScore = recall_score(y_pred, y_test)
    print("Accuracy score: ", accScore)
    print("Recall score: ", recallScore)
    
    return accScore, recallScore

answer_two()

Accuracy score:  0.9852507374631269
Recall score:  0.0


  'recall', 'true', average, warn_for)


(0.9852507374631269, 0.0)

### Question 3 (10 points)

Using X_train, X_test, y_train, y_test (as defined above), train a linear support vector classifier using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [None]:
def answer_three():
    # YOUR CODE HERE
    from sklearn.svm import SVC
    svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
    
    
answer_three()

### Question 4 (10 points)

Using the linear support vector classifier with parameters `{'C': 1e9, 'gamma': 2e-07}`, what is the confusion matrix when using a threshold of -230 on the decision function? Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [None]:
def answer_four():
    # YOUR CODE HERE

#answer_four()

### Question 5 (20 points)

Train a logistic regression classifier with default parameters using X_train and y_train.

For the logistic regression classifier, create (1) a precision-recall curve and (2) a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the ROC curve, what is the true positive rate when the false positive rate is `0.16`?

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*
*You should also includce code to generate the precision/recall and ROC curves above*

In [None]:
# add code here to train the classifier and plot the P/R and ROC curves

### Question 6 (20 points)

(a) (13 points) Perform a grid search over the parameters listed below for a Logistic Regression classifier, using recall for scoring and the default 3-fold cross validation.

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`0.5`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|
| **`100`**  	|    ?	|   ? 	|

<br>

*This function should return a 4 by 2 numpy array with 8 floats.* 

*Note: do not return a DataFrame, just the values denoted by `?` in a numpy array.*

In [None]:
def answer_six():    
    # YOUR CODE HERE

answer_six()

(b) (2 points) *What is the optimal combination of penalty setting and C value?*  (You can use the following helper function to visualize the results from your grid search.)

In [None]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib inline
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(6,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 0.5, 1, 10, 100])
    plt.yticks(rotation=0);

GridSearch_Heatmap(answer_six())

(c) (5 points) Your fraud data was provided by a client who represents a large banking company. What is your final recommendation for the type of classifier and settings they should use to detect fraud in their system -- and why?