# Descripción del Caso

Este documento describe un análisis de datos enfocado en la detección de fraudes utilizando diversos algoritmos de clasificación. El objetivo principal es identificar correctamente las instancias fraudulentas en un conjunto de datos y evaluar el rendimiento de diferentes modelos de clasificación para este propósito. A lo largo del documento, se abordarán distintas tareas que guiarán el análisis, desde la comprensión básica del conjunto de datos hasta la optimización de modelos mediante la búsqueda de hiperparámetros.

### Objetivos

1. **Importación y Análisis Inicial del Conjunto de Datos:**
   - Importar los datos desde un archivo CSV y determinar el porcentaje de observaciones que representan instancias de fraude.

2. **Entrenamiento de Modelos de Clasificación:**
   - Entrenar un clasificador dummy para establecer una línea base de comparación en términos de precisión y recall.
   - Evaluar un clasificador de Máquinas de Soporte Vectorial (SVC) y calcular su precisión, recall, y precisión.

3. **Análisis de Matrices de Confusión:**
   - Utilizar un clasificador SVC con parámetros específicos para obtener una matriz de confusión bajo un umbral definido.

4. **Curvas de Precisión-Recall y ROC:**
   - Entrenar un clasificador de regresión logística y generar curvas de precisión-recall y ROC para evaluar el rendimiento en diferentes umbrales.
   - Extraer métricas específicas de recall y tasa de verdaderos positivos para valores dados de precisión y tasa de falsos positivos.

5. **Optimización de Modelos:**
   - Realizar una búsqueda de hiperparámetros para un clasificador de regresión logística utilizando validación cruzada, maximizando el recall como métrica de rendimiento.



In [13]:
import numpy as np
import pandas as pd

### Question 1
Import the data from `assets/fraud_data.csv`. What percentage of the observations in the dataset are instances of fraud?

*This function should return a float between 0 and 1.* 

In [16]:
import pandas as pd

def answer_one():
    df = pd.read_csv('assets/fraud_data.csv')
    # Calcular el porcentaje de instancias de fraude
    fraud_percentage = df['Class'].mean()
    return fraud_percentage

# Llamada a la función para ver el resultado
print(answer_one())

0.016410823768035772


In [18]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('assets/fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Question 2

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

*This function should a return a tuple with two floats, i.e. `(accuracy score, recall score)`.*

In [21]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score, accuracy_score

    # Negative class (0) is most frequent
    dummy_majority = DummyClassifier()
    dummy_majority.fit(X_train, y_train)

    # Therefore the dummy 'most_frequent' classifier always predicts class 0
    y_dummy_predictions = dummy_majority.predict(X_test)

    recall = recall_score(y_test, y_dummy_predictions, average='binary')
    
    # Accuracy_score function does not accept an average parameter
    score = dummy_majority.score(X_test, y_test)
    
    return (score, recall)

result = answer_two()
print(result)

(0.9852507374631269, 0.0)


### Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

*This function should a return a tuple with three floats, i.e. `(accuracy score, recall score, precision score)`.*

In [46]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC

    # Your code here
    clf = SVC()
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    score = clf.score(X_test, y_test)
    recall_score = recall_score(y_test, predictions)
    precision_score = precision_score(y_test, predictions)
    
    return (score, recall_score, precision_score) # Return your answer


answer_three()

(0.9901659496004918, 0.3673469387755102, 0.9473684210526315)

### Question 4

Using the SVC classifier with parameters `{'C': 1e9, 'gamma': 1e-07}`, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

*This function should return a confusion matrix, a 2x2 numpy array with 4 integers.*

In [48]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    svc_classifier = SVC(C=1e9, gamma=1e-07)
    svc_classifier.fit(X_train, y_train)
    decision_scores = svc_classifier.decision_function(X_test)
    
    threshold = -220
    y_pred_thresholded = (decision_scores > threshold).astype(int)
    
    conf_matrix = confusion_matrix(y_test, y_pred_thresholded)
    
    return conf_matrix

result = answer_four()
print(result)

[[6380   30]
 [  14   84]]


### Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train. This classifier should use the parameter solver='liblinear'.

For the logisitic regression classifier, compute the scores using decision_function() or with predict_proba(), then create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is `0.75`?

Looking at the roc curve, what is the true positive rate when the false positive rate is `0.16`?

Note: When getting the ROC curve and finding the records where the FPR entry is closest to 0.16, take the corresponding TPRs. As there are two such records where the FPR is close to 0.16, take the higher TPR of these two records.

*This function should return a tuple with two floats, i.e. `(recall, true positive rate)`.*

In [50]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

def answer_five():
    # Cargo datos y separo etiquetas
    df = pd.read_csv('assets/fraud_data.csv')
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    # Divido en entrenamiento y test
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # Entreno clasificador
    lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)

    # Obtengo las puntuaciones
    y_scores_lr = lr.predict_proba(X_test)[:, 1]

    # Calculo curva recall-precision
    precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)

    # Encuentro el índice más cercano al valor de precisión deseado
    desired_precision = 0.75
    closest_precision_index = np.argmin(np.abs(precision - desired_precision))
    approximate_recall = recall[closest_precision_index]

    # Calculo la curva ROC
    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)

    # Encuentro el índice del FPR más cercano a 0.16
    desired_fpr = 0.16
    closest_fpr_indices = np.where(np.abs(fpr_lr - desired_fpr) == np.min(np.abs(fpr_lr - desired_fpr)))[0]
    true_positive_rate = np.max(tpr_lr[closest_fpr_indices])

    # Retorno el resultado como tuple
    return approximate_recall, true_positive_rate

# Llamo a la función
result = answer_five()
print(result)

(0.825, 0.95)


### Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation. (Suggest to use `solver='liblinear'`, more explanation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html))

`'penalty': ['l1', 'l2']`

`'C':[0.01, 0.1, 1, 10]`

From `.cv_results_`, create an array of the mean test scores of each parameter combination. i.e.

|      	| `l1` 	| `l2` 	|
|:----:	|----	|----	|
| **`0.01`** 	|    ?	|   ? 	|
| **`0.1`**  	|    ?	|   ? 	|
| **`1`**    	|    ?	|   ? 	|
| **`10`**   	|    ?	|   ? 	|

<br>

*This function should return a 4 by 2 numpy array with 8 floats.* 

*Note: do not return a DataFrame, just the values denoted by `?` in a numpy array.*

In [52]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Carga de datos
df = pd.read_csv('assets/fraud_data.csv')

# Dividir en características (X) y etiquetas (y)
X = df.drop('Class', axis=1)
y = df['Class']

# Dividir en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

def answer_six():
    # Configurar el clasificador y los valores del grid
    clf = LogisticRegression(solver='liblinear')
    grid_values = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2']}

    # Realizar la búsqueda en grid
    grid_search = GridSearchCV(clf, param_grid=grid_values, scoring='recall', cv=3)
    grid_search.fit(X_train, y_train)
    
    # Extraer los resultados de la búsqueda
    cv_result = grid_search.cv_results_
    mean_test_score = cv_result['mean_test_score']

    # Redimensionar los resultados a la forma requerida
    result = np.array(mean_test_score).reshape(4, 2)
    
    return result

# Llamada a la función
results_array = answer_six()
print(results_array)

[[0.64728682 0.75581395]
 [0.79844961 0.81007752]
 [0.80620155 0.81007752]
 [0.80620155 0.81007752]]


In [54]:
def GridSearch_Heatmap(scores):
    get_ipython().magic('matplotlib notebook')
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);