# Support Vector Machine Classifier (SVC) - Fradulent Card Data

## 0. Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedShuffleSplit, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.inspection import permutation_importance
from sklearn.utils import resample
import math
import joblib
import random

ModuleNotFoundError: No module named 'sklearn'

## 1. Framing the Problem

- What is the __business or research__ objective?

[Insert Answer Here]
- How will they (the business or research facility) benrfit from this chosen machine learning model

[Insert Answer Here]

- What performance metrics should be used for the chosen problem?

    - [Performance Metrics by Neptune AI](https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide)
    - [Evaluation Metrics by National Library of Medicine](https://pmc.ncbi.nlm.nih.gov/articles/PMC10937649/)
    - [Evaluation Metrics by Amir Masoud](https://iamirmasoud.com/2022/06/19/understanding-micro-macro-and-weighted-averages-for-scikit-learn-metrics-in-multi-class-classification-with-example/)

*It is alright to skip step 1 for now and import and familiarize youself with the data first (**Steps** __2__ & __3__)*

## 2. Get the Data

Method for Retrieving Data from UCI ML Repo
- [UCI ML Repo Github Repository view README.md for Instructions](https://github.com/uci-ml-repo/ucimlrepo)
- [Read CSV Method from Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

In [28]:
# Insert Method Here to Import Data

## 3. Explore the Data (Exploratory Data Analysis - EDA)

- DataFrame.head()
- DataFrame.info()
- DataFrame.describe()
- DataFrame.value_counts()

See [Pandas Documentation - Data Frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for more methods and information on methods already used in class.

Check out [Matplotlib's Documentation](https://matplotlib.org/stable/plot_types/index.html) for the different types of graphs that can be made to explore the data. **I highly suggest spending more time with EDA** then I do with in-class with prior examples. It is good to familiarize yourself with the data before deciding what to do with it and if it can even work for a machine learning model.

It may be useful to have Python's list slicing method to access certain columns of a dataframe, explore the dataset, and prepare the data for trainning by knowing about list slicing in Python (data frame in the context of using Dataframes via Pandas) check out this article [here](https://www.pythonmorsels.com/slicing/)

*Note: Instead of a number followed by a colon (:) (i.e. [0:4]) dataframes use the name of the data frame column to perform splicing

df - dataframe

df["Column 1"] - where "Column 1" is the name of the column to be spliced

In [16]:
# Insert Code to Explore the Data (Perform EDA Here)

## 4. Prepare the Data and Model for Machine Learning

### Data Cleaning

#### Drop (or Fill) All NaN Values from Dataset (Clean the Data)

- DataFrame.fillna()
- DataFrame.drop()
- DataFrame.dropna()


See [Pandas Documentation - Data Frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for more methods and information on methods already used in class. (**Same link as above for Pandas**)

*Note: Sometimes it will be necessary depending on the dataset to also use __Numpy's methods__ documentation on this can be found [here](https://numpy.org/doc/stable/index.html) ( i.e. np.unique() )* 

In [17]:
# Insert Code Here

- Convert all __floats__ for true or false (binary) columns to __ints__

In [18]:
# Insert Code Here

**Hint**: In the end-end ML project example there's one column that gets converted to int, now we need to do it with all columns that represent binary values (1's and 0's)

### Balance Out the Classes (To prevent class imbalance)
* Be sure to change 'df' to whatever is the name of your dataframe

In [23]:
# Seperate the majority and minority classes
majority_class = df[df["fraud"] == 0]
minority_class = df[df["fraud"] == 1]

# Downsample the majority class to match the minority class
downsampled_majority_class = resample(majority_class, 
                                      replace=False, 
                                      n_samples=len(minority_class), 
                                      random_state=42)

# Combine the two dataframes together
df = pd.concat([downsampled_majority_class, minority_class])

NameError: name 'df' is not defined

### Perform Stratified Sampling and Reduce the Amount of Data to Prevent Overfitting

- Checkout [Scikit Learn's Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) to see how to create a StratifiedShuffleSplit
- Or... [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with stratified sampling

In [24]:
# Create Train and Test Split Here (Remember to Perform Cross Fold Validation Later if Not Using a Validation Set)

### Build the Data Pipeline

In [26]:
# Build the Data Pipeline
features = X_train.select_dtypes(include=["int64", "float64"]).columns
pipeline = Pipeline([
  
])

preprocessing = ColumnTransformer([

])

### Create a Baseline Model

In [None]:
# Code to  Be Completed Here
baseline = Pipeline([

])

# Code to Be Completed Here
baseline_scores = cross_val_score(

)
baseline_scores.mean()

## 5. Select and Train the Model

### Create the Initial Support Vector Machine

In [3]:
# Code to Be Completed Here

### Run Cross-Validation to Get the Score of the Initial ML Model

In [None]:
# Code to Be Completed Here

## 6. Fine Tune the Model

## Use Grid Search (or RandomSearch) CV For Hyper-Parameter Tuning 

In [2]:
param_grid = {
    '''
        Insert Hyperparameters for a Support Vector Machine Here
    '''
}

# Initialize the GridSearchCV (or RandomSearchCV if chosen) object
grid_search = GridSearchCV(estimator='''input name of initial classifier here''', param_grid=param_grid, cv=5, scoring='average_precision', n_jobs=-1, verbose=3)

# Fit the grid(random) search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score achieved during the grid search
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

SyntaxError: incomplete input (1503326985.py, line 15)

### Build Standardized PCA Pipe-line to Visualize the Multi-Dimensional Data (To Visualize SVM)

In [127]:
pipeline=Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))
])
X_train_scaled_pca = pipeline.fit_transform(X_train)

In [128]:
pca_pipeline_step = pipeline.named_steps['pca']
loadings = pca_pipeline_step.components_
loadings_df = pd.DataFrame(data=loadings, columns=X_train.columns, index=[f'PC{i+1}' for i in range(pca_pipeline_step.n_components)])
display(loadings_df)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
PC1,0.649231,0.005839,-0.15363,0.531136,-0.388894,-0.150952,0.314217
PC2,-0.113151,-0.062503,0.66976,-0.082057,0.056144,-0.486259,0.537008


### Plotting the Initial Scalble Vectore Machine Classifier (SVC)

In [129]:
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_train_scaled_pca[:,0], X_train_scaled_pca[:,1], c=y_train, cmap="viridis", edgecolor='k', s=150)
plt.xlabel("Principle Component 1")
plt.ylabel("Principle Component 2")
plt.title("Scaled PCA Projection of Credit Card Fraud Detection (SVC)")
plt.colorbar(scatter)
plt.show()

### Re-fine through Feature Selection

#### Evaluating feature importance


In [7]:
results = permutation_importance(fraud_check_classifier, X_test, y_test, n_repeats=10, random_state=42)

NameError: name 'permutation_importance' is not defined

#### Displaying importance


In [None]:
importances = results.importances_mean
print(importances)

#### Select Features that are Relevant Based on Importances

In [None]:
threshold = np.median(importances)
indices = np.where(importances > threshold)[0]
# new x_train and y_train_data
X_train_new_feature = X_train.iloc[:, indices]
X_test_new_feature = X_test.iloc[:, indices]

### Make Finanlized Classifier with Newly Hypertuned Parameters and Refined Feature Selection

In [141]:
# Code to Be Completed Here



### Evaluate the Final Model on the Test Set

In [None]:
#Code to Be Completed Here

### Build Standardized PCA Pipe-line to Visualize the Multi-Dimensional Data

In [142]:
pipeline=Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))
])
X_final_train_scaled_pca = pipeline.fit_transform(X_final_training_set)
X_final_test_scaled_pca = pipeline.fit_transform(X_test_new_feature)

In [143]:
pca_pipeline_step = pipeline.named_steps['pca']
loadings = pca_pipeline_step.components_
columns = ["distance_from_home","ratio_to_median_purchase_price","online_order"]
loadings_df = pd.DataFrame(data=loadings, columns=columns, index=[f'PC{i+1}' for i in range(pca_pipeline_step.n_components)])
display(loadings_df)

Unnamed: 0,distance_from_home,ratio_to_median_purchase_price,online_order
PC1,-0.489517,-0.445961,0.749328
PC2,-0.685672,0.72776,-0.014807


### Final Scalable Vectore Machine Classifier (SVC)

In [144]:
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_final_test_scaled_pca[:,0], X_final_test_scaled_pca[:,1], c=y_test_pred, cmap="viridis", edgecolor='k', s=150)
plt.xlabel("Principle Component 1")
plt.ylabel("Principle Component 2")
plt.title("Final Scaled PCA Projection of Credit Card Fraud Detection (SVC)")
plt.colorbar(scatter)
plt.show(bbox_inches="tight")

In [145]:
conf_matrix = confusion_matrix(y_test, y_test_pred)
class_report = classification_report(y_test, y_test_pred, target_names=class_names)
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', class_report)

Confusion Matrix:
 [[1165  129]
 [  45 1249]]
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.90      0.93      1294
           1       0.91      0.97      0.93      1294

    accuracy                           0.93      2588
   macro avg       0.93      0.93      0.93      2588
weighted avg       0.93      0.93      0.93      2588



### Plot the Final Confusion matrix

In [5]:
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='.2f', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show(bbox_inches="tight")

NameError: name 'plt' is not defined

### Plot the Final Classification Report 

In [6]:
class_report = classification_report(y_test, y_test_pred, target_names=class_names, output_dict=True)
df_final_report = pd.DataFrame(class_report).transpose()
plt.figure(figsize=(10, 5))
sns.heatmap(df_final_report.iloc[:-1, :].drop(columns=['support']), annot=True, cmap='Blues', cbar=False)
plt.title('Classification Report')
plt.show(bbox_inches="tight")

NameError: name 'classification_report' is not defined

### Plot and Get the Final Percision-Recall Score

In [4]:
# Code to Be Completed Here

### Saving the Created Model for 'Live' Testing

In [149]:
joblib.dump('''Final SVM Classifier Name Here''', "model_name.pkl")

['./models/fraud_checker_bal_Non-Linear_Scalable_Vector_Classifier.pkl']