# Project 2 - Finance

### DESCRIPTION

#### Problem Statement
- Finance Industry is the biggest consumer of Data Scientists. It faces constant attack by fraudsters, who try to trick the system. Correctly identifying fraudulent transactions is often compared with finding needle in a haystack because of the low event rate. 
- It is important that credit card companies are able to recognize fraudulent credit card transactions so that the customers are not charged for items that they did not purchase.
You are required to try various techniques such as supervised models with oversampling, unsupervised anomaly detection, and heuristics to get good accuracy at fraud detection.

#### Dataset Snapshot

The datasets contain transactions made by credit cards in September 2013 by European cardholders. This dataset represents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

### Project Task: Week 1

#### Exploratory Data Analysis (EDA):
- Perform an EDA on the Dataset.
      - Check all the latent features and parameters with their mean and standard deviation. Value are close to 0 centered (mean) with unit standard deviation
      - Find if there is any connection between Time, Amount, and the transaction being fraudulent.
- Check the class count for each class. It’s a class Imbalance problem.
- Use techniques like undersampling or oversampling before running Naïve Bayes, Logistic Regression or SVM.
       - Oversampling or undersampling can be used to tackle the class imbalance problem
       - Oversampling increases the prior probability of imbalanced class and in case of other classifiers, error gets multiplied as the low-proportionate class is mimicked multiple times.
- Following are the matrices for evaluating the model performance: Precision, Recall, F1-Score, AUC-ROC curve. Use F1-Score as the evaluation criteria for this project.

In [1]:
# !pip install matplotlib
# !python -m pip install seaborn
# !pip install -U imbalanced-learn
# !pip install delayed

In [2]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, plot_roc_curve, auc, f1_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB

In [3]:
# read data
df_train_original = pd.read_csv('./dataset/train_data.csv')
df_dev_original = pd.read_csv('./dataset/test_data_hidden.csv')
df_test_original = pd.read_csv('./dataset/test_data.csv')

FileNotFoundError: [Errno 2] No such file or directory: './dataset/train_data.csv'

In [None]:
print(df_train_original.shape)
print(df_dev_original.shape)
print(df_test_original.shape)

In [None]:
# df_train_original.dtypes

In [None]:
# df_dev_original.dtypes

In [None]:
# df_test_original.dtypes

In [None]:
# Rows containing train duplicate data
duplicate_rows_df_train_original = df_train_original[df_train_original.duplicated()]
print("number of duplicate train rows: ", duplicate_rows_df_train_original.shape)

In [None]:
# Rows containing test hidden duplicate data
duplicate_rows_df_dev_original = df_dev_original[df_dev_original.duplicated()]
print("number of duplicate test rows: ", duplicate_rows_df_dev_original.shape)

In [None]:
# Rows containing test duplicate data
duplicate_rows_df_test_original = df_test_original[df_test_original.duplicated()]
print("number of duplicate test rows: ", duplicate_rows_df_test_original.shape)

In [None]:
# Remove duplicate entires
df_train_original = df_train_original.drop_duplicates()
df_dev_original = df_dev_original.drop_duplicates()
df_test_original = df_test_original.drop_duplicates()

In [None]:
print(df_train_original.shape)
print(df_dev_original.shape)
print(df_test_original.shape)

#### 1.a Check all the latent features and parameters with their mean and standard deviation. Value are close to 0 centered (mean)

In [None]:
df_train_original.describe().T

#### 1.b Find if there is any connection between Time, Amount, and the transaction being fraudulent.

In [None]:
# visualizations of time
plt.figure(figsize=(10,8))
plt.title("Time ditributions")
sns.distplot(df_train_original.Time)

In [None]:
# The time inseconds need to converts
df_train_original['Time'] = df_train_original['Time'] / 3600
# df_train_original['Time'] = df_train_original['Time'] / 24
print(df_train_original['Time'].min())
print(df_train_original['Time'].max())

In [None]:
plt.figure(figsize=(12,6), dpi=80)
sns.distplot(df_train_original.Time, bins=48)
plt.xlim([0,48])
plt.xticks(np.arange(0,54,6))
plt.xlabel('Time ditributions in Hrs')
plt.ylabel('Count')
plt.title('Transaction Times')

In [None]:
# visualizations of amount
plt.figure(figsize=(10,8))
plt.title("Time ditributions")
sns.distplot(df_train_original.Amount)

In [None]:
sns.relplot(x="Time", y="Amount", hue="Class", data=df_train_original);

In [None]:
sns.relplot(x="Time", y="Amount", col="Class", data=df_train_original);

####   2.Check the class count for each class. It’s a class Imbalance problem.

In [None]:
df_train_original['Class'].value_counts()

In [None]:
sns.countplot('Class', data=df_train_original)

#### 3.Use techniques like undersampling or oversampling before running Naïve Bayes, Logistic Regression or SVM.

In [None]:
# Remove the Time,Amount 
df_train_original.drop(['Time','Amount'], axis=1, inplace=True)
df_dev_original.drop(['Time','Amount'], axis=1, inplace=True)
df_test_original.drop(['Time','Amount'], axis=1, inplace=True)

In [None]:
Y_train = df_train_original.pop('Class')
X_train = df_train_original

In [None]:
print(Y_train.shape)
print(X_train.shape)

In [None]:
Y_dev = df_dev_original.pop('Class')
X_dev = df_dev_original
print(Y_dev.shape)
print(X_dev.shape)

In [None]:
# oversampling tarin data
oversample = RandomOverSampler(sampling_strategy='minority')

# fit and apply the transform
X_train_over, y_train_over = oversample.fit_resample(X_train, Y_train)

print(X_train_over.shape)
print(y_train_over.shape)

In [None]:
# undersampling tarin data
undersample = RandomUnderSampler(sampling_strategy='majority')

# fit and apply the transform
X_train_under, y_train_under = undersample.fit_resample(X_train, Y_train)

print(X_train_under.shape)
print(y_train_under.shape)

In [None]:
def modal_perfomance(model, X_train, Y_train, X_test, Y_test):
    model.fit(X_train, Y_train)
    predicted = model.predict(X_test)
    report = classification_report(Y_test, predicted)
    print("f1_score       : ", f1_score(Y_test, predicted, average="macro"))
    print("precision_score: ", precision_score(Y_test, predicted, average="macro"))
    print("recall_score   : ", recall_score(Y_test, predicted, average="macro"))
    print("\nAccuracy Score :",accuracy_score(predicted, Y_test))
    print(report)
    plot_roc_curve(model, X_test, Y_test)  
    plt.show() 
#     roc_curve_plt(model, X_test, Y_test, predicted)

In [None]:
lr = LogisticRegression(class_weight ='balanced')
modal_perfomance(lr, X_train_over, y_train_over, X_dev, Y_dev)

### Observations: Week 1
- Traing set contains two days transaction details, the rate transaction low at night times
- The amount of transcation is highy skewed distribution, only low number count detected for high amount transcation
- There is no connection between Time, Amount, and the transaction being fraudulent
- All the fraudulent transactions are low amount
- This is highly class Imbalance problem, fraudulent transaction are very rare less than 0.17 %

### Project Task: Week 2

#### Modeling Techniques:
- Try out models like Naive Bayes, Logistic Regression or SVM. Find out which one performs the best
- Use different Tree-based classifiers like Random Forest and XGBoost. 
       a.    Remember Tree-based classifiers work on two ideologies: Bagging or Boosting
       b.    Tree-based classifiers have fine-tuning parameters which takes care of the imbalanced class. Random-Forest and XGBboost.
- Compare the results of 1 with 2 and check if there is any incremental gain.

In [None]:
# Create logistic regression pipeline
lr_pipeline = Pipeline(
    steps=[
        ('scaler', StandardScaler()), 
        ('lr', LogisticRegression())
    ]
)

In [None]:
# Create SVC pipeline
svm_pipeline = Pipeline(
    steps=[
        ('scaler', StandardScaler()), 
        ('svm', SVC())
    ]
)

In [None]:
# Create Gaussian NB pipeline
gnb_pipeline = Pipeline(
    steps=[
        ('scaler', StandardScaler()), 
        ('gnb', GaussianNB())
    ]
)

In [None]:
# Create Multinomial NB pipeline

mnb_pipeline = Pipeline(
    steps=[
        ('scaler', StandardScaler()), 
        ('mnb', MultinomialNB())
    ]
)

In [None]:
pipelines = [
    lr_pipeline, 
    svm_pipeline, 
    gnb_pipeline, 
    mnb_pipeline
]

In [None]:
pipeline_dict = {
    '0' : 'LogisticRegression',
    '1' : 'SVM',
    '2' : 'Gaussian NB',
    '3' : 'Multinomial NB'
}

for pipe in pipelines:
    pipe.fit(X_train_over, y_train_over)