Task: Data Classification in Financial Domain
=======


This notebook is dedicated to a classification problem in the financial domain. We use a dataset (available in the current directory as `moro14_synth2.csv`), which is synthesized from the [bank telemarketing dataset](https://www.researchgate.net/publication/260805594_A_Data-Driven_Approach_to_Predict_the_Success_of_Bank_Telemarketing). The detailed description of the included variables (columns) can be found in [UCI repository](https://archive.ics.uci.edu/ml/datasets/bank+marketing).

The following cells contain code snippets to build a classification pipeline using the above dataset. The main task is to predict the variable `y`, based on the data at hand.

We ask for the candidate to:

- investigate and justify pre-processing steps to be performed on the data;
- correct the implementation (which contains various deliberately problematic aspects) into a proper cross-validation procedure;
- choose and justify evaluation strategies for the given problem.


## Task 1: Data Analysis & Pre-processing

It may be wise to pre-process the data. Please conduct an analysis to investigate what pre-processing you might want to do.

First, load the data using the following code cell:

In [None]:
import pandas as pd
import numpy as np

# for reproducibility
np.random.seed(2021)

# load data
df = pd.read_csv('moro14_synth2.csv', index_col=0)

Once the provided line of code loads the data, spend a (couple of) cells and Markdown blocks to explore the data. Report and discuss any interesting findings from the data. If there exist any concerns to be addressed, also discuss them accordingly.

[NOTE]: Please assume you only have the current dataset at hand, so do not assume prior distributional knowledge from the original bank telemarketing dataset.

In [1]:
# Explore the dataset `df` and elaborate on your exploration.
# Feel free to use as many Markdown comment blocks and code blocks as you want, and feel free to add visualizations.

In the next code cell, an empty function spec is given, for running pre-processing on the data before further classification will be performed. Please complete the function with any pre-processing steps you would like to take, and explain the details.

In [None]:
from typing import Tuple

def pre_processing(df_: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
    """Pre-processing step for classification task
    
    This function takes the data frame loaded from above and returns 
    a tuple of the NumPy array. The tuple's first element is the independent variables,
    or feature vectors, having a shape of (N, d) where N is the number of observations 
    and d is the number of variables (or columns). The tuple's second element is that
    the vector represents the dependent variable or label with the shape in (N,).
    
    Finalize this function to pre-process the data frame to be fit in the output spec.
    Beyond the mechanical conversion between input data-type and output data-type,
    apply any content-wise pre-processing that is necessary.
    """
    pass

We now call this function to pre-process the data.

In [None]:
# simple ``execution`` lines
X, y = pre_processing(df)

## Task 2: Classification & Model Selection

We now will proceed to the training-model selection steps of the classification task. In particular, we consider a range of classification models:

- Gaussian Naive Bayes Classifier
- Logistic Regression
- Quadratic Discriminant Analysis
- Decision Tree Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

clfs = {
    'LR': LogisticRegression(),
    'QDA': QuadraticDiscriminantAnalysis(),
    'GNB': GaussianNB(),
    'DT': DecisionTreeClassifier()
}

We are interested in selecting the better model from these candidates through a cross-validation procedure.

We provide some basic code below, but this code is (deliberately) problematic. Please improve the procedure; feel free to fully modify the cells, and to include any further intermediate processing steps you find necessary.

Furthermore, we have not said anything about evaluation. Please propose and implement a proper evaluation procedure, such that a model can be selected.

In [None]:
n_samples, n_features = X.shape

# split the dataset into train / test
rnd_idx = np.random.permutation(n_samples)
bound = int(n_samples / 2)

x_train = X[rnd_idx[:bound]]
y_train = y[rnd_idx[:bound]]
x_test = X[rnd_idx[bound:]]
y_test = y[rnd_idx[bound:]]

In [None]:
# here all classifiers are trained with the train dataset split
# [note] if there's warning or error, try to fix it or try to discuss about it
for name, clf in clfs.items():
    clf.fit(x_train, y_train)

In [None]:
# how would you evaluate, and jusitfy a

### Discussion of evaluation results

Could you find any irregular/unexpected/interesting behavior? Write a cell to evaluate them and discuss

In [None]:
# answer go here