# Part 1. Problem and Reference Answer

## 1.1 The Problem

As an expert in artificial intelligence, banks are looking to your expertise to predict credit card approvals. They provide you with a credit card database and your task is to determine whether the credit card application should be approved based on the variables within it.

To maintain confidentiality, all variable names and values have been anonymized and replaced with symbols.

| Variable  names | Role    | Type        | Value                                        |
| --------------- | ------- | ----------- | -------------------------------------------- |
| Feature 1       | Feature | Continuous  |                                              |
| Feature 2       | Feature | Continuous  |                                              |
| Feature 3       | Feature | Continuous  |                                              |
| Feature 4       | Feature | Continuous  |                                              |
| Feature 5       | Feature | Continuous  |                                              |
| Feature 6       | Feature | Continuous  |                                              |
| Feature 7       | Feature | Categorical | 0,1                                          |
| Feature 8       | Feature | Categorical | 0,1                                          |
| Feature 9       | Feature | Categorical | 0,1                                          |
| Feature 10      | Feature | Categorical | 0,1                                          |
| Feature 11      | Feature | Categorical | 1,2,3                                        |
| Feature 12      | Feature | Categorical | 1,2,3                                        |
| Feature 13      | Feature | Categorical | 1,2,3,4,5,6,7,8,9                            |
| Feature 14      | Feature | Categorical | 1,2,3,4,5,6,7,8,9,10,11,12,13,14             |
| Label           | Label   | Categorical | 0,1 (0 for non-approval and 1 for  approval) |

The bank required the use of decision tree-based models, emphasizing the importance of interpretability in the decision-making process. Try using a decision tree-based model for the best results! You will train and validate a decision tree-based model using 590 samples in the public dataset. Your results will then be on 100 samples in the private dataset, and your performance on the private dataset will be used as the final score.

Hint 1: Cross-validation is important.

Hint 2: Consider preprocessing and feature engineering if it benefits your model.

Hint 3: Optimize hyperparameters for improved performance.

Hint 4: Other techniques like pre-pruning and post-pruning can be applied.

Note:
* Please download the public dataset from https://drive.google.com/file/d/1gQ_hA5DLMYQHcqGmkSaq2w-Sm-OLAKfZ/view?usp=sharing
* Please download the private dataset from https://drive.google.com/file/d/1s1xhpfWKWeACf3SiPAmEhCs8dfwmSVEP/view?usp=sharing
* Please download the ground-truth for private dataset from https://drive.google.com/file/d/1yS5Qi81CJwxdhunUDNoz-2_D9QWzVyLN/view?usp=share_link
* Please use the following Python template for submission. (You can copy the code below from lab6-Exercise.ipynb)
* Your results will be evaluated on 100 samples in the private dataset (The labels will released in Lab 7).

```
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_3_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (590, 14)
print('Shape of y_public:', y_public.shape)  # n_sample (590,)

'''
CODE HERE!
'''

X_private = read_data_from_csv('assignment_3_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (100, 14)

# remove and make your own predictions.
preds = np.full(len(X_private), -1,
                dtype=int)
'''
CODE HERE!
e.g.,
preds = np.full(len(X_private), -1, dtype=int)
'''

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_3.csv', index=True, index_label='Id')

```

## 1.2 Refernce Code

You may need to revise the generated code. Here is what I finally got. Some useful and simple techniques that may be included are:

- Utilizing five-fold cross-validation to select the hyperparameters with the best average performance, such as max_leaf_nodes
- Ensembling the models trained during five-fold cross-validation
- Handling missing values, such as using the most frequent value
- Processing categorical variables, such as one-hot encoding

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier


def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X


X_public, y_public = read_data_from_csv('assignment_3_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (590, 14)
print('Shape of y_public:', y_public.shape)  # n_sample (590,)

import random
import itertools

from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

def seed_everthing(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)


seed_everthing(seed=42)

import warnings

warnings.filterwarnings("ignore")

def make_grid(pars_dict):
    keys = pars_dict.keys()
    combinations = itertools.product(*pars_dict.values())
    outputs = [dict(zip(keys, combination)) for combination in combinations]
    return outputs


# something really practical.
param_grid = {
    'max_leaf_nodes': [None, 5, 10, 15],
    'imputer': [None, 'median', 'most_frequent'],
    'category': [None, 'onehot']
}

param_list = make_grid(param_grid)

test_acc = 0.0
best_param = None
best_models = None
best_imputer = None
best_category = None

pbar = tqdm(total=len(param_list))
for param in param_list:
    processed_X_public = X_public.copy()

    if param['imputer'] == 'median':
        imputer = SimpleImputer(strategy='median')
        imputer.fit(processed_X_public)
        processed_X_public = imputer.transform(processed_X_public)
    elif param['imputer'] == 'most_frequent':
        imputer = SimpleImputer(strategy='most_frequent')
        imputer.fit(processed_X_public)
        processed_X_public = imputer.transform(processed_X_public)
    else:
        # you can do more imputation methods here.
        imputer = None

    if param['category'] == 'onehot':
        # the 7th column to the last column are categorical features.
        from sklearn.preprocessing import OneHotEncoder
        est = OneHotEncoder(sparse=False, handle_unknown='ignore')
        est.fit(processed_X_public[:, 6:])
        processed_X_public = np.concatenate([processed_X_public[:, :6], est.transform(processed_X_public[:, 6:])], axis=1)
        category = est
    else:
        category = None

    k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

    preds = []
    gts = []
    models = []

    for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(processed_X_public)):
        X_train, X_test = processed_X_public[train_idx], processed_X_public[test_idx]
        y_train, y_test = y_public[train_idx], y_public[test_idx]

        # filter keys that are not parameters of DecisionTreeClassifier.
        clf_param = {k: v for k, v in param.items() if k in DecisionTreeClassifier().get_params().keys()}
        model = DecisionTreeClassifier(**clf_param)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        preds.append(pred.tolist())
        gts.append(y_test.tolist())

        models.append(model)

    cur_acc = np.sum(np.array(preds) == np.array(gts)) / len(y_public)
    pbar.update(1)
    pbar.set_description(f'acc: {cur_acc}')

    if cur_acc > test_acc:
        test_acc = cur_acc
        best_param = param
        best_models = models
        best_imputer = imputer
        best_category = category

pbar.close()
print(best_param)

Shape of X_public: (590, 14)
Shape of y_public: (590,)


acc: 0.8593220338983051: 100%|██████████| 24/24 [00:00<00:00, 43.91it/s]

{'max_leaf_nodes': 10, 'imputer': 'most_frequent', 'category': 'onehot'}





# Part 2. Demo of Interactions with ChatGPT

You can use either of the approaches to interact with ChatGPT:
- Access https://genai.polyu.edu.hk/ to interact with ChatGPT.
- Use OpenAI API to interact with ChatGPT via code.

## 2.1 Preparations

In [None]:
%%capture

import numpy as np
import networkx as nx

from typing import List

# Set the seed
def seed_everything(seed=0):
    np.random.seed(seed)
seed_everything()

# Install OpenAI package
!pip install openai

# Import OpenAI and set the API key
import openai
openai.api_key = 'OpenAI_API_Key' # Replace with your own OpenAI API Key

# Define the function of get response from ChatGPT
_messages = []
def get_completion(prompt, model="gpt-3.5-turbo-0613"):
    _messages.append({"role": "user", "content": prompt})
    response = openai.ChatCompletion.create(
        model=model,
        messages=_messages,
        temperature=0.0,  # this is the degree of randomness of the model's output
    )
    content = response.choices[0].message["content"]
    _messages.append({"role": "assistant", "content": content})
    return content


# Set display format
import html
from IPython.core.display import display, HTML
css_content = """.cs-message{box-sizing:border-box;font-size:1em;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;color:#000000de;display:flex;flex-direction:row;padding:0;background-color:transparent;overflow:hidden;border-radius:0}.cs-message:not(:only-child){margin:.2em 0 0}.cs-message__avatar{box-sizing:border-box;margin:0 8px 0 0;display:flex;flex-direction:column;justify-content:flex-end;width:42px}.cs-message__content-wrapper{box-sizing:border-box;display:flex;flex-direction:column}.cs-message__content{box-sizing:border-box;color:#000000de;background-color:#c6e3fa;margin-top:0;padding:.6em .9em;border-radius:.7em;white-space:pre-wrap;overflow-wrap:anywhere;word-break:break-word;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;font-weight:400;font-size:.91em;font-variant:normal}.cs-message--incoming{color:#000000de;background-color:transparent;margin-right:auto}.cs-message--incoming .cs-message__avatar{margin:0 8px 0 0}.cs-message--incoming .cs-message__content{color:#000000de;background-color:#c6e3fa;border-radius:0 .7em .7em 0}.cs-message--outgoing{color:#000000de;background-color:transparent;margin-left:auto;justify-content:flex-end}.cs-message--outgoing .cs-message__avatar{order:1;margin:0 0 0 8px}.cs-message--outgoing .cs-message__content{color:#000000de;background-color:#6ea9d7;border-radius:.7em 0 0 .7em}.cs-message.cs-message--incoming.cs-message--single{border-radius:0}.cs-message.cs-message--incoming.cs-message--single:not(:first-child){margin-top:.4em}.cs-message.cs-message--incoming.cs-message--single .cs-message__content{border-radius:0 .7em .7em}.cs-message.cs-message--outgoing.cs-message--single{border-radius:0}.cs-message.cs-message--outgoing.cs-message--single .cs-message__content{border-radius:.7em .7em 0}.cs-avatar{position:relative;width:42px;height:42px;border-radius:50%;box-sizing:border-box}.cs-avatar>img{box-sizing:border-box;width:100%;height:100%;border-radius:50%}.cs-avatar.cs-avatar--md{width:42px;height:42px;min-width:42px;min-height:42px
"""
html_content = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">hello</div>
            </div>
        </div>
    </section>
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">Hello! How can I assist you today?</div>
            </div>
        </div>
    </section>
"""
def generate_html(messages, n=2):
    if n is not None:
      messages = messages[-n:]

    html_parts = []
    user_template = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    assistant_template = """
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    for message in messages:
        sanitized_content = html.escape(message["content"])
        if message["role"] == "user":
            html_parts.append(user_template.format(content=sanitized_content))
        elif message["role"] == "assistant":
            html_parts.append(assistant_template.format(content=sanitized_content))

    return "".join(html_parts) + f"<style>{css_content}</style>"

## 2.2 Interact with ChatGPT

In [None]:
problem = '''
As an expert in artificial intelligence, banks are looking to your expertise to predict credit card approvals. They provide you with a credit card database and your task is to determine whether the credit card application should be approved based on the variables within it.

To maintain confidentiality, all variable names and values have been anonymized and replaced with symbols.

| Variable  names | Role    | Type        | Value                                        |
| --------------- | ------- | ----------- | -------------------------------------------- |
| Feature 1       | Feature | Continuous  |                                              |
| Feature 2       | Feature | Continuous  |                                              |
| Feature 3       | Feature | Continuous  |                                              |
| Feature 4       | Feature | Continuous  |                                              |
| Feature 5       | Feature | Continuous  |                                              |
| Feature 6       | Feature | Continuous  |                                              |
| Feature 7       | Feature | Categorical | 0,1                                          |
| Feature 8       | Feature | Categorical | 0,1                                          |
| Feature 9       | Feature | Categorical | 0,1                                          |
| Feature 10      | Feature | Categorical | 0,1                                          |
| Feature 11      | Feature | Categorical | 1,2,3                                        |
| Feature 12      | Feature | Categorical | 1,2,3                                        |
| Feature 13      | Feature | Categorical | 1,2,3,4,5,6,7,8,9                            |
| Feature 14      | Feature | Categorical | 1,2,3,4,5,6,7,8,9,10,11,12,13,14             |
| Label           | Label   | Categorical | 0,1 (0 for non-approval and 1 for  approval) |

The bank required the use of decision tree-based models, emphasizing the importance of interpretability in the decision-making process. Try using a decision tree-based model for the best results! You will train and validate a decision tree-based model using 590 samples in the public dataset. Your results will then be on 100 samples in the private dataset, and your performance on the private dataset will be used as the final score.
'''

prompt = '''
{problem}
Let’s think step by step.
'''

message = prompt.format(problem=problem)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [None]:
# you can ask ChatGPT many times to get your own idea
idea = '''
Step 1: Data Preprocessing
- Load the credit card database and examine the data.
- Split the data into training and validation sets. I will employ a 5-fold cross-validation technique.

Step 2: Model Training
- Choose a decision tree-based model, such as a traditional decision tree, random forest, or gradient boosting.
- Train the model using the training set.
- Optimize the model's hyperparameters using techniques like grid search or random search. I will optimize the decision
tree's hyperparameters, specifically 'max_leaf_nodes', with values [None, 5, 10, 15].

Step 3: Model Prediction
- Combine the predictions from all folds using an ensemble method, such as majority voting or averaging, to obtain the final prediction.
'''
prompt = '''
I will use the following idea. What do you think?
{idea}
'''

message = prompt.format(idea=idea)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [None]:
from IPython.core.display import HTML

# you may need to give ChatGPT some examples to start
code_snippet_train = """
def make_grid(pars_dict):
    keys = pars_dict.keys()
    combinations = itertools.product(*pars_dict.values())
    outputs = [dict(zip(keys, combination)) for combination in combinations]
    return outputs

param_grid = {
    'max_leaf_nodes': [None, 5, 10, 15]
}

param_list = make_grid(param_grid)

best_models = []
for param in param_list:
  '''
  CODE HERE!
  - evaluate under 5 fold cross validation
  - save the model list (of five folds) with the best average performance
  - ...
  '''
"""

code_snippet_infer = """
for model in best_models:
  '''
  CODE HERE!
  - use the model list for prediction
  - use majority voting for ensemble
  - ...
  '''
"""

template = '''
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {{path}}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {{os.path.splitext(path)[-1]}}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_3_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (590, 14)
print('Shape of y_public:', y_public.shape)  # n_sample (590,)

{code_snippet_train}

X_private = read_data_from_csv('assignment_3_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (100, 14)

{code_snippet_infer}

submission = pd.DataFrame({{'Label': preds}})
submission.to_csv('assignment_3.csv', index=True, index_label='Id')
'''

prompt = '''
Use the given code snippet to revise the code.
{template}
'''

message = prompt.format(template=template.format(code_snippet_train=code_snippet_train, code_snippet_infer=code_snippet_infer))
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [None]:
from IPython.core.display import HTML

ideas = '''
- I have observed that there are missing values in the dataset and am considering
implementing various strategies for imputation, such as median imputation and
most_frequent imputation.

- Moreover, I noticed that the dataset contains both categorical and continuous
variables, and hence, I am contemplating utilizing different strategies for
their representation. For instance, one-hot encoding can be employed for
categorical variables.

- These designs should be explored to determine their utility, and as such,
the code can be modified accordingly:

param_grid = {
    'max_leaf_nodes': [None, 5, 10, 15],
    'imputer': [None, 'median', 'most_frequent'],
    'category': [None, 'onehot'],
    'continuous': [None, 'binning']
}
'''

prompt = '''
Consider modifying the code according to the ideas below to enhance performance.
{ideas}
'''
message = prompt.format(ideas=ideas)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

## Private test cases

In [None]:
X_private = read_data_from_csv('assignment_3_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (100, 14)

if best_imputer is not None:
    X_private = best_imputer.transform(X_private)

if best_category is not None:
    X_private = np.concatenate([X_private[:, :6], best_category.transform(X_private[:, 6:])], axis=1)

# use best models for ensemble.
preds = []
for model in best_models:
    pred = model.predict(X_private)
    preds.append(pred.tolist())

# use majority voting for ensemble.
preds = np.array(preds)
preds = np.sum(preds, axis=0)
preds = np.where(preds > len(best_models) / 2, 1, 0)

# save your predictions in a CSV file.
submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_3.csv', index=True, index_label='Id')

Shape of X_private: (100, 14)


In [None]:
# read pred from csv file.

preds = pd.read_csv('assignment_3.csv')['Label'].values

# check the accuracy of your predictions.

X_private, y_private = read_data_from_csv('assignment_3_private_gt.csv')
print('Accuracy:', np.sum(preds == y_private) / len(y_private))

Accuracy: 0.86
