# Part 1. Problem and Reference Answer

## 1.1 The Problem

It is an important problem for credit card companies to predict potential defaults to assess the risk probability of payments in advance. As an expert in artificial intelligence, a credit card company is seeking your assistance to predict default payments based on historical data.

The dataset you are provided contains a set of 23 features. The goal is to forecast the default of payment (yes or no). In the public dataset, you can train and validate your model on 20,000 samples. Then, you need to predict the labels for 5,000 samples in the private dataset, and the area under the Receiver Operating Characteristic curve (AUC-ROC) on the private dataset will determine your final score.

| Variable  Name | Role    | Type        | Description                                                  |
| -------------- | ------- | ----------- | ------------------------------------------------------------ |
| Feature 1      | Feature | Continuous  | Amount of the given credit                                   |
| Feature 2      | Feature | Categorical | Gender (1 = male, 2 = female)                                |
| Feature 3      | Feature | Categorical | Education (1 = graduate school, 2 = university, 3 = high school,  4 = others) |
| Feature 4      | Feature | Categorical | Marital status (1 = married, 2 = single, 3 = others)         |
| Feature 5      | Feature | Continuous  | Age (year)                                                   |
| Feature 6      | Feature | Categorical | Repayment status in September (-1 = pay duly, 1 = payment delay  for one month; . . ., 8 = payment delay for eight months, 9 = payment delay  for nine months and above) |
| Feature 7      | Feature | Categorical | Repayment status in August (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 8      | Feature | Categorical | Repayment status in July (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 9      | Feature | Categorical | Repayment status in June (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 10     | Feature | Categorical | Repayment status in May (-1 = pay duly, 1 = payment delay for one  month; . . ., 8 = payment delay for eight months, 9 = payment delay for nine  months and above) |
| Feature 11     | Feature | Categorical | Repayment status in April (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 12     | Feature | Continuous  | Amount of bill statement in September                        |
| Feature 13     | Feature | Continuous  | Amount of bill statement in August                           |
| Feature 14     | Feature | Continuous  | Amount of bill statement in July                             |
| Feature 15     | Feature | Continuous  | Amount of bill statement in June                             |
| Feature 16     | Feature | Continuous  | Amount of bill statement in May                              |
| Feature 17     | Feature | Continuous  | Amount of bill statement in April                            |
| Feature 18     | Feature | Continuous  | Amount paid in September                                     |
| Feature 19     | Feature | Continuous  | Amount paid in August                                        |
| Feature 20     | Feature | Continuous  | Amount paid in July                                          |
| Feature 21     | Feature | Continuous  | Amount paid in June                                          |
| Feature 22     | Feature | Continuous  | Amount paid in May                                           |
| Feature 23     | Feature | Continuous  | Amount paid in April                                         |
| Label          | Label   | Categorical | Default payment (1=yes, 0=no)                                |

Hint 1: Consider preprocessing and feature engineering if it benefits your model.

Hint 2: A well-designed network architecture is crucial.

Hint 3: Optimize hyperparameters, including learning rate and weight decay, to enhance model performance.

Hint 4: Advanced neural network-related techniques, e.g., LeakyReLU, Dropout, and Batch Normalization, can help to improve the performance. You can use ChatGPT to implement them.

Hint 5: Loss functions and optimization algorithms, e.g., Adam and SGD for model optimization, also play an important role. You can use ChatGPT to implement them.

Hint 6: [OPTIONAL] Class imbalance is a common problem in machine learning. Can you tackle this bottleneck?

Note:

* External data and pre-trained models are not allowed.

* Please download the public dataset from https://drive.google.com/file/d/1rVtGUBpsWWd0z2YWk806xyf5Ho0BwJwa/view?usp=sharing

* Please download the private dataset from https://drive.google.com/file/d/1pGJCLPlPHkPJX2xmnyEX-2aZhsbpi6Or/view?usp=sharing

* Please download the label of the private dataset from https://drive.google.com/file/d/1qB82SYItMoDb8oVFWMNWI_EpoqNF1VOA/view?usp=sharing

* Please use the following Python template for submission. (You can copy the code below from lab8-Exercise.ipynb)

* Your results will be evaluated on 5000 samples in the private dataset, using classification accuracy (The labels will released in Lab 9).

```
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'


    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()


    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_6_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (20000, 23)
print('Shape of y_public:', y_public.shape)  # n_sample (20000,)

'''
CODE HERE!
'''

X_private = read_data_from_csv('assignment_6_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (5000, 23)

# remove and make your own predictions.
preds = np.full(len(X_private), -1,
                dtype=int)
'''
CODE HERE!
e.g.,
preds = np.full(len(X_private), -1, dtype=int)
'''

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_6.csv', index=True, index_label='Id')

```

## 1.2 An Example Algorithm

You may need to revise the generated code to make it concise and correct. Here is what I finally got.

* Preprocessing: We apply one-hot encoding for discrete data and normalization for continuous data.
* Network Structure: The multilayer perceptron with two hidden layers. The first hidden layer contains 16 neurons, while the second has 8.
* Techniques: We utilize LeakyReLU and Dropout.
* Hyperparameters: The learning rate and weight decay are tuned based on experimental results of 5-fold cross-validation.
* Loss Function and Data Imbalance: Weighted cross-entropy loss is used to address data imbalance, balancing the majority and minority categories to an extent.
* Optimizer: The Adam optimizer is employed.

## 1.3 Refernce Code

In [4]:
import os
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

In [5]:
import random
import numpy as np
import torch

def seed_everthing(seed):
    """Set seed for reproducibility.
    Args:
        seed (int): Seed number.
    """
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    # torch.cuda.manual_seed(seed)

seed_everthing(42)

In [6]:
X_public, y_public = read_data_from_csv('assignment_6_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (20000, 23)
print('Shape of y_public:', y_public.shape)  # n_sample (20000,)

Shape of X_public: (20000, 23)
Shape of y_public: (20000,)


In [7]:
# 80% negative, 20% positive
y_public.mean()

0.22135

In [8]:
from sklearn.preprocessing import OneHotEncoder, RobustScaler

# modified from the answer of assignment 5

X_df = pd.DataFrame(X_public)
cat_cols = list(range(1, 4)) + list(range(5, 11))

one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(X_df[cat_cols])

one_hot_encoded_df = pd.DataFrame(one_hot_encoded.toarray())
X_df = X_df.drop(cat_cols, axis=1)

scaler = RobustScaler()
X_df = pd.DataFrame(scaler.fit_transform(X_df), columns=X_df.columns)
X_df = pd.concat([X_df, one_hot_encoded_df], axis=1)

X_public = X_df.values

In [9]:
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_channels, hid_channels, num_classes):
        super().__init__()
        self.in_channels = in_channels
        self.hid_channels = hid_channels
        self.num_classes = num_classes

        modules = []
        for i in range(len(hid_channels)):
            if i == 0:
                modules.append(nn.Linear(in_channels, hid_channels[i]))
            else:
                modules.append(nn.Linear(hid_channels[i-1], hid_channels[i]))
            modules.append(nn.LeakyReLU())
            modules.append(nn.Dropout(0.2))
        modules.append(nn.Linear(hid_channels[-1], num_classes))
        self.model = nn.Sequential(*modules)

    def forward(self, x):
        x = self.model(x)
        return x

In [10]:
RECEIVED_PARAMS = {
    'n_splits': 5,
    'hid_channels': [16, 8],
    'num_epochs': 1000,
    'lr': 0.001,
    'weight_decay': 0.0001
}

In [11]:
import torch.optim as optim
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

k_fold = KFold(n_splits=RECEIVED_PARAMS['n_splits'], shuffle=True, random_state=0)

k_preds = []
k_labels = []

for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
    X_train, X_test = X_public[train_idx], X_public[test_idx]
    y_train, y_test = y_public[train_idx], y_public[test_idx]

    # Define model
    model = MLP(X_train.shape[-1], RECEIVED_PARAMS['hid_channels'], 2)#.cuda()

    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss(weight=torch.Tensor([0.4, 0.6]))#.cuda()
    # criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=RECEIVED_PARAMS['lr'], weight_decay=RECEIVED_PARAMS['weight_decay'])

    # Convert numpy arrays to PyTorch tensors
    X_train = torch.from_numpy(X_train).float()#.cuda()
    y_train = torch.from_numpy(y_train).long()#.cuda()

    # Train the model
    model.train()
    for epoch in range(RECEIVED_PARAMS['num_epochs']):
        optimizer.zero_grad()
        output = model(X_train)
        loss = criterion(output, y_train)
        loss.backward()
        optimizer.step()

    # Convert numpy arrays to PyTorch tensors
    X_test = torch.from_numpy(X_test).float()#.cuda()
    y_test = torch.from_numpy(y_test).long()#.cuda()

    torch.save(model.state_dict(), f'model_{fold_idx}.pth')

    # Test the model
    model.eval()
    with torch.no_grad():
        output = model(X_test)
        loss = criterion(output, y_test)
        pred = output[:, 1]

        k_preds += pred.cpu().numpy().tolist()
        k_labels += y_test.cpu().numpy().tolist()

roc_auc = roc_auc_score(k_labels, k_preds)
roc_auc

0.7623666533989291

In [12]:
X_private = read_data_from_csv('assignment_6_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (100, 14)

Shape of X_private: (5000, 23)


In [13]:
# modified from the answer of assignment 5

X_df = pd.DataFrame(X_private)
one_hot_encoded = one_hot_encoder.transform(X_df[cat_cols])

one_hot_encoded_df = pd.DataFrame(one_hot_encoded.toarray())
X_df = X_df.drop(cat_cols, axis=1)
X_df = pd.DataFrame(scaler.transform(X_df), columns=X_df.columns)

X_df = pd.concat([X_df, one_hot_encoded_df], axis=1)
X_private = X_df.values

In [14]:
# Predict the private dataset

k_preds = []
for fold_idx in range(k_fold.n_splits):
    model = MLP(X_public.shape[-1], RECEIVED_PARAMS['hid_channels'], 2)#.cuda()
    model.load_state_dict(torch.load(f'model_{fold_idx}.pth'))
    model.eval()
    with torch.no_grad():
        output = model(torch.from_numpy(X_private).float())#.cuda()
        preds = torch.softmax(output, dim=1)[:, 1].cpu()
    k_preds.append(preds.numpy().tolist())

preds = np.mean(k_preds, axis=0)

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_6.csv', index=True, index_label='Id')

## Private test cases

In [15]:
# read pred from csv file.

preds = pd.read_csv('assignment_6.csv')['Label'].values

# check the accuracy of your predictions.

X_private, y_private = read_data_from_csv('assignment_6_private_gt.csv')
print('ROC-AUC:', roc_auc_score(y_private, preds))

ROC-AUC: 0.7801431752998839


# Part 2. Demo of Interactions with ChatGPT

You can use either of the approaches to interact with ChatGPT:
- Access https://genai.polyu.edu.hk/ to interact with ChatGPT.
- Use OpenAI API to interact with ChatGPT via code.

## 2.1 Preparations

In [7]:
%%capture

import numpy as np
import networkx as nx

from typing import List

# Set the seed
def seed_everything(seed=0):
    np.random.seed(seed)
seed_everything()

# Install OpenAI package
!pip install openai==0.27.0

# Import OpenAI and set the API key
import openai
openai.api_key = 'OpenAI_API_Key' # Replace with your own OpenAI API Key

# Define the function of get response from ChatGPT
_messages = []
def get_completion(prompt, model="gpt-3.5-turbo-0613"):
    _messages.append({"role": "user", "content": prompt})
    response = openai.ChatCompletion.create(
        model=model,
        messages=_messages,
        temperature=0.0,  # this is the degree of randomness of the model's output
    )
    content = response.choices[0].message["content"]
    _messages.append({"role": "assistant", "content": content})
    return content


# Set display format
import html
from IPython.core.display import display, HTML
css_content = """.cs-message{box-sizing:border-box;font-size:1em;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;color:#000000de;display:flex;flex-direction:row;padding:0;background-color:transparent;overflow:hidden;border-radius:0}.cs-message:not(:only-child){margin:.2em 0 0}.cs-message__avatar{box-sizing:border-box;margin:0 8px 0 0;display:flex;flex-direction:column;justify-content:flex-end;width:42px}.cs-message__content-wrapper{box-sizing:border-box;display:flex;flex-direction:column}.cs-message__content{box-sizing:border-box;color:#000000de;background-color:#c6e3fa;margin-top:0;padding:.6em .9em;border-radius:.7em;white-space:pre-wrap;overflow-wrap:anywhere;word-break:break-word;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;font-weight:400;font-size:.91em;font-variant:normal}.cs-message--incoming{color:#000000de;background-color:transparent;margin-right:auto}.cs-message--incoming .cs-message__avatar{margin:0 8px 0 0}.cs-message--incoming .cs-message__content{color:#000000de;background-color:#c6e3fa;border-radius:0 .7em .7em 0}.cs-message--outgoing{color:#000000de;background-color:transparent;margin-left:auto;justify-content:flex-end}.cs-message--outgoing .cs-message__avatar{order:1;margin:0 0 0 8px}.cs-message--outgoing .cs-message__content{color:#000000de;background-color:#6ea9d7;border-radius:.7em 0 0 .7em}.cs-message.cs-message--incoming.cs-message--single{border-radius:0}.cs-message.cs-message--incoming.cs-message--single:not(:first-child){margin-top:.4em}.cs-message.cs-message--incoming.cs-message--single .cs-message__content{border-radius:0 .7em .7em}.cs-message.cs-message--outgoing.cs-message--single{border-radius:0}.cs-message.cs-message--outgoing.cs-message--single .cs-message__content{border-radius:.7em .7em 0}.cs-avatar{position:relative;width:42px;height:42px;border-radius:50%;box-sizing:border-box}.cs-avatar>img{box-sizing:border-box;width:100%;height:100%;border-radius:50%}.cs-avatar.cs-avatar--md{width:42px;height:42px;min-width:42px;min-height:42px
"""
html_content = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">hello</div>
            </div>
        </div>
    </section>
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">Hello! How can I assist you today?</div>
            </div>
        </div>
    </section>
"""
def generate_html(messages, n=2):
    if n is not None:
      messages = messages[-n:]

    html_parts = []
    user_template = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    assistant_template = """
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    for message in messages:
        sanitized_content = html.escape(message["content"])
        if message["role"] == "user":
            html_parts.append(user_template.format(content=sanitized_content))
        elif message["role"] == "assistant":
            html_parts.append(assistant_template.format(content=sanitized_content))

    return "".join(html_parts) + f"<style>{css_content}</style>"

## 2.2 Interact with ChatGPT

In [8]:
problem = '''
It is an important problem for credit card companies to predict potential defaults to assess the risk probability of payments in advance. As an expert in artificial intelligence, a credit card company is seeking your assistance to predict default payments based on historical data.

The dataset you are provided contains a set of 23 features. The goal is to forecast the default of payment (yes or no). In the public dataset, you can train and validate your model on 20,000 samples. Then, you need to predict the labels for 5,000 samples in the private dataset, and the area under the Receiver Operating Characteristic curve (AUC-ROC) on the private dataset will determine your final score.

| Variable  Name | Role    | Type        | Description                                                  |
| -------------- | ------- | ----------- | ------------------------------------------------------------ |
| Feature 1      | Feature | Continuous  | Amount of the given credit                                   |
| Feature 2      | Feature | Categorical | Gender (1 = male, 2 = female)                                |
| Feature 3      | Feature | Categorical | Education (1 = graduate school, 2 = university, 3 = high school,  4 = others) |
| Feature 4      | Feature | Categorical | Marital status (1 = married, 2 = single, 3 = others)         |
| Feature 5      | Feature | Continuous  | Age (year)                                                   |
| Feature 6      | Feature | Categorical | Repayment status in September (-1 = pay duly, 1 = payment delay  for one month; . . ., 8 = payment delay for eight months, 9 = payment delay  for nine months and above) |
| Feature 7      | Feature | Categorical | Repayment status in August (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 8      | Feature | Categorical | Repayment status in July (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 9      | Feature | Categorical | Repayment status in June (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 10     | Feature | Categorical | Repayment status in May (-1 = pay duly, 1 = payment delay for one  month; . . ., 8 = payment delay for eight months, 9 = payment delay for nine  months and above) |
| Feature 11     | Feature | Categorical | Repayment status in April (-1 = pay duly, 1 = payment delay for  one month; . . ., 8 = payment delay for eight months, 9 = payment delay for  nine months and above) |
| Feature 12     | Feature | Continuous  | Amount of bill statement in September                        |
| Feature 13     | Feature | Continuous  | Amount of bill statement in August                           |
| Feature 14     | Feature | Continuous  | Amount of bill statement in July                             |
| Feature 15     | Feature | Continuous  | Amount of bill statement in June                             |
| Feature 16     | Feature | Continuous  | Amount of bill statement in May                              |
| Feature 17     | Feature | Continuous  | Amount of bill statement in April                            |
| Feature 18     | Feature | Continuous  | Amount paid in September                                     |
| Feature 19     | Feature | Continuous  | Amount paid in August                                        |
| Feature 20     | Feature | Continuous  | Amount paid in July                                          |
| Feature 21     | Feature | Continuous  | Amount paid in June                                          |
| Feature 22     | Feature | Continuous  | Amount paid in May                                           |
| Feature 23     | Feature | Continuous  | Amount paid in April                                         |
| Label          | Label   | Categorical | Default payment (1=yes, 0=no)                                |

Hint 1: Consider preprocessing and feature engineering if it benefits your model.

Hint 2: A well-designed network architecture is crucial.

Hint 3: Optimize hyperparameters, including learning rate and weight decay, to enhance model performance.

Hint 4: Advanced neural network-related techniques, e.g., LeakyReLU, Dropout, and Batch Normalization, can help to improve the performance. You can use ChatGPT to implement them.

Hint 5: Loss functions and optimization algorithms, e.g., Adam and SGD for model optimization, also play an important role. You can use ChatGPT to implement them.

Hint 6: [OPTIONAL] Class imbalance is a common problem in machine learning. Can you tackle this bottleneck?
'''

prompt = '''
{problem}
Let’s think step by step.
'''

message = prompt.format(problem=problem)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [9]:
# you can ask ChatGPT many times to get your own idea
idea = '''
* Preprocessing: We apply one-hot encoding for discrete data and normalization for continuous data.
* Network Structure: The multilayer perceptron with two hidden layers. The first hidden layer contains 16 neurons, while the second has 8.
* Techniques: We utilize LeakyReLU and Dropout.
* Hyperparameters: The learning rate and weight decay are tuned based on experimental results of 5-fold cross-validation.
* Loss Function and Data Imbalance: Weighted cross-entropy loss is used to address data imbalance, balancing the majority and minority categories to an extent.
* Optimizer: The Adam optimizer is employed.
'''
prompt = '''
I will use the following idea. What do you think?
{idea}
'''

message = prompt.format(idea=idea)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [10]:
from IPython.core.display import HTML


template = '''
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'


    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()


    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_6_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (20000, 23)
print('Shape of y_public:', y_public.shape)  # n_sample (20000,)

\'\'\'
CODE HERE!
\'\'\'

X_private = read_data_from_csv('assignment_6_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (5000, 23)

# remove and make your own predictions.
preds = np.full(len(X_private), -1,
                dtype=int)
\'\'\'
CODE HERE!
e.g.,
preds = np.full(len(X_private), -1, dtype=int)
\'\'\'

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_6.csv', index=True, index_label='Id')
'''

prompt = '''
Use the given code snippet to revise the code.
{template}
'''

message = prompt.format(template=template)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

You can reuse many of the Assignment 5 techniques in this assignment. You may prefer to complete code on demand based on existing code rather than rebuild from scratch. You can use the prompt words and comments below to guide ChatGPT to help you complete the code completion task.

In [6]:
from IPython.core.display import HTML

template = '''
import torch.optim as optim
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

k_fold = KFold(n_splits=RECEIVED_PARAMS['n_splits'], shuffle=True, random_state=0)

k_preds = []
k_labels = []

for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
    X_train, X_test = X_public[train_idx], X_public[test_idx]
    y_train, y_test = y_public[train_idx], y_public[test_idx]

    # Define model
    model = ... # TODO:

    # Define loss function and optimizer
    criterion = ... # TODO:
    optimizer = ... # TODO:

    # Convert numpy arrays to PyTorch tensors
    X_train = ... # TODO:
    y_train = ... # TODO:

    # Train the model
    ... # TODO:

    # Convert numpy arrays to PyTorch tensors
    X_test = ... # TODO:
    y_test = ... # TODO:

    # Test the model
    ... # TODO:

roc_auc = roc_auc_score(k_labels, k_preds)
roc_auc
'''

prompt = '''
Complete code based on comments:
{template}
'''

message = prompt.format(template=template)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))