# Part 1. Problem and Reference Answer

## 1.1 The Problem

Online news is crucial in providing people with diverse, multifaceted perspectives on political and public issues. As an expert in artificial intelligence, an online news website seeks your assistance in predicting the popularity of online news based on its features (as shown in the table below).

| Variable  Name | Role    | Type        | Description                                            |
| -------------- | ------- | ----------- | ------------------------------------------------------ |
| Feature 1      | Feature | Continuous  | Number of words in the title                           |
| Feature 2      | Feature | Continuous  | Number of words in the content                         |
| Feature 3      | Feature | Continuous  | Rate of unique words in the content                    |
| Feature 4      | Feature | Continuous  | Rate of non-stop words in the content                  |
| Feature 5      | Feature | Continuous  | Rate of unique non-stop words in the content           |
| Feature 6      | Feature | Continuous  | Number of links                                        |
| Feature 7      | Feature | Continuous  | Number of links to other articles                      |
| Feature 8      | Feature | Continuous  | Number of images                                       |
| Feature 9      | Feature | Continuous  | Number of videos                                       |
| Feature 10     | Feature | Continuous  | Average length of the words in the content             |
| Feature 11     | Feature | Continuous  | Number of keywords in the metadata                     |
| Feature 12     | Feature | Categorical | Is the article from the Lifestyle topic?               |
| Feature 13     | Feature | Categorical | Is the article from the Entertainment topic?           |
| Feature 14     | Feature | Categorical | Is the article from the Business topic?                |
| Feature 15     | Feature | Categorical | Is the article from the Social Media topic?            |
| Feature 16     | Feature | Categorical | Is the article from the Tech topic?                    |
| Feature 17     | Feature | Categorical | Is the article from the World topic?                   |
| Feature 18     | Feature | Continuous  | Min. shares of worst keyword                           |
| Feature 19     | Feature | Continuous  | Max. shares of worst keyword                           |
| Feature 20     | Feature | Continuous  | Avg. shares of worst keyword                           |
| Feature 21     | Feature | Continuous  | Min. shares of best keyword                            |
| Feature 22     | Feature | Continuous  | Max. shares of best keyword                            |
| Feature 23     | Feature | Continuous  | Avg. shares of best keyword                            |
| Feature 24     | Feature | Continuous  | Min. shares of avg. keyword                            |
| Feature 25     | Feature | Continuous  | Max. shares of avg. keyword                            |
| Feature 26     | Feature | Continuous  | Avg. shares of avg. keyword                            |
| Feature 27     | Feature | Continuous  | Min. shares of referenced articles                     |
| Feature 28     | Feature | Continuous  | Max. shares of referenced articles                     |
| Feature 29     | Feature | Continuous  | Avg. shares of referenced articles                     |
| Feature 30     | Feature | Categorical | Was the article published on a Monday?                 |
| Feature 31     | Feature | Categorical | Was the article published on a Tuesday?                |
| Feature 32     | Feature | Categorical | Was the article published on a Wednesday?              |
| Feature 33     | Feature | Categorical | Was the article published on a Thursday?               |
| Feature 34     | Feature | Categorical | Was the article published on a Friday?                 |
| Feature 35     | Feature | Categorical | Was the article published on a Saturday?               |
| Feature 36     | Feature | Categorical | Was the article published on a Sunday?                 |
| Feature 37     | Feature | Categorical | Was the article published on the weekend?              |
| Feature 38     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 0 |
| Feature 39     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 1 |
| Feature 40     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 2 |
| Feature 41     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 3 |
| Feature 42     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 4 |
| Feature 43     | Feature | Continuous  | Text subjectivity                                      |
| Feature 44     | Feature | Continuous  | Text sentiment polarity                                |
| Feature 45     | Feature | Continuous  | Rate of positive words in the content                  |
| Feature 46     | Feature | Continuous  | Rate of negative words in the content                  |
| Feature 47     | Feature | Continuous  | Rate of positive words among non-neutral tokens        |
| Feature 48     | Feature | Continuous  | Rate of negative words among non-neutral tokens        |
| Feature 49     | Feature | Continuous  | Avg. polarity of positive words                        |
| Feature 50     | Feature | Continuous  | Min. polarity of positive words                        |
| Feature 51     | Feature | Continuous  | Max. polarity of positive words                        |
| Feature 52     | Feature | Continuous  | Avg. polarity of negative words                        |
| Feature 53     | Feature | Continuous  | Min. polarity of negative words                        |
| Feature 54     | Feature | Continuous  | Max. polarity of negative words                        |
| Feature 55     | Feature | Continuous  | Subjectivity of the title                              |
| Feature 56     | Feature | Continuous  | Sentiment polarity of the title                        |
| Feature 57     | Feature | Continuous  | Absolute level of subjectivity in the title            |
| Feature 58     | Feature | Continuous  | Absolute level of sentiment polarity in the title      |
| Label          | Label   | Categorical | 0,1 (0 for not popular and 1 for popular)

The dataset contains a set of features describing published online news. The goal is to forecast their popularity on social networks. The articles with more than 1400 shares can be as popular. Your task is to predict whether a piece of online news will be popular.

In the public dataset, you can train and validate your model on 30,000 samples. Then, you need to predict the labels for 5,000 samples in the private dataset, and your performance on the private dataset will determine your final score.

Hint 1: Cross-validation is important.

Hint 2: Consider preprocessing and feature engineering if it benefits your model.

Hint 3: Optimize hyperparameters for improved performance.

Hint 4: Utilize any algorithms you have learned, including the decision tree, K-nearest neighbor, support vector machine, etc. You may ensemble their predictions to achieve better performance.

Note:
* Please download the public dataset from https://drive.google.com/file/d/1FoXCtnlw_0DFI3VzUVXLAoSwUHYlzhZ5/view?usp=sharing

* Please download the private dataset from https://drive.google.com/file/d/1SxEYOYIdSPbAlzCcGp-jGjAkXzgC6FZ6/view?usp=sharing

* Please download the label of the private dataset from https://drive.google.com/file/d/1u0Gk_UBNug13WM-XlYAhyzY65TnRDtjm/view?usp=sharing

* Please use the following Python template for submission.

* Your results will be evaluated on 5000 samples in the private dataset, using classification accuracy.

```
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_5_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (30000, 58)
print('Shape of y_public:', y_public.shape)  # n_sample (30000,)

'''
CODE HERE!
'''

X_private = read_data_from_csv('assignment_5_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (5000, 58)

import numpy as np

# remove and make your own predictions.
preds = np.full(len(X_private), -1,
                dtype=int)
'''
CODE HERE!
e.g.,
preds = np.full(len(X_private), -1, dtype=int)
'''

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_5.csv', index=True, index_label='Id')

```

## 1.2 An Example Algorithm

You may need to revise the generated code to make it concise and correct. Here is what I finally got.

* Preprocessing: For discrete data, we employ one-hot encoding, while for continuous data, normalization is utilized.
* Feature Selection: We use the 'Select K Best' method to choose 57 features, with the ANOVA F-value serving as the evaluation metric.
* Model: We use a variety of models including Random Forest (an ensemble of multiple decision trees), k Nearest Neighbor, and Support Vector Classifier.
* Hyperparameters: Grid search is used to find the optimal combination of hyperparameters.
* Ensemble: We conduct weighted voting based on the performance from cross-validation. Models that perform better (in cross-validation) are assigned higher weights, while the other models are given lower weights.

## 1.3 Refernce Code

In [1]:
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_5_public.csv')
print('Shape of X_public:', X_public.shape)
print('Shape of y_public:', y_public.shape)

import warnings

warnings.filterwarnings('ignore')

from sklearn.preprocessing import OneHotEncoder, RobustScaler

X_df = pd.DataFrame(X_public)
cat_cols = list(range(11, 17)) + list(range(29, 37))

one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(X_df[cat_cols])

one_hot_encoded_df = pd.DataFrame(one_hot_encoded.toarray())
X_df = X_df.drop(cat_cols, axis=1)

scaler = RobustScaler()
X_df = pd.DataFrame(scaler.fit_transform(X_df), columns=X_df.columns)
X_df = pd.concat([X_df, one_hot_encoded_df], axis=1)

X_public = X_df.values

Shape of X_public: (30000, 58)
Shape of y_public: (30000,)


In [2]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=57)
selector.fit(X_public, y_public)
X_public = selector.transform(X_public)

In [3]:
import itertools

from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

def make_grid(pars_dict):
    keys = pars_dict.keys()
    combinations = itertools.product(*pars_dict.values())
    outputs = [dict(zip(keys, combination)) for combination in combinations]
    return outputs


param_grid = {
    'n_estimators': [10, 50, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10],
}
param_list = make_grid(param_grid)

k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

test_acc = 0.0
best_param = None
best_decision_trees = None

pbar = tqdm(total=len(param_list))
for param in param_list:
    k_preds = []
    k_labels = []
    decision_trees = []

    for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
        X_train, X_test = X_public[train_idx], X_public[test_idx]
        y_train, y_test = y_public[train_idx], y_public[test_idx]

        model = RandomForestClassifier(**param)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        k_preds.append(pred.tolist())
        k_labels.append(y_test.tolist())
        decision_trees.append(model)

    cur_acc = np.sum(np.array(k_preds) == np.array(k_labels)) / len(y_public)

    if cur_acc > test_acc:
        test_acc = cur_acc
        best_param = param
        best_decision_trees = decision_trees

    pbar.update(1)
    pbar.set_description(f'acc: {cur_acc}')

pbar.close()

acc: 0.6649: 100%|██████████| 18/18 [07:16<00:00, 24.23s/it]


In [4]:
from sklearn.neighbors import KNeighborsClassifier

param_grid = {
    'n_neighbors': np.arange(3, 33, 2),
    'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski'],
}
param_list = make_grid(param_grid)

test_acc = 0.0
best_param = None
best_knns = None

pbar = tqdm(total=len(param_list))
for param in param_list:
    k_preds = []
    k_labels = []
    knns = []

    for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
        X_train, X_test = X_public[train_idx], X_public[test_idx]
        y_train, y_test = y_public[train_idx], y_public[test_idx]

        model = KNeighborsClassifier(**param)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        k_preds.append(pred.tolist())
        k_labels.append(y_test.tolist())
        knns.append(model)

    cur_acc = np.sum(np.array(k_preds) == np.array(k_labels)) / len(y_public)

    if cur_acc > test_acc:
        test_acc = cur_acc
        best_param = param
        best_knns = knns

    pbar.update(1)
    pbar.set_description(f'acc: {cur_acc}')

pbar.close()

acc: 0.6338333333333334: 100%|██████████| 60/60 [28:08<00:00, 28.14s/it]


In [5]:
from sklearn.svm import LinearSVC

param_grid = {
    'C': [0.1, 1.0, 10.0],
    'loss': ['hinge', 'squared_hinge'],
    'max_iter': [3000],
}
param_list = make_grid(param_grid)

test_acc = 0.0
best_param = None
best_svms = None

pbar = tqdm(total=len(param_list))
for param in param_list:
    k_preds = []
    k_labels = []
    svms = []

    for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
        X_train, X_test = X_public[train_idx], X_public[test_idx]
        y_train, y_test = y_public[train_idx], y_public[test_idx]

        model = LinearSVC(**param)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        k_preds.append(pred.tolist())
        k_labels.append(y_test.tolist())
        svms.append(model)

    cur_acc = np.sum(np.array(k_preds) == np.array(k_labels)) / len(y_public)

    if cur_acc > test_acc:
        test_acc = cur_acc
        best_param = param
        best_svms = svms

    pbar.update(1)
    pbar.set_description(f'acc: {cur_acc}')

pbar.close()

acc: 0.6104666666666667: 100%|██████████| 6/6 [11:24<00:00, 114.12s/it]


In [6]:
X_private = read_data_from_csv('assignment_5_private.csv')
print('Shape of X_private:', X_private.shape)

X_df = pd.DataFrame(X_private)
one_hot_encoded = one_hot_encoder.transform(X_df[cat_cols])

one_hot_encoded_df = pd.DataFrame(one_hot_encoded.toarray())
X_df = X_df.drop(cat_cols, axis=1)
X_df = pd.DataFrame(scaler.transform(X_df), columns=X_df.columns)

X_df = pd.concat([X_df, one_hot_encoded_df], axis=1)
X_private = X_df.values

X_private = selector.transform(X_private)

preds = np.zeros(len(X_private))
for model in best_decision_trees:
    pred = model.predict(X_private)
    preds += 0.7 * pred / len(best_decision_trees)

for model in best_knns:
    pred = model.predict(X_private)
    preds += 0.2 * pred / len(best_knns)

for model in best_svms:
    pred = model.predict(X_private)
    preds += 0.1 * pred / len(best_svms)

preds = (preds >= 0.5).astype(int)

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_5.csv', index=True, index_label='Id')

Shape of X_private: (5000, 58)


## Private test cases

In [7]:
# read pred from csv file.

preds = pd.read_csv('assignment_5.csv')['Label'].values

# check the accuracy of your predictions.

X_private, y_private = read_data_from_csv('assignment_5_private_gt.csv')
print('Accuracy:', np.sum(preds == y_private) / len(y_private))

Accuracy: 0.6754


# Part 2. Demo of Interactions with ChatGPT

You can use either of the approaches to interact with ChatGPT:
- Access https://genai.polyu.edu.hk/ to interact with ChatGPT.
- Use OpenAI API to interact with ChatGPT via code.

## 2.1 Preparations

In [1]:
%%capture

import numpy as np
import networkx as nx

from typing import List

# Set the seed
def seed_everything(seed=0):
    np.random.seed(seed)
seed_everything()

# Install OpenAI package
!pip install openai==0.27.0

# Import OpenAI and set the API key
import openai
openai.api_key = 'OpenAI_API_Key' # Replace with your own OpenAI API Key

# Define the function of get response from ChatGPT
_messages = []
def get_completion(prompt, model="gpt-3.5-turbo-0613"):
    _messages.append({"role": "user", "content": prompt})
    response = openai.ChatCompletion.create(
        model=model,
        messages=_messages,
        temperature=0.0,  # this is the degree of randomness of the model's output
    )
    content = response.choices[0].message["content"]
    _messages.append({"role": "assistant", "content": content})
    return content


# Set display format
import html
from IPython.core.display import display, HTML
css_content = """.cs-message{box-sizing:border-box;font-size:1em;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;color:#000000de;display:flex;flex-direction:row;padding:0;background-color:transparent;overflow:hidden;border-radius:0}.cs-message:not(:only-child){margin:.2em 0 0}.cs-message__avatar{box-sizing:border-box;margin:0 8px 0 0;display:flex;flex-direction:column;justify-content:flex-end;width:42px}.cs-message__content-wrapper{box-sizing:border-box;display:flex;flex-direction:column}.cs-message__content{box-sizing:border-box;color:#000000de;background-color:#c6e3fa;margin-top:0;padding:.6em .9em;border-radius:.7em;white-space:pre-wrap;overflow-wrap:anywhere;word-break:break-word;font-family:Helvetica Neue,Segoe UI,Helvetica,Arial,sans-serif;font-weight:400;font-size:.91em;font-variant:normal}.cs-message--incoming{color:#000000de;background-color:transparent;margin-right:auto}.cs-message--incoming .cs-message__avatar{margin:0 8px 0 0}.cs-message--incoming .cs-message__content{color:#000000de;background-color:#c6e3fa;border-radius:0 .7em .7em 0}.cs-message--outgoing{color:#000000de;background-color:transparent;margin-left:auto;justify-content:flex-end}.cs-message--outgoing .cs-message__avatar{order:1;margin:0 0 0 8px}.cs-message--outgoing .cs-message__content{color:#000000de;background-color:#6ea9d7;border-radius:.7em 0 0 .7em}.cs-message.cs-message--incoming.cs-message--single{border-radius:0}.cs-message.cs-message--incoming.cs-message--single:not(:first-child){margin-top:.4em}.cs-message.cs-message--incoming.cs-message--single .cs-message__content{border-radius:0 .7em .7em}.cs-message.cs-message--outgoing.cs-message--single{border-radius:0}.cs-message.cs-message--outgoing.cs-message--single .cs-message__content{border-radius:.7em .7em 0}.cs-avatar{position:relative;width:42px;height:42px;border-radius:50%;box-sizing:border-box}.cs-avatar>img{box-sizing:border-box;width:100%;height:100%;border-radius:50%}.cs-avatar.cs-avatar--md{width:42px;height:42px;min-width:42px;min-height:42px
"""
html_content = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">hello</div>
            </div>
        </div>
    </section>
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">Hello! How can I assist you today?</div>
            </div>
        </div>
    </section>
"""
def generate_html(messages, n=2):
    if n is not None:
      messages = messages[-n:]

    html_parts = []
    user_template = """
    <section aria-label="User" class="cs-message cs-message--outgoing cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/user-8c5a41ea.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    assistant_template = """
    <section aria-label="ChatGPT" class="cs-message cs-message--incoming cs-message--single" data-cs-message="">
        <div class="cs-message__avatar">
            <div class="cs-avatar cs-avatar--md"><img src="https://genai.polyu.edu.hk/assets/chatbot-08b96590.png" alt="PolyU"> </div>
        </div>
        <div class="cs-message__content-wrapper">
            <div class="cs-message__content">
                <div class="cs-message__custom-content">{content}</div>
            </div>
        </div>
    </section>
    """

    for message in messages:
        sanitized_content = html.escape(message["content"])
        if message["role"] == "user":
            html_parts.append(user_template.format(content=sanitized_content))
        elif message["role"] == "assistant":
            html_parts.append(assistant_template.format(content=sanitized_content))

    return "".join(html_parts) + f"<style>{css_content}</style>"

## 2.2 Interact with ChatGPT

In [2]:
problem = '''
Online news is crucial in providing people with diverse, multifaceted perspectives on political and public issues. As an expert in artificial intelligence, an online news website seeks your assistance in predicting the popularity of online news based on its features (as shown in the table below).

| Variable  Name | Role    | Type        | Description                                            |
| -------------- | ------- | ----------- | ------------------------------------------------------ |
| Feature 1      | Feature | Continuous  | Number of words in the title                           |
| Feature 2      | Feature | Continuous  | Number of words in the content                         |
| Feature 3      | Feature | Continuous  | Rate of unique words in the content                    |
| Feature 4      | Feature | Continuous  | Rate of non-stop words in the content                  |
| Feature 5      | Feature | Continuous  | Rate of unique non-stop words in the content           |
| Feature 6      | Feature | Continuous  | Number of links                                        |
| Feature 7      | Feature | Continuous  | Number of links to other articles                      |
| Feature 8      | Feature | Continuous  | Number of images                                       |
| Feature 9      | Feature | Continuous  | Number of videos                                       |
| Feature 10     | Feature | Continuous  | Average length of the words in the content             |
| Feature 11     | Feature | Continuous  | Number of keywords in the metadata                     |
| Feature 12     | Feature | Categorical | Is the article from the Lifestyle topic?               |
| Feature 13     | Feature | Categorical | Is the article from the Entertainment topic?           |
| Feature 14     | Feature | Categorical | Is the article from the Business topic?                |
| Feature 15     | Feature | Categorical | Is the article from the Social Media topic?            |
| Feature 16     | Feature | Categorical | Is the article from the Tech topic?                    |
| Feature 17     | Feature | Categorical | Is the article from the World topic?                   |
| Feature 18     | Feature | Continuous  | Min. shares of worst keyword                           |
| Feature 19     | Feature | Continuous  | Max. shares of worst keyword                           |
| Feature 20     | Feature | Continuous  | Avg. shares of worst keyword                           |
| Feature 21     | Feature | Continuous  | Min. shares of best keyword                            |
| Feature 22     | Feature | Continuous  | Max. shares of best keyword                            |
| Feature 23     | Feature | Continuous  | Avg. shares of best keyword                            |
| Feature 24     | Feature | Continuous  | Min. shares of avg. keyword                            |
| Feature 25     | Feature | Continuous  | Max. shares of avg. keyword                            |
| Feature 26     | Feature | Continuous  | Avg. shares of avg. keyword                            |
| Feature 27     | Feature | Continuous  | Min. shares of referenced articles                     |
| Feature 28     | Feature | Continuous  | Max. shares of referenced articles                     |
| Feature 29     | Feature | Continuous  | Avg. shares of referenced articles                     |
| Feature 30     | Feature | Categorical | Was the article published on a Monday?                 |
| Feature 31     | Feature | Categorical | Was the article published on a Tuesday?                |
| Feature 32     | Feature | Categorical | Was the article published on a Wednesday?              |
| Feature 33     | Feature | Categorical | Was the article published on a Thursday?               |
| Feature 34     | Feature | Categorical | Was the article published on a Friday?                 |
| Feature 35     | Feature | Categorical | Was the article published on a Saturday?               |
| Feature 36     | Feature | Categorical | Was the article published on a Sunday?                 |
| Feature 37     | Feature | Categorical | Was the article published on the weekend?              |
| Feature 38     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 0 |
| Feature 39     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 1 |
| Feature 40     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 2 |
| Feature 41     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 3 |
| Feature 42     | Feature | Continuous  | Closeness to Latent Dirichlet Allocation (LDA) topic 4 |
| Feature 43     | Feature | Continuous  | Text subjectivity                                      |
| Feature 44     | Feature | Continuous  | Text sentiment polarity                                |
| Feature 45     | Feature | Continuous  | Rate of positive words in the content                  |
| Feature 46     | Feature | Continuous  | Rate of negative words in the content                  |
| Feature 47     | Feature | Continuous  | Rate of positive words among non-neutral tokens        |
| Feature 48     | Feature | Continuous  | Rate of negative words among non-neutral tokens        |
| Feature 49     | Feature | Continuous  | Avg. polarity of positive words                        |
| Feature 50     | Feature | Continuous  | Min. polarity of positive words                        |
| Feature 51     | Feature | Continuous  | Max. polarity of positive words                        |
| Feature 52     | Feature | Continuous  | Avg. polarity of negative words                        |
| Feature 53     | Feature | Continuous  | Min. polarity of negative words                        |
| Feature 54     | Feature | Continuous  | Max. polarity of negative words                        |
| Feature 55     | Feature | Continuous  | Subjectivity of the title                              |
| Feature 56     | Feature | Continuous  | Sentiment polarity of the title                        |
| Feature 57     | Feature | Continuous  | Absolute level of subjectivity in the title            |
| Feature 58     | Feature | Continuous  | Absolute level of sentiment polarity in the title      |
| Label          | Label   | Categorical | 0,1 (0 for not popular and 1 for popular)

The dataset contains a set of features describing published online news. The goal is to forecast their popularity on social networks. The articles with more than 1400 shares can be as popular. Your task is to predict whether a piece of online news will be popular.

In the public dataset, you can train and validate your model on 30,000 samples. Then, you need to predict the labels for 5,000 samples in the private dataset, and your performance on the private dataset will determine your final score.

Hint 1: Cross-validation is important.

Hint 2: Consider preprocessing and feature engineering if it benefits your model.

Hint 3: Optimize hyperparameters for improved performance.

Hint 4: Utilize any algorithms you have learned, including the decision tree, K-nearest neighbor, support vector machine, etc. You may ensemble their predictions to achieve better performance.
'''

prompt = '''
{problem}
Let’s think step by step.
'''

message = prompt.format(problem=problem)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [3]:
# you can ask ChatGPT many times to get your own idea
idea = '''
* Preprocessing: For discrete data, we employ one-hot encoding, while for continuous data, normalization is utilized.
* Feature Selection: We use the 'Select K Best' method to choose 57 features, with the ANOVA F-value serving as the evaluation metric.
* Model: We use a variety of models including Random Forest (an ensemble of multiple decision trees), k Nearest Neighbor, and Support Vector Classifier.
* Hyperparameters: Grid search is used to find the optimal combination of hyperparameters.
* Ensemble: We conduct weighted voting based on the performance from cross-validation. Models that perform better (in cross-validation) are assigned higher weights, while the other models are given lower weights.
'''
prompt = '''
I will use the following idea. What do you think?
{idea}
'''

message = prompt.format(idea=idea)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

In [4]:
from IPython.core.display import HTML


template = '''
import os
import numpy as np
import pandas as pd

def read_data_from_csv(path):
    """Load datasets from CSV files.
    Args:
        path (str): Path to the CSV file.
    Returns:
        X (np.ndarray): Features of samples.
        y (np.ndarray): Labels of samples, only provided in the public datasets.
    """
    assert os.path.exists(path), f'File not found: {path}!'
    assert os.path.splitext(path)[
        -1] == '.csv', f'Unsupported file type {os.path.splitext(path)[-1]}!'

    data = pd.read_csv(path)
    column_list = data.columns.values.tolist()

    if 'Label' in column_list:
        # for the public dataset, label column is provided.
        column_list.remove('Label')
        X = data[column_list].values
        y = data['Label'].astype('int').values
        return X, y
    else:
        # for the private dataset, label column is not provided.
        X = data[column_list].values
        return X

X_public, y_public = read_data_from_csv('assignment_5_public.csv')
print('Shape of X_public:', X_public.shape)  # n_sample, m_feature (30000, 58)
print('Shape of y_public:', y_public.shape)  # n_sample (30000,)

\'\'\'
CODE HERE!
\'\'\'

X_private = read_data_from_csv('assignment_5_private.csv')
print('Shape of X_private:', X_private.shape)  # k_sample, m_feature (5000, 58)

import numpy as np

# remove and make your own predictions.
preds = np.full(len(X_private), -1,
                dtype=int)
\'\'\'
CODE HERE!
e.g.,
preds = np.full(len(X_private), -1, dtype=int)
\'\'\'

submission = pd.DataFrame({'Label': preds})
submission.to_csv('assignment_5.csv', index=True, index_label='Id')
'''

prompt = '''
Use the given code snippet to revise the code.
{template}
'''

message = prompt.format(template=template)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))

As tasks become increasingly complex and challenging, it is necessary for you to segment your generated code into blocks and place them into different cells for debugging and experimentation. You can specify modification requirements for each block of code to facilitate a broader range of experiments:

In [7]:
from IPython.core.display import HTML

_messages = []

original_code = '''
import itertools

from tqdm import tqdm
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

def make_grid(pars_dict):
    keys = pars_dict.keys()
    combinations = itertools.product(*pars_dict.values())
    outputs = [dict(zip(keys, combination)) for combination in combinations]
    return outputs


param_grid = {
    'n_estimators': [10, 50, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10],
}
param_list = make_grid(param_grid)

k_fold = KFold(n_splits=5, shuffle=True, random_state=42)

test_acc = 0.0
best_param = None
best_decision_trees = None

pbar = tqdm(total=len(param_list))
for param in param_list:
    k_preds = []
    k_labels = []
    decision_trees = []

    for fold_idx, (train_idx, test_idx) in enumerate(k_fold.split(X_public)):
        X_train, X_test = X_public[train_idx], X_public[test_idx]
        y_train, y_test = y_public[train_idx], y_public[test_idx]

        model = RandomForestClassifier(**param)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)

        k_preds.append(pred.tolist())
        k_labels.append(y_test.tolist())
        decision_trees.append(model)

    cur_acc = np.sum(np.array(k_preds) == np.array(k_labels)) / len(y_public)

    if cur_acc > test_acc:
        test_acc = cur_acc
        best_param = param
        best_decision_trees = decision_trees

    pbar.update(1)
    pbar.set_description(f'acc: {cur_acc}')

pbar.close()
'''

requirement = '''
1. Use K nearest neighbor algorithm as classifier
2. Apply grid search to determine the optimal n_neighbors value and metric
'''

prompt = '''
{original_code}
Revise the code to meet the following requirements:
{requirement}
'''

message = prompt.format(original_code=original_code, requirement=requirement)
response = get_completion(message)

html_content = generate_html(_messages)
display(HTML(html_content))