#Put your Google Colab link here:
*your link here*

#### This notebook demonstrates the process of training and evaluating a Differentiable Logic Network (DLN) for classification using the Heart Disease Kaggle dataset. It covers the steps of data preparation, including preprocessing, scaling, and feature reordering. It then shows how to train a DLN model, evaluate its performance, and visualize the learned network.

### Step 0. Prepare Python Environment

In [None]:
# Install/Download packages

# Clone the DLN repo
!git clone https://github.com/chang-yue/dln.git

# cd to DLN folder
%cd dln/quickstart

### Step 1. Prepare Dataset (14 points)

#### The processed datasets and related information will be saved in the data/datasets/NAME/seed_{SEED}/data directory:
- `train.csv` and `test.csv` (store features and the target class).
- `data_info.json` (stores dataset information such as feature data types and scaling).

#### The columns of the datasets should follow these standards:
- Features should be ordered as categorical features, then continuous features, then the target.
- Features should be scaled between 0 and 1.
- The target column should be named “Target” and labeled from 0 up to (num_classes – 1).
- Try to avoid using characters other than letters or underscores in feature names.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import os
import sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))
from data.data_utils import *
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)

#### 1.1 Download dataset (4 points)
* Use Linux command to complete this step. In Google Colab, prefix the command with ! so it runs as a shell command instead of Python code. For example: !pip install numpy
* We will use the Heart Disease Kaggle dataset

In [None]:
# Download the dataset ZIP to example/data_raw/Heart/
# Create the directory example/data_raw/Heart
"""TO DO"""

# download the dataset from https://www.kaggle.com/api/v1/datasets/download/cherngs/heart-disease-cleveland-uci
# save it as example/data_raw/Heart/heart-disease-cleveland-uci.zip
"""TO DO"""

# Unzip the downloaded file into example/data_raw/Heart/, then delete the ZIP file.
"""TO DO"""

# Read the .csv file into a pandas DataFrame
df = """TO DO"""

print(df.head())

#### 1.2 Check missing values, analyze class distribution, and split data (6 points)

In [None]:
# Check for missing values

print(df.isnull().sum().sum())

In [None]:
# Preprocessing columns
# Make categorical features one-hot

oh_list = ["cp", "restecg", "slope", "thal", "ca"]
# check features in oh_list
for _f in oh_list:
  print(f"{_f}: {np.unique(df[_f], return_counts=True)}")

# drop 'restecg' == 1 since it contains only 4 samples
df.drop(df[df["restecg"]==1].index, inplace=True)

# change columns in oh_list to object type
"""TO DO"""

# create one-hot encoding for columns in oh_list
# use the functuion pd.get_dummies(); get k-1 dummies out of k categorical levels by removing the first level.
"""TO DO"""

# drop the original columns in oh_list
"""TO DO"""

# join the one-hot encoded columns
"""TO DO"""


# reset index
df.reset_index(inplace=True, drop=True)

# Assign the column name of the target feature as "Target"
df.rename(columns={"condition":"Target"}, inplace=True)

print('\ndata shape: ', df.shape, sep='')
print('\nclass distribution:\n', df.Target.value_counts(), sep='')
# print('\ncolumn types:\n', df.dtypes, sep='')

# visualize the data
# df should contain multiple columns for each categorical feature, and the original columns should be removed
# for example, the column "cp" should be removed and replaced by "cp_1", "cp_2", "cp_3"
print(df.head())

In [None]:
# Sort features into the [categorical, continuous, target] order

continuous_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

# other features are categorical
categorical_features = """TO DO"""

print('continuous_features:', continuous_features)
print('\ncategorical_features:', categorical_features)

# Reindex columns to [cat, con, label]
df = """TO DO"""
print(df.head())

dtype_dict = df.dtypes.to_dict()

In [None]:
# Shuffle and split data into train/(val)/test
seed = 0

train_fraction = 0.75 ###
df_train, df_test = shuffle_split_data(df, train_fraction, seed=seed)

print('train:', df_train.shape)
print(np.unique(df_train.Target, return_counts=True))
print('\ntest:', df_test.shape)
print(np.unique(df_test.Target, return_counts=True))

#### 1.3 Visualize training data

In [None]:
# Plot histograms of the training data

ncol, nrow = 2, int(np.ceil(len(df_train.columns)/2))
figsize = (16,3*nrow)

plot_hist(df_train, figsize, nrow, ncol)

#### 1.4 Clip outliers, scale features, and then save the processed data and info (4 points)

In [None]:
# Feature outlier clipping and [0, 1] scaling

for feature in continuous_features:
    # clip outliers to 0.5th and 99.5th percentiles
    # get 0.5th percentile and 99.5th percentile of current feature
    """TO DO"""

    # set values below 0.5th percentile to the 0.5th percentile, and set values above 99.5th percentile to the 99.5th percentile
    # do it for both training and testing data
    """TO DO"""


scaler_list = [MinMaxScaler(clip=True), MinMaxScaler(clip=True)]
feature_list = [continuous_features, categorical_features]
df_train_scaled, df_test_scaled, scaler_params = scale_features(df_train, df_test, feature_list, scaler_list)

In [None]:
# Plot the scaled training data

plot_hist(df_train_scaled, figsize, nrow, ncol)

In [None]:
# Save the processed data and feature information

# Save data into the data/datasets/Heart/seed_0/data directory
# scaler_params and dtype_dict are used for network visualization
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
folderpath = f'{parent_dir}/data/datasets/Heart/seed_{seed}/data'
save_data(folderpath, continuous_features, categorical_features, scaler_params, dtype_dict, df_train_scaled, df_test_scaled)

### Step 2. Training and Evaluation (3 points)

#### Let's use the dataset we just prepared. We present a general use case here. For more advanced functions such as pruning, freezing, and the unified phase, see the descriptions in `experiments/main.py`.

#### For training, use ```--train_model``` flag. For evaluation, use ```--evaluate_model``` flag, which loads the model and evaluates its balanced-class accuracy. It then attempts to simplify the model using SymPy before evaluating the model’s high-level OPs, basic gate-level OPs, number of parameters, and disk space usage. If simplification is successful, the simplified model is used for these evaluations.

#### Please check how to use command in [DLN repo](https://github.com/chang-yue/dln). Ensure you use the same parameters as specified in the DLN repo readme.

In [None]:
# cd to the dln directory
import os
"""TO DO"""

In [None]:
"""TO DO"""

# Training:
# last_hidden_layer_size = first_hidden_layer_size x last_hl_size_wrt_first
# The middle hidden layers will have sizes in a geometric progression from the first to the last layer
# Will save the model with the best mean train + val balanced-class accuracy

In [None]:
# Read the eval results

import json
from experiments.utils import *

results_path = get_results_path(dataset='Heart', seed=0)
with open(f"{results_path}/eval_results.json", 'r') as f:
    data = json.load(f)
print(json.dumps(data, indent=4))

### Step 3. Visualization (3 points)

#### We use Graphviz to render DLNs generated from SymPy code.

In [None]:
!python experiments/DLN_viz.py \
results/Heart/seed_0/sympy_code.py \
quickstart/example/viz

# A file named viz.png will be created

In [None]:
from IPython.display import Image
Image(filename='quickstart/example/viz.png')

#### How many continuous features and categorical features are there in the dataset? how many of them does the DLN use? (3 points)

*your answer here*