<p style="text-align: center"><img src="https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/challenges/clock-decomposition/notebook-banner.jpg?inline=false" alt="Drawing" style="height: 400px;"/></p>

# Simple EDA and baseline models

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1)    Pre-Alzheimer’s (Early Warning)
2)    Post-Alzheimer’s (Detection)
3)    Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

# How to use this notebook? 📝

<p style="text-align: center"><img src="https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/notebook/aicrowd_notebook_submission_flow.png?inline=false" alt="notebook overview" style="width: 650px;"/></p>

- **Update the config parameters**. You can define the common variables here

Variable | Description
--- | ---
`AICROWD_DATASET_PATH` | Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH` | Path to write the output to.
`AICROWD_ASSETS_DIR` | In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY` | In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

- **Installing packages**. Please use the [Install packages 🗃](#install-packages-) section to install the packages
- **Training your models**. All the code within the [Training phase ⚙️](#training-phase-) section will be skipped during evaluation. **Please make sure to save your model weights in the assets directory and load them in the predictions phase section** 

# Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:
!pip install -q -U aicrowd-cli

In [2]:
%load_ext aicrowd.magic

# AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under `ASSETS_DIR`. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under `/ds_shared_drive` on the workspace.

In [3]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"


# Install packages 🗃

Please add all pacakage installations in this section

In [4]:
!pip install numpy pandas
!pip install seaborn lightgbm scikit-learn optuna



# Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

### Import common packages

Please import packages that are common for training and prediction phases here.

In [5]:
import numpy as np
import pandas as pd

In [6]:
# some precessing code

In [7]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import accuracy_score, log_loss, f1_score
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

import joblib

import warnings
warnings.filterwarnings("ignore")

# Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

In [8]:
# model = define_your_model

## Load training data

In [9]:
# load your data

In [10]:
AICROWD_DATASET_PATH

'/ds_shared_drive/validation.csv'

In [11]:
target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021

target_values = ["normal", "post_alzheimer", "pre_alzheimer"]

train = pd.read_csv('/home/desktop0/Desktop/ds_shared_drive/train.csv')
test_true=pd.read_csv('/home/desktop0/Desktop/ds_shared_drive/validation_ground_truth.csv')
test_data=pd.read_csv('/home/desktop0/Desktop/ds_shared_drive/validation.csv')
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)


print(train.shape)
features = train.columns[1:-1].to_list()

numeric_features = [c for c in features if c not in cat_cols]
for c in numeric_features:
    train[c] = train[c].astype(float)

#train.tail(3)

(32777, 122)


### Balance the dataset and see the the distribution again

In [12]:
df_pos = train[train[target_col].isin(target_values[1:])]
print(target_values[1:])
nb_pos = df_pos.shape[0]
print(nb_pos)
nb_neg = nb_pos*2
df_neg = train[train[target_col] == "normal"].sample(n=nb_neg, random_state=seed)
print(train[train[target_col] == "normal"].shape)
df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)

['post_alzheimer', 'pre_alzheimer']
1569
(31208, 122)


## Train your model

In [13]:
# model.fit(train_data)
#print(df_samples['final_rotation_angle'])

### Simple FE

In [14]:
#print(cat_cols)
df_samples.drop(['between_digits_angle_ccw_sum'],axis=1)
df_samples.drop(['single_hand_length'],axis=1)   
df_samples['final_rotation_angle'].fillna(df_samples['final_rotation_angle'].mode(),inplace=True)
df_samples["number_of_digits"].fillna(df_samples["number_of_digits"].mode(),inplace=True)
df_samples.fillna(-1, inplace=True)

df_dummies = pd.get_dummies(df_samples['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
df_samples = df_samples.drop('intersection_pos_rel_centre', axis=1)
df_samples = pd.concat([df_samples, df_dummies], axis=1)


#create more feature
df_dummies = pd.get_dummies(df_samples['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')
df_samples = df_samples.drop('hand_count_dummy', axis=1)
df_samples = pd.concat([df_samples, df_dummies], axis=1)

feat_col = df_samples['final_rotation_angle']
df_samples['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this column
df_samples['rotation_angle_360'] = (feat_col > 180).astype('int') 
df_samples = df_samples.drop('final_rotation_angle', axis=1)

df_samples['more than 12'] = [1 if x > 12 else 0 for x in df_samples['number_of_digits'] ]
new_cols = ["missing_digit_", "euc_dist__digit_", "area_digit_", 
           "height_digit_", "width_digit_","dist from "]
for new_col in new_cols:
    digit_columns = df_samples.columns[df_samples.columns.str.contains(new_col)]
    df_samples[df_samples['diagnosis']=='normal'][new_col + "mean"] = df_samples[df_samples['diagnosis']=='normal'][digit_columns].mean(axis=1)
    df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col + "mean"] = df_samples[df_samples['diagnosis']=='normal'][digit_columns].mean(axis=1)
    df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col + "mean"] = df_samples[df_samples['diagnosis']=='normal'][digit_columns].mean(axis=1)
    df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col + "std"] = df_samples[df_samples['diagnosis']=='post_alzheimer'][digit_columns].std(axis=1)
    df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col + "std"] = df_samples[df_samples['diagnosis']=='pre_alzheimer'][digit_columns].std(axis=1)
    df_samples[df_samples['diagnosis']=='normal'][new_col + "std"] = df_samples[df_samples['diagnosis']=='normal'][digit_columns].std(axis=1)
#    data[new_col + "skew"] = data[digit_columns].mean(axis=1)
#    data[new_col + "kurtosis"] = data[digit_columns].std(axis=1)
#,'sequence_flag_ccw','number_of_digits','hor_count'
cols=["minute_proximity_from_2","pred_tremor","double_minor","horizontal_dist",'angle_between_hands']
for new_col in cols:
    df_samples[df_samples['diagnosis']=='normal'][new_col + "mean"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='normal'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='normal'][new_col].mean(axis=0)
    df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col + "mean"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col].mean(axis=0)
    df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col + "mean"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='post_alzehimer'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col].mean(axis=0)
    #df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col + "std"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='post_alzheimer'][new_col].std(axis=0)
    #df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col + "std"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='pre_alzheimer'][new_col].std(axis=0)
    #df_samples[df_samples['diagnosis']=='normal'][new_col + "std"] = pd.Series(np.ones((1,df_samples[df_samples['diagnosis']=='normal'][new_col].shape[0]))[0])*df_samples[df_samples['diagnosis']=='normal'][new_col].std(axis=0)
cont_features = [f"missing_digit_{i}" for i in range(2,13)]
df_samples["missing_digit_1"].fillna(0)
a=df_samples["missing_digit_1"]
df_samples.drop(["missing_digit_1"],axis=1)
for i,feature in enumerate(cont_features):
    df_samples[feature].fillna(0)
    a=a+df_samples[feature].values
#    data.drop(feature,axis=1)
miss=pd.DataFrame(a)
df_samples['missing_digit']=miss['missing_digit_1']


#df_samples.head(3)

In [15]:
model_features = df_samples.columns.to_list()
print(len(model_features))
model_features = [c for c in model_features if c not in [key_col, target_col] ]

unique_value_cols = []
for c in model_features:
    if df_samples[c].unique().shape[0] == 1:
        unique_value_cols.append(c)
        
print(unique_value_cols)
model_features = [c for c in model_features if c not in unique_value_cols]
print(len(model_features))
print(cat_cols)
print(len(df_samples.columns))

132
['actual_hour_digit', 'actual_minute_digit']
128
['intersection_pos_rel_centre']
132


In [16]:
from sklearn.feature_selection import SelectKBest,chi2
X_train = df_samples[model_features]
X_train[X_train._get_numeric_data()<0]=0
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))
print(X_train.shape)
X_new=SelectKBest(chi2,k=128).fit(X_train,y_train)


(4707, 128)


### Train models with 5 folds

In [17]:

from sklearn.feature_selection import SelectKBest,chi2
from sklearn.metrics import log_loss
X_train = df_samples[model_features]
X_train[X_train._get_numeric_data()<0]=0
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))
print(X_train.shape)
X_new=SelectKBest(chi2,k=128).fit_transform(X_train,y_train)
X_train = pd.DataFrame(X_new)
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))

skf = StratifiedKFold(n_splits=5, random_state=2021, shuffle=True)

params={"objective" : "multiclass",
          "num_class" : len(target_values),
          "bagging_seed" : 2021,
          "verbosity" : 1,}

preds = 0.0
clfs = []
log=[]
log1=[]
for fold, (itrain, ivalid) in enumerate(skf.split(X_train, y_train)):
    print("-"*40)
    print(f"Running for fold {fold}")
    lgb_train = lgb.Dataset(X_train.iloc[itrain], y_train.iloc[itrain])
    lgb_eval  = lgb.Dataset(X_train.iloc[ivalid], y_train.iloc[ivalid], reference = lgb_train)
    clf = lgb.train(params, lgb_train, 1000, valid_sets=[lgb_eval], 
                early_stopping_rounds=100, verbose_eval=200)
        
    clfs.append(clf)


(4707, 128)
----------------------------------------
Running for fold 0
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16620
[LightGBM] [Info] Number of data points in the train set: 3765, number of used features: 110
[LightGBM] [Info] Start training from score -0.405465
[LightGBM] [Info] Start training from score -1.410217
[LightGBM] [Info] Start training from score -2.416392
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[26]	valid_0's multi_logloss: 0.628773
----------------------------------------
Running for fold 1
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 16626
[LightGBM] [Info] Number of data points in the train set: 3765, number of used features: 110
[LightGBM] [Info] Start training from score -0.405465
[LightGBM] [Info] Start training from score -1.410217
[LightGBM] [Info] Start training from score -2.416392
Training until validation scores don

## Save your trained model

In [18]:
# model.save()/home/desktop0/Desktop/hz/assets/submit.pkl

In [19]:
for i, clf in enumerate(clfs):
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{i}.pkl'
    joblib.dump(clf, model_filename)

In [20]:
meta = {
    "numeric_features": numeric_features,
    "cat_cols": cat_cols,
    "model_features": model_features
}
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
joblib.dump(meta, meta_filename)

['assets/model_lgb_meta.pkl']

# Prediction phase 🔎

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [21]:
# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)

In [22]:
nb_folds = 5 # skf.n_splits
clfs = []
for fold in range(nb_folds):
    print("-"*40)
    print(f"Running for fold {fold}")
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{fold}.pkl'
    
    clf = joblib.load(model_filename)
    clfs.append(clf)
    
print("-"*40)
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
meta = joblib.load(meta_filename)
print(meta.keys())

numeric_features = meta['numeric_features']
cat_cols = meta['cat_cols']
model_features = meta['model_features']

----------------------------------------
Running for fold 0
----------------------------------------
Running for fold 1
----------------------------------------
Running for fold 2
----------------------------------------
Running for fold 3
----------------------------------------
Running for fold 4
----------------------------------------
dict_keys(['numeric_features', 'cat_cols', 'model_features'])


## Load test data

In [23]:
test = pd.read_csv(AICROWD_DATASET_PATH)

## Generate predictions

In [24]:
test_val = test.copy()

In [25]:
test_val.drop(['between_digits_angle_ccw_sum'],axis=1)
test_val.drop(['single_hand_length'],axis=1)   
test_val['final_rotation_angle'].fillna(test_val['final_rotation_angle'].mode(),inplace=True)
test_val["number_of_digits"].fillna(test_val["number_of_digits"].mode(),inplace=True)
test_val.fillna(-1, inplace=True)
    
df_dummies = pd.get_dummies(test_val['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
tes_val = test_val.drop('intersection_pos_rel_centre', axis=1)
test_val = pd.concat([test_val, df_dummies], axis=1)

#create more features
df_dummies= pd.get_dummies(test_val['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')

test_val = test_val.drop(['hand_count_dummy'], axis=1)
test_val = pd.concat([test_val, df_dummies], axis=1)

feat_col = test_val['final_rotation_angle']
test_val['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this column
test_val['rotation_angle_360'] = (feat_col > 180).astype('int') 
test_val = test_val.drop('final_rotation_angle', axis=1)

test_val['more than 12'] = [1 if x > 12 else 0 for x in test_val['number_of_digits'] ]
new_cols = ["missing_digit_", "euc_dist__digit_", "area_digit_", 
           "height_digit_", "width_digit_","dist from "]
for new_col in new_cols:
    digit_columns = test_val.columns[test_val.columns.str.contains(new_col)]
    test_val[new_col + "mean"] = test_val[digit_columns].mean(axis=1)
#    test_val[new_col + "std"] = test_val[digit_columns].std(axis=1)

cols=["minute_proximity_from_2","pred_tremor","double_minor","horizontal_dist",'angle_between_hands']
for new_col in cols:
    test_val[new_col + "mean"] = test_val[new_col]
    test_val[new_col + "std"] = test_val[new_col].std()

cont_features = [f"missing_digit_{i}" for i in range(2,13)]
test_val["missing_digit_1"].fillna(-1)
a=test_val["missing_digit_1"]
test_val.drop(["missing_digit_1"],axis=1)
for i,feature in enumerate(cont_features):
    test_val[feature].fillna(0)
    a=a+test_val[feature].values
    test_val.drop(feature,axis=1)
miss=pd.DataFrame(a)
test_val['missing_digit']=miss['missing_digit_1']


print("Missing columns:", [c for c in model_features if c not in test_val.columns])
test_val.head(3)

X_test = test_val[model_features]

preds = 0.0
nb_folds = 5 # skf.n_splits
for fold, clf in enumerate(clfs):
    print("-"*40)
    print(f"Running for fold {fold}")
    pred = clf.predict(X_test)
    preds+= pred/nb_folds
print(preds.shape)

preds = 0.0
nb_folds = 5 # skf.n_splits
for fold, clf in enumerate(clfs):
    pred = clf.predict(X_test)
    preds += pred/nb_folds


Missing columns: []
----------------------------------------
Running for fold 0


In [None]:
predictions = {
    "row_id": test_val["row_id"].values,
    "normal_diagnosis_probability": preds[:,0],
    "post_alzheimer_diagnosis_probability": preds[:,1],
    "pre_alzheimer_diagnosis_probability": preds[:,2]
}

predictions_df = pd.DataFrame.from_dict(predictions)

## Save predictions 📨

In [None]:
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

# Submit to AIcrowd 🚀

**NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)**

In [None]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge