# Revision

- v1: initital
- v2: clean up code

# Todos

# To-Do List for Achieving Golden Score

## 1. **Review and Clean Up the Notebook**
   - [x] **Remove Deprecated Code:**
     - Identify and remove commented-out code that is no longer in use.
     - Ensure that any relevant documentation or explanations are retained.
   - [ ] **Refactor Code:**
     - Simplify any complex logic where possible.
     - Ensure the code is clean, concise, and well-commented.

## 2. **Enhance Data Preprocessing**
   - [ ] **Feature Engineering:**
     - Explore additional feature engineering techniques to extract more valuable information.
     - Consider domain-specific features that may enhance model performance.
   - [ ] **Data Augmentation:**
     - Implement data augmentation techniques to increase the diversity of the training set, particularly for positive samples.
     - Evaluate the impact of augmentation on model performance.

## 3. **Model Tuning and Selection**
   - [ ] **Hyperparameter Optimization:**
     - Revisit hyperparameter tuning using Optuna or other optimization frameworks.
     - Consider increasing the number of trials to explore a broader search space.
   - [ ] **Model Ensemble:**
     - Experiment with different ensemble techniques, such as stacking or blending, to combine the strengths of multiple models.
     - Optimize the weights for the model fusion to maximize performance.
   - [ ] **Algorithm Exploration:**
     - Explore additional algorithms (e.g., Neural Networks) that might outperform current models.
     - Evaluate these models in the same GroupKFold cross-validation setup.

## 4. **Secondary Metrics Optimization**
   - [ ] **Top-15 Retrieval Sensitivity:**
     - Implement and test the secondary prize metric as described.
     - Optimize the model specifically for the Top-15 retrieval sensitivity metric to increase chances of winning the secondary prize.
   - [ ] **Model Efficiency:**
     - Evaluate and optimize the efficiency of the model by balancing runtime and accuracy.
     - Explore potential model simplifications or adjustments to improve runtime without significantly sacrificing performance.

## 5. **Feature Selection and Reduction**
   - [ ] **Correlation Analysis:**
     - Perform a thorough correlation analysis to identify and remove highly correlated features.
     - Consider techniques like PCA or other dimensionality reduction methods if feature space is large.
   - [ ] **Feature Importance:**
     - Use feature importance scores from models like XGBoost or LGBM to select the most impactful features.
     - Remove low-importance features to reduce overfitting and improve generalization.

## 6. **Cross-Validation Strategy**
   - [ ] **GroupKFold Tuning:**
     - Fine-tune the GroupKFold strategy to ensure that patient data is properly stratified and leakage is minimized.
     - Explore alternative cross-validation techniques if necessary.

## 7. **Final Model Evaluation**
   - [ ] **Out-of-Fold Predictions:**
     - Use out-of-fold predictions to evaluate the final model's performance on unseen data.
   - [ ] **Private Leaderboard Testing:**
     - Run the model on a holdout set or using a simulated private leaderboard setup to validate the final performance before submission.
   - [ ] **Submission Preparation:**
     - Ensure that the final submission meets all competition guidelines and requirements.

## 8. **Documentation and Reporting**
   - [ ] **Notebook Documentation:**
     - Update the notebook with clear explanations of each step taken.
     - Include reasoning for choices made during the modeling process.
   - [ ] **Result Analysis:**
     - Document the results of each model iteration, including any improvements or declines in performance.
   - [ ] **Submission Commentary:**
     - Prepare a submission report that explains the model’s methodology, strengths, and any unique approaches taken.

## 9. **Explore External Data**
   - [ ] **External Dataset Integration:**
     - Research and incorporate external datasets if allowed by the competition rules.
     - Ensure that the external data is preprocessed consistently with the internal dataset.

## 10. **Community Engagement**
   - [ ] **Kaggle Discussions:**
     - Engage with the Kaggle community to share insights and gain feedback on approaches.
   - [ ] **Review Competitor Notebooks:**
     - Analyze and learn from top competitors' public notebooks to gain new ideas and insights.

## 11. **Final Submission and Backup Plan**
   - [ ] **Backup Model:**
     - Prepare a secondary model or approach as a backup in case the primary model underperforms on the private leaderboard.
   - [ ] **Final Submission:**
     - Ensure that the final model is thoroughly tested, validated, and ready for submission.
     - Double-check submission format and file integrity.

# Steps


## Libraries and Config

In [None]:
import pandas as pd
import numpy as np
from  lightgbm import LGBMClassifier,log_evaluation
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from tqdm import tqdm
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GroupKFold
import warnings

warnings.filterwarnings('ignore')

class Config():
    seed=2024
    num_folds=10
    TARGET_NAME ='target'
    
import random
    
def seed_everything(seed):
    np.random.seed(seed)
    random.seed(seed)
seed_everything(Config.seed)

## Read Dataset

We can see that the ratio of targets is 1000:1

In [None]:
train=pd.read_csv("/kaggle/input/isic-2024-challenge/train-metadata.csv")
print(f"len(train):{len(train)}")
test=pd.read_csv("/kaggle/input/isic-2024-challenge/test-metadata.csv")
print(f"len(test):{len(test)}")

# Saving GPU time offline
# if len(test) == 3:
#     train=train[:50000]

train.head()
train[Config.TARGET_NAME].value_counts()

## EDA Analysis Results for Tabular Data

- **`isic_id`**: Unique identifier.
- **`patient_id`**: There are 1,042 unique patients.
- **`target`**: This is the label (binary outcome).

### Demographics
- **`sex`**: Males have a slightly higher positive rate than females, approximately 5:4.
- **`age_approx`**: No cases were found below the age of 18. The `age_approx` variable has a high Pearson correlation coefficient with `target.mean()`.

### Anatomical Sites
- **`anatom_site_general`**: The `Head/Neck` region has 64 cases, while other categories have around 8 cases each.
- **`clin_size_long_diam_mm`**: The count shows a long-tailed distribution. In some cases, `target.mean()` is 0, which might be due to small sample size; subjective conclusions should not be drawn.

### Image and Lesion Types
- **`image_type`**: Only one category is present, making it not useful.
- **`tbp_tile_type`**: For `3D:XP` and `3D:white`, `target.mean()` is 6 and 17, respectively.

### Lesion Volume Features
- **`tbp_lv_A`, `tbp_lv_Aext`, `tbp_lv_B`, `tbp_lv_Bext`, `tbp_lv_C`, `tbp_lv_Cext`, `tbp_lv_H`, `tbp_lv_Hext`, `tbp_lv_L`, `tbp_lv_Lext`**: These continuous variables follow a normal distribution.
- **`tbp_lv_areaMM2`**: The data shows a long-tailed distribution.
- **`tbp_lv_area_perim_ratio`**: The data shows a long-tailed distribution.
- **`tbp_lv_color_std_mean`**: Extremely long-tailed distribution; consider categorizing into two classes based on whether the value is 0 or not.
- **`tbp_lv_deltaA`, `tbp_lv_deltaB`, `tbp_lv_deltaL`**: Continuous variables, normally distributed.
- **`tbp_lv_deltaLB`**: Continuous variable, normally distributed. Values ≥ 25 might be outliers.
- **`tbp_lv_deltaLBnorm`**: Continuous variable, normally distributed. Values ≥ 20 might be outliers.
- **`tbp_lv_eccentricity`**: Continuous variable, normally distributed.
- **`tbp_lv_location`**: The `Head & Neck` region seems to have the highest incidence. Some other categories have `target.mean()` equal to 0; it is unclear if this is due to insufficient data or if it can be used as a conclusion.
- **`tbp_lv_location_simple`**: Categories might overlap with those in `tbp_lv_location`.
- **`tbp_lv_minorAxisMM`**: Shows a long-tailed distribution.
- **`tbp_lv_nevi_confidence`**: Right-skewed long-tailed distribution.
- **`tbp_lv_norm_border`**: Right-skewed long-tailed distribution.
- **`tbp_lv_norm_color`**: Significant clustering at values 10 and 0; consider categorizing into three classes: 0, 10, and others.
- **`tbp_lv_perimeterMM`**: Long-tailed distribution.
- **`tbp_lv_radial_color_std_max`**: Most values are 0, with a long-tailed distribution.
- **`tbp_lv_stdL`, `tbp_lv_stdLExt`**: Somewhat long-tailed distributions.
- **`tbp_lv_symm_2axis`**: Continuous variable, with particularly high counts at a few points.
- **`tbp_lv_symm_2axis_angle`**: Categorical variable.
- **`tbp_lv_x`, `tbp_lv_y`, `tbp_lv_z`**: Normally distributed continuous variables, with `tbp_lv_y` being left-skewed.

### Metadata
- **`attribution`**: Categorical variable.
- **`copyright_license`**: The mean value for `CC-BY-NC` is particularly low.

### Lesion and Diagnosis Data
- **`lesion_id`**: There are 20,000 lesion IDs with 400,000 training data points.
- **`iddx_full`**: Over 390,000 belong to one category (`Benign`). Consider whether the remaining categories should be further subdivided.
- **`iddx_1`**: A categorical variable with three categories.
- **`iddx_2`, `iddx_3`, `iddx_4`**: Variables with missing and non-missing values; consider combining them into one category.
- **`iddx_5`**: Only one value is present; the rest are missing, so it should be dropped.
- **`mel_mitotic_index`**: A few values are present; it is a categorical variable with an ordinal relationship.
- **`mel_thick_mm`**: Combine cases with and without values into one category, as only a few have values.
- **`tbp_lv_dnn_lesion_confidence`**: Continuous variable with a right-skewed long-tailed distribution.


## Feature Engineering

Part of the features were created by myself, and part came from <a href="https://www.kaggle.com/code/abdmental01/multimodel-isic/notebook">multimodel-isic</a>.

In [None]:
#训练数据里的类别型变量
print("tbp_lv_dnn_lesion_confidence feature")
cates=['age_approx', 'sex', 'anatom_site_general', 'tbp_tile_type', 'tbp_lv_location_simple', 'attribution', 'copyright_license']
#'tbp_lv_dnn_lesion_confidence'构造groupby特征
cates2mean={}
for c in cates:
    base=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].mean().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'mean_{c}_tbp_lv_dnn_lesion_confidence'})
    
    min_tmp=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].min().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'min_{c}_tbp_lv_dnn_lesion_confidence'})
    base=base.merge(min_tmp,on=c,how='left')
    
    max_tmp=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].max().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'max_{c}_tbp_lv_dnn_lesion_confidence'})
    base=base.merge(max_tmp,on=c,how='left')
    
    median_tmp=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].median().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'median_{c}_tbp_lv_dnn_lesion_confidence'})
    base=base.merge(median_tmp,on=c,how='left')
    
    std_tmp=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].std().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'std_{c}_tbp_lv_dnn_lesion_confidence'})
    base=base.merge(std_tmp,on=c,how='left')
    
    skew_tmp=train.groupby(c)['tbp_lv_dnn_lesion_confidence'].skew().reset_index().rename(columns={'tbp_lv_dnn_lesion_confidence':f'skew_{c}_tbp_lv_dnn_lesion_confidence'})
    base=base.merge(skew_tmp,on=c,how='left')
    
    cates2mean[c]=base

def FE(df):
    #特征工程出处:https://www.kaggle.com/code/abdmental01/multimodel-isic
    #这部分特征工程可能需要一些专业的背景知识,这里我也没有深究 
    df["lesion_size_ratio"]=df["tbp_lv_minorAxisMM"]/df["clin_size_long_diam_mm"]
    df["lesion_shape_index"]=df["tbp_lv_areaMM2"]/(df["tbp_lv_perimeterMM"]**2)
    df["hue_contrast"]= (df["tbp_lv_H"]-df["tbp_lv_Hext"]).abs()
    df["luminance_contrast"]= (df["tbp_lv_L"]-df["tbp_lv_Lext"]).abs()
    df["lesion_color_difference"]=np.sqrt(df["tbp_lv_deltaA"]**2+df["tbp_lv_deltaB"]**2+df["tbp_lv_deltaL"]**2)
    df["border_complexity"]=df["tbp_lv_norm_border"]+df["tbp_lv_symm_2axis"]
    df["3d_position_distance"]=np.sqrt(df["tbp_lv_x"]**2+df["tbp_lv_y"]**2+df["tbp_lv_z"]**2)
    df["perimeter_to_area_ratio"]=df["tbp_lv_perimeterMM"]/df["tbp_lv_areaMM2"]
    df["lesion_visibility_score"]=df["tbp_lv_deltaLBnorm"]+df["tbp_lv_norm_color"]
    df["combined_anatomical_site"]=df["anatom_site_general"]+"_"+df["tbp_lv_location"]
    df["symmetry_border_consistency"]=df["tbp_lv_symm_2axis"]*df["tbp_lv_norm_border"]
    df["color_consistency"]=df["tbp_lv_stdL"]/df["tbp_lv_Lext"]
    df["size_age_interaction"]=df["clin_size_long_diam_mm"]*df["age_approx"]
    df["hue_color_std_interaction"]=df["tbp_lv_H"]*df["tbp_lv_color_std_mean"]
    df["lesion_severity_index"]=(df["tbp_lv_norm_border"]+df["tbp_lv_norm_color"]+df["tbp_lv_eccentricity"])/3
    df["shape_complexity_index"]=df["border_complexity"]+df["lesion_shape_index"]
    df["color_contrast_index"]=df["tbp_lv_deltaA"]+df["tbp_lv_deltaB"]+df["tbp_lv_deltaL"]+df["tbp_lv_deltaLBnorm"]
    df["normalized_lesion_size"]=df["clin_size_long_diam_mm"]/df["age_approx"]
    df["mean_hue_difference"]=(df["tbp_lv_H"]+df["tbp_lv_Hext"])/2
    df["std_dev_contrast"]=np.sqrt((df["tbp_lv_deltaA"]**2+df["tbp_lv_deltaB"]**2+df["tbp_lv_deltaL"]**2)/3)
    df["color_shape_composite_index"]=(df["tbp_lv_color_std_mean"]+df["tbp_lv_area_perim_ratio"]+df["tbp_lv_symm_2axis"])/3
    df["3d_lesion_orientation"]=np.arctan2(df["tbp_lv_y"],df["tbp_lv_x"])
    df["overall_color_difference"]=(df["tbp_lv_deltaA"]+df["tbp_lv_deltaB"]+df["tbp_lv_deltaL"])/3
    df["symmetry_perimeter_interaction"]=df["tbp_lv_symm_2axis"]*df["tbp_lv_perimeterMM"]
    df["comprehensive_lesion_index"]=(df["tbp_lv_area_perim_ratio"]+df["tbp_lv_eccentricity"]+df["tbp_lv_norm_color"]+df["tbp_lv_symm_2axis"])/4

    
    print("drop_cols")
    drop_cols=['lesion_id',#训练数据有且测试数据没有,缺失值占比0.945,每个id出现1次,故drop
     'iddx_2', #训练数据有且测试数据没有,缺失值占比0.997,故drop
     'iddx_3', #训练数据有且测试数据没有,缺失值占比0.997,故drop
     'iddx_4',#训练数据有且测试数据没有,缺失值占比0.998,故drop
     'iddx_5',#训练数据有且测试数据没有,缺失值占比0.99999,故drop
     'mel_mitotic_index',#训练数据有且测试数据没有,缺失值占比0.9998,故drop
     'mel_thick_mm',#训练数据有且测试数据没有,缺失值占比0.9998,故drop    
     'image_type',#训练数据中nunique=1
     #'isic_id',#就像普通的id一样没什么意义
     #可能本来就是知道target才有这两列数据
     'iddx_full',#训练数据有且测试数据没有,和target有一一对应关系,每个类别target.mean()不是0就是1
     'iddx_1',#训练数据有且测试数据没有,和target有一一对应关系,每个类别target.mean()不是0就是1 
    ]
    #如果测试数据没有这些列可以忽略掉
    df.drop(drop_cols,axis=1,inplace=True,errors='ignore')
    print("tbp_lv_dnn_lesion_confidence feature")
    for c in cates:
        df=df.merge(cates2mean[c],on=c,how='left')
        
    print("age_approx feature")
    #年龄低于15岁的变成15岁
    df.loc[df['age_approx']<=15,'age_approx']=15
    #缺失值用最多的填充
    df.loc[(df['age_approx']!=df['age_approx']),'age_approx']=55
    value_counts={55.0: 58123,
                 65.0: 54946,
                 60.0: 54109,
                 50.0: 47924,
                 70.0: 39775,
                 40.0: 31297,
                 75.0: 30801,
                 45.0: 23580,
                 80.0: 21096,
                 35.0: 11543,
                 30.0: 10400,
                 85.0: 8847,
                 25.0: 3433,
                 20.0: 1742,
                 15.0: 645}
    df['age_approx_count']=df['age_approx'].apply(lambda x:value_counts.get(x,645))
    
    
    print("sex feature")
    #sex
    df['sex_male']=(df['sex']=='male').astype(np.int8)
    df['sex_female']=(df['sex']=='female').astype(np.int8)
    value_counts={'male':265546,'female':123996}
    #nan的value_counts
    df['sex']=df['sex'].apply(lambda x:value_counts.get(x,11517))
    
    #'anatom_site_general'  one-hot
    print("anatom_site_general feature")
    cols=['posterior torso','lower extremity','anterior torso','upper extremity','head/neck']
    for col in cols:
        df[f'anatom_site_general_{col}']=(df['anatom_site_general']==col).astype(np.int8)
    #value_counts
    value_counts={'posterior torso': 121902,
     'lower extremity': 103028,
     'anterior torso': 87770,
     'upper extremity': 70557,
     'head/neck': 12046}
    df['anatom_site_general']=df['anatom_site_general'].apply(lambda x:value_counts.get(x,12046))
    
    print("clin_size_long_diam_mm feature")
    #这个长尾分布感觉修正也修正的一般
    df['clin_size_long_diam_mm']=np.log1p(df['clin_size_long_diam_mm'])
    
    print("tbp_tile_type feature")
    df['tbp_tile_type']=(df['tbp_tile_type']=='3D: XP').astype(np.int8)
    
    print("tbp_lv_XX', 'tbp_lv_XXext feature")
    #不知道具体含义的暴力特征构造
    for c in ['A','B','C','H','L']:
        col1,col2=f'tbp_lv_{c}',f'tbp_lv_{c}ext'
        df[f'{col1}+{col2}']=df[col1]+df[col2]
        df[f'{col1}-{col2}']=df[col1]-df[col2]
        df[f'{col1}*{col2}']=df[col1]*df[col2]
        df[f'{col1}/{col2}']=df[col1]/(df[col2]+1e-20)  
        
    print("tbp_lv_areaMM2 feature")
    df['tbp_lv_areaMM2']=np.log1p(df['tbp_lv_areaMM2'])
    
    print("tbp_lv_area_perim_ratio feature")
    #tbp_lv_area_perim_ratio是长尾分布
    df['tbp_lv_area_perim_ratio']=np.log1p(df['tbp_lv_area_perim_ratio'])
    #修正大于4的异常值为均值
    df.loc[df['tbp_lv_area_perim_ratio']>=4,'tbp_lv_area_perim_ratio']=2.9
    
    print("tbp_lv_symm_2axis feature")
    #tbp_lv_symm_2axis_angle应该是一个角度,故考虑sin和cos
    df['sin_tbp_lv_symm_2axis_angle']=np.sin(2*np.pi*df['tbp_lv_symm_2axis_angle']/180)
    df['cos_tbp_lv_symm_2axis_angle']=np.cos(2*np.pi*df['tbp_lv_symm_2axis_angle']/180)
    df['tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle']=df['tbp_lv_symm_2axis']*df['sin_tbp_lv_symm_2axis_angle']
    df['tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle']=df['tbp_lv_symm_2axis']*df['cos_tbp_lv_symm_2axis_angle']
    df['tbp_lv_symm_2axis/sin_tbp_lv_symm_2axis_angle']=df['tbp_lv_symm_2axis']/df['sin_tbp_lv_symm_2axis_angle']
    df['tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle']=df['tbp_lv_symm_2axis']/df['cos_tbp_lv_symm_2axis_angle']
    
    print("tbp_lv_x', 'tbp_lv_y', 'tbp_lv_z feature")
    #x,y,z也许是长方体的长宽高?用求体积和表面积的特征构造方法试试
    df['V_tbp_lv']=abs(df['tbp_lv_x']*df['tbp_lv_y']*df['tbp_lv_z'])
    df['S_tbp_lv']=2*(abs(df['tbp_lv_x']*df['tbp_lv_y'])+abs(df['tbp_lv_x']*df['tbp_lv_z'])+abs(df['tbp_lv_y']*df['tbp_lv_z']))
    
    print("copyright feature")
    cols=['CC-BY','CC-BY-NC','CC-0']
    for col in cols:
        df[f'copyright_license_{col}']=(df['copyright_license']==col).astype(np.int8)
    value_counts={'CC-BY': 188812, 'CC-BY-NC': 183582, 'CC-0': 28665}
    df['copyright_license']=df['copyright_license'].apply(lambda x:value_counts.get(x,28665))
    
    print("attribution feature")
    value_counts={'Memorial Sloan Kettering Cancer Center': 129068,
         'Department of Dermatology, Hospital Clínic de Barcelona': 105724,
         'University Hospital of Basel': 65218,
         'Frazer Institute, The University of Queensland, Dermatology Research Centre': 51768,
         'ACEMID MIA': 28665,
         'ViDIR Group, Department of Dermatology, Medical University of Vienna': 12640,
         'Department of Dermatology, University of Athens, Andreas Syggros Hospital of Skin and Venereal Diseases, Alexander Stratigos, Konstantinos Liopyris': 7976}
    for key,value in value_counts.items():
        df[f"attribution_{key}"]=(df['attribution']==key).astype(np.int8)
    df['attribution']=df['attribution'].apply(lambda x:value_counts.get(x,7976))
    
    print("tbp_lv_location feature")
    value_counts={'Torso Back Top Third': 71112,
     'Torso Front Top Half': 63350,
     'Torso Back Middle Third': 46185,
     'Left Leg - Lower': 27428,
     'Right Leg - Lower': 25208,
     'Torso Front Bottom Half': 24360,
     'Left Leg - Upper': 23673,
     'Right Leg - Upper': 23034,
     'Right Arm - Upper': 22972,
     'Left Arm - Upper': 22816,
     'Head & Neck': 12046,
     'Left Arm - Lower': 11939,
     'Right Arm - Lower': 10636,
     'Unknown': 5756,
     'Torso Back Bottom Third': 4596,
     'Left Leg': 1974,
     'Right Leg': 1711,
     'Left Arm': 1593,
     'Right Arm': 601,
     'Torso Front': 60,
     'Torso Back': 9}
    for key,value in value_counts.items():
        df[f"tbp_lv_location_{key}"]=(df['tbp_lv_location']==key).astype(np.int8)
    #训练集没有出现过的key假设为训练集里最少的value
    df['tbp_lv_location']=df['tbp_lv_location'].apply(lambda x:value_counts.get(x,9))
    
    value_counts={'Torso Back': 121902,
     'Torso Front': 87770,
     'Left Leg': 53075,
     'Right Leg': 49953,
     'Left Arm': 36348,
     'Right Arm': 34209,
     'Head & Neck': 12046,
     'Unknown': 5756}    
    for key,value in value_counts.items():
        df[f"tbp_lv_location_simple_{key}"]=(df['tbp_lv_location_simple']==key).astype(np.int8)
    #训练集没有出现过的key假设为训练集里最少的value
    df['tbp_lv_location_simple']=df['tbp_lv_location_simple'].apply(lambda x:value_counts.get(x,5756))
    
    print("group feature")
    #tbp_lv_dnn_lesion_confidence
    float_cols=['clin_size_long_diam_mm', 'tbp_lv_A', 'tbp_lv_Aext', 'tbp_lv_B', 'tbp_lv_Bext', 'tbp_lv_C', 'tbp_lv_Cext', 'tbp_lv_H', 'tbp_lv_Hext', 'tbp_lv_L', 'tbp_lv_Lext', 'tbp_lv_areaMM2', 'tbp_lv_area_perim_ratio', 'tbp_lv_color_std_mean', 'tbp_lv_deltaA', 'tbp_lv_deltaB', 'tbp_lv_deltaL', 'tbp_lv_deltaLB', 'tbp_lv_deltaLBnorm', 'tbp_lv_eccentricity', 'tbp_lv_minorAxisMM', 'tbp_lv_nevi_confidence', 'tbp_lv_norm_border', 'tbp_lv_norm_color', 'tbp_lv_perimeterMM', 'tbp_lv_radial_color_std_max', 'tbp_lv_stdL', 'tbp_lv_stdLExt', 'tbp_lv_symm_2axis', 'tbp_lv_x', 'tbp_lv_y', 'tbp_lv_z', 'tbp_lv_A+tbp_lv_Aext', 'tbp_lv_A-tbp_lv_Aext', 'tbp_lv_A*tbp_lv_Aext', 'tbp_lv_A/tbp_lv_Aext', 'tbp_lv_B+tbp_lv_Bext', 'tbp_lv_B-tbp_lv_Bext', 'tbp_lv_B*tbp_lv_Bext', 'tbp_lv_B/tbp_lv_Bext', 'tbp_lv_C+tbp_lv_Cext', 'tbp_lv_C-tbp_lv_Cext', 'tbp_lv_C*tbp_lv_Cext', 'tbp_lv_C/tbp_lv_Cext', 'tbp_lv_H+tbp_lv_Hext', 'tbp_lv_H-tbp_lv_Hext', 'tbp_lv_H*tbp_lv_Hext', 'tbp_lv_H/tbp_lv_Hext', 'tbp_lv_L+tbp_lv_Lext', 'tbp_lv_L-tbp_lv_Lext', 'tbp_lv_L*tbp_lv_Lext', 'tbp_lv_L/tbp_lv_Lext', 'sin_tbp_lv_symm_2axis_angle', 'cos_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis/sin_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'V_tbp_lv', 'S_tbp_lv']
    for col in float_cols:
        df[f"mean_patient_id_{col}"]=df.groupby('patient_id')[col].transform('mean')
        df[f"min_patient_id_{col}"]=df.groupby('patient_id')[col].transform('min')
        df[f"max_patient_id_{col}"]=df.groupby('patient_id')[col].transform('max')
        df[f"std_patient_id_{col}"]=df.groupby('patient_id')[col].transform('std')
        df[f"median_patient_id_{col}"]=df.groupby('patient_id')[col].transform('median')
        df[f"skew_patient_id_{col}"]=df.groupby('patient_id')[col].transform('skew')

        df[f"{col}-mean_patient_id_{col}/std_patient_id_{col}"]=(df[col]-df[f"mean_patient_id_{col}"])/(df[f'std_patient_id_{col}']+1e-15)
        df[f"{col}/mean_patient_id_{col}"]=df[col]/(df[f"mean_patient_id_{col}"]+1e-15)
        df[f'ptp_patient_id_{col}']=df[f"max_patient_id_{col}"]-df[f"min_patient_id_{col}"]
        
    #patient_id的count特征
    tmp=df.groupby('patient_id')['tbp_lv_B'].count().reset_index().rename(columns={"tbp_lv_B":"tbp_lv_B_count"})
    df=df.merge(tmp,on='patient_id',how='left')
    
    print("-"*30)
    return df
train=FE(train)
test=FE(test)

print("group age_approx feature")
float_cols=['clin_size_long_diam_mm', 'tbp_lv_A', 'tbp_lv_Aext', 'tbp_lv_B', 'tbp_lv_Bext', 'tbp_lv_C', 'tbp_lv_Cext', 'tbp_lv_H', 'tbp_lv_Hext', 'tbp_lv_L', 'tbp_lv_Lext', 'tbp_lv_areaMM2', 'tbp_lv_area_perim_ratio', 'tbp_lv_color_std_mean', 'tbp_lv_deltaA', 'tbp_lv_deltaB', 'tbp_lv_deltaL', 'tbp_lv_deltaLB', 'tbp_lv_deltaLBnorm', 'tbp_lv_eccentricity', 'tbp_lv_minorAxisMM', 'tbp_lv_nevi_confidence', 'tbp_lv_norm_border', 'tbp_lv_norm_color', 'tbp_lv_perimeterMM', 'tbp_lv_radial_color_std_max', 'tbp_lv_stdL', 'tbp_lv_stdLExt', 'tbp_lv_symm_2axis', 'tbp_lv_x', 'tbp_lv_y', 'tbp_lv_z', 'tbp_lv_A+tbp_lv_Aext', 'tbp_lv_A-tbp_lv_Aext', 'tbp_lv_A*tbp_lv_Aext', 'tbp_lv_A/tbp_lv_Aext', 'tbp_lv_B+tbp_lv_Bext', 'tbp_lv_B-tbp_lv_Bext', 'tbp_lv_B*tbp_lv_Bext', 'tbp_lv_B/tbp_lv_Bext', 'tbp_lv_C+tbp_lv_Cext', 'tbp_lv_C-tbp_lv_Cext', 'tbp_lv_C*tbp_lv_Cext', 'tbp_lv_C/tbp_lv_Cext', 'tbp_lv_H+tbp_lv_Hext', 'tbp_lv_H-tbp_lv_Hext', 'tbp_lv_H*tbp_lv_Hext', 'tbp_lv_H/tbp_lv_Hext', 'tbp_lv_L+tbp_lv_Lext', 'tbp_lv_L-tbp_lv_Lext', 'tbp_lv_L*tbp_lv_Lext', 'tbp_lv_L/tbp_lv_Lext', 'sin_tbp_lv_symm_2axis_angle', 'cos_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis/sin_tbp_lv_symm_2axis_angle', 'tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'V_tbp_lv', 'S_tbp_lv']
for col in tqdm(float_cols):
    
    tmp=train.groupby(['age_approx'])[col].mean().reset_index().rename(columns={col:f"mean_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left')
    
    tmp=train.groupby(['age_approx'])[col].min().reset_index().rename(columns={col:f"min_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left')
    
    tmp=train.groupby(['age_approx'])[col].max().reset_index().rename(columns={col:f"max_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left')
    
    tmp=train.groupby(['age_approx'])[col].std().reset_index().rename(columns={col:f"std_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left')
    
    tmp=train.groupby(['age_approx'])[col].skew().reset_index().rename(columns={col:f"skew_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left')
    
    tmp=train.groupby(['age_approx'])[col].median().reset_index().rename(columns={col:f"median_age_approx_{col}"})
    train=train.merge(tmp,on='age_approx',how='left')
    test=test.merge(tmp,on='age_approx',how='left') 
    
#     train[f"{col}-mean_age_approx_{col}/std_age_approx_{col}"]=(train[col]-train[f"mean_age_approx_{col}"])/(train[f"std_age_approx_{col}"]+1e-15)
#     test[f"{col}-mean_age_approx_{col}/std_age_approx_{col}"]=(test[col]-test[f"mean_age_approx_{col}"])/(test[f"std_age_approx_{col}"]+1e-15)
#     train[f"{col}/mean_age_approx_{col}"]=train[col]/(train[f"mean_age_approx_{col}"]+1e-15)
#     test[f"{col}/mean_age_approx_{col}"]=test[col]/(train[f"mean_age_approx_{col}"]+1e-15)
#     train[f'ptp_age_approx_{col}']=train[f"max_age_approx_{col}"]-train[f"min_age_approx_{col}"]
#     test[f'ptp_age_approx_{col}']=test[f"max_age_approx_{col}"]-test[f"min_age_approx_{col}"]
        
print("shift feature")
train=train.sort_values(['patient_id','age_approx'])
test=test.sort_values(['patient_id','age_approx'])
for gap in [1]:
    for col in tqdm(float_cols):
        train[f"{col}_diff_{gap}"]=train.groupby(['patient_id'])[col].diff(gap)
        train[f"{col}_diff_{gap}"]=train[f"{col}_diff_{gap}"].fillna(0)
        test[f"{col}_diff_{gap}"]=test.groupby(['patient_id'])[col].diff(gap)
        test[f"{col}_diff_{gap}"]=test[f"{col}_diff_{gap}"].fillna(0)
        
        train[f"{col}_groupanatom_site_general_diff_{gap}"]=train.groupby(['patient_id','anatom_site_general'])[col].diff(gap)
        train[f"{col}_groupanatom_site_general_diff_{gap}"]=train[f"{col}_groupanatom_site_general_diff_{gap}"].fillna(0)
        test[f"{col}_groupanatom_site_general_diff_{gap}"]=test.groupby(['patient_id','anatom_site_general'])[col].diff(gap)
        test[f"{col}_groupanatom_site_general_diff_{gap}"]=test[f"{col}_groupanatom_site_general_diff_{gap}"].fillna(0)
        

train.replace([np.inf, -np.inf], np.nan, inplace=True)    
test.replace([np.inf, -np.inf], np.nan, inplace=True)

for col in test.drop(['isic_id','patient_id','combined_anatomical_site'],axis=1).columns:
    skew=train[col].skew()
    if abs(skew)>0.5:
        min_value=train[col].min()
        train[col]=train[col]-min_value
        test[col]=test[col]-min_value
        
        #print(f"col:{col},skew:{skew}")
        train[col]=np.log1p(train[col])
        test[col]=np.log1p(test[col])

train.head()

Although this dataset is not large and theoretically does not exceed memory capacity, this function was still used.

In [None]:
#遍历表格df的所有列修改数据类型减少内存使用
def reduce_mem_usage(df, float16_as32=True):
    #memory_usage()是df每列的内存使用量,sum是对它们求和, B->KB->MB
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:#遍历每列的列名
        col_type = df[col].dtype#列名的type
        if col_type != object and str(col_type)!='category':#不是object也就是说这里处理的是数值类型的变量
            c_min,c_max = df[col].min(),df[col].max() #求出这列的最大值和最小值
            if str(col_type)[:3] == 'int':#如果是int类型的变量,不管是int8,int16,int32还是int64
                #如果这列的取值范围是在int8的取值范围内,那就对类型进行转换 (-128 到 127)
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                #如果这列的取值范围是在int16的取值范围内,那就对类型进行转换(-32,768 到 32,767)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                #如果这列的取值范围是在int32的取值范围内,那就对类型进行转换(-2,147,483,648到2,147,483,647)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                #如果这列的取值范围是在int64的取值范围内,那就对类型进行转换(-9,223,372,036,854,775,808到9,223,372,036,854,775,807)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:#如果是浮点数类型.
                #如果数值在float16的取值范围内,如果觉得需要更高精度可以考虑float32
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    if float16_as32:#如果数据需要更高的精度可以选择float32
                        df[col] = df[col].astype(np.float32)
                    else:
                        df[col] = df[col].astype(np.float16)  
                #如果数值在float32的取值范围内，对它进行类型转换
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                #如果数值在float64的取值范围内，对它进行类型转换
                else:
                    df[col] = df[col].astype(np.float64)
    #计算一下结束后的内存
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #相比一开始的内存减少了百分之多少
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
train=reduce_mem_usage(train, float16_as32=True)
test=reduce_mem_usage(test, float16_as32=True)

## Metric

- <a href="https://www.kaggle.com/code/yunsuxiaozi/isic-2024-metric-pauc">isic-2024-metric-pauc</a>

In [None]:
# Using the official evaluation metric: pAUC
def pauc_above_tpr(y_true, y_pred):
    min_tpr = 0.8  # Minimum True Positive Rate (TPR)
    v_gt = abs(np.asarray(y_true) - 1)  # Convert ground truth to binary (0 and 1)
    v_pred = np.array([1.0 - x for x in y_pred])  # Convert predictions to probabilities
    max_fpr = abs(1 - min_tpr)  # Calculate maximum False Positive Rate (FPR)
    
    # Calculate the scaled partial AUC
    partial_auc_scaled = roc_auc_score(v_gt, v_pred, max_fpr=max_fpr)
    
    # Calculate the final partial AUC
    partial_auc = 0.5 * max_fpr**2 + (max_fpr - 0.5 * max_fpr**2) / (1.0 - 0.5) * (partial_auc_scaled - 0.5)
    
    return 'pauc', partial_auc, True

## Model Training

- We use 'GroupKFold' here to prevent data leakage between patients.Due to the involvement of different patients in the training and testing sets, the label distribution varies among different patients, therefore 'StratifiedGroupKFold' is not used.

- Due to the small proportion of positive samples, 'early_stop' is not used here.

- If the data is not undersampled, the sample will be imbalanced; If undersampling the data leads to a small total sample size and poor performance of the trained model, we will sample 100000 negative samples and weight the positive samples.

In [None]:
# Select columns for modeling that are not of object type
choose_cols = [col for col in test.columns if train[col].dtype != object]

# Dropping columns with high correlation
# drop_cols = []
# metric = train[choose_cols].corr().values
# for i in range(len(metric)):
#     for j in range(i + 1, len(metric)):
#         if abs(metric[i][j]) > 0.99:
#             drop_cols += [choose_cols[j]]
#     print(f"i:{i}")
# print(f"drop_cols={drop_cols}")

# Retain only one column among those with high correlation with others
drop_cols = ['mean_sex_tbp_lv_dnn_lesion_confidence', 'min_sex_tbp_lv_dnn_lesion_confidence', 
             'median_sex_tbp_lv_dnn_lesion_confidence', 'std_sex_tbp_lv_dnn_lesion_confidence', 
             'skew_sex_tbp_lv_dnn_lesion_confidence', 'mean_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'min_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'median_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'std_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'skew_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'tbp_lv_L+tbp_lv_Lext', 'lesion_shape_index', 'tbp_lv_A-tbp_lv_Aext', 'tbp_lv_B-tbp_lv_Bext', 
             'tbp_lv_L-tbp_lv_Lext', 'border_complexity', 'shape_complexity_index', 
             'min_copyright_license_tbp_lv_dnn_lesion_confidence', 'skew_copyright_license_tbp_lv_dnn_lesion_confidence', 
             'copyright_license_CC-0', 'attribution_ACEMID MIA', 'std_dev_contrast', 'shape_complexity_index', 
             'tbp_lv_H+tbp_lv_Hext', 'tbp_lv_H*tbp_lv_Hext', 'min_sex_tbp_lv_dnn_lesion_confidence', 
             'median_sex_tbp_lv_dnn_lesion_confidence', 'std_sex_tbp_lv_dnn_lesion_confidence', 
             'skew_sex_tbp_lv_dnn_lesion_confidence', 'sex_male', 'sex_female', 'median_sex_tbp_lv_dnn_lesion_confidence', 
             'std_sex_tbp_lv_dnn_lesion_confidence', 'skew_sex_tbp_lv_dnn_lesion_confidence', 'sex_male', 'sex_female', 
             'std_sex_tbp_lv_dnn_lesion_confidence', 'skew_sex_tbp_lv_dnn_lesion_confidence', 'sex_male', 'sex_female', 
             'skew_sex_tbp_lv_dnn_lesion_confidence', 'sex_male', 'sex_female', 'sex_male', 'sex_female', 
             'mean_tbp_lv_location_simple_tbp_lv_dnn_lesion_confidence', 'anatom_site_general_upper extremity', 
             'median_tbp_lv_location_simple_tbp_lv_dnn_lesion_confidence', 'std_tbp_lv_location_simple_tbp_lv_dnn_lesion_confidence', 
             'skew_tbp_lv_location_simple_tbp_lv_dnn_lesion_confidence', 'min_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'median_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'std_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'skew_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'median_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'std_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'skew_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'std_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'skew_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 
             'skew_tbp_tile_type_tbp_lv_dnn_lesion_confidence', 'tbp_lv_location_Unknown', 
             'tbp_lv_location_simple_Unknown', 'attribution_ViDIR Group, Department of Dermatology, Medical University of Vienna', 
             'std_copyright_license_tbp_lv_dnn_lesion_confidence', 'skew_copyright_license_tbp_lv_dnn_lesion_confidence', 
             'copyright_license_CC-0', 'attribution_ACEMID MIA', 'copyright_license_CC-BY', 'copyright_license_CC-0', 
             'attribution_ACEMID MIA', 'tbp_lv_location_simple_Torso Back', 'tbp_lv_location_simple_Torso Front', 
             'tbp_lv_location_Head & Neck', 'tbp_lv_location_simple_Head & Neck', 'tbp_lv_B*tbp_lv_Bext', 'tbp_lv_C*tbp_lv_Cext', 
             'tbp_lv_H*tbp_lv_Hext', 'sin_tbp_lv_symm_2axis_angle-mean_patient_id_sin_tbp_lv_symm_2axis_angle/std_patient_id_sin_tbp_lv_symm_2axis_angle', 
             'cos_tbp_lv_symm_2axis_angle-mean_patient_id_cos_tbp_lv_symm_2axis_angle/std_patient_id_cos_tbp_lv_symm_2axis_angle', 
             'attribution_ACEMID MIA', 'tbp_lv_location_simple_Head & Neck', 'tbp_lv_location_simple_Unknown', 
             'clin_size_long_diam_mm/mean_patient_id_clin_size_long_diam_mm', 'median_patient_id_tbp_lv_A', 
             'median_patient_id_tbp_lv_Aext', 'median_patient_id_tbp_lv_B', 'median_patient_id_tbp_lv_Bext', 
             'median_patient_id_tbp_lv_C', 'median_patient_id_tbp_lv_Cext', 'median_patient_id_tbp_lv_Hext', 
             'median_patient_id_tbp_lv_L', 'mean_patient_id_tbp_lv_L+tbp_lv_Lext', 'median_patient_id_tbp_lv_L+tbp_lv_Lext', 
             'max_patient_id_tbp_lv_L+tbp_lv_Lext', 'std_patient_id_tbp_lv_L+tbp_lv_Lext', 'mean_patient_id_tbp_lv_L+tbp_lv_Lext', 
             'median_patient_id_tbp_lv_L+tbp_lv_Lext', 'median_patient_id_tbp_lv_Lext', 'mean_patient_id_tbp_lv_L+tbp_lv_Lext', 
             'median_patient_id_tbp_lv_L+tbp_lv_Lext', 'std_patient_id_tbp_lv_L+tbp_lv_Lext', 'mean_patient_id_tbp_lv_L+tbp_lv_Lext', 
             'median_patient_id_tbp_lv_L+tbp_lv_Lext', 'tbp_lv_areaMM2/mean_patient_id_tbp_lv_areaMM2', 
             'median_patient_id_tbp_lv_area_perim_ratio', 'mean_patient_id_tbp_lv_norm_color', 
             'ptp_patient_id_tbp_lv_color_std_mean', 'mean_patient_id_tbp_lv_A-tbp_lv_Aext', 'min_patient_id_tbp_lv_A-tbp_lv_Aext', 
             'max_patient_id_tbp_lv_A-tbp_lv_Aext', 'std_patient_id_tbp_lv_A-tbp_lv_Aext', 'median_patient_id_tbp_lv_A-tbp_lv_Aext', 
             'skew_patient_id_tbp_lv_A-tbp_lv_Aext', 'tbp_lv_A-tbp_lv_Aext-mean_patient_id_tbp_lv_A-tbp_lv_Aext/std_patient_id_tbp_lv_A-tbp_lv_Aext', 
             'tbp_lv_A-tbp_lv_Aext/mean_patient_id_tbp_lv_A-tbp_lv_Aext', 'ptp_patient_id_tbp_lv_A-tbp_lv_Aext', 
             'mean_patient_id_tbp_lv_B-tbp_lv_Bext', 'min_patient_id_tbp_lv_B-tbp_lv_Bext', 'max_patient_id_tbp_lv_B-tbp_lv_Bext', 
             'std_patient_id_tbp_lv_B-tbp_lv_Bext', 'median_patient_id_tbp_lv_B-tbp_lv_Bext', 'skew_patient_id_tbp_lv_B-tbp_lv_Bext', 
             'tbp_lv_B-tbp_lv_Bext-mean_patient_id_tbp_lv_B-tbp_lv_Bext/std_patient_id_tbp_lv_B-tbp_lv_Bext', 
             'tbp_lv_B-tbp_lv_Bext/mean_patient_id_tbp_lv_B-tbp_lv_Bext', 'ptp_patient_id_tbp_lv_B-tbp_lv_Bext', 
             'mean_patient_id_tbp_lv_deltaLB', 'mean_patient_id_tbp_lv_L-tbp_lv_Lext', 'min_patient_id_tbp_lv_L-tbp_lv_Lext', 
             'max_patient_id_tbp_lv_L-tbp_lv_Lext', 'std_patient_id_tbp_lv_deltaLB', 'std_patient_id_tbp_lv_L-tbp_lv_Lext', 
             'median_patient_id_tbp_lv_L-tbp_lv_Lext', 'skew_patient_id_tbp_lv_L-tbp_lv_Lext', 'tbp_lv_L-tbp_lv_Lext-mean_patient_id_tbp_lv_L-tbp_lv_Lext/std_patient_id_tbp_lv_L-tbp_lv_Lext', 
             'tbp_lv_L-tbp_lv_Lext/mean_patient_id_tbp_lv_L-tbp_lv_Lext', 'ptp_patient_id_tbp_lv_L-tbp_lv_Lext', 
             'mean_patient_id_tbp_lv_L-tbp_lv_Lext', 'std_patient_id_tbp_lv_L-tbp_lv_Lext', 'tbp_lv_eccentricity/mean_patient_id_tbp_lv_eccentricity', 
             'ptp_patient_id_tbp_lv_minorAxisMM', 'median_patient_id_tbp_lv_norm_border', 'mean_patient_id_tbp_lv_symm_2axis', 
             'tbp_lv_norm_border/mean_patient_id_tbp_lv_norm_border', 'mean_patient_id_tbp_lv_radial_color_std_max', 
             'ptp_patient_id_tbp_lv_perimeterMM', 'ptp_patient_id_tbp_lv_radial_color_std_max', 'ptp_patient_id_tbp_lv_stdL', 
             'ptp_patient_id_tbp_lv_stdLExt', 'median_patient_id_tbp_lv_symm_2axis', 'tbp_lv_symm_2axis/mean_patient_id_tbp_lv_symm_2axis', 
             'median_patient_id_tbp_lv_A+tbp_lv_Aext', 'tbp_lv_A*tbp_lv_Aext-mean_patient_id_tbp_lv_A*tbp_lv_Aext/std_patient_id_tbp_lv_A*tbp_lv_Aext', 
             'tbp_lv_A*tbp_lv_Aext/mean_patient_id_tbp_lv_A*tbp_lv_Aext', 'median_patient_id_tbp_lv_A*tbp_lv_Aext', 
             'median_patient_id_tbp_lv_B+tbp_lv_Bext', 'max_patient_id_tbp_lv_B*tbp_lv_Bext', 'tbp_lv_B*tbp_lv_Bext-mean_patient_id_tbp_lv_B*tbp_lv_Bext/std_patient_id_tbp_lv_B*tbp_lv_Bext', 
             'tbp_lv_B*tbp_lv_Bext/mean_patient_id_tbp_lv_B*tbp_lv_Bext', 'median_patient_id_tbp_lv_B*tbp_lv_Bext', 
             'median_patient_id_tbp_lv_C+tbp_lv_Cext', 'max_patient_id_tbp_lv_C*tbp_lv_Cext', 
             'tbp_lv_C*tbp_lv_Cext-mean_patient_id_tbp_lv_C*tbp_lv_Cext/std_patient_id_tbp_lv_C*tbp_lv_Cext', 
             'tbp_lv_C*tbp_lv_Cext/mean_patient_id_tbp_lv_C*tbp_lv_Cext', 'median_patient_id_tbp_lv_C-tbp_lv_Cext', 
             'median_patient_id_tbp_lv_C*tbp_lv_Cext', 'median_patient_id_tbp_lv_C/tbp_lv_Cext', 'median_patient_id_tbp_lv_H+tbp_lv_Hext', 
             'mean_patient_id_tbp_lv_H*tbp_lv_Hext', 'median_patient_id_tbp_lv_H*tbp_lv_Hext', 'min_patient_id_tbp_lv_H*tbp_lv_Hext', 
             'mean_patient_id_tbp_lv_H*tbp_lv_Hext', 'median_patient_id_tbp_lv_H*tbp_lv_Hext', 'tbp_lv_H*tbp_lv_Hext-mean_patient_id_tbp_lv_H*tbp_lv_Hext/std_patient_id_tbp_lv_H*tbp_lv_Hext', 
             'tbp_lv_H*tbp_lv_Hext/mean_patient_id_tbp_lv_H*tbp_lv_Hext', 'median_patient_id_tbp_lv_H*tbp_lv_Hext', 
             'median_patient_id_tbp_lv_L+tbp_lv_Lext', 'max_patient_id_tbp_lv_L*tbp_lv_Lext', 'tbp_lv_L*tbp_lv_Lext-mean_patient_id_tbp_lv_L*tbp_lv_Lext/std_patient_id_tbp_lv_L*tbp_lv_Lext', 
             'tbp_lv_L*tbp_lv_Lext/mean_patient_id_tbp_lv_L*tbp_lv_Lext', 'median_patient_id_tbp_lv_L*tbp_lv_Lext', 
             'std_patient_id_tbp_lv_symm_2axis/sin_tbp_lv_symm_2axis_angle', 'ptp_patient_id_tbp_lv_symm_2axis/sin_tbp_lv_symm_2axis_angle', 
             'ptp_patient_id_V_tbp_lv', 'ptp_patient_id_S_tbp_lv', 'median_age_approx_tbp_lv_Bext', 'median_age_approx_tbp_lv_C', 
             'median_age_approx_tbp_lv_Hext', 'skew_age_approx_tbp_lv_L+tbp_lv_Lext', 'median_age_approx_tbp_lv_L+tbp_lv_Lext', 
             'median_age_approx_tbp_lv_minorAxisMM', 'skew_age_approx_tbp_lv_area_perim_ratio', 'median_age_approx_tbp_lv_area_perim_ratio', 
             'mean_age_approx_tbp_lv_nevi_confidence', 'skew_age_approx_tbp_lv_nevi_confidence', 'mean_age_approx_tbp_lv_norm_border', 
             'median_age_approx_tbp_lv_norm_border', 'mean_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 
             'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'mean_age_approx_tbp_lv_norm_border', 
             'skew_age_approx_tbp_lv_norm_border', 'skew_age_approx_tbp_lv_symm_2axis', 'mean_age_approx_tbp_lv_nevi_confidence', 
             'skew_age_approx_tbp_lv_nevi_confidence', 'mean_age_approx_tbp_lv_norm_border', 'median_age_approx_tbp_lv_norm_border', 
             'median_age_approx_tbp_lv_symm_2axis', 'mean_age_approx_tbp_lv_norm_color', 'median_age_approx_tbp_lv_radial_color_std_max', 
             'median_age_approx_tbp_lv_norm_color', 'median_age_approx_tbp_lv_radial_color_std_max', 'median_age_approx_tbp_lv_deltaA', 
             'mean_age_approx_tbp_lv_A-tbp_lv_Aext', 'median_age_approx_tbp_lv_A-tbp_lv_Aext', 'min_age_approx_tbp_lv_A-tbp_lv_Aext', 
             'max_age_approx_tbp_lv_A-tbp_lv_Aext', 'std_age_approx_tbp_lv_A-tbp_lv_Aext', 'skew_age_approx_tbp_lv_A-tbp_lv_Aext', 
             'mean_age_approx_tbp_lv_A-tbp_lv_Aext', 'median_age_approx_tbp_lv_A-tbp_lv_Aext', 'mean_age_approx_tbp_lv_B-tbp_lv_Bext', 
             'median_age_approx_tbp_lv_B/tbp_lv_Bext', 'min_age_approx_tbp_lv_B-tbp_lv_Bext', 'max_age_approx_tbp_lv_B-tbp_lv_Bext', 
             'std_age_approx_tbp_lv_B-tbp_lv_Bext', 'skew_age_approx_tbp_lv_B-tbp_lv_Bext', 'median_age_approx_tbp_lv_B-tbp_lv_Bext', 
             'median_age_approx_tbp_lv_deltaL', 'mean_age_approx_tbp_lv_L-tbp_lv_Lext', 'median_age_approx_tbp_lv_L-tbp_lv_Lext', 
             'min_age_approx_tbp_lv_L-tbp_lv_Lext', 'max_age_approx_tbp_lv_L-tbp_lv_Lext', 'std_age_approx_tbp_lv_deltaLB', 
             'std_age_approx_tbp_lv_L-tbp_lv_Lext', 'skew_age_approx_tbp_lv_deltaLB', 'skew_age_approx_tbp_lv_L-tbp_lv_Lext', 
             'mean_age_approx_tbp_lv_L-tbp_lv_Lext', 'median_age_approx_tbp_lv_L-tbp_lv_Lext', 'median_age_approx_tbp_lv_deltaLB', 
             'std_age_approx_tbp_lv_L-tbp_lv_Lext', 'skew_age_approx_tbp_lv_deltaLBnorm', 'skew_age_approx_tbp_lv_L-tbp_lv_Lext', 
             'median_age_approx_tbp_lv_deltaLBnorm', 'mean_age_approx_tbp_lv_stdL', 'median_age_approx_tbp_lv_stdL', 
             'mean_age_approx_tbp_lv_stdL', 'median_age_approx_tbp_lv_stdL', 'skew_age_approx_tbp_lv_eccentricity', 
             'median_age_approx_tbp_lv_eccentricity', 'median_age_approx_tbp_lv_eccentricity', 'skew_age_approx_tbp_lv_nevi_confidence', 
             'mean_age_approx_tbp_lv_norm_border', 'skew_age_approx_tbp_lv_norm_border', 'median_age_approx_tbp_lv_norm_border', 
             'mean_age_approx_tbp_lv_symm_2axis', 'skew_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 
             'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'median_age_approx_sin_tbp_lv_symm_2axis_angle', 'median_age_approx_cos_tbp_lv_symm_2axis_angle', 
             'median_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'median_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'mean_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'mean_age_approx_tbp_lv_norm_border', 
             'median_age_approx_tbp_lv_norm_border', 'mean_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 
             'skew_age_approx_tbp_lv_norm_border', 'median_age_approx_tbp_lv_norm_border', 'mean_age_approx_tbp_lv_symm_2axis', 
             'skew_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 
             'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 'mean_age_approx_tbp_lv_symm_2axis', 
             'skew_age_approx_tbp_lv_symm_2axis', 'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 
             'mean_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 
             'median_age_approx_tbp_lv_norm_color', 'median_age_approx_tbp_lv_radial_color_std_max', 'median_age_approx_tbp_lv_radial_color_std_max', 
             'median_age_approx_tbp_lv_stdL', 'skew_age_approx_tbp_lv_symm_2axis', 'median_age_approx_tbp_lv_symm_2axis', 
             'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'std_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'mean_age_approx_tbp_lv_A*tbp_lv_Aext', 'max_age_approx_tbp_lv_A*tbp_lv_Aext', 'median_age_approx_tbp_lv_A*tbp_lv_Aext', 
             'median_age_approx_tbp_lv_A-tbp_lv_Aext', 'median_age_approx_tbp_lv_A/tbp_lv_Aext', 'mean_age_approx_tbp_lv_B*tbp_lv_Bext', 
             'median_age_approx_tbp_lv_B/tbp_lv_Bext', 'median_age_approx_tbp_lv_C+tbp_lv_Cext', 'mean_age_approx_tbp_lv_C*tbp_lv_Cext', 
             'skew_age_approx_tbp_lv_C*tbp_lv_Cext', 'median_age_approx_tbp_lv_H+tbp_lv_Hext', 'median_age_approx_tbp_lv_H*tbp_lv_Hext', 
             'median_age_approx_tbp_lv_H*tbp_lv_Hext', 'median_age_approx_tbp_lv_H-tbp_lv_Hext', 'skew_age_approx_tbp_lv_H/tbp_lv_Hext', 
             'median_age_approx_tbp_lv_H/tbp_lv_Hext', 'max_age_approx_tbp_lv_L*tbp_lv_Lext', 'median_age_approx_tbp_lv_L-tbp_lv_Lext', 
             'std_age_approx_cos_tbp_lv_symm_2axis_angle', 'median_age_approx_cos_tbp_lv_symm_2axis_angle', 
             'median_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 'median_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'mean_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'median_age_approx_tbp_lv_symm_2axis*sin_tbp_lv_symm_2axis_angle', 
             'median_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 'mean_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 
             'std_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 'median_age_approx_tbp_lv_symm_2axis*cos_tbp_lv_symm_2axis_angle', 
             'mean_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'mean_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 
             'max_age_approx_tbp_lv_symm_2axis/cos_tbp_lv_symm_2axis_angle', 'median_age_approx_V_tbp_lv', 'mean_age_approx_S_tbp_lv', 
             'mean_age_approx_S_tbp_lv', 'tbp_lv_A-tbp_lv_Aext_diff_1', 'tbp_lv_B-tbp_lv_Bext_diff_1', 'tbp_lv_deltaLB_diff_1', 
             'tbp_lv_L-tbp_lv_Lext_diff_1', 'tbp_lv_L-tbp_lv_Lext_diff_1', 'tbp_lv_B*tbp_lv_Bext_diff_1', 'tbp_lv_C*tbp_lv_Cext_diff_1', 
             'tbp_lv_H*tbp_lv_Hext_diff_1']

choose_cols = [col for col in choose_cols if col not in drop_cols]
print(f"len(choose_cols): {len(choose_cols)}")

cols_name = {}
for i in range(len(choose_cols)):
    cols_name[choose_cols[i]] = f"cols_{i}"

def fit_and_predict(train_feats=train, test_feats=test, model=None, num_folds=10, name='lgb'):
    X = train_feats[choose_cols].copy().rename(columns=cols_name)
    y = train_feats['target'].copy()
    patient_id = train_feats['patient_id'].copy()
    oof_pred = np.zeros((len(X)))
    test_X = test_feats[choose_cols].copy().rename(columns=cols_name)
    test_pred_pro = np.zeros((num_folds, len(test_X)))
     
    # K-fold cross-validation
    gkf = GroupKFold(n_splits=num_folds) #,shuffle=True
    for fold, (train_index, valid_index) in enumerate(gkf.split(X, y, patient_id)):
        print(f"name {name}, fold: {fold}")

        X_train, X_valid = X.iloc[train_index].reset_index(drop=True), X.iloc[valid_index].reset_index(drop=True)
        y_train, y_valid = y.iloc[train_index].reset_index(drop=True), y.iloc[valid_index].reset_index(drop=True)
        
        # Sample 100,000 instances from the label 0 data in the training set
        zero_index = np.where(y_train == 0)[0]
        one_index = np.where(y_train == 1)[0]
        np.random.shuffle(zero_index)
        total_index = list(zero_index[:10000]) + list(one_index)
        X_train = X_train.iloc[total_index]
        y_train = y_train.iloc[total_index]
        
        if 'lgb' in name:
            model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
                      callbacks=[log_evaluation(100)],
                      eval_metric=pauc_above_tpr
                     )
        if 'xgb' in name:
            model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
                      verbose=100,
                     )
        if 'cat' in name:
            model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],
                      verbose=100
                     )
        
        oof_pred[valid_index] = model.predict_proba(X_valid)[:, 1]
        test_pred_pro[fold] = model.predict_proba(test_X)[:, 1]

    test_pred_pro = test_pred_pro.mean(axis=0)
    print(f"name: {name}, pauc: {pauc_above_tpr(y.values.astype(np.int8), oof_pred)}")
    
    return oof_pred, test_pred_pro

### Optuna find best

```sh
- optuna find best lgb_params
- optuna find best xgb_params
- optuna find best cat_params
```

### Sample of Optimization

In [None]:
# LightGBM Model - Best Parameters Found by Optuna (Trial 21)
lgb_params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "random_state": 2024,
    "n_estimators": 1000,
    "reg_alpha": 0.3459645276017008,
    "reg_lambda": 0.18711879709643683,
    "colsample_bytree": 0.62892422100233,
    "subsample": 0.6208054621175016,
    "learning_rate": 0.007309181549082522,
    "num_leaves": 38,
    "min_child_samples": 53,
    "scale_pos_weight": 2.5,
    "verbose": -1,
    'device':'gpu', 'gpu_use_dp':True,  # Uncomment this line if using a GPU environment. Comment it out for CPU.
}

# Fit and predict using the LightGBM model with the best parameters
lgb_oof_pred_pro, lgb_test_pro = fit_and_predict(model=LGBMClassifier(**lgb_params), num_folds=Config.num_folds, name='lgb')
print(f"lgb_test_pro[:10]: {lgb_test_pro[:10]}")

# CatBoost Model - Best Parameters Found by Optuna (Trial 1)
cat_params = {
    'iterations': 999,
    'od_wait': 589,
    'task_type': "GPU",
    'leaf_estimation_method': 'Newton',
    'bootstrap_type': 'Bernoulli',
    'learning_rate': 0.06565361652314616,
    'reg_lambda': 92.76585571631034,
    'subsample': 0.8310010342463381,
    'random_strength': 27.409704119980354,
    'depth': 7,
    'min_data_in_leaf': 24,
    'leaf_estimation_iterations': 13
}

# Fit and predict using the CatBoost model with the best parameters
cat_model = CatBoostClassifier(**cat_params)
cat_oof_pred_pro, cat_test_pro = fit_and_predict(model=cat_model, num_folds=Config.num_folds, name='cat')
print(f"cat_test_pro[:10]: {cat_test_pro[:10]}")

# XGBoost Model - Best Parameters Found by Optuna (Trial 35)
xgb_params = {
    'random_state': 2024,
    'n_estimators': 760,
    'learning_rate': 0.009826644028525231,
    'max_depth': 10,
    'reg_alpha': 0.08277318651348423,
    'reg_lambda': 0.7719612355688399,
    'subsample': 0.9579828266704034,
    'colsample_bytree': 0.6228502853913586,
    'min_child_weight': 3,
    'scale_pos_weight': 2.5,
    'tree_method': 'gpu_hist',
    'objective': 'binary:logistic',
}

# Fit and predict using the XGBoost model with the best parameters
xgb_model = XGBClassifier(**xgb_params)
xgb_oof_pred_pro, xgb_test_pro = fit_and_predict(model=xgb_model, num_folds=Config.num_folds, name='xgb')
print(f"xgb_test_pro[:10]: {xgb_test_pro[:10]}")

## Best Fusion Weight

In [None]:
# Model Parameter Selection
steps = 100
best_w1 = 1
best_w2 = 1

# Initial predictions based on default weights
init_pros = (best_w1 * lgb_oof_pred_pro + best_w2 * cat_oof_pred_pro + (steps - best_w1 - best_w2) * xgb_oof_pred_pro) / steps
best_pauc = pauc_above_tpr(train[Config.TARGET_NAME].values, init_pros)[1]
print(f"Initial best_pauc: {best_pauc}")

# Skipping the parameter search, using average as the final prediction result
best_w1, best_w2 = 33, 33

# # Uncomment the following code to search for the best weights
# for w1 in range(1, steps - 2):  # Weight for LightGBM
#     for w2 in range(1, steps - 2):  # Weight for CatBoost
#         w3 = steps - w1 - w2  # Weight for XGBoost
#         # Current prediction results
#         cur_pros = (w1 * lgb_oof_pred_pro + w2 * cat_oof_pred_pro + w3 * xgb_oof_pred_pro) / steps
#         cur_pauc = pauc_above_tpr(train[Config.TARGET_NAME].values, cur_pros)[1]
#         if cur_pauc > best_pauc:
#             best_w1 = w1
#             best_w2 = w2
#             best_pauc = cur_pauc
#             print(f"Updated best_w1: {best_w1}, best_w2: {best_w2}, best_pauc: {best_pauc}")

# Final predictions based on the selected best weights
best_pros = (best_w1 * lgb_oof_pred_pro + best_w2 * cat_oof_pred_pro + (steps - best_w1 - best_w2) * xgb_oof_pred_pro) / steps
best_pauc = pauc_above_tpr(train[Config.TARGET_NAME].values, best_pros)[1]
print(f"Final best_pauc: {best_pauc}")

# Apply the best weights to the test set predictions
test_pros = (best_w1 * lgb_test_pro + best_w2 * cat_test_pro + (steps - best_w1 - best_w2) * xgb_test_pro) / steps
test['target'] = test_pros
print(test_pros[:10])

## Submission

In [None]:
submission=test[['isic_id','target']]
submission.to_csv("submission.csv",index=None)
submission.head()