## A First Simple Baseline - Submit Mean Probability for Each Label

As a first starting point in the competition, we'll compute the mean probabilities for each tag/label that we have to predict. We'll use the information in **train.csv** to compute these mean values.

For each patient in the test set, we simply return the computed mean values.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

We read in the **train.csv** file.

In [2]:
BASE_DIR = '/kaggle/input/rsna-2023-abdominal-trauma-detection/'
patient_labels = pd.read_csv(BASE_DIR + 'train.csv')
patient_labels.head()

Unnamed: 0,patient_id,bowel_healthy,bowel_injury,extravasation_healthy,extravasation_injury,kidney_healthy,kidney_low,kidney_high,liver_healthy,liver_low,liver_high,spleen_healthy,spleen_low,spleen_high,any_injury
0,10004,1,0,0,1,0,1,0,1,0,0,0,0,1,1
1,10005,1,0,1,0,1,0,0,1,0,0,1,0,0,0
2,10007,1,0,1,0,1,0,0,1,0,0,1,0,0,0
3,10026,1,0,1,0,1,0,0,1,0,0,1,0,0,0
4,10051,1,0,1,0,1,0,0,1,0,0,0,1,0,1


We compute the mean values of the labels we are interested in.

In [3]:
target_labels = ['bowel_healthy', 'bowel_injury', 'extravasation_healthy',
       'extravasation_injury', 'kidney_healthy', 'kidney_low', 'kidney_high',
       'liver_healthy', 'liver_low', 'liver_high', 'spleen_healthy',
       'spleen_low', 'spleen_high'] # get from patient_labels.columns, instead of typing
mean_prob = patient_labels[target_labels].mean()
mean_prob

bowel_healthy            0.979663
bowel_injury             0.020337
extravasation_healthy    0.936447
extravasation_injury     0.063553
kidney_healthy           0.942167
kidney_low               0.036543
kidney_high              0.021290
liver_healthy            0.897998
liver_low                0.082301
liver_high               0.019701
spleen_healthy           0.887512
spleen_low               0.063235
spleen_high              0.049253
dtype: float64

We get a list of all the patients for whom we need to make predictions.

In [4]:
test_patient_ids = os.listdir(BASE_DIR + 'test_images')

We create a new dataframe with appropriate column names (based on what is asked in Evaluation section of competition).

In [5]:
submission = pd.DataFrame(test_patient_ids, columns = ['patient_id'])
for label in target_labels:
    submission[label] = mean_prob[label]
submission

Unnamed: 0,patient_id,bowel_healthy,bowel_injury,extravasation_healthy,extravasation_injury,kidney_healthy,kidney_low,kidney_high,liver_healthy,liver_low,liver_high,spleen_healthy,spleen_low,spleen_high
0,63706,0.979663,0.020337,0.936447,0.063553,0.942167,0.036543,0.02129,0.897998,0.082301,0.019701,0.887512,0.063235,0.049253
1,50046,0.979663,0.020337,0.936447,0.063553,0.942167,0.036543,0.02129,0.897998,0.082301,0.019701,0.887512,0.063235,0.049253
2,48843,0.979663,0.020337,0.936447,0.063553,0.942167,0.036543,0.02129,0.897998,0.082301,0.019701,0.887512,0.063235,0.049253


We write our predictions to the submission csv file.

In [6]:
submission.to_csv('submission.csv', header = True, index = False)

As expected, this solution is not very good. The submission gives us a place of 418 out of 476 teams; so we are at the bottom of the table. We're now going to iterate upon our submission.