<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-research-datasets/scin/blob/main/scin_demo.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/google-research-datasets/scin/blob/main/scin_demo.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# SCIN Dataset Demo

This notebook demonstrates how to access and use the SCIN dataset.

The SCIN dataset is stored in a public Google Cloud Storage (GCS) bucket. You can access it using the `gcsfs` library.

## Installation

```python
!pip install gcsfs pandas
```

## Accessing the Data

The dataset consists of several CSV files and an `images/` directory containing the image files. The main metadata is in `scin_cases.csv` and `scin_labels.csv`.

```python
import gcsfs
import pandas as pd

# Define the GCS bucket path
BUCKET_PATH = 'gs://dx-scin-public-data/dataset/'

# Initialize a GCS filesystem client
fs = gcsfs.GCSFileSystem()

# List contents of the bucket
print(fs.ls(BUCKET_PATH))

# Read scin_cases.csv
with fs.open(BUCKET_PATH + 'scin_cases.csv') as f:
    cases_df = pd.read_csv(f)
print('scin_cases.csv head:')
print(cases_df.head())

# Read scin_labels.csv
with fs.open(BUCKET_PATH + 'scin_labels.csv') as f:
    labels_df = pd.read_csv(f)
print('
scin_labels.csv head:')
print(labels_df.head())

# Merge the two dataframes on 'case_id'
merged_df = pd.merge(cases_df, labels_df, on='case_id', how='left')
print('
Merged DataFrame head:')
print(merged_df.head())

# Accessing images
# The image paths are relative to the 'images/' directory within the bucket.
# You can construct the full GCS path for an image using the 'image_file_name' column from merged_df.
# Example: Get the GCS path for the first image
first_image_file = merged_df['image_file_name'].iloc[0]
first_image_gcs_path = BUCKET_PATH + 'images/' + first_image_file
print(f'
First image GCS path: {first_image_gcs_path}')

# To open an image (e.g., using PIL)
# from PIL import Image
# with fs.open(first_image_gcs_path) as f:
#     img = Image.open(f)
#     img.show() # This will likely not work in a headless environment

# You can also download images locally
# fs.get(first_image_gcs_path, 'local_image.jpg')
# print('Downloaded first image to local_image.jpg')

# Example: Filter data and access specific columns
# Filter for cases with 'Acne' condition
acne_cases = merged_df[merged_df['dermatologist_condition_label'] == 'Acne']
print('
Acne cases head:')
print(acne_cases[['case_id', 'dermatologist_condition_label', 'estimated_fitzpatrick_skin_type', 'estimated_monk_skin_tone']].head())

# Example: Get image paths for acne cases
acne_image_paths = [BUCKET_PATH + 'images/' + f for f in acne_cases['image_file_name']]
print(f'
First 5 acne image paths: {acne_image_paths[:5]}')

# Example: Analyze distribution of Fitzpatrick skin types
print('
Distribution of Estimated Fitzpatrick Skin Type:')
print(merged_df['estimated_fitzpatrick_skin_type'].value_counts())

# Example: Analyze distribution of Monk Skin Tones
print('
Distribution of Estimated Monk Skin Tone:')
print(merged_df['estimated_monk_skin_tone'].value_counts())

# Example: Analyze distribution of dermatologist condition labels
print('
Distribution of Dermatologist Condition Labels:')
print(merged_df['dermatologist_condition_label'].value_counts())

# Example: Accessing scin_app_questions.csv and scin_label_questions.csv
# These files contain the questions asked to contributors and labelers, respectively.
# They can be useful for understanding the context of the self-reported and labeled data.
with fs.open(BUCKET_PATH + 'scin_app_questions.csv') as f:
    app_questions_df = pd.read_csv(f)
print('
scin_app_questions.csv head:')
print(app_questions_df.head())

with fs.open(BUCKET_PATH + 'scin_label_questions.csv') as f:
    label_questions_df = pd.read_csv(f)
print('
scin_label_questions.csv head:')
print(label_questions_df.head())
```

## Dataset Documentation

For a detailed overview of the dataset schema and fields, refer to the [Dataset Documentation](https://github.com/google-research-datasets/scin/blob/main/dataset_schema.md).

## License

The SCIN Dataset is released under the [SCIN Data Use License](https://github.com/google-research-datasets/scin/blob/main/LICENSE).
