# Data Exploration
Data exploration is a crucial step in the data analysis process. It involves examining the dataset to understand its structure, contents, and relationships. This step helps in identifying patterns, detecting anomalies, and forming hypotheses for further analysis. Key activities in data exploration include:

- **Descriptive Statistics**: Summarizing the main features of the dataset using measures such as mean, median, mode, standard deviation, and variance.
- **Data Visualization**: Creating visual representations of the data, such as histograms, scatter plots, and box plots, to identify trends and patterns.
- **Missing Data Analysis**: Identifying and handling missing values in the dataset.
- **Correlation Analysis**: Examining the relationships between different variables in the dataset.

By thoroughly exploring the data, we can gain valuable insights and make informed decisions for subsequent data processing and modeling steps.

In [1]:
# Import all necessary libraries
import pandas as pd

In [2]:
# Load the data
data_path = '../data/raw/'
data_train = pd.read_csv(data_path + 'train.csv')
data_test = pd.read_csv(data_path + 'test.csv')
data_train_targets = pd.read_csv(data_path + 'train_targets.csv')

In [7]:
# Explore the data
print("--- Training data shape: ", data_train.shape)
print(data_train.head())

--- Training data shape:  (742, 19921)
  Unnamed: 0      A1BG       A1CF       A2M     A2ML1   A3GALT2     A4GALT  \
0        CL1  1.672481  45.412546  9.377504  0.860362  0.156075   0.358733   
1        CL2  0.545643  15.886006  0.126553  0.731387  0.000000   3.006263   
2        CL3  1.652956   0.464895  0.353668  0.196430  0.000000  11.393572   
3        CL4  0.795200   0.182806  0.534622  0.239157  0.027417  20.203002   
4        CL5  9.983922   0.222700  0.451019  0.152793  0.233698   1.174855   

      A4GNT       AAAS        AACS  ...       ZW10      ZWILCH       ZWINT  \
0  0.013006  90.484463  119.760414  ...  38.069286  118.897181  201.401740   
1  0.015819  43.455131   37.971081  ...  40.892433   72.780020   95.990439   
2  0.408079  86.349518   35.893872  ...  14.024315   33.830939   59.865191   
3  0.342707  74.806003   56.297983  ...  18.122326   56.826586  120.221485   
4  0.068735  53.228255   36.978543  ...  47.234577   63.179324   89.121585   

        ZXDA       ZXDB

In [8]:
print("--- Training targets shape: ", data_train_targets.shape)
print(data_train_targets.head())

--- Training targets shape:  (742, 3)
  sample       AAC                 tissue
0    CL1  0.050705               Prostate
1    CL2  0.163113      Esophagus/Stomach
2    CL3  0.236655  Bladder/Urinary Tract
3    CL4  0.270218  Bladder/Urinary Tract
4    CL5  0.071619              CNS/Brain


In [9]:
print("--- Test data shape: ", data_test.shape)
print(data_test.head())

--- Test data shape:  (304, 19921)
  Unnamed: 0      A1BG       A1CF       A2M     A2ML1   A3GALT2     A4GALT  \
0        CL1  1.446057  75.700445  0.031116  0.471972  0.014860   3.129831   
1        CL2  7.201269   0.354029  0.074407  0.453273  0.108228  11.557571   
2        CL3  6.835308   0.121081  0.000000  0.211975  0.020079  12.372666   
3        CL4  6.384288   0.132942  0.018201  0.207608  0.007283   4.364313   
4        CL5  4.538481   1.354367  0.007254  0.237628  0.021695   3.881963   

      A4GNT       AAAS        AACS  ...       ZW10     ZWILCH      ZWINT  \
0  0.000000  54.031837  130.140472  ...  27.105080  39.252909  27.238822   
1  0.344976  43.575264   45.155525  ...  24.459509  20.147702   4.051783   
2  0.066928  34.581955   37.123022  ...  14.771119  27.387161  21.477355   
3  0.203914  49.908053   36.949086  ...  26.690940  18.067079  27.841600   
4  0.267576  39.319195   55.127152  ...  33.244498  49.214787  61.607545   

       ZXDA       ZXDB        ZXDC     

In [10]:
# Check for nan values
print("--- Training data nan values: ", data_train.isna().sum().sum())
print("--- Training targets nan values: ", data_train_targets.isna().sum().sum())
print("--- Test data nan values: ", data_test.isna().sum().sum())

--- Training data nan values:  0
--- Training targets nan values:  0
--- Test data nan values:  0


## What is AAC (Area Above the Curve)?

**AAC (Area Above the Curve)** is a metric used in pharmacology to measure the effectiveness of a drug on cancer cell lines or tumor models. It represents the area above the dose-response curve, indicating how a specific cell line responds to a given drug over a range of concentrations.

In drug testing, various concentrations of a drug (like Erlotinib, in this case) are applied to cancer cell lines, and the response is measured. AAC captures the drug's ability to inhibit cell growth:

- **Higher AAC values**: These suggest a stronger response, meaning the drug is more effective in inhibiting cell growth at lower concentrations.
- **Lower AAC values**: These imply a weaker response, indicating that higher drug concentrations are needed to achieve cell inhibition.
