# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Handling Missing Values](#handling-missing-values)
3. [Feature Distributions](#feature-distributions)
4. [Possible Biases](#possible-biases)
5. [Correlations](#correlations)


. [Correlations](#correlations)


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os


## Dataset Overview

[Provide a high-level overview of the dataset. This should include the source of the dataset, the number of samples, the number of features, and example showing the structure of the dataset.]


**Variables**:\
\
    **CLINICAL DATA** (one line per patient):\
        ID = unique identifier per patient\
        CENTER = clinical center\
        BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)\
        WBC = White Blood Cell count in Giga/L\
        ANC = Absolute Neutrophil count in Giga/L\
        MONOCYTES = Monocyte count in Giga/L\
        HB = Hemoglobin in g/dL\
        PLT = Platelets coutn in Giga/L\
        CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease\
        \

    **GENE MOLECULAR DATA** (one line per patient per somatic mutation). Mutations are detected from the sequencing of the blood tumoral cells. We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.

        ID = unique identifier per patient\
        CHR START END = position of the mutation on the human genome\
        REF ALT = reference and alternate (=mutant) nucleotide\
        GENE = the affected gene\
        PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene\
        EFFECT = a broad categorization of the mutation consequences on a given gene\
        VAF = Variant Allele Fraction = it represents the proportion of cells with the deleterious mutations\
\
**OUTCOME**:\

    OS_YEARS = Overall survival time in years\
    OS_STATUS = 1 (death) , 0 (alive at the last follow-up)\


In [13]:
import pandas as pd

# Load the data
print(os.getcwd())

# Clinical Data
clinical_df = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/X_train/clinical_train.csv") # clinical_train
clinical_df_eval = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/X_test/clinical_test.csv") # clinical_test

# Molecular Data
molecular_df = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/X_train/molecular_train.csv") # molecular_train
molecular_eval = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/X_test/molecular_test.csv") # molecular_test

# outcome (survival) data 
outcome_df = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/target_train.csv") # target_train
outcome_df_test = pd.read_csv("/workspaces/PredictLeukemiaSurvival/ChallengeData/random_submission_FRacdcw_v9kP4pP.csv") # target_test

# Preview the data
clinical_df.head()

# Number of samples
num_samples_clinical = clinical_df.shape[0]
num_samples_molecular = molecular_df.shape[0]

# Number of features
num_features_clinical = clinical_df.shape[1]
num_features_molecular = molecular_df.shape[1]

# Display these dataset characteristics
print(f"Number of samples in the clinical data: {num_samples_clinical}")
print(f"Number of features in the clinical data: {num_features_clinical}")

# Display these dataset characteristics
print(f"Number of samples in the molecular data: {num_samples_molecular}")
print(f"Number of features in the molecular data: {num_features_molecular}")

# Display the first few rows of the dataframe to show the structure
print("\nExample data (clinical):")
print(clinical_df.head())

print("\nExample data (molecular):")
print(molecular_df.head())

print("\nExample data (outcome_df)")
print(outcome_df.head())



/workspaces/PredictLeukemiaSurvival/1_DatasetCharacteristics
Number of samples in the clinical data: 3323
Number of features in the clinical data: 9
Number of samples in the molecular data: 10935
Number of features in the molecular data: 11

Example data (clinical):
        ID CENTER  BM_BLAST    WBC  ANC  MONOCYTES    HB    PLT  \
0  P132697    MSK      14.0    2.8  0.2        0.7   7.6  119.0   
1  P132698    MSK       1.0    7.4  2.4        0.1  11.6   42.0   
2  P116889    MSK      15.0    3.7  2.1        0.1  14.2   81.0   
3  P132699    MSK       1.0    3.9  1.9        0.1   8.9   77.0   
4  P132700    MSK       6.0  128.0  9.7        0.9  11.1  195.0   

                          CYTOGENETICS  
0      46,xy,del(20)(q12)[2]/46,xy[18]  
1                                46,xx  
2   46,xy,t(3;3)(q25;q27)[8]/46,xy[12]  
3    46,xy,del(3)(q26q27)[15]/46,xy[5]  
4  46,xx,t(3;9)(p13;q22)[10]/46,xx[10]  

Example data (molecular):
        ID CHR        START          END                R

## Handling Missing Values

[Identify any missing values in the dataset, and describe your approach to handle them if there are any. If there are no missing values simply indicate that there are none.]


In [17]:
# Check for missing values
# clinical
missing_values_clinical = clinical_df.isnull().sum()

# molecular
missing_values_molecular = molecular_df.isnull().sum()

# outcome
missing_values_outcome = outcome_df.isnull().sum()

print(f'Missing values in the clinical data: \n{missing_values_clinical}\n')
print(f'Missing values in the molecular data: \n{missing_values_molecular}\n')
print(f'Missing values in the outcome data: \n{missing_values_outcome}\n')

Missing values in the clinical data: 
ID                0
CENTER            0
BM_BLAST        109
WBC             272
ANC             193
MONOCYTES       601
HB              110
PLT             124
CYTOGENETICS    387
dtype: int64

Missing values in the molecular data: 
ID                  0
CHR               114
START             114
END               114
REF               114
ALT               114
GENE                0
PROTEIN_CHANGE     12
EFFECT              0
VAF                89
DEPTH             114
dtype: int64

Missing values in the outcome data: 
ID             0
OS_YEARS     150
OS_STATUS    150
dtype: int64



In [None]:
# Handling missing values
# Example: Replacing NaN values with the mean value of the column
# df.fillna(df.mean(), inplace=True)

# Your code for handling missing values goes here


## Feature Distributions

[Plot the distribution of various features and target variables. Comment on the skewness, outliers, or any other observations.]


In [None]:
# Example: Plotting histograms of all numerical features
df.hist(figsize=(12, 12))
plt.show()


## Possible Biases

[Investigate the dataset for any biases that could affect the model’s performance and fairness (e.g., class imbalance, historical biases).]


In [None]:
# Example: Checking for class imbalance in a classification problem
# sns.countplot(x='target_variable', data=df)

# Your code to investigate possible biases goes here


## Correlations

[Explore correlations between features and the target variable, as well as among features themselves.]


In [None]:
# Example: Plotting a heatmap to show feature correlations
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
