# Data Preprocessing for AIDs Clinical Trials

This notebook focuses on preparing the AIDs clinical trials dataset for analysis. It involves data loading, basic data cleaning, and handling outliers. The preprocessed data will then be saved to a seperate file.

---------

## Data Loading and Overview

The raw dataset is loaded into a Pandas DataFrame. Basic information such as data structure, data types, missing values, and summary statistics are reviewed.

In [79]:
# Import packages
import pandas as pd

# Load the dataset
csv_file = '../data/raw/aids_clinical_trials_raw.csv'
aids_trials = pd.read_csv(csv_file)

# Review the dataset
printmd('**Initial Dataset Head:**')
display(aids_trials.head())

printmd('**Initial Dataset Information:**')
display(aids_trials.info())

printmd('**Dataset Summary Statistics:**')
display(aids_trials.describe())



**Initial Dataset Head:**

Unnamed: 0,time,trt,age,wtkg,hemo,homo,drugs,karnof,oprior,z30,...,str2,strat,symptom,treat,offtrt,cd40,cd420,cd80,cd820,label
0,948,2,48,89.8128,0,0,0,100,0,0,...,0,1,0,1,0,422,477,566,324,0
1,1002,3,61,49.4424,0,0,0,90,0,1,...,1,3,0,1,0,162,218,392,564,1
2,961,3,45,88.452,0,1,1,90,0,1,...,1,3,0,1,1,326,274,2063,1893,0
3,1166,3,47,85.2768,0,1,0,100,0,1,...,1,3,0,1,0,287,394,1590,966,0
4,1090,0,43,66.6792,0,1,0,100,0,1,...,1,3,0,0,0,504,353,870,782,0


**Initial Dataset Information:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 24 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   time     2139 non-null   int64  
 1   trt      2139 non-null   int64  
 2   age      2139 non-null   int64  
 3   wtkg     2139 non-null   float64
 4   hemo     2139 non-null   int64  
 5   homo     2139 non-null   int64  
 6   drugs    2139 non-null   int64  
 7   karnof   2139 non-null   int64  
 8   oprior   2139 non-null   int64  
 9   z30      2139 non-null   int64  
 10  zprior   2139 non-null   int64  
 11  preanti  2139 non-null   int64  
 12  race     2139 non-null   int64  
 13  gender   2139 non-null   int64  
 14  str2     2139 non-null   int64  
 15  strat    2139 non-null   int64  
 16  symptom  2139 non-null   int64  
 17  treat    2139 non-null   int64  
 18  offtrt   2139 non-null   int64  
 19  cd40     2139 non-null   int64  
 20  cd420    2139 non-null   int64  
 21  cd80     2139 

None

**Initial Dataset Summary Statistics:**

Unnamed: 0,time,trt,age,wtkg,hemo,homo,drugs,karnof,oprior,z30,...,str2,strat,symptom,treat,offtrt,cd40,cd420,cd80,cd820,label
count,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,...,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0,2139.0
mean,879.098177,1.520804,35.248247,75.125311,0.084151,0.661057,0.13137,95.44647,0.021973,0.550257,...,0.585788,1.979897,0.172978,0.751286,0.362786,350.501169,371.307153,986.627396,935.369799,0.243572
std,292.274324,1.12789,8.709026,13.263164,0.27768,0.473461,0.337883,5.900985,0.146629,0.497584,...,0.492701,0.899053,0.378317,0.432369,0.480916,118.573863,144.634909,480.19775,444.976051,0.429338
min,14.0,0.0,12.0,31.0,0.0,0.0,0.0,70.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,49.0,40.0,124.0,0.0
25%,727.0,1.0,29.0,66.6792,0.0,0.0,0.0,90.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,263.5,269.0,654.0,631.5,0.0
50%,997.0,2.0,34.0,74.3904,0.0,1.0,0.0,100.0,0.0,1.0,...,1.0,2.0,0.0,1.0,0.0,340.0,353.0,893.0,865.0,0.0
75%,1091.0,3.0,40.0,82.5552,0.0,1.0,0.0,100.0,0.0,1.0,...,1.0,3.0,0.0,1.0,1.0,423.0,460.0,1207.0,1146.5,0.0
max,1231.0,3.0,70.0,159.93936,1.0,1.0,1.0,100.0,1.0,1.0,...,1.0,3.0,1.0,1.0,1.0,1199.0,1119.0,5011.0,6035.0,1.0


-----

## Basic Data Cleaning

This data is subsetted for columns relevent to current and potential future analysis. Columns are renamed for clarity, and categorical variables are mapped to descriptive labels. Potential duplicates and missing values are also addressed.

In [81]:
# Subset relevant columns for current and potential future analysis
relevant_columns = ['age', 'wtkg', 'gender', 'cd40', 'trt', 'time', 'label']
aids_trials = aids_trials[relevant_columns].copy()

# Rename columns for clarity
aids_trials = aids_trials.rename(columns={
    'wtkg': 'weight_kg',
    'cd40':'baseline_cd4_count',
    'trt':'treatment_type', 
    'label':'event_status', 
    'time':'time_to_event'
})

# Map categorical variables to descriptive labels
aids_trials['gender'] = aids_trials['gender'].map({0: 'Female', 1: 'Male'})
aids_trials['treatment_type'] = aids_trials['treatment_type'].map({
    0: 'ZDV', 
    1: 'ZDV+ddI', 
    2: 'ZDV+ddC', 
    3: 'ddI'
})

# Drop duplicates
aids_trials = aids_trials.drop_duplicates()

# Drop missing values
aids_trials = aids_trials.dropna()

------

## Outlier Handling

The IQR method is used to cap outliers in numerical columns.

In [83]:
# Cap outliers in numerical columns 
num_columns = ['age', 'weight_kg', 'baseline_cd4_count', 'time_to_event']

for col in num_columns:
    Q1 = aids_trials[col].quantile(0.25)
    Q3 = aids_trials[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    aids_trials[col] = aids_trials[col].clip(lower=lower_bound, upper=upper_bound)
    
    print(f"'Outliers capped for column '{col}'")

'Outliers capped for column 'age'
'Outliers capped for column 'weight_kg'
'Outliers capped for column 'baseline_cd4_count'
'Outliers capped for column 'time_to_event'


--------

## Final Dataset Review

After preprocessing, the dataset is ready for analysis. The cleaned dataset is saved for future use.

In [85]:
# Review the dataset
printmd('**Final Dataset Head:**')
display(aids_trials.head())

printmd('**Final Dataset Information:**')
display(aids_trials.info())

printmd('**Final Dataset Summary Statistics:**')
display(aids_trials.describe())

# Save the cleaned dataset
cleaned_file_path = '../data/processed/aids_clinical_trials_cleaned.csv'
aids_trials.to_csv(cleaned_file_path, index=False)

**Final Dataset Head:**

Unnamed: 0,age,weight_kg,gender,baseline_cd4_count,treatment_type,time_to_event,event_status
0,48.0,89.8128,Female,422.0,ZDV+ddC,948,0
1,56.5,49.4424,Female,162.0,ddI,1002,1
2,45.0,88.452,Male,326.0,ddI,961,0
3,47.0,85.2768,Male,287.0,ddI,1166,0
4,43.0,66.6792,Male,504.0,ZDV,1090,0


**Final Dataset Information:**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   age                 2139 non-null   float64
 1   weight_kg           2139 non-null   float64
 2   gender              2139 non-null   object 
 3   baseline_cd4_count  2139 non-null   float64
 4   treatment_type      2139 non-null   object 
 5   time_to_event       2139 non-null   int64  
 6   event_status        2139 non-null   int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 117.1+ KB


None

**Final Dataset Summary Statistics:**

Unnamed: 0,age,weight_kg,baseline_cd4_count,time_to_event,event_status
count,2139.0,2139.0,2139.0,2139.0,2139.0
mean,35.131837,74.905023,349.38885,880.30201,0.243572
std,8.360611,12.441725,114.323661,289.209229,0.429338
min,12.5,42.8652,24.25,181.0,0.0
25%,29.0,66.6792,263.5,727.0,0.0
50%,34.0,74.3904,340.0,997.0,0.0
75%,40.0,82.5552,423.0,1091.0,0.0
max,56.5,106.3692,662.25,1231.0,1.0
