# 00 Data Preprocessing

## Importing Required Libraries
We begin by importing the fundamental Python libraries required for data handling and numerical computations:  

- **pandas** → used for data manipulation and analysis (loading `.csv` files, handling tables).  
- **numpy** → provides support for mathematical operations and numerical computations.  


In [1]:
import pandas as pd 
import numpy as np

## Importing Visualization Library

To visualize our dataset and model outputs, we use **Matplotlib**.  
Specifically, the `pyplot` sub-library provides functions for creating plots such as line charts, scatter plots, heatmaps, and more.  

This will later help us in visualizing **t-SNE plots, confusion matrices, and ROC curves**.  


In [2]:
import matplotlib.pyplot as plt  # pyplot is the sub-library of matplotlib used for plotting

## Loading the Dataset

We load the processed dataset (`datafile.csv`) into a Pandas DataFrame.  
The dataset originates from the **GEO database (GSE81089)** and contains **gene expression profiles** for NSCLC samples,  
which were converted from `.tsv` format into `.csv` for easier handling in Python.

- The variable `df` will store our dataset in tabular form.  
- Each row corresponds to a sample, and each column represents a gene feature or phenotype label.


In [3]:
df = pd.read_csv("../data/processed/datafile.csv")

## Exploring the Dataset

After loading the dataset, it is important to **inspect its structure** to understand:

- The type of object (`DataFrame`) we are working with.  
- Number of rows and columns.  
- Total number of elements.  
- Column names, data types, and presence of missing values.  
- A preview of the first few samples.

These steps help ensure that the dataset is loaded correctly and ready for preprocessing.


In [4]:
# Check the type of the object
type(df)

# Summary info: column names, non-null counts, data types
df.info()

# Total number of elements in the DataFrame
df.size

# Number of rows and columns
df.shape

# Preview first 5 rows
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Columns: 18986 entries, ENSG00000000003 to Labels
dtypes: float64(18209), int64(776), object(1)
memory usage: 31.6+ MB


Unnamed: 0,ENSG00000000003,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000272537,ENSG00000272538,ENSG00000272539,ENSG00000272540,ENSG00000272541,ENSG00000272542,ENSG00000272543,ENSG00000272544,ENSG00000272545,Labels
0,52.195,43.8616,14.7101,4.81335,7.40831,112.426,43.9196,12.1289,13.3027,6.53823,...,0.0,0,0,1.38909,0.312571,0.086774,0,0.0,0.0,Tumor
1,37.8891,47.0457,7.81233,5.92073,9.83188,39.7146,60.4056,9.20525,19.6343,2.65372,...,0.0,0,0,1.15011,0.050841,0.0,0,0.0,0.0,Tumor
2,23.191,38.1292,12.3117,8.21385,9.68575,25.9596,49.0519,23.9222,20.166,9.99002,...,0.0,0,0,1.11998,0.551958,0.036071,0,0.0,0.0,Tumor
3,25.0324,54.303,8.41631,6.71221,10.9263,80.2073,40.47,46.9369,20.1807,5.55931,...,0.0,0,0,4.3457,0.319958,0.0,0,0.0,0.086684,Tumor
4,41.9686,51.2969,8.84999,4.79088,8.36149,38.4429,58.1048,15.6082,23.3442,5.53239,...,0.0,0,0,2.61839,0.415423,0.041381,0,0.0,0.09524,Tumor


## Encoding Labels and Preparing Features

Machine learning models require **numerical inputs**. Therefore, we need to:

1. Encode the **phenotype labels** (`Normal` and `Tumor`) into numerical form using `LabelEncoder` from `scikit-learn`.  
   - `Normal` → 0  
   - `Tumor` → 1  

2. Verify the unique classes and their counts to ensure correct encoding.

3. Separate the dataset into:  
   - `X` → feature matrix (all gene expression columns)  
   - `y` → target labels (encoded NSCLC phenotypes)


In [5]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['Labels'] = label_encoder.fit_transform(df['Labels'])
df['Labels'].unique()
#counting the number of classes
df["Labels"].value_counts()
print(df["Labels"].value_counts())
#Assigning the numerical data to a "X" variable and labels column into a "y" variable that will be used in the next steps
X = df.iloc[:,:-1]
y = df["Labels"]

Labels
1    199
0     19
Name: count, dtype: int64


## Data Splitting, Scaling, and Normalization

Before training machine learning models, we need to prepare the data:

1. **Train-Test Split**:  
   - Divide the dataset into training and testing sets using `train_test_split`  
   - `70%` for training and `30%` for testing  
   - `random_state=42` ensures reproducibility

2. **Standardization**:  
   - Features are scaled to have **zero mean** and **unit variance** using `StandardScaler`  
   - This helps models converge faster and improves performance, especially for distance-based algorithms like SVM.

3. **Handling Missing Values**:  
   - Fill any missing values in `X` with the **column mean** to avoid errors during training.

4. **Optional Normalization**:  
   - For large-scale values, we normalize features to keep them on a similar scale, which improves numerical stability.


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test ,Y_train, Y_test = train_test_split(X,y,test_size =0.30, random_state=42)
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#filling up missing values
X = X.fillna(X.mean())
#normalizing for large values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Train-Test Split and Feature Scaling (50-50 Split)

For experimentation, we also create a **50-50 train-test split** to evaluate model performance on a larger test set.

1. **Train-Test Split**:  
   - `50%` of data is used for training, `50%` for testing  
   - `random_state=42` ensures reproducibility  

2. **Feature Scaling**:  
   - Standardize features using `StandardScaler` to have **zero mean** and **unit variance**  
   - Essential for models like SVM and Logistic Regression, which are sensitive to feature scales.


In [7]:
#importing train_test_split 
from sklearn.model_selection import train_test_split
X_train, X_test ,Y_train, Y_test = train_test_split(X,y,test_size =0.50, random_state=42)
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)