### Data Normalization

#### In order to compare our data from each other, we need to normalize our data to accurately compare them. 

There are two ways we can normalize our data: 
- **Min-Max Scaling:**
    - Transform data to a specific range, usually [0,1]. 
    - Sensitive to outliers
    - This normalization is suitable for algorithms that assume features are on sim,ilar scale, such as neural networks and algorithms using distance-based metrics.
- **Z-Score Normalization:**
    - Centers the data around 0 with a standard deviation of 1. 
    - Less sensitive to outliers compared to Min-Max Scaling.
    - Suitable for algorithms that assume normally distributed data or require features to have zero mean and unit variance, such as linear regression, support vector machines, and k-means clustering. 

I am going to run both normalization algorithm to explore it a little further. 
- Packages needed: 
    - from sklearn.preprocessing import MinMaxScaler
    - from sklearn.preprocessing import StandardScaler

In [None]:
# Import packages

# Initial Packages
import pandas as pd
from functions import *
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from collections import Counter

# Categorical to Numeric
from sklearn.preprocessing import LabelEncoder

# SMOTE for imbalanced data
from imblearn.over_sampling import SMOTE
from sklearn.utils import shuffle


# Normalization
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

## Min-Max Scaling Normalization

In [None]:
# Min-Max Scaling 

# First, separate our numeric columns and non-numeric_columns as we will only need to apply our normalization to all the numeric columns
mm_non_numeric_cols = updated_df.select_dtypes(exclude = ['float64']).columns
mm_numeric_cols = updated_df.select_dtypes(include=['float64']).columns
mm_numeric_data = updated_df[mm_numeric_cols]

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Data Transformation
min_max_array = scaler.fit_transform(mm_numeric_data) # Will spit out a numpy array

# Convert to a dataframe
min_max_df = pd.DataFrame(min_max_array, columns = mm_numeric_cols)

norm_min_max_df = pd.concat([updated_df[mm_non_numeric_cols], min_max_df], axis = 1)


### Check correlation and create subset dataset for training the models

In [None]:
# Grab only the numeric columns
correlation_columns = norm_min_max_df.iloc[:, 2:].select_dtypes(include=['float64', 'int64']).columns

# Check the correlation between each numeric colunm with the encoded target variable
correlations = norm_min_max_df[correlation_columns].corrwith(norm_min_max_df['class_encoded'])

# Order the correlation by ordering the absolute values
important_m_corr= correlations.abs().sort_values(ascending = False)

In [None]:
# Convert Series into a Dataframe
mm_corr_df = pd.DataFrame(important_m_corr)
mm_corr_df.reset_index(inplace=True)
mm_corr_df.columns = ['Gene', 'Correlation']

# Save dataframe to a csv file
subfolder_path = "/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets"
csv_filename = "m_correlation_results.csv"
save_dataframe_to_csv(mm_corr_df, subfolder_path, csv_filename)

Now, that I have my correlation data. I will create subsets of my original normalized dataframe that only contains genes with a certain correlation.

Here are my statistics for my correlation data: 
- Min: 0.000020
- Max: 0.8500829894215582

Percentiles: 
- 25%	0.064557
- 50%	0.147424
- 75%	0.262105

The three correlation threshold I will work with are: 
- 0.262105
- 0.50
- 0.75

**All three subsets will be above the 75% percentile.**

In [None]:
# Correlation Thresholds
corr_threshold_1 = 0.262105 #75 percentile
corr_threshold_2 = 0.50
corr_threshold_3 = 0.75

# Return subset dataframes based on correlation threshold
min_max_subset_df_1 = df_corr_subset(norm_min_max_df, mm_corr_df, corr_threshold_1)
min_max_subset_df_2 = df_corr_subset(norm_min_max_df, mm_corr_df, corr_threshold_2)
min_max_subset_df_3 = df_corr_subset(norm_min_max_df, mm_corr_df, corr_threshold_3)

# Save dataframes to a csv file in our dataset subfolder
subfolder_path = "/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets"
csv_filename_1 = "min-max_threshold_df_0.26.csv"
csv_filename_2 = "min-max_threshold_df_0.50.csv"
csv_filename_3 = "min-max_threshold_df_0.75.csv"

save_dataframe_to_csv(min_max_subset_df_1, subfolder_path, csv_filename_1)
save_dataframe_to_csv(min_max_subset_df_2, subfolder_path, csv_filename_2)
save_dataframe_to_csv(min_max_subset_df_3, subfolder_path, csv_filename_3)

- Subset_df_1 contains 5069 columns 
- Subset_df_2 contains 698 columns 
- Subset_df_3 contains 13 columns

## Z-Score Normalization (Standardization)

In [None]:
# Z-Score Normalization

# First, separate our numeric columns and non-numeric_columns as we will only need to apply our normalization to all the numeric columns

z_non_numeric_cols = updated_df.select_dtypes(exclude = ['float64']).columns
z_numeric_cols = updated_df.select_dtypes(include=['float64']).columns
z_numeric_data = updated_df[z_numeric_cols]

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
z_normalized_array = scaler.fit_transform(z_numeric_data)

# Convert to a dataframe
z_df = pd.DataFrame(z_normalized_array, columns = z_numeric_cols)

norm_z_df = pd.concat([updated_df[z_non_numeric_cols], z_df], axis = 1)
norm_z_df

In [None]:
# Encoder
LE = LabelEncoder()

# Fit the encoder and transform the data
norm_z_df['class_encoded'] = LE.fit_transform(norm_z_df['Class'])

# Print to check 
print(norm_z_df['class_encoded'])

In [None]:
# Grab only the numeric columns
correlation_columns = norm_z_df.iloc[:, 2:].select_dtypes(include=['float64', 'int64']).columns

# Check the correlation between each numeric colunm with the encoded target variable
correlations = norm_z_df[correlation_columns].corrwith(norm_z_df['class_encoded'])

# Order the correlation by ordering the absolute values
important_z_corr= correlations.abs().sort_values(ascending = False)

# Convert Series into a Dataframe
z_corr_df = pd.DataFrame(important_z_corr)
z_corr_df.reset_index(inplace=True)
z_corr_df.columns = ['Gene', 'Correlation']

# Save dataframe to a csv file
subfolder_path = "/Users/kim/Desktop/repos/RNA-Seq_GeneExpression_Model/Datasets"
csv_filename = "z_correlation_results.csv"
save_dataframe_to_csv(z_corr_df, subfolder_path, csv_filename)

In [None]:
# Compare the two correlation dataframe using the two normalization algorithm 
 
# Merge dataframes 
merged_corr_df = pd.merge(mm_corr_df, z_corr_df, on = 'Gene', suffixes= ('_mm_corr_df', '_z_corr_df'), how = 'outer', indicator = True)

# Filter rows where values are different
differing_rows = merged_corr_df[merged_corr_df['_merge'] != 'both']

# Print the differing rows
print(differing_rows)
merged_corr_df

Data Normalization Summary: 

I employed both the Min-Max Scaling and Z-Score Normalization algorithms to standardize my dataset, subsequently correlating the normalized data with my target variable.

The analysis revealed distinct normalization outputs from the two algorithms, as expected due to their differing calculation methodologies.

To assess consistency, I compared the genes exhibiting the highest correlation with the target variable under Min-Max Scaling with those identified using Z-Score Normalization.