# Data Preprocessing & Outlier Removal

### This notebook encapsulates the first step of our project: data preprocessing and outlier removal. This step serves to reduce your datasets to only include metadata relevant to performing our subsequent analyses, impute missing values, and remove outliers for each feature contained in your dataset. 

#### Below are the Python modules needed to preprocess your data and perform outlier detection

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import warnings
import sys
sys.path.append('../src/')
from preprocessing import drop_columns, replace_NA, replace_outliers_with_sd, outlier_detection
warnings.filterwarnings('ignore')

## Step 1. Data Ingestion

#### Data must first be ingested into our pipeline in order for preprocessing to begin. Because the datasets that are produced with Cell Profiler are quite large, these files should be uploaded to this notebook from your local computer. For a walkthrough of this preprocessing step, we'll be preprocessing the object cell file from the PAM194 Keratinocytes file. You may replace this file with any experiment file from your local computer.

#### We'll read in the PAM194 Object Cell file below

In [2]:
# obj_cell = '/Users/apple/Desktop/DATA_590/BRI_Capstone/PAM194_Keratino_CytoPanel_1/pam194ObjCell.csv'
obj_cell = '/home/logan/MSDS/Capstone/data/PAM194_Keratino_CytoPanel_1/pam194ObjCell.csv'

cell_data = pd.read_csv(obj_cell, sep = ',')

#### Printing out the first 5 rows of the file to ensure that it was properly parsed into a dataframe

In [3]:
cell_data.head()

Unnamed: 0,ImageNumber,ObjectNumber,Metadata_Date,Metadata_FileLocation,Metadata_Frame,Metadata_Metadata_Cytokine,Metadata_Metadata_Dose,Metadata_Plate,Metadata_Run,Metadata_Series,...,Texture_Contrast_CorrMito_3_02_256,Texture_Contrast_CorrMito_3_03_256,Texture_Contrast_CorrNileRed_3_00_256,Texture_Contrast_CorrNileRed_3_01_256,Texture_Contrast_CorrNileRed_3_02_256,Texture_Contrast_CorrNileRed_3_03_256,Texture_Contrast_CorrWGA_3_00_256,Texture_Contrast_CorrWGA_3_01_256,Texture_Contrast_CorrWGA_3_02_256,Texture_Contrast_CorrWGA_3_03_256
0,1,1,,,0,IFNg,33,Plate 1,,0,...,809.526316,2213.948905,63.3125,62.456274,43.161184,64.916058,456.894737,558.148289,571.819079,835.682482
1,1,2,,,0,IFNg,33,Plate 1,,0,...,140.515945,297.913007,41.003843,44.32035,41.900836,55.248416,423.985791,488.225973,537.927107,690.420186
2,1,3,,,0,IFNg,33,Plate 1,,0,...,416.107143,637.340426,131.801587,141.045184,123.582589,184.291962,1244.634921,1873.365042,1386.792411,1969.450355
3,1,4,,,0,IFNg,33,Plate 1,,0,...,369.692492,442.252926,58.827859,72.721717,56.393994,68.36736,1357.431873,2371.83225,1386.081081,1289.561769
4,1,5,,,0,IFNg,33,Plate 1,,0,...,422.836423,480.79878,44.629263,90.982022,79.689204,90.452439,645.309131,1582.348315,1484.908397,2007.782927


## Step 2. Dropping Unncessary Columns

#### After reading in the experiment file into a dataframe, the next step is to clean the dataframe such that only the columns needed for dimensionality reduction and statistical analyses remain. Columns that contain metadata on the filename that the dataframe was derived from (i.e. 'FileName_'), pathname of the file ('PathName_'), the frame, location, date, and series will be dropped from the dataframe. The function 'drop_columns' will handle this part of the data cleaning process below

In [4]:
drop_columns(cell_data)

#### Printing out the first 5 rows to confirm that the columns deemed unnecessary have been removed

In [5]:
cell_data.head()

Unnamed: 0,ImageNumber,ObjectNumber,Metadata_Metadata_Cytokine,Metadata_Metadata_Dose,Metadata_Plate,Metadata_Well,AreaShape_Area,AreaShape_Orientation,Granularity_1_CorrActin,Granularity_1_CorrDNA2,...,Texture_Contrast_CorrMito_3_02_256,Texture_Contrast_CorrMito_3_03_256,Texture_Contrast_CorrNileRed_3_00_256,Texture_Contrast_CorrNileRed_3_01_256,Texture_Contrast_CorrNileRed_3_02_256,Texture_Contrast_CorrNileRed_3_03_256,Texture_Contrast_CorrWGA_3_00_256,Texture_Contrast_CorrWGA_3_01_256,Texture_Contrast_CorrWGA_3_02_256,Texture_Contrast_CorrWGA_3_03_256
0,1,1,IFNg,33,Plate 1,B10,370.0,-60.810939,32.776566,53.650428,...,809.526316,2213.948905,63.3125,62.456274,43.161184,64.916058,456.894737,558.148289,571.819079,835.682482
1,1,2,IFNg,33,Plate 1,B10,3152.0,49.398792,43.897883,73.654626,...,140.515945,297.913007,41.003843,44.32035,41.900836,55.248416,423.985791,488.225973,537.927107,690.420186
2,1,3,IFNg,33,Plate 1,B10,1033.0,25.981245,42.339322,47.96022,...,416.107143,637.340426,131.801587,141.045184,123.582589,184.291962,1244.634921,1873.365042,1386.792411,1969.450355
3,1,4,IFNg,33,Plate 1,B10,1978.0,61.618333,30.14131,30.521827,...,369.692492,442.252926,58.827859,72.721717,56.393994,68.36736,1357.431873,2371.83225,1386.081081,1289.561769
4,1,5,IFNg,33,Plate 1,B10,1090.0,71.432709,39.343563,39.508533,...,422.836423,480.79878,44.629263,90.982022,79.689204,90.452439,645.309131,1582.348315,1484.908397,2007.782927


## Step 3. Imputing Missing Values

#### The next step is to handle missing values that may occur in the imaging dataset. It was decided that missing values for each feature would be imputed using the average values for that feature. 

#### First, let's see how many missing values are within the dataframe

In [6]:
print("The sum of NAs in this dataset is:", cell_data.isna().sum().sum())

The sum of NAs in this dataset is: 276


#### We've defined a function that imputes missing values using the mean as previously stated. The function 'replace_NA' performs this imputation on the dataset in place 

#### Once this imputation is performed, we should no longer see any missing values in the dataset

In [7]:
replace_NA(cell_data)
print("The sum of NAs in this dataset is:", cell_data.isna().sum().sum())

The sum of NAs in this dataset is: 0


## Step 4. Outlier Detection & Removal

#### The final step we need to perform prior to principal component analysis (PCA), we need to detect & remove the outliers in our datasets. This is an important step as we don't want the outliers to bias our subsequent PCA and statistical analyses in any way. 

#### To perform this outlier dectection & removal, we'll use the standard deviation (SD) to detect any outlier object images whose value is more than 5 SD from the mean for each feature. If the dataset contains a small number of features (i.e 5 features), we detect and remove any object image that has any outlier for any of its features. If the dataset contains more than 5 features, we remove the outlier image if it has outliers in more than 20% of the features within the dataset. 

#### Below, we have two functions that help us perform outlier detection & removal. Please not that this process does take a while to run

#### The following function call performs outlier detection and removal at 5 SD and a threshold of 20% of feature.  If you would like to change the threshould, simply provide a third argument to the function call below 

#### i.e. outlier_detection(data, 5, 0.3)

In [8]:
outliers, clean_data = outlier_detection(cell_data, 5, 0.2) 

In [9]:
# Below counts the the total number of object images in the dataset, the outliers, and the dataset without outliers
print(len(cell_data))
print(len(outliers))
print(len(clean_data))

94370
125
94245


#### The final step is to write the dataset to a local file that will undergo subsequent PCA and statistical analyses

In [10]:
#clean_data.to_csv('C:/Users/mdbla/Documents/UW_VM_Capstone_2024/HTI/Preprocessed_Data/PAM194_ObjCell_clean.csv')
# clean_data.to_pickle('/home/logan/MSDS/Capstone/data/PAM194_Keratino_CytoPanel_1/cleaned/PAM194_ObjCell_clean.pkl')
clean_data.to_csv('/home/logan/MSDS/Capstone/data/PAM194_Keratino_CytoPanel_1/cleaned/PAM194_ObjCell_clean.csv', index=False)