# Data Preprocessing (Data Scientists)
This notebook aims to simulate a data scientist consuming data from a data pipeline, using PySpark. 

In this notebook, I will be executing the following steps:
  - Conduct EDA on data produced from Medallion Data Pipeline (Missing Values, Identifying Distributions and Relationships etc)
  - Deal with Multicollinearity 
  - Feature Selection & Transformation, Standardisation, Dimensionality Reduction 
  - Dealing with Dataset Imbalance

## Import Libraries

In [0]:
from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws,  regexp_extract
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)



## 1. EDA: Summary Statistics


In this section, I will be acting as a data scientist (credit risk modeling) pulling data from the Gold Delta Layer of the Medallion Structure. I will be mainly observing summary statistics, spotting and solving issues (e.g. missing values), understanding distribution of features etc. [](url)

In [0]:
df = spark.read.table('gold.medallion_cleaned_lc_data')
df.limit(10).display()

In [0]:
df.summary().display() 

## 2. EDA: Dealing with Missing Values 

I will not be handling missing values, which was not dealt with in the Medallion Architecture. 

### Find Null Value % Per Column 

### Dropping Irrelevant Columns

### Impute Missing Values (Categorical & Numerical)

### Confirm Imputation Works 

### Save Cleaned Data for Modeling 

## 3. EDA: Examining Distributions and Feature Relationships

### Univariate Analysis 

### Bivariate Analysis 

### Multicollinearity Handling 

## 4. Feature Selection & Engineering 

### Standardisation 

### Dimensionality Reduction 

## 5. Handling Dataset Imbalance 