## 1. Data Processing & Cleaning
- Load and process the CSV file using PySpark
- Handle missing values
- Convert data types where needed (especially 'Sleep Duration' to numeric)
- Remove any inconsistent values
- Output the data quality metrics (nulls, value counts, basic statistics)

##### 1.a Load and process the CSV file using PySpark

In [None]:
from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.read.csv('./data/Student Depression Dataset.csv', header=True, inferSchema=True)
df.show()

##### 1.b Handle missing values

I want first to output rows that have missing values and then remove them from data frame

In [None]:
from pyspark.sql import functions as F

filter_expr = F.exists(F.array(*df.columns), lambda x: x.isNull())
df.filter(filter_expr).show()

In [29]:
# remove null values 
df = df.na.drop(subset=['Financial Stress'])

# or maybe setting value to 0 would also be an option
#df.na.fill(value=0,subset=['Financial Stress']).

##### 1.c Convert data types where needed (especially 'Sleep Duration' to numeric)

First I want to output data type for each column just to check if everything look ok. Then I want to convert columns which don't have appopriate type

In [None]:
df.printSchema()

First I want to group 'Sleep Duration' column to chech data we are working with 

In [None]:
df.groupBy('Sleep Duration').count().show()

In [None]:
from pyspark.sql.functions import when, col

#drop rows where value is set to 'Others'
df = df.filter('`Sleep Duration` != "Others"')

df = df.withColumn(
    "Sleep Duration",
    when(col("Sleep Duration") == "More than 8 hours", 9)
    .when(col("Sleep Duration") == "7-8 hours", 7.5)
    .when(col("Sleep Duration") == "5-6 hours", 5.5)
    .when(col("Sleep Duration") == "Less than 5 hours", 4)
)

df = df.withColumn("Sleep Duration", col("Sleep Duration").cast("float"))
df.show()

In [None]:
# verify schema
df.printSchema()

Looks like columns 'Have you ever had suicidal thoughts ?', 'Family History of Mental Illness' and 'Depresion' can be transformed to boolean but first lets double check data

In [None]:
df.groupBy('Have you ever had suicidal thoughts ?').count().show()
df.groupBy('Family History of Mental Illness').count().show()
df.groupBy('Depression').count().show()

In [None]:
df = df.withColumn("Have you ever had suicidal thoughts ?", col("Have you ever had suicidal thoughts ?").cast("boolean"))
df = df.withColumn("Family History of Mental Illness", col("Family History of Mental Illness").cast("boolean"))
df = df.withColumn("Depression", col("Depression").cast("boolean"))

df.printSchema()

##### 1.d Remove any inconsistent values

Could not find any inconsistenct values except 'Other' in 'Dietary Habits' I but unsure what to do with it...


##### 1.e Output the data quality metrics (nulls, value counts, basic statistics)

In [None]:
# Summary statistics for numerical columns
df.describe().show()

Save to parquet file

In [None]:
from utils import output_to_parquet_file

output_to_parquet_file(df, 'processed_data.parquet')