## Data Cleaning
In this notebook, I will be processing the data from a CSV file, and then cleaning it to be used for analysis in the other notebooks<br> 
<br>
The dataset is taken from kaggle (https://www.kaggle.com/datasets/abdullahashfaqvirk/student-mental-health-survey). The target audience of the survey are students who took up IT-related courses in University.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Import dataset
df = pd.read_csv("../data/MentalHealthSurvey.csv")

#### Checking

In [3]:
# Quick overview of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    87 non-null     object
 1   age                       87 non-null     int64 
 2   university                87 non-null     object
 3   degree_level              87 non-null     object
 4   degree_major              87 non-null     object
 5   academic_year             87 non-null     object
 6   cgpa                      87 non-null     object
 7   residential_status        87 non-null     object
 8   campus_discrimination     87 non-null     object
 9   sports_engagement         87 non-null     object
 10  average_sleep             87 non-null     object
 11  study_satisfaction        87 non-null     int64 
 12  academic_workload         87 non-null     int64 
 13  academic_pressure         87 non-null     int64 
 14  financial_concerns        87

In [4]:
df.head()

Unnamed: 0,gender,age,university,degree_level,degree_major,academic_year,cgpa,residential_status,campus_discrimination,sports_engagement,...,study_satisfaction,academic_workload,academic_pressure,financial_concerns,social_relationships,depression,anxiety,isolation,future_insecurity,stress_relief_activities
0,Male,20,PU,Undergraduate,Data Science,2nd year,3.0-3.5,Off-Campus,No,No Sports,...,5,4,5,4,3,2,1,1,2,"Religious Activities, Social Connections, Onli..."
1,Male,20,UET,Postgraduate,Computer Science,3rd year,3.0-3.5,Off-Campus,No,1-3 times,...,5,4,4,1,3,3,3,3,4,Online Entertainment
2,Male,20,FAST,Undergraduate,Computer Science,3rd year,2.5-3.0,Off-Campus,No,1-3 times,...,5,5,5,3,4,2,3,3,1,"Religious Activities, Sports and Fitness, Onli..."
3,Male,20,UET,Undergraduate,Computer Science,3rd year,2.5-3.0,On-Campus,No,No Sports,...,3,5,4,4,1,5,5,5,3,Online Entertainment
4,Female,20,UET,Undergraduate,Computer Science,3rd year,3.0-3.5,Off-Campus,Yes,No Sports,...,3,5,5,2,3,5,5,4,4,Online Entertainment


In [5]:
df.describe()

Unnamed: 0,age,study_satisfaction,academic_workload,academic_pressure,financial_concerns,social_relationships,depression,anxiety,isolation,future_insecurity
count,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0,87.0
mean,19.942529,3.931034,3.885057,3.781609,3.390805,2.781609,3.218391,3.218391,3.241379,3.011494
std,1.623636,1.043174,0.85488,1.125035,1.400634,1.175578,1.367609,1.297809,1.405682,1.385089
min,17.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,19.0,3.0,3.0,3.0,2.5,2.0,2.0,2.0,2.0,2.0
50%,20.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,3.0
75%,21.0,5.0,4.5,5.0,5.0,4.0,4.0,4.0,4.5,4.0
max,26.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


These three outputs allow me to identify the flaws within my dataset.

1. Identifying which column has the wrong data types
2. Identifying columns that contain null values
3. Identifying columns that are not needed for analysis and can be removed
4. Identifying the outliers

With this, it allows us to move smoothly to the next step, which is cleansing the data.

#### Cleaning


In [7]:
# Handling missing values
df = df.dropna()

After analysing the outputs above, there was not much to clean. However, I still conducted a removal of null values so as to make sure that the data is fully clean.

#### Saving cleaned dataset

In [8]:
df.to_csv("../data/MentalHealthSurvey_Cleaned.csv", index=False)
print("Data cleaning completed!")

Data cleaning completed!
