## Import Libraries

In [1]:
import pandas as pd

## Load the dataset

In [None]:
# Load testing data
test_data = pd.read_csv('https://raw.githubusercontent.com/ktxdev/mind-matters/refs/heads/master/data/raw/test.csv')
# Load training data
train_data = pd.read_csv('https://raw.githubusercontent.com/ktxdev/mind-matters/refs/heads/master/data/raw/train.csv')
# Concatenating the two dataset
data = pd.concat([test_data, train_data], ignore_index=True)

## Handling Missing Values
To handle missing values in the dataset for the columns `Job Satisfaction` and `Study Satisfaction`, we’ll create a `Job/Study Satisfaction` column by merging these columns. Since each individual is either a student or a professional, only one of these columns will have a value for each user, while the other will be empty. By consolidating them into a single column, we simplify the dataset and eliminate these missing values, as Satisfaction will contain the relevant data for each user on a consistent 1 to 5 scale. The same will be done for the columns `Academic Pressure` and `Work Pressure` by creating a `Academic/Work Pressure` column combining these two columns. For handling missing values for `Profession` since there is a student profession in the `Profession` column I will make all students have the `Student` profession and for working professional I will make all working professionals have the `Working Professional` category. Since CGPA has around 80% of missing values and imputation of this value is unreliable as most this data is missing for working professionals and not students hence we will drop the feature.

In [None]:
# Handling missing values for Job and Study satisfaction
data['Job/Study Satisfaction'] = data['Study Satisfaction'].fillna(data['Job Satisfaction'])
# Dropping the original satisfaction columns
data.drop(['Study Satisfaction', 'Job Satisfaction'], axis=1, inplace=True)

# Handling missing values for Academic and Work pressure
data['Academic/Work Pressure'] = data['Academic Pressure'].fillna(data['Work Pressure'])
# Dropping the original pressure columns
data.drop(['Academic Pressure', 'Work Pressure'], axis=1, inplace=True)

# Fill missing values for profession
data.loc[(data['Working Professional or Student'] == 'Student') & (data['Profession'].isnull()), 'Profession'] = 'Student'
data.loc[(data['Working Professional or Student'] == 'Working Professional') & (data['Profession'].isnull()), 'Profession'] = 'Working Professional'

# Dropping CGPA feature
data.drop(columns=['CGPA'], inplace=True)

#### Re-checking Missing Values

In [2]:
data.isnull().sum()

NameError: name 'data' is not defined

Since the number of records with missing values is no longer significant I will drop those records

In [None]:
cleaned_data = data.drop(columns=['Depression']).dropna()
# Printin the shape of the data after dropping records
print(cleaned_data.shape)