# IT326 Project - Phase1

### Depression 

##### "Depression (also known as major depression, major depressive disorder, or clinical depression) is a common but serious mood disorder. It causes severe symptoms that affect how a person feels, thinks, and handles daily activities, such as sleeping, eating, or working".

### The goal

##### The goal of this project is to use data mining techniques to explore how various aspects of life influence depression and how these factors interact with each other. We will focus on factors such as age, gender, marital status, and socioeconomic status, among others. By applying classification and clustering methods, we aim to reveal the connections between these factors and predicting depression, ultimately helping us gain better insights to support affected individuals.


### General information about the dataset:

##### The dataset contains comprehensive data aimed at analyzing depression and its related factors. It includes various attributes that can be used to study the depression status of individuals.


In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('b_depressed.csv')

In [11]:
num_attributes = len(df.columns)
attribute_types = df.dtypes.to_frame().rename(columns={0: 'Data Types'})
num_objects = len(df)
class_name = df.columns[-1]  

text3= "Number of attributes:"
bold_text3 = "\033[1m" + text3 + "\033[0m"
print(bold_text3, num_attributes)
print("\n")

text4= "Attribute types:"
bold_text4 = "\033[1m" + text4 + "\033[0m"
print(bold_text4)
print(attribute_types)
print("\n")

text5= "Number of objects:"
bold_text5 = "\033[1m" + text5 + "\033[0m"
print(bold_text5, num_objects)
print("\n")

[1mNumber of attributes:[0m 23


[1mAttribute types:[0m
                      Data Types
Survey_id                  int64
Ville_id                   int64
sex                        int64
Age                        int64
Married                    int64
Number_children            int64
education_level            int64
total_members              int64
gained_asset               int64
durable_asset              int64
save_asset                 int64
living_expenses            int64
other_expenses             int64
incoming_salary            int64
incoming_own_farm          int64
incoming_business          int64
incoming_no_business       int64
incoming_agricultural      int64
farm_expenses              int64
labor_primary              int64
lasting_investment         int64
no_lasting_investmen     float64
depressed                  int64


[1mNumber of objects:[0m 1429




####
Attributes: The dataset contains multiple features related to depression such as:
1. Survey_id: A unique identifier for each respondent in the dataset.
2. Ville_id: Refers to the ID of the respondent's village or town.
3. sex: The gender of the respondent as 0 (male) or 1 (female).
4. Age: The age of the respondent in years.
5. Married: Indicates the marital status of the respondent, often coded as 0 (unmarried) or 1 (married).
6. Number_children: The total number of children the respondent has.
7. education_level: The highest level of education attained by the respondent, which could be categorized as primary, secondary, or higher education.
8. total_members: The number of members in the respondent's household, indicating family size.
9. gained_asset: The assets gained by the respondent or household, possibly indicating economic progress.
10. durable_asset: The ownership of durable assets like cars, appliances, etc.
11. save_asset: The amount or presence of savings or financial assets held by the respondent.
12. living_expenses: The monthly or yearly expenses for basic living needs (e.g., food, housing).
13. other_expenses: Expenses beyond basic needs, such as entertainment or leisure activities.
14. incoming_salary: The salary or wage income the respondent receives from employment.
15. incoming_own_farm: Income derived from the respondent’s own farming activities.
16. incoming_business: Income generated from the respondent’s business operations.
17. incoming_no_business: Income from non-business sources (could include pensions, government aid, etc.).
18. incoming_agricultural: Income from agricultural activities, possibly farming or livestock.
19. farm_expenses: Expenses related to maintaining or operating a farm.
20. labor_primary: The labor status of the respondent, whether they are the primary laborer in their household or community.
21. lasting_investment: Investments made by the respondent in long-term assets or financial ventures.
22. no_lasting_investment: Lack of or absence of long-term investments.
23. depressed: The target variable indicating whether the respondent is depressed, labeled as 0 for "Not Depressed" and 1 for "Depressed". This binary attribute is the class label of the dataset, used for classification tasks.

Data type: The dataset features quantitative and qualitative attributes, including numeric and categorical variables.

Number of records (objects): The dataset consists of approximately 1429 data entries.

Class labels: The dataset is focused on detecting whether a person is depressed, often using binary labels like 1 "Depressed" and 0 "Not Depressed" based on various assessments.

### Source of dataset

###### https://www.kaggle.com/datasets/diegobabativa/depression?resource=download 

In [18]:

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer


# Step 1: Fill missing values in 'no_lasting_investmen' with the mean
imputer = SimpleImputer(strategy='mean')
df['no_lasting_investmen'] = imputer.fit_transform(df[['no_lasting_investmen']])

# Step 2: Discretize 'Age' into 'Youth', 'Adult', 'Senior'
age_bins = [0, 24, 59, float('inf')]
age_labels = ['Youth', 'Adult', 'Senior']
df['Age_group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=True)

# Step 3: Scale other numerical features using MinMaxScaler
numerical_features = ['incoming_salary', 'incoming_own_farm', 
                      'incoming_business', 'incoming_no_business', 'incoming_agricultural', 
                      'labor_primary', 'lasting_investment', 'no_lasting_investmen', 'total_assets', 'total_expenses']

scaler = MinMaxScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Step 4: One-hot encode categorical features including the new 'Age_group' column
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_columns = encoder.fit_transform(df[['education_level', 'Age_group']])

# Create a DataFrame for the encoded columns and merge back into the dataset
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['education_level', 'Age_group']))
df = pd.concat([df, encoded_df], axis=1)

# Drop the original 'education_level' and 'Age_group' columns after encoding
df.drop(columns=['education_level', 'Age_group'], inplace=True)

# Test the transformations
print("Scaled numerical features:")
print(df[numerical_features].head())

print("\nOne-hot encoded categorical features:")
print(encoded_df.head())


<class 'KeyError'>: "['education_level'] not in index"

In [19]:
# Aggregating asset-related columns into a new 'total_assets' column
df['total_assets'] = df[['gained_asset', 'durable_asset', 'save_asset']].sum(axis=1)
d()
# Aggregating expense-related columns into a new 'total_expenses' column
df['total_expenses'] = df[['living_expenses', 'other_expenses', 'farm_expenses']].sum(axis=1)

# Drop the original asset and expense columns if no longer needed
df.drop(columns=['gained_asset', 'durable_asset', 'save_asset', 'living_expenses', 'other_expenses', 'farm_expenses'], inplace=True)

# Test the new columns
print(df[['total_assets', 'total_expenses']].hea)


<class 'KeyError'>: "None of [Index(['gained_asset', 'durable_asset', 'save_asset'], dtype='object')] are in the [columns]"