## Models Part 1 - For Df 1 and Df 2



To prepare our datasets for analysis, we took the following steps:


*  **Loading Datasets:** We loaded Dataset 1 and Dataset 2 to clean inconsistencies and ensure the data was ready for modeling. This step involved reading the data into a manageable format for preprocessing.  
*   **Cleaning Dataset 1:** In Dataset 1, we renamed columns for clarity, encoded binary responses such as "Yes" and "No" into numeric values for better compatibility with machine learning models, and scaled numerical features like age and academic pressure to ensure uniformity across features.
*  **Cleaning Dataset 2:** For Dataset 2, we handled complex entries such as CGPA ranges and sleep durations by converting them into numeric values suitable for modeling. Additionally, we dropped rows with missing values to ensure a clean and reliable dataset for regression analysis.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load datasets
dataset1_url = "https://raw.githubusercontent.com/hbedros/data622-assignment4/main/data/dataset1.csv"
dataset2_url = "https://raw.githubusercontent.com/hbedros/data622-assignment4/main/data/dataset2.csv"
df1 = pd.read_csv(dataset1_url)
df2 = pd.read_csv(dataset2_url)

# Clean Dataset 1
df1.columns = df1.columns.str.strip()
df1 = df1.rename(columns={'Have you ever had suicidal thoughts ?': 'suicidal_thoughts'})
df1['suicidal_thoughts'] = df1['suicidal_thoughts'].map({'Yes': 1, 'No': 0})
df1['Depression'] = df1['Depression'].map({'Yes': 1, 'No': 0})

# Scale numerical columns
numeric_cols_df1 = ['Age', 'Academic Pressure', 'Study Satisfaction', 'Study Hours', 'Financial Stress']
scaler = MinMaxScaler()
df1[numeric_cols_df1] = scaler.fit_transform(df1[numeric_cols_df1])

# Clean Dataset 2
df2.columns = df2.columns.str.strip()

# Parse CGPA ranges
def parse_range(val):
    if "-" in val:
        lower, upper = map(float, val.split("-"))
        return (lower + upper) / 2
    else:
        return float(val)

df2['cgpa'] = df2['cgpa'].apply(parse_range)

# Clean 'average_sleep' column
def parse_sleep(val):
    if isinstance(val, str):
        val = val.replace(" hrs", "").strip()
        if "-" in val:
            lower, upper = map(float, val.split("-"))
            return (lower + upper) / 2
        elif val.isdigit():
            return float(val)
    return np.nan

df2['average_sleep'] = df2['average_sleep'].apply(parse_sleep)

# Feature scaling for numerical columns
numeric_cols_df2 = ['cgpa', 'study_satisfaction', 'average_sleep']
df2[numeric_cols_df2] = scaler.fit_transform(df2[numeric_cols_df2])

# Drop rows with missing values
df2 = df2.dropna()


### Analysis on Dataset 1 (Classification)




We prepared Dataset 1 for modeling by scaling numerical features and encoding categorical variables. The data was split into training and testing sets to predict the binary outcome (Depression). Classification models such as Decision Tree, Random Forest, and Logistic Regression were chosen to align with the dataset's classification nature. Model performance was evaluated based on accuracy.

**Why These Models Were Used for Dataset 1:**



*   **Decision Tree:** Selected as a simple starting model for classification. It provides clear decision-making rules, ideal for exploring how features like age, academic pressure, and sleep duration relate to Depression. This aligns with Dataset 1’s classification objective.
*   **Random Forest:** Built on the Decision Tree by aggregating results from multiple trees. This ensemble approach reduces overfitting and enhances prediction accuracy, making it effective for capturing complex feature interactions.

*  **Logistic Regression:** Chosen as a linear baseline model for comparison. Its ability to handle binary classification tasks efficiently is suitable for predicting Depression, particularly when relationships between features and the target are linear.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Dataset 1 - Preprocessing
X1 = df1.drop(columns=['Depression'])
y1 = df1['Depression']

# Apply OneHotEncoding for categorical variables and scaling for numeric variables
categorical_cols = X1.select_dtypes(include=['object']).columns
numerical_cols = X1.select_dtypes(include=['float64', 'int64']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ]
)

X1_preprocessed = preprocessor.fit_transform(X1)

# Train-test split
X1_train, X1_test, y1_train, y1_test = train_test_split(X1_preprocessed, y1, test_size=0.2, random_state=42)

# Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X1_train, y1_train)
y1_pred_dt = dt_clf.predict(X1_test)
dt_accuracy = accuracy_score(y1_test, y1_pred_dt)

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X1_train, y1_train)
y1_pred_rf = rf_clf.predict(X1_test)
rf_accuracy = accuracy_score(y1_test, y1_pred_rf)

# Logistic Regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X1_train, y1_train)
y1_pred_lr = log_reg.predict(X1_test)
lr_accuracy = accuracy_score(y1_test, y1_pred_lr)

# Tabulate Results for Dataset 1
results_df1 = pd.DataFrame({
    "Data Table": ["DF 1", "DF 1", "DF 1"],
    "Model": ["Decision Tree", "Random Forest", "Logistic Regression"],
    "Accuracy": [dt_accuracy, rf_accuracy, lr_accuracy],
    "MAE": [None, None, None],
    "RMSE": [None, None, None]
})

print(results_df1)



  Data Table                Model  Accuracy   MAE  RMSE
0       DF 1        Decision Tree  0.881188  None  None
1       DF 1        Random Forest  0.950495  None  None
2       DF 1  Logistic Regression  0.960396  None  None


**Results Explained for Dataset 1:**



*   **Decision Tree:** Achieved an accuracy of **88.12%**, making it a solid starting model. While it provides clear and interpretable decision paths, its lower accuracy suggests it may not capture complex patterns in the data as effectively as the other models.
*   **Random Forest:** Performed significantly better with an accuracy of **95.05%**. By averaging the outputs of multiple decision trees, it reduces overfitting and better handles the complexity of the dataset, leading to improved predictions.
*   **Logistic Regression:** Delivered the highest accuracy of **96.04%**, indicating that the relationships in the dataset are well-suited to a linear model. This suggests features like age, academic pressure, and financial stress have a straightforward influence on predicting depression in this context.





### Analysis on Dataset 2 (Regression Analysis)

We prepared Dataset 2 to predict CGPA (Cumulative Grade Point Average), a measure of a student’s academic performance, using factors like study satisfaction, workload, and sleep patterns. Numerical features were scaled, categorical variables encoded, and regression models such as Linear Regression, Decision Tree Regressor, Random Forest, and Gradient Boosting were applied. The data was split into training and testing sets, and model performance was evaluated using RMSE and MAE to assess predictive accuracy.

*  **Linear Regression:** Selected as a baseline regression model to predict CGPA. Its simplicity and suitability for linear relationships between features and the target make it a logical starting point.
*   **Decision Tree:** Chosen for its ability to capture non-linear relationships, this model helps understand the impact of factors like academic workload and sleep duration on CGPA. This aligns with Dataset 2's nature as a regression problem.
*   **Random Forest:** Built on the Decision Tree to enhance prediction accuracy by averaging the results of multiple trees. This reduces overfitting and captures complex feature interactions effectively.
*   **Gradient Boosting:** Selected to refine predictions iteratively, improving accuracy over other regression models. Its strength lies in handling non-linear relationships and feature interactions, making it well-suited for this dataset.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Define target variable and features
X2 = df2.drop(columns=["cgpa"])
y2 = df2["cgpa"]

# Preprocessing
categorical_cols = X2.select_dtypes(include=["object"]).columns
numerical_cols = X2.select_dtypes(include=["float64", "int64"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_cols),
        ("cat", OneHotEncoder(), categorical_cols)
    ]
)

X2_preprocessed = preprocessor.fit_transform(X2)

# Train-test split
X2_train, X2_test, y2_train, y2_test = train_test_split(X2_preprocessed, y2, test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X2_train, y2_train)
y2_pred_lr = lr.predict(X2_test)
lr_rmse = np.sqrt(mean_squared_error(y2_test, y2_pred_lr))
lr_mae = mean_absolute_error(y2_test, y2_pred_lr)

# Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X2_train, y2_train)
y2_pred_dt = dt.predict(X2_test)
dt_rmse = np.sqrt(mean_squared_error(y2_test, y2_pred_dt))
dt_mae = mean_absolute_error(y2_test, y2_pred_dt)

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X2_train, y2_train)
y2_pred_rf = rf.predict(X2_test)
rf_rmse = np.sqrt(mean_squared_error(y2_test, y2_pred_rf))
rf_mae = mean_absolute_error(y2_test, y2_pred_rf)

# Gradient Boosting Regressor
gb = GradientBoostingRegressor(random_state=42)
gb.fit(X2_train, y2_train)
y2_pred_gb = gb.predict(X2_test)
gb_rmse = np.sqrt(mean_squared_error(y2_test, y2_pred_gb))
gb_mae = mean_absolute_error(y2_test, y2_pred_gb)

# Tabulate Results for Dataset 2
results_df2 = pd.DataFrame({
    "Data Table": ["DF 2", "DF 2", "DF 2", "DF 2"],
    "Model": ["Linear Regression", "Decision Tree", "Random Forest", "Gradient Boosting"],
    "Accuracy": [None, None, None, None],
    "MAE": [lr_mae, dt_mae, rf_mae, gb_mae],
    "RMSE": [lr_rmse, dt_rmse, rf_rmse, gb_rmse]
})

print(results_df2)


  Data Table              Model Accuracy       MAE      RMSE
0       DF 2  Linear Regression     None  0.345027  0.447927
1       DF 2      Decision Tree     None  0.214815  0.352066
2       DF 2      Random Forest     None  0.158963  0.230364
3       DF 2  Gradient Boosting     None  0.186680  0.255632


**Results Explained for Dataset 2:**

*  **Linear Regression:** Achieved an RMSE of 0.4479 and MAE of 0.3450. While it identifies some linear trends, it struggles to capture the complexity of how factors like study satisfaction and sleep patterns influence CGPA.

*   **Decision Tree:** Improved with an RMSE of 0.3521 and MAE of 0.2148. Its ability to handle non-linear relationships makes it better at identifying the effects of features like academic workload on CGPA.
*   **Random Forest:** Delivered the best performance with an RMSE of 0.2304 and MAE of 0.1590. By averaging multiple trees, it effectively captures complex interactions and provides the most accurate predictions.
*  **Gradient Boosting:** Achieved an RMSE of 0.2556 and MAE of 0.1867. Its iterative improvements make it a strong model, though slightly less accurate than Random Forest for this dataset.


**Key Takeaway:**

Random Forest excelled in predicting CGPA, showcasing its ability to model complex relationships between academic and personal factors.

## Conclusion

- The analysis of both datasets highlights the importance of selecting models that align with the data and objectives.  

- For **Dataset 1**, focused on predicting Depression:  
  - **Logistic Regression** achieved the highest accuracy, effectively handling the binary classification task.  
  - This suggests that the relationships between features like age, financial stress, and academic pressure are well-suited to a linear model.  

- For **Dataset 2**, which aimed to predict CGPA:  
  - **Random Forest** delivered the best performance, excelling at capturing complex interactions between features such as academic workload, sleep duration, and social relationships.  
  - Its ensemble approach minimized errors and provided the most accurate predictions.  

- Overall:  
  - The methodologies—classification for Dataset 1 and regression for Dataset 2—were well-matched to the nature of each dataset.  
  - **Random Forest** stood out for its versatility and accuracy in handling complex datasets.  