<a href="https://colab.research.google.com/github/johnobodai/alu-machine_learning/blob/main/ML_SUMMATIVE_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Import Libraries and Load Data**
*Imports necessary libraries for data preprocessing and model training. It also loads the Kaggle dataset from Google Drive.[dataset.cvs](https://docs.google.com/spreadsheets/d/15CsEhLuLZEKHGW2YCpQXikyJntFxV7U9rgNJl4asFWM/edit?usp=sharing)*





In [57]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report
from joblib import dump, load
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV


# Load Kaggle dataset from Google Drive
kaggle_data_link = 'https://drive.google.com/uc?id=1joIqVz1UYhiIJX590sjAR7cE-Sk6-G_Y'
kaggle_data = pd.read_csv(kaggle_data_link)


# Data Preprocessing
*Handles data preprocessing tasks such as filling missing values and converting categorical data into numerical format. It also prepares the dataset for further analysis by separating features and the target variable ('Target_Dropout').*




In [37]:
# Fill missing values in numeric columns with mean
numeric_columns = kaggle_data.select_dtypes(include=['float64', 'int64']).columns
kaggle_data[numeric_columns] = kaggle_data[numeric_columns].fillna(kaggle_data[numeric_columns].mean())

# Fill missing values in categorical columns with mode
categorical_columns = kaggle_data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    kaggle_data[column].fillna(kaggle_data[column].mode()[0], inplace=True)

# Convert categorical data to numerical using one-hot encoding
kaggle_data = pd.get_dummies(kaggle_data)

# Set 'Target_Dropout' as the target column
target_column = 'Target_Dropout'

# Separate features and target
X = kaggle_data.drop(columns=['Target_Dropout', 'Target_Enrolled', 'Target_Graduate'])
y = kaggle_data[target_column]

# Check the distribution of the target variable
print("Target Variable Distribution:\n", y.value_counts())


Target Variable Distribution:
 Target_Dropout
False    3003
True     1421
Name: count, dtype: int64


# Data Scaling and Splitting
*Splits the dataset into training and testing sets to prepare for model training and evaluation.*



In [38]:
# Scale features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


# Training and Evaluating Models
*Trains three different machine learning models: Logistic Regression, Decision Tree Classifier, and Neural Network (MLPClassifier). Each model is fitted using the training data prepared in the previous cell.*

In [39]:
# Train Logistic Regression model with regularization (L2 by default)
print("Training Logistic Regression model...")
lr = LogisticRegression(C=1.0)  # Set C parameter for regularization strength (adjust as needed)
lr.fit(X_train, y_train)


# Train Decision Tree model
print("Training Decision Tree model...")
dt = DecisionTreeClassifier(max_depth=None, min_samples_split=2)  # Set max_depth and min_samples_split (adjust as needed)
dt.fit(X_train, y_train)


# Train Neural Network model with optimizations
print("Training Neural Network model...")
nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, alpha=0.0001, learning_rate_init=0.001,
                  early_stopping=True, validation_fraction=0.1)
nn.fit(X_train, y_train)

# Save models
dump(lr, 'logistic_regression_model.joblib')
dump(dt, 'decision_tree_model.joblib')
dump(nn, 'neural_network_model.joblib')

# Load models
lr_saved = load('logistic_regression_model.joblib')
dt_saved = load('decision_tree_model.joblib')
nn_saved = load('neural_network_model.joblib')

Training Logistic Regression model...
Training Decision Tree model...
Training Neural Network model...


# Evaluating Models and Analysis
*This cell evaluates the performance of each trained model on the test set. It calculates accuracy scores and generates classification reports to assess how well each model predicts the target variable ('Target_Dropout').*

In [52]:
# Evaluate Logistic Regression
print("\nEvaluating Logistic Regression:")
print("--------------------------------")
y_pred_lr = lr.predict(X_test)

# Calculate Specificity for Logistic Regression
tn_lr, fp_lr, fn_lr, tp_lr = confusion_matrix(y_test, y_pred_lr).ravel()
specificity_lr = tn_lr / (tn_lr + fp_lr)
print(f"\nSpecificity (Logistic Regression): {specificity_lr:.4f}")

# Confusion Matrix for Logistic Regression
print("\nConfusion Matrix (Logistic Regression):")
print(confusion_matrix(y_test, y_pred_lr))

# F1 Score for Logistic Regression
f1_lr = f1_score(y_test, y_pred_lr)
print(f"\nF1 Score (Logistic Regression): {f1_lr:.4f}")



Evaluating Logistic Regression:
--------------------------------

Specificity (Logistic Regression): 0.9332

Confusion Matrix (Logistic Regression):
[[531  38]
 [ 82 234]]

F1 Score (Logistic Regression): 0.7959


In [54]:
# Evaluate Decision Tree
print("\nEvaluating Decision Tree:")
print("--------------------------")
y_pred_dt = dt.predict(X_test)

# Calculate Specificity for Decision Tree
tn_dt, fp_dt, fn_dt, tp_dt = confusion_matrix(y_test, y_pred_dt).ravel()
specificity_dt = tn_dt / (tn_dt + fp_dt)
print(f"\nSpecificity (Decision Tree): {specificity_dt:.4f}")

# Confusion Matrix for Decision Tree
print("\nConfusion Matrix (Decision Tree):")
print(confusion_matrix(y_test, y_pred_dt))

# F1 Score for Decision Tree
f1_dt = f1_score(y_test, y_pred_dt)
print(f"\nF1 Score (Decision Tree): {f1_dt:.4f}")



Evaluating Decision Tree:
--------------------------

Specificity (Decision Tree): 0.8366

Confusion Matrix (Decision Tree):
[[476  93]
 [103 213]]

F1 Score (Decision Tree): 0.6849


In [53]:
# Evaluate Neural Network
print("\nEvaluating Neural Network:")
print("---------------------------")
y_pred_nn = nn.predict(X_test)

# Calculate Specificity for Neural Network
tn_nn, fp_nn, fn_nn, tp_nn = confusion_matrix(y_test, y_pred_nn).ravel()
specificity_nn = tn_nn / (tn_nn + fp_nn)
print(f"\nSpecificity (Neural Network): {specificity_nn:.4f}")

# Confusion Matrix for Neural Network
print("\nConfusion Matrix (Neural Network):")
print(confusion_matrix(y_test, y_pred_nn))

# F1 Score for Neural Network
f1_nn = f1_score(y_test, y_pred_nn)
print(f"\nF1 Score (Neural Network): {f1_nn:.4f}")



Evaluating Neural Network:
---------------------------

Specificity (Neural Network): 0.9402

Confusion Matrix (Neural Network):
[[535  34]
 [ 90 226]]

F1 Score (Neural Network): 0.7847


# Feature Importance and Coefficients

*Analyzes and prints the feature importance for the Decision Tree model and the coefficients for the Logistic Regression model. Understanding feature importance helps in identifying which features have the most significant impact on predicting the target variable.*

In [19]:
# Feature importance for Decision Tree
print("\nFeature Importance (Decision Tree):")
print("------------------------------------")
feature_importance = dt.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
importance_df_sorted = importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True)
print(importance_df_sorted.to_string(index=False))

# Coefficients for Logistic Regression
print("\nCoefficients (Logistic Regression):")
print("-------------------------------------")
coefficients = lr.coef_[0]
coef_df = pd.DataFrame({'Feature': features, 'Coefficient': coefficients})
coef_df_sorted = coef_df.reindex(coef_df['Coefficient'].abs().sort_values(ascending=False).index).reset_index(drop=True)
print(coef_df_sorted.to_string(index=False))



Feature Importance (Decision Tree):
------------------------------------
                                       Feature  Importance
           Curricular units 2nd sem (approved)    0.426885
                       Tuition fees up to date    0.071133
              Curricular units 2nd sem (grade)    0.046259
           Curricular units 1st sem (enrolled)    0.040469
                             Age at enrollment    0.038139
                                        Course    0.029537
                           Mother's occupation    0.027680
              Curricular units 1st sem (grade)    0.026808
                           Father's occupation    0.025044
                                           GDP    0.024760
                        Mother's qualification    0.021745
                             Unemployment rate    0.020213
        Curricular units 2nd sem (evaluations)    0.019461
                                Inflation rate    0.018335
                        Father's qualific

#Early Intervention System
*Defines an early intervention system function that identifies at-risk students based on predicted probabilities from each model. It demonstrates how to apply the function to the predictions of each model (Logistic Regression, Decision Tree, Neural Network).*

In [20]:
# Early intervention system
def early_intervention(predictions, threshold=0.5):
    """Identifies at-risk students based on predicted probabilities."""
    return predictions > threshold

# Example usage for Logistic Regression
at_risk_lr = early_intervention(lr.predict_proba(X_test)[:, 1])
at_risk_dt = early_intervention(dt.predict_proba(X_test)[:, 1])
at_risk_nn = early_intervention(nn.predict_proba(X_test)[:, 1])

# Create a DataFrame to organize results
results_df = pd.DataFrame({
    'Logistic Regression': at_risk_lr.astype(int),
    'Decision Tree': at_risk_dt.astype(int),
    'Neural Network': at_risk_nn.astype(int)
})

# Print formatted results
print("\nImplementing Early Intervention System:")
print("========================================\n")
print("--------------")
print(results_df)



Implementing Early Intervention System:

--------------
     Logistic Regression  Decision Tree  Neural Network
0                      1              1               1
1                      0              0               0
2                      0              0               0
3                      0              0               0
4                      1              1               1
..                   ...            ...             ...
880                    0              0               0
881                    1              1               1
882                    0              0               0
883                    0              0               0
884                    0              0               0

[885 rows x 3 columns]


# Error Analysis Metrics Implementation

In [43]:
# Calculate Specificity for Logistic Regression
tn_lr, fp_lr, fn_lr, tp_lr = confusion_matrix(y_test, y_pred_lr).ravel()
specificity_lr = tn_lr / (tn_lr + fp_lr)
print(f"Specificity (Logistic Regression): {specificity_lr:.4f}")

# Calculate Specificity for Decision Tree
tn_dt, fp_dt, fn_dt, tp_dt = confusion_matrix(y_test, y_pred_dt).ravel()
specificity_dt = tn_dt / (tn_dt + fp_dt)
print(f"Specificity (Decision Tree): {specificity_dt:.4f}")

# Calculate Specificity for Neural Network
tn_nn, fp_nn, fn_nn, tp_nn = confusion_matrix(y_test, y_pred_nn).ravel()
specificity_nn = tn_nn / (tn_nn + fp_nn)
print(f"Specificity (Neural Network): {specificity_nn:.4f}")


Specificity (Logistic Regression): 0.9332
Specificity (Decision Tree): 0.8260
Specificity (Neural Network): 0.8946


In [44]:
# Confusion Matrix for Logistic Regression
print("\nConfusion Matrix (Logistic Regression):")
print(confusion_matrix(y_test, y_pred_lr))

# Confusion Matrix for Decision Tree
print("\nConfusion Matrix (Decision Tree):")
print(confusion_matrix(y_test, y_pred_dt))

# Confusion Matrix for Neural Network
print("\nConfusion Matrix (Neural Network):")
print(confusion_matrix(y_test, y_pred_nn))



Confusion Matrix (Logistic Regression):
[[531  38]
 [ 82 234]]

Confusion Matrix (Decision Tree):
[[470  99]
 [108 208]]

Confusion Matrix (Neural Network):
[[509  60]
 [ 86 230]]


In [48]:
# F1 Score for Logistic Regression
f1_lr = f1_score(y_test, y_pred_lr)
print(f"\nF1 Score (Logistic Regression): {f1_lr:.4f}")

# F1 Score for Decision Tree
f1_dt = f1_score(y_test, y_pred_dt)
print(f"F1 Score (Decision Tree): {f1_dt:.4f}")

# F1 Score for Neural Network
f1_nn = f1_score(y_test, y_pred_nn)
print(f"F1 Score (Neural Network): {f1_nn:.4f}")



F1 Score (Logistic Regression): 0.7959
F1 Score (Decision Tree): 0.6677
F1 Score (Neural Network): 0.7591


In [60]:
# Evaluate Logistic Regression (Simple Model)
print("\nEvaluating Logistic Regression (Simple Model):")
print("----------------------------------------------")
y_pred_lr = lr.predict(X_test)
simple_lr_accuracy = accuracy_score(y_test, y_pred_lr)
print(f"Logistic Regression Accuracy: {simple_lr_accuracy:.4f}")

# Train and Evaluate Optimized Decision Tree Model
print("\nTraining and Evaluating Optimized Decision Tree Model:")
print("------------------------------------------------------")
params_dt = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search_dt = GridSearchCV(DecisionTreeClassifier(), params_dt, cv=5)
grid_search_dt.fit(X_train, y_train)
optimized_dt = grid_search_dt.best_estimator_

# Evaluate Optimized Decision Tree Model
y_pred_dt_optimized = optimized_dt.predict(X_test)
optimized_dt_accuracy = accuracy_score(y_test, y_pred_dt_optimized)
print(f"Optimized Decision Tree Accuracy: {optimized_dt_accuracy:.4f}")

# Calculate Validation Accuracy Difference for Decision Tree
accuracy_difference_dt = optimized_dt_accuracy - simple_lr_accuracy
print(f"\nDecision Tree Validation Accuracy Difference: {accuracy_difference_dt:.4f}")

# Train and Evaluate Optimized Neural Network Model
print("\nTraining and Evaluating Optimized Neural Network Model:")
print("--------------------------------------------------------")
nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, solver='adam', learning_rate_init=0.001, alpha=0.0001, early_stopping=True, random_state=42)
nn.fit(X_train, y_train)

# Evaluate Neural Network Model
y_pred_nn = nn.predict(X_test)
nn_accuracy = accuracy_score(y_test, y_pred_nn)
print(f"Neural Network Accuracy: {nn_accuracy:.4f}")

# Calculate Validation Accuracy Difference for Neural Network
accuracy_difference_nn = nn_accuracy - simple_lr_accuracy
print(f"\nNeural Network Validation Accuracy Difference: {accuracy_difference_nn:.4f}")

# Check Consistency and Performance
if (optimized_dt_accuracy > 0.8 and simple_lr_accuracy > 0.8 and nn_accuracy > 0.8) and (optimized_dt_accuracy > simple_lr_accuracy and nn_accuracy > simple_lr_accuracy):
    print("\nAll models show consistent results above 80%, with the optimized models outperforming the simple models.")
else:
    print("\nModels do not meet the specified criteria.")



Evaluating Logistic Regression (Simple Model):
----------------------------------------------
Logistic Regression Accuracy: 0.8644

Training and Evaluating Optimized Decision Tree Model:
------------------------------------------------------
Optimized Decision Tree Accuracy: 0.8463

Decision Tree Validation Accuracy Difference: -0.0181

Training and Evaluating Optimized Neural Network Model:
--------------------------------------------------------
Neural Network Accuracy: 0.8644

Neural Network Validation Accuracy Difference: 0.0000

Models do not meet the specified criteria.
