<a href="https://colab.research.google.com/github/ngabo-dev/water-model-peer-group-4/blob/main/Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nhial Majok Riak - Optimization Parameters.

I implemented a classification model to predict unsafe water, focusing on ensuring a balanced recall and precision despite imbalanced data. My approach included:

Regularizer: L2(0.001) to penalize large weights and reduce overfitting.

Dropout Rate: 0.3 applied after each Dense layer.

Optimizer: Adam with a learning rate of 0.0004 for fine gradient updates.

Early Stopping: Based on val_loss with patience=6 and restore_best_weights=True to prevent overfitting and recover the best weights.

Model Evaluation & Compare.

 My Model Results
Metric	Value
Accuracy	0.6545
Precision	0.5573
Recall	0.5573
F1 Score	0.5573

While the accuracy is moderate, all key metrics are balanced, showing that the model does not favor the majority class. This is an improvement from earlier models that had high accuracy but recall and precision at 0.0, meaning they predicted only the majority class.

 Comparison with Benitha Uwituze’s Model

Metric	Benitha	Nhial
Accuracy	0.9720	0.6545
Precision	0.9664	0.5573
Recall	0.8464	0.5573
F1 Score	0.9024	0.5573

Interpretation:

Benitha's model outperformed mine in every metric, with strong recall (84.64%) and precision (96.64%).

Why my model underperformed:


Likely underfitting due to conservative regularization and dropout settings.

Smaller learning rate (0.0004) may have slowed convergence.

I used a basic L2 regularizer, while Benitha may have used more effective techniques or additional data preprocessing.

What I improved over time:


Earlier versions of my model had 0% recall and precision despite over 60% accuracy.

Now, balanced metrics indicate the model predicts both classes, not just the majority one.


 Insights and Challenges

Insights

Overfitting was initially addressed using Dropout and L2 regularization.

Early stopping helped prevent continued training on noise.

Precision and recall moved from 0.0 to 0.55 — showing progress in learning from both classes.

Challenges

Initial models were biased toward the majority class, failing to identify unsafe water.

Tuning learning rate and dropout required several tests to avoid underfitting.

Model performance plateaued, suggesting possible need for feature engineering or resampling techniques.



Final Summary Table

Train Instance	Engineer Name	Regularizer	Optimizer (LR)	Early Stopping Criteria	Dropout Rate	Accuracy	F1 Score	Recall	Precision
1	Nhial Majok Riak Maketh	L2 (0.001)	Adam (lr=0.0004)	val_loss, patience=6, restore_best_weights=True	0.3	0.6545	0.5573	0.5573	0.5573
2	Benitha Uwituze	Unknown (Likely L2)	Adam (lr unknown)	Early stopping used	Unknown	0.9720	0.9024	0.8464	0.9664



**Jean Pierre**

My model with Nhial's:

Accuracy: On the test set, my model achieved an accuracy of 0.747, which is higher than Nhial’s model accuracy of 0.6768.

Precision: My model attained a test precision of 0.747, outperforming Nhial’s model, which had a precision of 0.5943.

Recall: My model recorded a recall of 0.747, which is higher than Nhial’s model recall of 0.5417.

F1 Score: My model achieved a better F1 score of 0.715, compared to Nhial’s F1 score of 0.5668. This suggests that my model had better balance between precision and recall.

Regularization: Nhial used L2 regularization (l2=0.0005) to prevent overfitting, while I applied a combination of L1 and L2 regularization (l1=0.001, l2=0.001). This combination helped in both generalization and sparsity in the learned weights.

Activation Functions: Nhial used LeakyReLU activations to allow small gradients during negative inputs, potentially preventing dead neurons. I used standard ReLU which is simpler and works well in many scenarios.

Dropout Rate: Nhial used a higher dropout rate of 0.35, while I used 0.25. This lower dropout rate in my model may have allowed more neurons to be active, helping the model learn better representations without overfitting.

Learning Rate: Nhial trained his model with a conservative learning rate of 0.0003 to ensure stable convergence. I used a higher learning rate of 0.001, which may have sped up learning while still maintaining generalization thanks to regularization and early stopping.

Threshold Tuning: Nhial manually adjusted the classification threshold to 0.4 to boost recall, indicating an attempt to address class imbalance. I kept the default softmax-argmax method suitable for multiclass classification.

Loss Function: Nhial’s use of binary_crossentropy aligns with his binary target, while I used categorical_crossentropy for one-hot encoded outputs, as my setup was designed for multiclass classification.

Output Layer: Nhial used a sigmoid output neuron for binary classification. I used a softmax layer, outputting a probability distribution over multiple classes, consistent with the categorical_crossentropy loss.

Epochs and Early Stopping: Nhial trained up to 150 epochs with early stopping patience set to 8, while I allowed 200 epochs but set the patience to 6 with a min_delta of 0.002, which enabled early stopping only for significant improvements.

**Insights from My Experiment:**
I found that combining L1 and L2 regularization made the model more robust and reduced overfitting.

Using dropout with moderation (0.25) and early stopping helped my model generalize better on unseen data.

I noticed that precision, recall, and F1 scores improved when I used a balanced structure of hidden layers (128 → 64) along with appropriate regularization and dropout.

**Challenges I Faced:**
One of the main challenges was choosing the right regularization technique. Initially, L2 alone wasn't giving good generalization, so I had to switch to L1L2.

Ensuring the correct use of one-hot encoded labels with categorical_crossentropy required modifying my target arrays.

Finding the right balance in dropout and learning rate took several iterations to avoid underfitting or overfitting.





 **Placide Imanzi Kabisa**

 **Analysis 1 – Comparison With Prince Rurangwa's Model**
I compared my model with Prince Rurangwa’s. We used different hyperparameters, which led to different results in performance. Here’s what I noticed:
 Precision
Mine: 0.704


Prince: 0.653


My model is clearly better here. It made fewer mistakes when predicting unsafe water. I think the stronger regularization (L1 + L2) and higher dropout (0.4) made it more careful and reduced overconfident predictions. Prince used only L1 and a lower dropout (0.3), which probably gave his model more flexibility but made it a bit less accurate in those unsafe predictions.

 Recall
Mine: 0.269


Prince: 0.333


This is where Prince’s model did better. His model caught more of the unsafe water cases, which is more important in a real-life setting. I suspect that my model was too strict because of the strong regularization and the Adagrad optimizer with a high learning rate. His use of RMSprop and lighter regularization helped his model be more open and detect more unsafe samples.

 F1 Score
Mine: 0.389


Prince: 0.441


Because his model had better recall while still keeping decent precision, the F1 score came out better overall. Mine had high precision but recall was too low, which dragged the F1 score down. So Prince's model was more balanced.
Accuracy
Mine: 0.681


Prince: 0.681


We both had the same accuracy, but I don’t think this tells the full story. It’s easy to get decent accuracy if the model just predicts most samples as safe. What really matters here is whether the model correctly catches unsafe water and Prince’s model did better on that.
Conclusion
Best model: Prince’s
He got the better balance overall and did a better job at finding unsafe water, which is the main goal.


His use of RMSprop and milder regularization helped his model learn better without overfitting.


 Weaker model: Mine
Even though I had better precision, the recall was too low. That means my model misses too many unsafe samples  not great for this kind of task.


I probably over-regularized it, and Adagrad might not have been the best choice here.





**Analysis 2 – Comparison With Uwituze Benitha’s Model**
Now I’m comparing my model with Benitha’s. She used a different optimizer (Adam instead of Adagrad), only L2 regularization (not L1 or both), and a smaller learning rate. Here's how we compare:
Precision
Mine: 0.704


Benitha: 0.6591


I did better here again. My model was more precise when predicting unsafe water. That means it gave fewer false alarms. I think this comes from using both L1 and L2 regularization and a higher dropout (0.4), which made my model more careful. Benitha only used L2 and a dropout of 0.3, so her model was more flexible  which probably helped with recall but slightly lowered precision.

 Recall
Mine: 0.2688


Benitha: 0.3118


Her model performed better here. It caught more of the unsafe water cases. I believe her use of Adam optimizer with a small learning rate (0.0005) helped the model learn slower and more steadily, which probably allowed it to find more real positives. My Adagrad optimizer with a learning rate of 0.01 might have caused my model to miss important patterns, especially for the positive class.
 F1 Score
Mine: 0.3891


Benitha: 0.4234


F1 score favors Benitha’s model again. It shows that she found a better balance between precision and recall. I had higher precision, but my low recall pulled my F1 down. Benitha’s model, while not perfect, didn’t sacrifice recall as much as mine did  and that made her F1 stronger.
 Accuracy
Mine: 0.6809


Benitha: 0.6789


Very close. We’re almost tied here. But again, accuracy isn’t that meaningful in this case. A model can have high accuracy even if it misses unsafe samples, so I focused more on recall and F1 when judging the performance.
Conclusion
 Best model: Benitha’s
Her F1 score and recall were both better, meaning she found more unsafe water while still keeping good precision.


Her use of Adam with a very low learning rate helped the model learn more gradually and carefully.


She balanced generalization well with L2 regularization and a decent dropout.


   Weaker model: Mine
Precision was higher, but recall was too low, so overall F1 dropped.


The combination of Adagrad, high learning rate, strong regularization, and high dropout probably made the model too conservative  missing too many unsafe samples.


**Insights from Experiments & Challenges Faced**
From running my experiments, I learned that using a combination of L1 and L2 regularization together with a high dropout rate (0.4) made my model too conservative. It became less flexible and struggled to learn the patterns properly. This is probably why my recall was low — the model was being too cautious and missed a lot of unsafe water samples.
I also realized that Adagrad wasn’t the best optimizer for this task. It tends to perform worse on sparse or imbalanced data because it shrinks the learning rate too quickly, which might explain why my model didn’t improve much after a certain point.
One of the biggest challenges I faced was choosing the right optimizer and learning rate. Small changes made a big difference in results, and it was hard to know what combination would work best without trying a lot of options.
Another challenge was balancing precision and recall. Sometimes when I improved one, the other would drop. It took a lot of trial and error to understand how different settings affected both sides of the model's performance.



## Benitha Uwituze's Evaluation

### Model Evaluation: Benitha vs Placide

#### Performance Metrics Comparison

| Metric           | Benitha | Placide|
|------------------|---------|--------|
| Test Accuracy    | 0.6789  | 0.6809 |
| Test Loss        | 0.6387  |        |
| F1 Score         | 0.4234  | 0.3891 |
| Precision        | 0.6591  | 0.7042 |
| Recall           | 0.3118  | 0.2688 |

#### Metric Interpretation

**F1 Score (0.42 vs 0.39)**: Both scores are poor (<0.5), indicating sub-optimal performance for binary classification. My model performs slightly better with F1=0.42, showing marginally better balance between precision and recall.

**Precision (0.66 vs 0.70)**: While both of our precission metrics are moderate and acceptable, Placide's model has higher precision (70% vs 66%), meaning it makes fewer false positive errors when predicting water as potable.

**Recall (0.31 vs 0.27)**: My model has higher recall (31% vs 27%), meaning it catches more cases of non-potable water. Both values are critically low for water safety applications where missing contaminated water is dangerous.

**Test Loss (0.64)**: My loss of 0.64 is high, indicating poor model fit. Values should ideally be <0.5 for good binary classification performance.

#### Why My Model Performs Better

1. **Better F1 Score (0.42 vs 0.39)**: My model achieves better overall performance balance. The higher F1 score directly results from my model's better recall metric(0.31 vs 0.27), which is critical for water safety since missing contaminated water has severe consequences.

2. **Better Recall Performance**: My model's recall of 0.31 vs Placide's 0.27 means I detect 15% more cases of non-potable water. This is crucial for the water quality task where false negatives (missing contaminated water) pose health risks.

#### Why Placide's Model Performs Better

1. **Higher Precision (0.70 vs 0.66)**: Placide's model makes fewer false alarms, with 70% accuracy when predicting water as potable compared to my 66%. This reduces unnecessary water treatment costs and user confusion.

2. **Slightly Better Test Accuracy (0.68 vs 0.68)**: Placide has, marginally, higher overall classification accuracy, indicating better general performance across both potable and non-potable classifications.

#### Conclusion

Both models show poor performance with low F1 scores and critically low recall rates. However, my model is better overall due to superior F1 score and recall, which are more important for water safety applications where missing contaminated water poses health risks.



### Model Evaluation: Benitha vs Jean Pierre

#### Performance Metrics Comparison

| Metric        | Benitha | Jean Pierre |
|---------------|---------|-------------|
| Test Accuracy | 0.6789  | 0.659       |
| Test Loss     | 0.6387  | 0.665       |
| F1 Score      | 0.4234  | 0.634       |
| Precision     | 0.6591  | 0.659       |
| Recall        | 0.3118  | 0.659       |

#### Metric Interpretation

**F1 Score (0.42 vs 0.63)**: Jean Pierre's model significantly outperforms mine with F1=0.63 vs 0.42. His score of 0.63 indicates moderate performance, while my 0.42 shows poor balance between precision and recall for binary classification.

**Precision (0.66 vs 0.66)**: Both models achieve virtually identical precision (~66%), meaning equal accuracy when predicting water as potable. Both of our values are moderate and acceptable.

**Recall (0.31 vs 0.66)**: Jean Pierre's model dramatically outperforms mine with recall of 66% vs 31%. His model catches more than twice as many cases of non-potable water, which is critical for water safety applications.

**Test Loss (0.64 vs 0.67)**: My model has slightly lower loss (0.64 vs 0.67), indicating a better model fit, but both values are high and sub-optimal for binary classification.

#### Why Jean Pierre's Model Performs Better

1. **Better F1 Score (0.63 vs 0.42)**: Jean Pierre's model has better F1 performance, indicating significantly better overall classification balance. This directly stems from his much higher recall while maintaining similar precision.

2. **Critical Recall Advantage (0.66 vs 0.31)**: Jean Pierre's model detects 112% more cases of non-potable water (66% vs 31% recall). For water safety applications, this dramatic improvement in catching contaminated water is essential for preventing health risks.

#### Why My Model Performs Better

1. **Lower Test Loss (0.64 vs 0.67)**: My model has better model fit with 4.5% lower loss, indicating slightly better optimization and learning from the training data.

2. **Marginally Higher Test Accuracy (0.68 vs 0.66)**: My model achieves 3% better overall test classification accuracy, showing slightly better general performance across both water quality classes.

#### Conclusion

Overall, Jean Pierre's model significantly outperforms mine. His superior F1 score (0.63 vs 0.42) and dramatically better recall (0.66 vs 0.31) make it more suitable for water safety applications where detecting contaminated water is critical. Despite my model's slight advantages in loss and accuracy, Jean Pierre's balanced performance makes it a better choice for this task.


## Insights and Challenges

#### Insights:

**Learning Rate Impact**: Lower learning rate (0.001) prevented weight overshooting and made training more stable, though slower to converge.

**L2 Regularization**: 0.01 coefficient worked well - prevented overfitting without killing the model's ability to learn water quality patterns.

**Dropout Rate Balance**: 0.3 dropout was the right middle ground - not too harsh like 0.5 but still effective at preventing overfitting.

**Effects of a Single Hyper Parameter**: While finding combinations on hyper parameters  I changed the learning rate from 0.005 to 0.001 which improved accuracy by 2 points from 0.68 to 0.7 approximately, as it helps to train the model faster and we can reach results in fewer epochs.

#### Challenges:

**Missing Data Problem**: Dataset had lots of missing values. Mean imputation was quick but might have messed with data patterns.

**Feature Scaling Issues**: Different scales across features caused issues during training. Had to figure out StandardScaler was needed through trial and error.