<a href="https://colab.research.google.com/github/lcbjrrr/ML315/blob/main/ML315_4_Naive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes



This dataset presents the risk of natural disasters and their potential impact on material loss. Here's a breakdown of the likely meaning of each column:

* **quake_lvl:**  The level or severity of earthquake risk.  Higher numbers likely indicate a greater risk.
* **flood_lvl:** The level or severity of flood risk.  Similar to quake_lvl, higher numbers mean higher risk.
* **hurricane_lvl:**  The level or severity of hurricane risk. Again, higher numbers suggest a greater risk.
* **wildfire_lvl:** The level or severity of wildfire risk.  Higher values indicate higher risk. It's interesting to note that this variable seems to allow for decimal values, suggesting a more fine-grained assessment of wildfire risk compared to the other disaster types.
* **sig_material_loss:** This appears to be a binary variable (0 or 1) indicating whether significant material loss occurred. 1 likely represents significant loss, while 0 represents no significant loss.
"

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/ML315/refs/heads/main/datasets/weather_loss.csv')
df.head(3)


Unnamed: 0,quake_lvl,flood_lvl,hurricane_lvl,wildfire_lvl,sig_material_loss
0,8,0,0,1.0,1
1,0,0,4,0.0,1
2,8,0,4,0.7,0


In [None]:
%%html
<blockquote class="twitter-tweet" data-media-max-width="560"><p lang="zxx" dir="ltr"><a href="https://t.co/JuOCz2MSsx">pic.twitter.com/JuOCz2MSsx</a></p>&mdash; luiz barboza (@luiz_barboza) <a href="https://twitter.com/luiz_barboza/status/1870323913072267465?ref_src=twsrc%5Etfw">December 21, 2024</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Naive Bayes is a simple way to classify things using probabilities. It's like guessing what fruit someone has by asking questions ("Is it red?", "Is it round?"). It uses:

- Prior probabilities: Initial guesses (e.g., apples are more common than mangoes).
- Likelihoods: How likely a feature is for a certain type of fruit (e.g., apples are often red).

It combines these to make a better guess. The "naive" part is that it assumes all questions are independent (which isn't always true). Despite this, it's useful for spam filtering, document classification, and more.

![](https://miro.medium.com/v2/resize:fit:1400/1*LB-G6WBuswEfpg20FMighA.png)

A train-test split divides data into two parts: a training set (used to train the model) and a testing set (used to evaluate its performance on unseen data). This prevents overfitting and gives a more realistic measure of how well the model generalizes. Important considerations include random splitting, maintaining data distribution, using stratified sampling for imbalanced data, and potentially using multiple splits to average performance.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['quake_lvl','flood_lvl','hurricane_lvl','wildfire_lvl']],df['sig_material_loss'] ,test_size=0.20)
X_train.head(3)

Unnamed: 0,quake_lvl,flood_lvl,hurricane_lvl,wildfire_lvl
1,0,0,4,0.0
61,8,0,4,0.7
80,5,0,4,0.7


The training process fits the Naive Bayes model to the training data (X_train, y_train). The nb.score() calculates the accuracy, which is the percentage of correct predictions on the training data (acuracia_treino). It's important to also check accuracy on test data for a true measure of performance.



In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)
acuracia_treino = nb.score(X_train , y_train)
acuracia_treino*100


73.75

A trained Naive Bayes model (nb) is used to predict the class labels for a set of new data (X_test). The predicted labels are then stored in a variable called preds


In [None]:
preds = nb.predict(X_test)
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

 Let's calculate the main metrics for the testing dataset:
- Accuracy: Overall correctness (right predictions / total predictions). Good for balanced data.
- Confusion Matrix: Table showing True/False Positives/Negatives. Gives a detailed view of prediction errors.
- Precision: Of all "yes" predictions, how many were correct? (TP / (TP + FP)). Avoids false positives.
- Recall: Of all actual "yes" cases, how many were found? (TP / (TP + FN)). Avoids false negatives.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
test_accuracy = accuracy_score(y_test,preds)*100
print(test_accuracy)
cm = confusion_matrix(y_test,preds)
print(cm)
print(classification_report(y_test,preds))

85.0
[[ 0  3]
 [ 0 17]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
