### Dependencies

In [None]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats.stats import pearsonr
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score

%matplotlib inline
# Suppress warnings
warnings.filterwarnings("ignore")

### Auxiliary functions

In [None]:
def cross_validate(estimator, train, validation):
    X_train = train[0]
    Y_train = train[1]
    X_val = validation[0]
    Y_val = validation[1]
    train_predictions = classifier.predict(X_train)
    train_accuracy = accuracy_score(train_predictions, Y_train)
    train_recall = recall_score(train_predictions, Y_train)
    train_precision = precision_score(train_predictions, Y_train)

    val_predictions = classifier.predict(X_val)
    val_accuracy = accuracy_score(val_predictions, Y_val)
    val_recall = recall_score(val_predictions, Y_val)
    val_precision = precision_score(val_predictions, Y_val)

    print('Model metrics')
    print('Accuracy  Train: %.2f, Validation: %.2f' % (train_accuracy, val_accuracy))
    print('Recall    Train: %.2f, Validation: %.2f' % (train_recall, val_recall))
    print('Precision Train: %.2f, Validation: %.2f' % (train_precision, val_precision))

    return train_accuracy, train_recall, train_precision, val_accuracy, val_recall, val_precision

### Load data

In [None]:
train_raw = pd.read_csv('/content/Taitanic.csv')
test_raw = pd.read_csv('/content/Taitanic.csv')
test_ids = test_raw['PassengerId'].values

# Join data to analyse and process the set as one.
train_raw['train'] = 1
test_raw['train'] = 0
data = pd.concat([train_raw, test_raw], ignore_index=True)

### Overview the data

In [None]:
data.head()

In [None]:
data.describe()

One advantage of Bayesian models is that it works well enough with small data, having more would give you more accurate probabilities but it's not data hungry as something like deep learning.

### Pre-process
* feature selection, data cleaning, feature engineering and data imputation

In [None]:
features = ['Age', 'Embarked', 'Fare', 'Parch', 'Pclass', 'Sex', 'SibSp']
target = 'Survived'

data = data[features + [target] + ['train']]
# Categorical values need to be transformed into numeric.
data['Sex'] = data['Sex'].replace(["female", "male"], [0, 1])
data['Embarked'] = data['Embarked'].replace(['S', 'C', 'Q'], [1, 2, 3])
data['Age'] = pd.qcut(data['Age'], 10, labels=False, duplicates='drop')

In [None]:
# Split data into train and test.
train = data.query('train == 1')
test = data.query('train == 0')

# Drop missing values from the train set.
train.dropna(axis=0, inplace=True)
labels = train[target].values

Our processed train set

In [None]:
train.head()

### Correlation study
* As we saw Naive Bayes models expect the features to be independent, so let's apply the Pearson correlation coefficient on them to give us a hint about how independent they are from the others.

In [None]:
columns = train[features + [target]].columns.tolist()
nColumns = len(columns)
result = pd.DataFrame(np.zeros((nColumns, nColumns)), columns=columns)

# Apply Pearson correlation on each pair of features.
for col_a in range(nColumns):
    for col_b in range(nColumns):
        result.iloc[[col_a], [col_b]] = pearsonr(train.loc[:, columns[col_a]], train.loc[:,  columns[col_b]])[0]

fig, ax = plt.subplots(figsize=(10,10))
ax = sns.heatmap(result, yticklabels=columns, vmin=-1, vmax=1, annot=True, fmt='.2f', linewidths=.2)
ax.set_title('PCC - Pearson correlation coefficient')
plt.show()

About the correlation between the features, we can see that "Fare" and "Pclass" seem to be highly related, so i'll remove "Pclass". Also features like "Sex", "Pclass" and "Fare" should be good predictors.

### Distribution study
* Also the model expect the features to come from a Gaussian (or normal) distribution, so let's check that as well.

In [None]:
continuous_numeric_features = ['Age', 'Fare', 'Parch', 'SibSp']
for feature in continuous_numeric_features:
    sns.distplot(train[feature])
    plt.show()

Looking at our continuous numeric features we can see that "Fare", "Parch" and "SibSp", have a distribution close to normal, but with a left side skew, "Age" have a distribution a a bit different from the other but maybe it's close enough to Gaussian.

In [None]:
train.drop(['train', target, 'Pclass'], axis=1, inplace=True)
test.drop(['train', target, 'Pclass'], axis=1, inplace=True)

### Split data in train and validation (80% ~ 20%)

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(train, labels, test_size=0.2, random_state=1)

In [None]:
X_train.head()

### Split train data into two parts

In [None]:
X_train1, X_train2, Y_train1, Y_train2 = train_test_split(X_train, Y_train, test_size=0.3, random_state=12)

In [None]:
classifier = GaussianNB()

#### Fit the first part
* Fitting data here is really fast.

In [None]:
classifier.fit(X_train2, Y_train2)

In [None]:
print('Metrics with only 30% of train data')
train_acc_30, train_rec_30, train_prec_30, val_acc_30, val_rec_30, val_prec_30 = cross_validate(classifier, (X_train, Y_train), (X_val, Y_val))

#### Update the model with the second part
* Nice thing about this kind of model, you can update it by just fitting the model again with more data.

In [None]:
classifier.partial_fit(X_train1, Y_train1)

In [None]:
print('Metrics with the remaining 70% of train data')
train_acc_100, train_rec_100, train_prec_100, val_acc_100, val_rec_100, val_prec_100 = cross_validate(classifier, (X_train, Y_train), (X_val, Y_val))

As you can see our results improved after we updated  the model with the remaining data.

The sklearn model also give us some interesting options from the model API about the target class.

In [None]:
print('Probability of each class')
print('Survive = 0: %.2f' % classifier.class_prior_[0])
print('Survive = 1: %.2f' % classifier.class_prior_[1])

In [None]:
print('Mean of each feature per class')
print('               Age         Embarked   Fare         Parch       Sex         SibSp')
print('Survive = 0: %s' % classifier.theta_[0])
print('Survive = 1: %s' % classifier.theta_[1])

In [None]:
print('Variance of each feature per class')
print('Survive = 0: %s' % classifier.var_[0])
print('Survive = 1: %s' % classifier.var_[1])

### Apply the model on the test data and create submission

In [None]:
# Unfortunately sklearn naive Bayes algorithm currently do not make inference with missing data (but should do), so we need to input missing data.
test.fillna(test.mean(), inplace=True)
test_predictions = classifier.predict(test)
submission = pd.DataFrame({'PassengerId': test_ids})
submission['Survived'] = test_predictions.astype('int')
submission.to_csv('submission.csv', index=False)
submission.head(10)

In [None]:
metrics = ['Accuracy', 'Recall', 'Precision']

# Data for 30% train
train_30_data = [train_acc_30, train_rec_30, train_prec_30]
val_30_data = [val_acc_30, val_rec_30, val_prec_30]

# Data for 100% train
train_100_data = [train_acc_100, train_rec_100, train_prec_100]
val_100_data = [val_acc_100, val_rec_100, val_prec_100]

width = 0.2
r1 = np.arange(len(metrics))
r2 = [x + width for x in r1]
r3 = [x + width for x in r2]
r4 = [x + width for x in r3]

fig, ax = plt.subplots(figsize=(12, 7))

ax.bar(r1, train_30_data, width, label='Train (30% data)', color='skyblue')
ax.bar(r2, val_30_data, width, label='Validation (30% data)', color='steelblue')
ax.bar(r3, train_100_data, width, label='Train (100% data)', color='lightcoral')
ax.bar(r4, val_100_data, width, label='Validation (100% data)', color='indianred')

ax.set_xlabel('Metric', fontweight='bold')
ax.set_ylabel('Score', fontweight='bold')
ax.set_title('Model Performance: 30% vs 100% Training Data', fontweight='bold')
ax.set_xticks([r + 1.5 * width for r in range(len(metrics))])
ax.set_xticklabels(metrics)
ax.legend()
ax.set_ylim(0, 1)

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

### What are the Pros and Cons of Naive Bayes?
#### Pros:
* Humans are not good with reasoning in systems with limited or conflicting information. It would be handy if we have something to manage all this limited/conflicting information.
* It is easy and fast to predict class of test data set. It also perform well in multi class prediction
* When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

#### Cons:
* Probably the most notable weakness of BNs is the designing methodology.There is no standard way of building BNs.
* If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
* On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
* Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

##### The design of a BN can be a considerable amount of effort in complex systems and it is based on the knowledge of the expert(s) who designed it. Although, this disadvantage can be good in another point of view, since BNs can be easily inspected by the designers and has the guarantee that the domain specific information is being used.

### Tips to improve the power of Naive Bayes Model
#### Here are some tips for improving power of Naive Bayes Model:

* If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.
* If test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
* Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.
* Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options (look at detail here). I would recommend to focus on your  pre-processing of data and the feature selection.
* You might think to apply some classifier combination technique like ensembling, bagging and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize.


### References
* [Introduction to Bayesian Networks with Jhonatan de Souza Oliveira - Machine Learning Mastery](https://machinelearningmastery.com/introduction-to-bayesian-networks-with-jhonatan-de-souza-oliveira/)
* [6 Easy Steps to Learn Naive Bayes Algorithm (with codes in Python and R) - Analytics Vidhya](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/)
* [Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm - Machine Learning Mastery](https://machinelearningmastery.com/better-naive-bayes/)

**Reasoning**:
I will define a Python function `predict_survival` that takes the required features as input, creates a DataFrame, and then uses the pre-trained `classifier` to predict the survival class and its probabilities.



In [None]:
def predict_survival(Age, Embarked, Fare, Parch, Sex, SibSp):
    # Create a DataFrame from the input parameters
    input_data = pd.DataFrame({
        'Age': [Age],
        'Embarked': [Embarked],
        'Fare': [Fare],
        'Parch': [Parch],
        'Sex': [Sex],
        'SibSp': [SibSp]
    })

    # Get the survival prediction
    prediction = classifier.predict(input_data)[0]

    # Get the probabilities of not surviving (0) and surviving (1)
    probabilities = classifier.predict_proba(input_data)[0]

    return prediction, probabilities

print("Prediction function 'predict_survival' defined successfully.")

**Reasoning**:
I need to import `ipywidgets` and `IPython.display`, then create individual widgets for 'Age', 'Embarked', 'Fare', 'Parch', 'Sex', and 'SibSp' as specified, using `IntSlider`, `FloatSlider`, and `Dropdown` types with appropriate ranges, default values, and labels.



In [None]:
import ipywidgets as widgets
from IPython.display import display

# 2. Create an IntSlider widget for 'Age'
age_widget = widgets.IntSlider(
    min=0,
    max=9, # 'Age' was binned into 10 categories (0-9)
    step=1,
    value=4, # Default value
    description='Age (Binned):'
)

# 3. Create a Dropdown widget for 'Embarked'
embarked_widget = widgets.Dropdown(
    options={'S': 1, 'C': 2, 'Q': 3},
    value=1, # Default to 'S'
    description='Embarked:'
)

# 4. Create a FloatSlider widget for 'Fare'
fare_widget = widgets.FloatSlider(
    min=0.0,
    max=512.33,
    step=0.1,
    value=30.0, # Reasonable default value
    description='Fare:'
)

# 5. Create an IntSlider widget for 'Parch'
parch_widget = widgets.IntSlider(
    min=0,
    max=6, # Based on data description
    step=1,
    value=0, # Default value
    description='Parch:'
)

# 6. Create a Dropdown widget for 'Sex'
sex_widget = widgets.Dropdown(
    options={'Female': 0, 'Male': 1},
    value=1, # Default to 'Male'
    description='Sex:'
)

# 7. Create an IntSlider widget for 'SibSp'
sibsp_widget = widgets.IntSlider(
    min=0,
    max=8, # Based on data description
    step=1,
    value=0, # Default value
    description='SibSp:'
)

print("Widgets for 'Age', 'Embarked', 'Fare', 'Parch', 'Sex', and 'SibSp' created successfully.")

## Build Interactive Dashboard

### Subtask:
Combine the input widgets and the prediction function into an interactive dashboard. This will include an 'Predict' button to trigger the prediction and display the results dynamically.


**Reasoning**:
I need to import the necessary `ipywidgets` classes, create the output widget and button, define the button's click handler, and then arrange and display all the widgets to build the interactive dashboard.



In [None]:
from IPython.display import display, clear_output

# 1. Create an Output widget to display the prediction results
output_widget = widgets.Output()

# 2. Create a Button widget
predict_button = widgets.Button(
    description='Predict Survival',
    button_style='success', # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click to predict survival'
)

# 3. Define the function to be executed when the button is clicked
def on_predict_button_clicked(button):
    with output_widget:
        clear_output()
        # Get current values from widgets
        age_val = age_widget.value
        embarked_val = embarked_widget.value
        fare_val = fare_widget.value
        parch_val = parch_widget.value
        sex_val = sex_widget.value
        sibsp_val = sibsp_widget.value

        # Call the prediction function
        prediction, probabilities = predict_survival(
            age_val, embarked_val, fare_val, parch_val, sex_val, sibsp_val
        )

        # Display the results
        print(f"--- Prediction Result ---")
        if prediction == 1:
            print(f"Survival Prediction: SURVIVED (1)")
        else:
            print(f"Survival Prediction: DID NOT SURVIVE (0)")
        print(f"Probability of Not Surviving (0): {probabilities[0]:.2f}")
        print(f"Probability of Surviving (1): {probabilities[1]:.2f}")
        print("-------------------------")

# 4. Register the on_predict_button_clicked function as the callback for the button's on_click event
predict_button.on_click(on_predict_button_clicked)

# 5. Arrange the input widgets and the button using VBox and HBox
input_widgets_layout = widgets.VBox([
    age_widget,
    embarked_widget,
    fare_widget,
    parch_widget,
    sex_widget,
    sibsp_widget
])

display(input_widgets_layout, predict_button, output_widget)

print("Interactive dashboard created successfully. Use the widgets to input values and click 'Predict Survival' to see the results.")

## Final Task

### Subtask:
Summarize the interactive front-end created and how to use it to get predictions from the Naive Bayes model.


## Summary:

### Q&A

The interactive front-end is an `ipywidgets` dashboard designed for predicting Titanic survival using a Naive Bayes model.

To use it:
1.  **Input Features**: Adjust the sliders and dropdown menus for the following passenger features: 'Age (Binned)', 'Embarked', 'Fare', 'Parch' (number of parents/children aboard), 'Sex', and 'SibSp' (number of siblings/spouses aboard).
    *   'Age' is represented as a binned integer from 0 to 9.
    *   'Embarked' uses a dropdown with 'S', 'C', 'Q' options.
    *   'Sex' uses a dropdown with 'Female' and 'Male' options.
2.  **Trigger Prediction**: Click the 'Predict Survival' button.
3.  **View Results**: The prediction (SURVIVED or DID NOT SURVIVE) and the associated probabilities for both outcomes will be displayed directly below the button.

### Data Analysis Key Findings

*   A `predict_survival` function was successfully implemented, taking six passenger features as input and returning the model's survival prediction (0 or 1) and the probabilities for each outcome.
*   Interactive `ipywidgets` were created for all six input features: 'Age' (IntSlider, 0-9), 'Embarked' (Dropdown, S:1, C:2, Q:3), 'Fare' (FloatSlider, 0.0-512.33), 'Parch' (IntSlider, 0-6), 'Sex' (Dropdown, Female:0, Male:1), and 'SibSp' (IntSlider, 0-8).
*   An interactive dashboard was successfully assembled, combining the input widgets with a 'Predict Survival' button and an output area.
*   The dashboard dynamically displays the survival prediction (e.g., "SURVIVED (1)") and the probabilities for both "Not Surviving (0)" and "Surviving (1)" when the button is clicked.

### Insights or Next Steps

*   The interactive dashboard provides a user-friendly interface for exploring the Naive Bayes model's predictions, allowing for immediate feedback on how different passenger attributes might influence survival.
*   Next steps could involve integrating this dashboard into a larger application or evaluating the model's performance more rigorously with new data through A/B testing or user studies.
