# Plan

1. Read data spreadsheet
2. Prepare features
3. Train a classification model
4. Compute model performance

# We will use `sklearn` library as a main library with ML tools and algos

In [None]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB

## Predict if passenger `Survived`

In [None]:
df = pd.read_csv('../data/titanic_data.csv')

In [None]:
df.head(3)

## For our example we will use 2 features: `Passenger class` and `Sex`

In [None]:
df[['Sex', 'Passenger Class', 'Survived']]

In [None]:
clf = MultinomialNB()

clf.fit(df[['Sex', 'Passenger Class']], df['Survived'])

## Need to convert categorical features to numerical (`sklearn` does not support it out of the box)

In [None]:
df['Sex'].value_counts()

In [None]:
df['Passenger Class'].value_counts()

In [None]:
X = pd.DataFrame()

X['Sex'] = df['Sex'].map({
    'Male':0,
    'Female':1
})
X['Passenger Class'] = df['Passenger Class'].map({
    'First': 1,
    'Second': 2,
    'Third': 3
})

X['Survived'] = df['Survived'].map({
    'Yes': 1,
    'No': 0
})

y = X['Survived']
X = X.drop(['Survived'], axis=1)

## You could also use 

`sklearn.preprocessing.LabelEncoder` and `sklearn.preprocessing.OrdinalEncoder`

In [None]:
clf = MultinomialNB()

clf.fit(X, y)

In [None]:
y_pred = clf.predict(X)

## Let's compute the accuracy of our algorithm by building a confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score

In [None]:
tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()

In [None]:
print(tp)

In [None]:
print(tn)

In [None]:
print(fp)

In [None]:
print(fn)

In [None]:
print(precision_score(y, y_pred))

In [None]:
print(recall_score(y, y_pred))

In [None]:
print(accuracy_score(y, y_pred))

## Is it a good idea to use the same exact data for model training and model performance assessment? Why?

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.66, random_state=5)

In [None]:
clf = MultinomialNB()

clf.fit(X_train, y_train)

In [None]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [None]:
print(f'Model performance on train data: {accuracy_score(y_train, y_train_pred)}')

In [None]:
print(f'Model performance on train data: {accuracy_score(y_test, y_test_pred)}')

## Typically you want to split your annotated data into 3 parts:

- `Train`: this part is used to fit the model (learn `model parameters`, e.g. probability distribution in Naive Bayes, weights in linear regression, thresholds in decision trees)
- `Validation`: this part is used to choose `model hyperparameters`, number of neighbours in KNN, regularization strength in linear models, tree depth in decision trees, etc.
- `Test`: this part is used to assess model performance and to compare it with other existing models. This part of the data should be never shown to the model other than in "predict" phase.


The split strategy might vary depending on the task specifics, e.g. 
- if one works with time-based data (predict stock price), the split should be time-based (you do not want to predict past from future)
- if you have multiple measurements from a single person, e.g. tumor size, you want to make your split based on unique subject ID and not based on individual measurements

# Questions

1. How to use Naive Bayes for continious (numerical) features?
2. How to use Naive Bayes if we want to predict a continious (numerical) target? (regression instead of classification)
3. Do we want to get probabilities instead of class labels? Yes/No and why?
4. How to interpret Naive Bayes model?

In [None]:
y_pred_proba = clf.predict_proba(X_test)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
y_pred_proba[:5]

In [None]:
cm = confusion_matrix(y_test, y_pred_proba[:, 1]>0.5)
ConfusionMatrixDisplay(cm).plot();

## If I want to reduce number of False Positive, do I need to increase or decrease probability threshold?

In [None]:
cm = confusion_matrix(y_test, y_pred_proba[:, 1]>0.7)
ConfusionMatrixDisplay(cm).plot();