# Naive Bayes 

- In this notebook, we will build a classification model using Naive Bayes to predict whether a flight will be delayed or not based on various factors such as the day of the week, departure time, origin, destination, and carrier.

- We will use the scikit-learn package for training, evaluating, and making predictions with the classification model. Additionally, we will utilize the pandas library for data manipulation. The MultinomialNB function from scikit-learn will be used for the Naive Bayes model.

## 1. Naive Bayes on “Flight Delays” dataset

### (1) Prepare the data

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pylab as plt

In [None]:
url = "https://raw.github.com/niharikabalachandra/Logistic-Regression/master/FlightDelays.csv"
delays_df = pd.read_csv(url)

In [None]:
delays_df.head()

### (2) Preprocessing

- Convert certain columns to categorical variables.
- Round the departure times to create hourly bins.

In [None]:
# convert to categorical
delays_df.dayweek = delays_df.dayweek.astype('category')
delays_df['delay'] = delays_df['delay'].astype('category')

# create hourly bins departure time 
delays_df.schedtime = [round(t / 100) for t in delays_df.schedtime]
delays_df.schedtime = delays_df.schedtime.astype('category')

predictors = ['dayweek', 'schedtime', 'origin', 'dest', 'carrier']
outcome = 'delay'

X = pd.get_dummies(delays_df[predictors])
y = delays_df['delay']

classes = list(y.cat.categories)

In [None]:
X.head(2)

### (3) Split the data into training and test sets

In [None]:
# split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=1)

### (4) Define and train a Naive Bayes model

In [None]:
# run naive Bayes
clf = MultinomialNB(alpha=0.01)
clf.fit(X_train, y_train)

### (5) Predict probabilities and class membership

In [None]:
# predict probabilities
predProb_train = clf.predict_proba(X_train)
predProb_test = clf.predict_proba(X_test)

# predict class membership
y_test_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)

In [None]:
# Subset a specific set
df = pd.concat([pd.DataFrame({'actual': y_test, 'predicted': y_test_pred}),
                pd.DataFrame(predProb_test, index=y_test.index)], axis=1)
mask = ((X_test.carrier_DL == 1) & (X_test.dayweek_7 == 1) & (X_test.schedtime_10 == 1) & 
        (X_test.dest_LGA == 1) & (X_test.origin_DCA == 1))

print(df[mask])

### (6) Generate probability frequency tables

In [None]:
# split the original data frame into a train and test using the same random_state
train_df, test_df = train_test_split(delays_df, test_size=0.4, random_state=1)

pd.set_option('display.precision', 4)
# probability of flight status
print(train_df['delay'].value_counts() / len(train_df))
print()

for predictor in predictors:
    # construct the frequency table
    df = train_df[['delay', predictor]]
    freqTable = df.pivot_table(index='delay', columns=predictor, aggfunc=len)

    # divide each row by the sum of the row to get conditional probabilities
    propTable = freqTable.apply(lambda x: x / sum(x), axis=1)
    print(propTable)
    print()
pd.reset_option('display.precision')

### (7) Calculate the posterior probabilities for specific cases

In [None]:
# P(delayed | Carrier = DL, Day_Week = 7, Dep_Time = 10, Dest = LGA, Origin = DCA)
P_hat_delayed = 0.0958 * 0.1609 * 0.0307 * 0.4215 * 0.5211 * 0.1977
# P(ontime | Carrier = DL, Day_Week = 7, Dep_Time = 10, Dest = LGA, Origin = DCA)
P_hat_ontime = 0.2040 * 0.1048 * 0.0519 * 0.5779 * 0.6478 * 0.8023
print('P_hat_delayed ~ ', P_hat_delayed)
print('P_hat_ontime ~ ', P_hat_ontime)

print('P(delayed|...) = ', P_hat_delayed / (P_hat_delayed + P_hat_ontime))
print('P(ontime|...) = ', P_hat_ontime / (P_hat_delayed + P_hat_ontime))

### (8) Evaluate the model using confusion matrix and ROC curve

In [None]:
print(clf.score(X_test, y_test))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_test_pred)

print("Confusion Matrix:")
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()

plt.show()

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()  
y_test_binary = lb.fit_transform(y_test)

fpr, tpr, thresholds = roc_curve(y_test_binary, predProb_test[:,1])  
roc_auc = auc(fpr, tpr)  

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')  # 대각선 (랜덤 모델)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

print(f'AUC: {roc_auc:.2f}')