## Naive Bayes Classification

In [1]:
# Imports
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix,ConfusionMatrixDisplay,f1_score,log_loss
from sklearn.model_selection import train_test_split
import pandas as pd

## Explanation:
- The Naive Bayes algorithm is a probabilistic classification algorithm based on Bayes' theorem with the "naive" assumption of feature independence. It is commonly used for text classification tasks, such as spam detection, sentiment analysis, and document categorization.
 - How it works:

#### 1. Bayes' Theorem:
- Bayes' theorem describes the probability of a hypothesis given the evidence: P(H|E) = (P(E|H) * P(H)) / P(E), where:
- - P(H|E) is the probability of hypothesis H given the evidence E.
- - P(E|H) is the probability of evidence E given the hypothesis H.
- - P(H) is the prior probability of hypothesis H.
- - P(E) is the prior probability of evidence E.

#### 2. Naive Assumption of Independence:
- The Naive Bayes algorithm assumes that the features used to describe instances are conditionally independent given the class label. In other words, the presence of a particular feature in a class is independent of the presence of any other feature.

#### 3. Training Phase:
- In the training phase, the algorithm calculates the prior probabilities of each class and the conditional probabilities of each feature given each class using the training dataset.
- The prior probability of each class is simply the proportion of instances belonging to that class in the training dataset.
- The conditional probability of each feature given each class is calculated based on the frequency of each feature occurring in instances of that class.

#### 4. Prediction Phase:
- In the prediction phase, given a new instance with a set of features, the algorithm calculates the posterior probability of each class given the features using Bayes' theorem.
- The class with the highest posterior probability is assigned as the predicted class for the new instance.
- Since the Naive Bayes algorithm assumes feature independence, it calculates the likelihood of each feature independently given the class and multiplies them together to obtain the joint likelihood.

#### 5. Types of Naive Bayes Classifiers:
- There are different variants of the Naive Bayes algorithm, including:
    - Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian distribution.
    - Multinomial Naive Bayes: Suitable for features with discrete counts, such as word counts in text classification.
    - Bernoulli Naive Bayes: Suitable for binary features, where features are assumed to be present or absent.
#### 6. Evaluation:
- The performance of the Naive Bayes algorithm can be evaluated using various metrics such as accuracy, precision, recall, F1 score, or area under the ROC curve (AUC), depending on the task.

Source: ChatGpt prompt.



### Generating data

In [None]:
X,y = make_classification(n_samples=800,n_features=6,random_state=1,n_classes=3,n_informative=2,n_clusters_per_class=1)


In [None]:
# Visualize the dataset
plt.scatter(X[:,0],X[:,1],c=y,marker='*')
plt.show()

### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)

### Model Building and Training

In [None]:
gaussian_model = GaussianNB()
gaussian_model.fit(X_train, y_train)
y_pred = gaussian_model.predict(X_test)


### Model evaluation

In [None]:
accuracy = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred,average='weighted')

print(f'Accuracy: {accuracy}')
print(f'f1 score: {f1}')


In [None]:
## Visualize confusion matrix
labels=[0,1,2]
cm = confusion_matrix(y_test,y_pred,labels=labels)
disp=ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=labels)
disp.plot()

## Naive Bayes Classifier with Loan Dataset

### Load data and check properties

In [None]:
loan_df= pd.read_csv('../data/loan_data.csv')
loan_df.head()

In [None]:
loan_df.shape

### Data preprocessing

In [None]:
loan_df= pd.get_dummies(loan_df,drop_first=True)
loan_df.head()

In [None]:
loan_df.shape

### Prepare features and label

In [None]:

X = loan_df.drop('not.fully.paid',axis=1)
y = loan_df['not.fully.paid']



### Train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)
gaussian_model = GaussianNB()
gaussian_model.fit(X_train, y_train)
y_pred = gaussian_model.predict(X_test)

### Model testing

In [None]:
accuracy = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred,average='weighted')
print(f'Accuracy: {accuracy}')
print(f'f1 score: {f1}')

In [None]:
labels=['Fully Paid','Not Paid']
cm = confusion_matrix(y_test,y_pred)
disp=ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=labels)
disp.plot()

#disp = ConfusionMatrixDisplay(confusion_matrix=cm)