## INFS3081 Predictive Analytics

### Practical Activity 3.5: Naive Bayes

This notebook is a demonstration of Naive Bayes classifier development for a loan approval prediction problem.

#### Learning Objectives:
- Understand the k_nearest Neighbours (kNN) algorithm.
- Learn how to preprocess and split a dataset.
- Train and evaluate a kNN model.
- Optimise k by testing multiple values.

Installed the package `mixed-naive-bayes` with **pip install mixed-naive-bayes**

#### Predict Loan Eligibility for Dream Housing Finance company
Source: https://courses.analyticsvidhya.com/courses/take/loan-prediction-practice-problem-using-python/texts/6119358-problem-statement

The dataset comes in csv format, and has the following attributes:

| Variable          | Description                                     |
| ----------------- | ----------------------------------------------- |
| Loan_ID           | Unique Loan ID                                  |
| Gender            | Male / Female                                   |
| Married           | Applicant married (Y / N)                       |
| Dependents        | Number of dependents                            |
| Education         | Applicant Education (Graduate / Under Graduate) |
| Self_Employed     | Self employed (Y / N)                           |
| ApplicantIncome   | Applicant income                                |
| CoapplicantIncome | Coapplicant income                              |
| LoanAmount        | Loan amount in thousands                        |
| Loan_Amount_Term  | Term of loan in months                          |
| Credit_History    | Credit history meets guidelines                 |
| Property_Area     | Urban / Semi Urban / Rural                      |
| Loan_Status       | (Target) Loan approved (Y / N)                  |

Our goal is to create a Naive Bayes model that can learn from the training samples, so that we can predict the outcome of a loan application.


In [6]:
# Import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

#### Step 1: Preprocessing
First, we load "loan_prediction" dataset from the csv file as pandas dataframe and observe first five instances.

In [2]:
df = pd.read_csv('loan_prediction.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y


In [3]:
# Inspect the values in the categorical features
df['Gender'].value_counts()

Gender
Male      394
Female     86
Name: count, dtype: int64

In [4]:
# shape
df.shape

(480, 13)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            480 non-null    object 
 1   Gender             480 non-null    object 
 2   Married            480 non-null    object 
 3   Dependents         480 non-null    object 
 4   Education          480 non-null    object 
 5   Self_Employed      480 non-null    object 
 6   ApplicantIncome    480 non-null    int64  
 7   CoapplicantIncome  480 non-null    float64
 8   LoanAmount         480 non-null    float64
 9   Loan_Amount_Term   480 non-null    float64
 10  Credit_History     480 non-null    float64
 11  Property_Area      480 non-null    object 
 12  Loan_Status        480 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 48.9+ KB


We get three important pieces of information from shape and info:
1. There are 480 instances and 12 features. The target is Loan_Status.
2. From the Non-Null Count, we find that there are no missing values. Missing values affect the models adversely.
3. dtypes at the bottom of the info summary tells us there are 4 floating point attributes, 1 integer attribute, and 8 object or string valued attributes.

We need to encode the categorical attributes.

In [7]:
le = preprocessing.LabelEncoder()

df['en_gender']        = le.fit_transform(df['Gender'])
df['en_married']       = le.fit_transform(df['Married'])
df['en_dependents']    = le.fit_transform(df['Dependents'])
df['en_education']     = le.fit_transform(df['Education'])
df['en_self_employed'] = le.fit_transform(df['Self_Employed'])
df['en_parea']         = le.fit_transform(df['Property_Area'])

# Encoding the target variable
df['target'] = le.fit_transform(df['Loan_Status'])

In [8]:
# list the features
features = list(df.columns)
features

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status',
 'en_gender',
 'en_married',
 'en_dependents',
 'en_education',
 'en_self_employed',
 'en_parea',
 'target']

We need to select the encoded features for our models. Also, loan id is an identifier i.e., it is different for each sample. We will exclude that from our model training.

In [9]:
# remove loan id and target from features
features.remove('Loan_ID')
features.remove('Loan_Status')
features.remove('target')

# remove the non encoded features from the features list
features.remove('Gender')
features.remove('Married')
features.remove('Dependents')
features.remove('Education')
features.remove('Self_Employed')
features.remove('Property_Area')

features

['ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'en_gender',
 'en_married',
 'en_dependents',
 'en_education',
 'en_self_employed',
 'en_parea']

In [10]:
# making sure we have 11 features in the list
len(features)

11

Now, we check whether we have equal number of samples for each target class. A dataset where the samples are equally distributed to the target classes is called a balanced dataset. Imbalanced datasets are not good for training models.

In [11]:
df['Loan_Status'].value_counts()

Loan_Status
Y    332
N    148
Name: count, dtype: int64

We observe that our data is not balanced. We have more samples where target class is Y than samples of class N. Now we will divide our dataset into training and test sets using a 70/30 split. As our dataset is imbalanced, we must use stratified sampling to split the dataset to ensure representative samples from all the target classes.

In [13]:
# stratified sampling
X_train, X_test, y_train, y_test = train_test_split(df[features], df['target'], test_size=0.3, stratify=df['target'], random_state=42)

In [14]:
# check the class distribution in the test set
y_test.value_counts()

target
1    100
0     44
Name: count, dtype: int64

#### Step 2: Learning the Naive Bayes model
To build a Naive Bayes classifier, we can use the implementation of the sklearn package. There are five different methods provided for probability estimation. They are: 1. GaussianNB - for continuous or numeric attributes, 2. MultinomialNB - for nominal or categorical attributes, 3. ComplementNB - improved method for nominal or categorical attributes, 4. BernoulliNB - for binary attributes, 5. CategoricalNB - for nominal or categorical attributes.

If our dataset has one type of feature only i.e., datatypes of the features are only numeric or nominal then we can use one of the above implementations to build a predictive model. However, the features in the loan_prediction dataset are of mixed types. Here, we cannot use sklearn.

The mixed_naive-bayes 0.0.1 (https://pypi.org/project/mixed-naive-bayes/) provides an implementation of Naive Bayes for mixed attributes. We will use this Python library for this practical.

Note: The module expects that we have label encoded the categorical features.

In [15]:
# initialise the mixed_naive_bayes library
from mixed_naive_bayes import MixedNB

# fit the model with the training set
clfNB = MixedNB(categorical_features=[5,6,7,8,9,10])
# categorical_features is a list of indices of the categorical features in the features list

clfNB.fit(X_train, y_train)

MixedNB(alpha=0.5, var_smoothing=1e-09)

#### Step3: Evaluation
We will now evaluate the performance of our model on the test set. That is, we will apply the model to the test set X_test and match the predictions of the model with y_test.

In [16]:
# predictions
y_pred = clfNB.predict(X_test)
y_pred

array([1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [17]:
# here is our true labels
y_test

64     1
295    1
37     1
453    1
259    1
      ..
260    1
418    1
330    1
413    1
293    0
Name: target, Length: 144, dtype: int64

In [18]:
# let us take the first test instance for example
X_test[0:1]

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,en_gender,en_married,en_dependents,en_education,en_self_employed,en_parea
64,3846,0.0,111.0,360.0,1.0,0,0,0,0,0,1


In [23]:
clfNB.predict_proba(X_test[0:1])

array([[0.05735267, 0.94264733]])

#### Explanation
Given an instance x' with features = (3846,0.0,111.0,360.0,1.0,0,0,0,0,0,1) the model predicts:
P(Y|x') = 0.057
P(N|x') = 0.943

We can say that the model predicted that the application is not eligible for a loan.

In [20]:
# measure accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8333333333333334

In [21]:
# we can also use the score method of MixedNB to observe the accuracy
clfNB.score(X_test, y_test)

np.float64(0.8333333333333334)

Our Naive Bayes model has 83.33% accuracy on the loan_prediction dataset.

In this practical, we work through all the steps required to develop a Naive Bayes model. We used a 70/30 split for training and test sets.