# **Financial Inclusion in Africa - Zindi Competition**

## **1. Problem Statement**
Financial inclusion remains a major challenge to economic and human development in Africa. In Kenya, Rwanda, Tanzania, and Uganda, only **14% of adults** (9.1 million) have access to or use a commercial bank account.

Despite the rise of mobile money and fintech solutions, **banks still play a crucial role** in financial inclusion. Access to bank accounts allows individuals to:
- Save and make payments  
- Build creditworthiness  
- Access loans, insurance, and related financial services  

### **Objective**
The goal of this competition is to **develop a machine learning model** that predicts whether an individual is likely to have a bank account. The insights gained from the model can:
- Measure financial inclusion in the target countries  
- Identify key factors affecting financial security  

---

## **2. Dataset Overview**
The dataset consists of survey responses from Kenya, Rwanda, Tanzania, and Uganda. Each row represents an individual's response to the survey.  

### **Variable Definitions**
| Feature | Description |
|---------|------------|
| `country` | Country where the interviewee is located |
| `year` | Year the survey was conducted |
| `uniqueid` | Unique identifier for each respondent |
| `location_type` | Type of location: Rural or Urban |
| `cellphone_access` | Does the respondent have access to a cellphone? (Yes/No) |
| `household_size` | Number of people living in the respondent's household |
| `age_of_respondent` | Age of the respondent |
| `gender_of_respondent` | Gender: Male or Female |
| `relationship_with_head` | Relationship to head of household (e.g., Head, Spouse, Child, etc.) |
| `marital_status` | Marital status (Married, Single, Divorced, etc.) |
| `education_level` | Highest education level attained |
| `job_type` | Employment category (e.g., Farming, Self-employed, Government job, etc.) |
| `bank_account` *(Target Variable)* | Does the respondent have a bank account? (Yes = 1, No = 0) |

---

## Load and Inspect the Data

In [1]:
import pandas as pd

# Load datasets
train = pd.read_csv("data/Train.csv")
test = pd.read_csv("data/Test.csv")

# Basic info
print(train.shape, test.shape)


(23524, 13) (10086, 12)


# Feature Engineering & Preprocessing 

In [2]:
#import preprocessing module
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

# Cobvert target label to numerical Data
le = LabelEncoder()
train['bank_account'] = le.fit_transform(train['bank_account'])

#Separate training features from target
X_train = train.drop(['bank_account'], axis=1)
y_train = train['bank_account']

print(y_train)

0        1
1        0
2        1
3        0
4        0
        ..
23519    0
23520    0
23521    0
23522    0
23523    0
Name: bank_account, Length: 23524, dtype: int64


In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler



In [4]:

# function to preprocess our data from train models
def preprocessing_data(data,feature_seleection=False):

    # Convert the following numerical labels from interger to float
    float_array = data[["household_size", "age_of_respondent", "year"]].values.astype(float)
    
    # categorical features to be onverted to One Hot Encoding
    categ = ["relationship_with_head",
             "marital_status",
             "education_level",
             "job_type",
             "country"]
    
    # One Hot Encoding conversion
    data = pd.get_dummies(data, prefix_sep="_", columns=categ)
    
    # Label Encoder conversion
    data["location_type"] = le.fit_transform(data["location_type"])
    data["cellphone_access"] = le.fit_transform(data["cellphone_access"])
    data["gender_of_respondent"] = le.fit_transform(data["gender_of_respondent"])
    
    # drop uniquid column
    data = data.drop(["uniqueid"], axis=1)
    if feature_seleection:
        data=data.drop(["household_size"],axis=1)
    
    # scale our data into range of 0 and 1
    # scaler = MinMaxScaler(feature_range=(0, 1))
    # scaler = StandardScaler()
    scaler = RobustScaler()
    data = scaler.fit_transform(data)   

    return data                  

In [5]:
# preprocess the train data 
processed_train = preprocessing_data(X_train)
processed_test = preprocessing_data(test)

In [6]:
# shape of the processed train set
print(processed_train.shape)

(23524, 37)


##  Split data into training and validation sets

In [7]:
# Split train_data
from sklearn.model_selection import train_test_split
X_Train, X_Val, y_Train, y_Val = train_test_split(processed_train, y_train, stratify = y_train, 
                                                  test_size = 0.1, random_state=42)

## Build a Deep Learning Model (MLP)

In [8]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'tensorflow'

In [24]:
# Define the MLP model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_Train.shape[1],)),
    Dropout(0.3),  # Prevent overfitting
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')  # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Model Summary
model.summary()


## Train the Model

In [26]:
from tensorflow.keras.callbacks import EarlyStopping

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
history = model.fit(X_Train, y_Train, 
                    validation_data=(X_Val, y_Val), 
                    epochs=50, batch_size=32, 
                    callbacks=[early_stopping], verbose=1)


## Evaluate Model Performance

In [28]:
from sklearn.metrics import confusion_matrix, classification_report

# Get predictions
y_pred = (model.predict(X_Val) > 0.5).astype(int)

# Confusion Matrix
conf_matrix = confusion_matrix(y_Val, y_pred)

# Display Confusion Matrix
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Classification Report
print("Classification Report:\n", classification_report(y_Val, y_pred))


Initial Error rate of Logestic Regression classifier:  0.12324691882702932


In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

# Predict probabilities
y_prob = model.predict(X_Val)

# Compute ROC Curve
fpr, tpr, _ = roc_curve(y_Val, y_prob)
roc_auc = roc_auc_score(y_Val, y_prob)

# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='b', label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.title("Receiver Operating Characteristic (ROC) Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc="lower right")
plt.show()

print(f"AUC Score: {roc_auc:.2f}")
