# Getting Started

**Name - Rajesh Kumar Mishra**

**Github Link - https://github.com/mishra2022/Module_2_Assignment**

**Overview**
In the modern banking sector, the ability to efficiently process, analyze, and draw insights from vast volumes of data is crucial. Banks and financial institutions generate and collect extensive data, including customer demographics, transaction histories, market trends, and more. This data, when effectively analyzed, can lead to improved customer service, risk management, marketing strategies, and overall operational efficiency.

Project Background
The banking industry faces challenges in managing and utilizing large datasets due to the volume, variety, and velocity of data. Traditional data processing methods often fall short in providing timely insights and handling real-time data streams. With the advent of distributed computing and machine learning technologies, banks now have the opportunity to harness these large datasets to make informed decisions, predict market trends, and enhance customer experiences.

Dataset Overview
age: Age of the individual (integer).
job: Job type (object/string).
marital: Marital status (object/string).
education: Education level (object/string).
default: Indicates if the individual has credit in default (object/string).
balance: Account balance (integer).
housing: Indicates if the individual has a housing loan (object/string).
loan: Indicates if the individual has a personal loan (object/string).
contact: Type of communication contact (object/string).
day: Last contact day of the month (integer).
month: Last contact month of the year (object/string).
duration: Last contact duration, in seconds (integer).
campaign: Number of contacts performed during this campaign for this client (integer).
pdays: Number of days that passed by after the client was last contacted from a previous campaign (integer, '-1' means client was not previously contacted).
previous: Number of contacts performed before this campaign and for this client (integer).
poutcome: Outcome of the previous marketing campaign (object/string).
y: Indicates if the client has subscribed to a term deposit (object/string).
Project Goal
The primary goal of this project is to demonstrate how distributed machine learning can transform banking data into actionable insights. Using the "bank.csv" dataset, students will explore various aspects of distributed computing, from data storage and querying to predictive analytics and real-time data processing. The project aims to simulate a real-world banking data environment, offering insights into customer behavior, identifying key trends, and facilitating data-driven decision-making.

Specific Objectives
1. Data Analysis and Management: Utilize Hadoop and Hive to store and query large volumes of banking data efficiently. This step simulates how banks manage and access their vast data repositories.

2. Exploratory Data Analysis (EDA) with Spark: Perform EDA on the "bank.csv" dataset using Apache Spark to uncover trends, patterns, and anomalies in the banking data. This mirrors the initial steps banks take in understanding their customer base and market conditions.

3. Predictive Modeling for Banking Trends: Develop machine learning models using Spark ML to predict customer behavior, loan default probabilities, or other trends. This reflects how banks leverage predictive analytics for risk assessment and strategic planning.

4. Real-Time Transaction Analysis: Implement a Spark Streaming system to process simulated real-time banking transactions. This component is crucial for banks to monitor transactions in real-time for fraud detection, customer service, and immediate business insights.

5. Efficient Data Handling through Data Parallelism: Explore and apply data parallelism techniques to enhance the processing efficiency of large-scale banking data. This aspect is essential for banks dealing with continuously growing data volumes and needing scalable solutions.

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf

from sklearn.metrics import classification_report, confusion_matrix

In [None]:
tf.random.set_seed(100)

In [None]:
data = pd.read_csv('../input/bank-marketing-campaigns-dataset/bank-additional-full.csv', delimiter=';')

In [None]:
data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

# Encoding Labels

In [None]:
data['y'] = data['y'].apply(lambda y: 1 if y == 'yes' else 0)

# Encoding Categorical Features

In [None]:
data.select_dtypes('object')

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome
0,housemaid,married,basic.4y,no,no,no,telephone,may,mon,nonexistent
1,services,married,high.school,unknown,no,no,telephone,may,mon,nonexistent
2,services,married,high.school,no,yes,no,telephone,may,mon,nonexistent
3,admin.,married,basic.6y,no,no,no,telephone,may,mon,nonexistent
4,services,married,high.school,no,no,yes,telephone,may,mon,nonexistent
...,...,...,...,...,...,...,...,...,...,...
41183,retired,married,professional.course,no,yes,no,cellular,nov,fri,nonexistent
41184,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,nonexistent
41185,retired,married,university.degree,no,yes,no,cellular,nov,fri,nonexistent
41186,technician,married,professional.course,no,no,no,cellular,nov,fri,nonexistent


In [None]:
{column: len(data[column].unique()) for column in data.select_dtypes('object').columns}

{'job': 12,
 'marital': 4,
 'education': 8,
 'default': 3,
 'housing': 3,
 'loan': 3,
 'contact': 2,
 'month': 10,
 'day_of_week': 5,
 'poutcome': 3}

In [None]:
{column: list(data[column].unique()) for column in data.select_dtypes('object').columns}

{'job': ['housemaid',
  'services',
  'admin.',
  'blue-collar',
  'technician',
  'retired',
  'management',
  'unemployed',
  'self-employed',
  'unknown',
  'entrepreneur',
  'student'],
 'marital': ['married', 'single', 'divorced', 'unknown'],
 'education': ['basic.4y',
  'high.school',
  'basic.6y',
  'basic.9y',
  'professional.course',
  'unknown',
  'university.degree',
  'illiterate'],
 'default': ['no', 'unknown', 'yes'],
 'housing': ['no', 'yes', 'unknown'],
 'loan': ['no', 'yes', 'unknown'],
 'contact': ['telephone', 'cellular'],
 'month': ['may',
  'jun',
  'jul',
  'aug',
  'oct',
  'nov',
  'dec',
  'mar',
  'apr',
  'sep'],
 'day_of_week': ['mon', 'tue', 'wed', 'thu', 'fri'],
 'poutcome': ['nonexistent', 'failure', 'success']}

In [None]:
data = data.replace('unknown', np.NaN)

In [None]:
data.isna().sum()

age                  0
job                330
marital             80
education         1731
default           8597
housing            990
loan               990
contact              0
month                0
day_of_week          0
duration             0
campaign             0
pdays                0
previous             0
poutcome             0
emp.var.rate         0
cons.price.idx       0
cons.conf.idx        0
euribor3m            0
nr.employed          0
y                    0
dtype: int64

In [None]:
def onehot_encode(df, columns, prefixes):
    df = df.copy()
    for column, prefix in zip(columns, prefixes):
        dummies = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    return df

def ordinal_encode(df, columns, orderings):
    df = df.copy()
    for column, ordering in zip(columns, orderings):
        df[column] = df[column].apply(lambda x: ordering.index(x))
    return df

def binary_encode(df, columns, positive_values):
    df = df.copy()
    for column, positive_value in zip(columns, positive_values):
        df[column] = df[column].apply(lambda x: 1 if x == positive_value else x)
        df[column] = df[column].apply(lambda x: 0 if str(x) != 'nan' else x)
    return df

In [None]:
nominal_features = [
    'job',
    'marital',
    'education',
    'day_of_week',
    'poutcome'
]

ordinal_features = [
    'month'
]

binary_features = [
    'default',
    'housing',
    'loan',
    'contact'
]

In [None]:
prefixes = ['J', 'M', 'E', 'D', 'P']

orderings = [
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
]

positive_values = [
    'yes',
    'yes',
    'yes',
    'cellular'
]

In [None]:
data = onehot_encode(
    data,
    columns=nominal_features,
    prefixes=prefixes
)

data = ordinal_encode(
    data,
    columns=ordinal_features,
    orderings=orderings
)

data = binary_encode(
    data,
    columns=binary_features,
    positive_values=positive_values
)

In [None]:
data

Unnamed: 0,age,default,housing,loan,contact,month,duration,campaign,pdays,previous,...,E_professional.course,E_university.degree,D_fri,D_mon,D_thu,D_tue,D_wed,P_failure,P_nonexistent,P_success
0,56,0.0,0.0,0.0,0,4,261,1,999,0,...,0,0,0,1,0,0,0,0,1,0
1,57,,0.0,0.0,0,4,149,1,999,0,...,0,0,0,1,0,0,0,0,1,0
2,37,0.0,0.0,0.0,0,4,226,1,999,0,...,0,0,0,1,0,0,0,0,1,0
3,40,0.0,0.0,0.0,0,4,151,1,999,0,...,0,0,0,1,0,0,0,0,1,0
4,56,0.0,0.0,0.0,0,4,307,1,999,0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,0.0,0.0,0.0,0,10,334,1,999,0,...,1,0,1,0,0,0,0,0,1,0
41184,46,0.0,0.0,0.0,0,10,383,1,999,0,...,1,0,1,0,0,0,0,0,1,0
41185,56,0.0,0.0,0.0,0,10,189,2,999,0,...,0,1,1,0,0,0,0,0,1,0
41186,44,0.0,0.0,0.0,0,10,442,1,999,0,...,1,0,1,0,0,0,0,0,1,0


# Filling Missing Values

In [None]:
for column in ['default', 'housing', 'loan']:
    data[column] = data[column].fillna(data[column].mean())

In [None]:
print("Remaining missing values:", data.isna().sum().sum())

Remaining missing values: 0


In [None]:
print("Remaining non-numeric columns:", len(data.select_dtypes('object').columns))

Remaining non-numeric columns: 0


# Splitting/Scaling

In [None]:
data

Unnamed: 0,age,default,housing,loan,contact,month,duration,campaign,pdays,previous,...,E_professional.course,E_university.degree,D_fri,D_mon,D_thu,D_tue,D_wed,P_failure,P_nonexistent,P_success
0,56,0.0,0.0,0.0,0,4,261,1,999,0,...,0,0,0,1,0,0,0,0,1,0
1,57,0.0,0.0,0.0,0,4,149,1,999,0,...,0,0,0,1,0,0,0,0,1,0
2,37,0.0,0.0,0.0,0,4,226,1,999,0,...,0,0,0,1,0,0,0,0,1,0
3,40,0.0,0.0,0.0,0,4,151,1,999,0,...,0,0,0,1,0,0,0,0,1,0
4,56,0.0,0.0,0.0,0,4,307,1,999,0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,0.0,0.0,0.0,0,10,334,1,999,0,...,1,0,1,0,0,0,0,0,1,0
41184,46,0.0,0.0,0.0,0,10,383,1,999,0,...,1,0,1,0,0,0,0,0,1,0
41185,56,0.0,0.0,0.0,0,10,189,2,999,0,...,0,1,1,0,0,0,0,0,1,0
41186,44,0.0,0.0,0.0,0,10,442,1,999,0,...,1,0,1,0,0,0,0,0,1,0


In [None]:
y = data['y'].copy()
X = data.drop('y', axis=1).copy()

In [None]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=100)

# Modeling/Training

In [None]:
print("Positive examples: {}".format(y.sum()))
print("Negative examples: {}".format(len(y) - y.sum()))

print("\nClass Distribution: {:.1f}% / {:.1f}%".format(y.mean() * 100, (1 - y.mean()) * 100))

Positive examples: 4640
Negative examples: 36548

Class Distribution: 11.3% / 88.7%


In [None]:
inputs = tf.keras.Input(shape=(X.shape[1],))
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)


model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.AUC(name='auc')
    ]
)

batch_size = 32
epochs = 100

history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100


# Results

In [None]:
model.evaluate(X_test, y_test)



[0.19635070860385895, 0.9074208736419678, 0.9286737442016602]

In [None]:
y_true = np.array(y_test)
y_pred = np.squeeze(np.array(model.predict(X_test) >= 0.5, dtype=np.int))

In [None]:
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

Confusion Matrix:
 [[10618   357]
 [  787   595]]


In [None]:
print("Classification Report:\n\n", classification_report(y_true, y_pred))

Classification Report:

               precision    recall  f1-score   support

           0       0.93      0.97      0.95     10975
           1       0.62      0.43      0.51      1382

    accuracy                           0.91     12357
   macro avg       0.78      0.70      0.73     12357
weighted avg       0.90      0.91      0.90     12357

