## **Bank Analysis**
---

- Task : Classification
- Objective : Prediksi client bank yang berlangganan term deposit

<br>

<center>
<img src="https://keralagbank.com/public/images/inner/personal/term-deposit.png">
</center>

### **Data description:**

**Bank Client Data**:

- `age` (numeric)
- `job` : type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services")
- `marital` : marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed)
- `education` (categorical: "unknown", "secondary", "primary", "tertiary")
- `default`: has credit in default? (binary: "yes", "no")
- `balance`: average yearly balance, in euros (numeric)
- `housing`: has housing loan? (binary: "yes", "no")
- `loan`: has personal loan? (binary: "yes", "no")

<br>

**Kondisi komunikasi dengan campaign terakhir**
- `contact`: contact communication type (categorical: "unknown", "telephone", "cellular")
- `day`: last contact day of the month (numeric)
- `month`: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- `duration`: last contact duration, in seconds (numeric)

<br>

**Atribut/Fitur lain**
- `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- `previous`: number of contacts performed before this campaign and for this client (numeric)
- `poutcome`: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

<br>

**Output variable (desired target)**
- `y` - has the client subscribed a term deposit? (binary: "yes","no")

In [40]:
#import necessary library
import pandas as pd
import numpy as np

#machine learning library
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [3]:
#import data
bank_data = pd.read_csv("C:\\Users\\ASUS\\Documents\\Python\\ML\\week 1\\bank-data.csv")

#review data
bank_data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


In [4]:
bank_data.shape

(45211, 17)

In [5]:
#check duplicated data
bank_data.duplicated().sum()

0

In [6]:
#Input-Output Split
output_data = bank_data["y"]
input_data = bank_data.drop("y", axis=1)


In [7]:
#put in a function
def extractInputOutput(data,
                       output_column_name):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :return input_data: <pandas dataframe> data input
    :return output_data: <pandas series> data output
    """
    output_data = data[output_column_name]
    input_data = data.drop(output_column_name, axis=1)

    return input_data, output_data

In [8]:
X, y = extractInputOutput(data=bank_data, output_column_name="y")

In [9]:
X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown


In [10]:
y.head()

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object

In [11]:
#splitting data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=12)

In [12]:
X_test.shape[0]/X.shape[0]

0.25000552962774547

Data Imputation

In [13]:
X_train.isnull().sum()

age          2626
job          2650
marital      2650
education    2542
default      2689
balance      2574
housing      2660
loan         2668
contact      2695
day          2617
month        2602
duration     2701
campaign     2614
pdays        2634
previous     2638
poutcome     2629
dtype: int64

In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33908 entries, 37156 to 14155
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        31282 non-null  float64
 1   job        31258 non-null  object 
 2   marital    31258 non-null  object 
 3   education  31366 non-null  object 
 4   default    31219 non-null  object 
 5   balance    31334 non-null  float64
 6   housing    31248 non-null  object 
 7   loan       31240 non-null  object 
 8   contact    31213 non-null  object 
 9   day        31291 non-null  float64
 10  month      31306 non-null  object 
 11  duration   31207 non-null  float64
 12  campaign   31294 non-null  float64
 13  pdays      31274 non-null  float64
 14  previous   31270 non-null  float64
 15  poutcome   31279 non-null  object 
dtypes: float64(7), object(9)
memory usage: 4.4+ MB


Terdapat kolom yang bernilai NaN dan tipe data yang tidak sesuai

Data kategorikal:
- job
- marital
- education
- default
- housing
- loan
- contact
- month
- poutcome

Sisanya adalah numerical

In [17]:
#separate numerical columns
numerical_column = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
X_train_numerical = X_train[numerical_column]
X_train_numerical.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0


In [18]:
X_train_numerical.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33908 entries, 37156 to 14155
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       31282 non-null  float64
 1   balance   31334 non-null  float64
 2   day       31291 non-null  float64
 3   duration  31207 non-null  float64
 4   campaign  31294 non-null  float64
 5   pdays     31274 non-null  float64
 6   previous  31270 non-null  float64
dtypes: float64(7)
memory usage: 2.1 MB


Numerical Imputation

In [20]:
#Function for numerical Imputation using SimpleImputer by sklearn
def numericalImputation (data, numerical_column):
     """
    Fungsi untuk melakukan imputasi data numerik
    :param data: <pandas dataframe> sample data input
    :param numerical_column: <list> list kolom numerik data
    :return X_train_numerical: <pandas dataframe> data numerik
    :return imputer_numerical: numerical imputer method
    """
     
    #Filter numerical data
     numerical_data = data[numerical_column]

    #imputer
     imputer_numerical = SimpleImputer(missing_values=np.nan,
                                       strategy="median")
     imputer_numerical.fit(numerical_data)

     #Transform
     imputed_data = imputer_numerical.transform(numerical_data)
     numerical_data_imputed = pd.DataFrame(imputed_data)

     numerical_data_imputed.columns = numerical_column
     numerical_data_imputed.index = numerical_data.index

     return numerical_data_imputed, imputer_numerical

In [21]:
X_train_numerical, imputer_numerical = numericalImputation(data=X_train,
                                                           numerical_column=numerical_column)

In [22]:
X_train_numerical.isnull().any()

age         False
balance     False
day         False
duration    False
campaign    False
pdays       False
previous    False
dtype: bool

Categorical Imputation

In [28]:
X_train_columns = list(X_train.columns)
categorical_column = list(set(X_train_columns).difference(set(numerical_column)))

In [29]:
X_train[categorical_column].isna().sum()

education    2542
housing      2660
poutcome     2629
month        2602
marital      2650
default      2689
contact      2695
job          2650
loan         2668
dtype: int64

In [30]:
#Function for categorical Imputation using SimpleImputer by sklearn
def categoricalImputation (data, categorical_column):
     """
    Fungsi untuk melakukan imputasi data kategorik
    :param data: <pandas dataframe> sample data input
    :param categorical_column: <list> list kolom kategorikal data
    :return categorical_data: <pandas dataframe> data kategorikal
    """
     #data selection
     categorical_data = data[categorical_column]

    #Imputation
     categorical_data = categorical_data.fillna(value="KOSONG")

     return categorical_data

In [34]:
X_train_categorical = categoricalImputation(X_train, 
                                            categorical_column=categorical_column)

X_train_categorical.isna().sum()

education    0
housing      0
poutcome     0
month        0
marital      0
default      0
contact      0
job          0
loan         0
dtype: int64

One Hot Encoding for categorical data

In [35]:
categorical_ohe = pd.get_dummies(X_train_categorical)
categorical_ohe.head()

Unnamed: 0,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown,housing_KOSONG,housing_no,housing_yes,poutcome_KOSONG,poutcome_failure,...,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,loan_KOSONG,loan_no,loan_yes
37156,False,False,False,True,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
20494,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
35272,False,False,False,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,True,False
22260,True,False,False,False,False,False,True,False,False,False,...,False,False,True,False,False,False,False,False,True,False
2728,False,False,True,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,True,False


In [36]:
#save OHE columns
OHE_columns = categorical_ohe.columns

In [37]:
OHE_columns

Index(['education_KOSONG', 'education_primary', 'education_secondary',
       'education_tertiary', 'education_unknown', 'housing_KOSONG',
       'housing_no', 'housing_yes', 'poutcome_KOSONG', 'poutcome_failure',
       'poutcome_other', 'poutcome_success', 'poutcome_unknown',
       'month_KOSONG', 'month_apr', 'month_aug', 'month_dec', 'month_feb',
       'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may',
       'month_nov', 'month_oct', 'month_sep', 'marital_KOSONG',
       'marital_divorced', 'marital_married', 'marital_single',
       'default_KOSONG', 'default_no', 'default_yes', 'contact_KOSONG',
       'contact_cellular', 'contact_telephone', 'contact_unknown',
       'job_KOSONG', 'job_admin.', 'job_blue-collar', 'job_entrepreneur',
       'job_housemaid', 'job_management', 'job_retired', 'job_self-employed',
       'job_services', 'job_student', 'job_technician', 'job_unemployed',
       'job_unknown', 'loan_KOSONG', 'loan_no', 'loan_yes'],
      dtype='object'

JOIN Numerical and categorical data

In [38]:
X_train_concat = pd.concat([X_train_numerical, categorical_ohe], axis=1)
X_train_concat.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,education_KOSONG,education_primary,education_secondary,...,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,loan_KOSONG,loan_no,loan_yes
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0,False,False,False,...,False,False,False,False,False,False,False,False,True,False
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0,True,False,False,...,False,False,False,False,False,False,False,True,False,False
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0,False,False,False,...,False,False,False,False,False,False,False,False,True,False
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0,True,False,False,...,False,False,True,False,False,False,False,False,True,False
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0,False,False,True,...,False,False,False,False,True,False,False,False,True,False


In [39]:
X_train_concat.isna().sum()

age                    0
balance                0
day                    0
duration               0
campaign               0
pdays                  0
previous               0
education_KOSONG       0
education_primary      0
education_secondary    0
education_tertiary     0
education_unknown      0
housing_KOSONG         0
housing_no             0
housing_yes            0
poutcome_KOSONG        0
poutcome_failure       0
poutcome_other         0
poutcome_success       0
poutcome_unknown       0
month_KOSONG           0
month_apr              0
month_aug              0
month_dec              0
month_feb              0
month_jan              0
month_jul              0
month_jun              0
month_mar              0
month_may              0
month_nov              0
month_oct              0
month_sep              0
marital_KOSONG         0
marital_divorced       0
marital_married        0
marital_single         0
default_KOSONG         0
default_no             0
default_yes            0


Standadized Variables

In [41]:
def standardizedData (data):
    """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> sampel data
    :return standardized_data: <pandas dataframe> sampel data standard
    :return standardizer: method untuk standardisasi data
    """

    data_columns = data.columns
    data_index = data.index

    standadizer = StandardScaler()
    standadizer.fit(data)

    #Transform data
    standardized_data_raw = standadizer.transform(data)
    standardized_data = pd.DataFrame(standardized_data_raw)
    standardized_data.columns = data_columns
    standardized_data.index = data.index

    return standardized_data, standadizer

In [42]:
X_train_clean, standardizer = standardizedData(data=X_train_concat)

In [43]:
X_train_clean.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,education_KOSONG,education_primary,education_secondary,...,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,loan_KOSONG,loan_no,loan_yes
37156,-0.568886,0.502047,-0.354434,-0.504073,-0.569761,-0.390255,-0.292138,-0.284681,-0.402775,-0.949599,...,-0.221001,-0.182418,-0.303041,-0.137592,-0.425647,-0.165313,-0.07566,-0.292238,0.543266,-0.418762
20494,-1.058043,-0.288093,-0.47921,-0.693995,-0.236736,-0.390255,-0.292138,3.512706,-0.402775,-0.949599,...,-0.221001,-0.182418,-0.303041,-0.137592,-0.425647,-0.165313,-0.07566,3.421863,-1.84072,-0.418762
35272,-0.177561,1.012588,-1.103089,-0.855631,-0.569761,-0.390255,-0.292138,-0.284681,-0.402775,-0.949599,...,-0.221001,-0.182418,-0.303041,-0.137592,-0.425647,-0.165313,-0.07566,-0.292238,0.543266,-0.418762
22260,0.800752,-0.302827,0.643772,-0.407091,0.096289,-0.390255,-0.292138,3.512706,-0.402775,-0.949599,...,-0.221001,-0.182418,3.299879,-0.137592,-0.425647,-0.165313,-0.07566,-0.292238,0.543266,-0.418762
2728,-1.253705,-0.279527,-0.354434,-0.40305,0.096289,-0.390255,-0.292138,-0.284681,-0.402775,1.053076,...,-0.221001,-0.182418,-0.303041,-0.137592,2.349365,-0.165313,-0.07566,-0.292238,0.543266,-0.418762


Machine Learning

In [45]:
#baseline
y_train.value_counts(normalize=True)

y
no     0.882624
yes    0.117376
Name: proportion, dtype: float64

In [46]:
#import machine learning model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [47]:
#KNN Model
knn = KNeighborsClassifier()
knn.fit(X_train_clean, y_train)

In [48]:
#Logistic Regression Model
logreg = LogisticRegression(random_state=123)
logreg.fit(X_train_clean, y_train)

In [50]:
#Random Forest Model
random_forest = RandomForestClassifier(random_state=123, n_estimators=500)
random_forest.fit(X_train_clean, y_train)

In [55]:
#prediction using logistic regression
logreg.predict(X_train_clean)
predicted_logreg = pd.DataFrame(logreg.predict(X_train_clean))
predicted_logreg.head()

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no


In [56]:
#prediction using knn
knn.predict(X_train_clean)
predicted_knn = pd.DataFrame(knn.predict(X_train_clean))
predicted_knn.head()

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no


In [57]:
#prediction using random forest
random_forest.predict(X_train_clean)
predicted_rf = pd.DataFrame(random_forest.predict(X_train_clean))
predicted_rf.head()

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no


Check Model Performance

In [59]:
#benchmark
benchmark = y_train.value_counts(normalize=True)[0]
benchmark

  benchmark = y_train.value_counts(normalize=True)[0]


0.8826235696590775

In [63]:
knn.score(X_train_clean, y_train)

0.9098737761000354

In [64]:
logreg.score(X_train_clean, y_train)

0.900554441429751

In [65]:
random_forest.score(X_train_clean, y_train)

1.0

Test Prediction

In [78]:
def extractTest(data,
                numerical_column, categorical_column, ohe_column,
                imputer_numerical, standardizer):
    """
    Fungsi untuk mengekstrak & membersihkan test data 
    :param data: <pandas dataframe> sampel data test
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one-hot-encoding dari data kategorik
    :param imputer_numerical: <sklearn method> imputer data numerik
    :param standardizer: <sklearn method> standardizer data
    :return cleaned_data: <pandas dataframe> data final
    """

    #filter data
    numerical_data = data[numerical_column]
    categorical_data = data[categorical_column]

    #numerical data
    numerical_data = pd.DataFrame(imputer_numerical.transform(numerical_data))
    numerical_data.columns = numerical_column
    numerical_data.index = data.index

    #categorical data
    categorical_data = categorical_data.fillna(value="KOSONG")
    categorical_data.index = data.index
    categorical_data = pd.get_dummies(categorical_data)
    categorical_data.reindex(index=categorical_data.index,
                             columns=ohe_column)
    
    #concat
    concat_data = pd.concat([numerical_data, categorical_data],
                            axis=1)
    cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
    cleaned_data.columns = concat_data.columns

    return cleaned_data


In [73]:
def testPrediction (X_test, y_test, classifier, compute_score):
    """
    Fungsi untuk mendapatkan prediksi dari model
    :param X_test: <pandas dataframe> input
    :param y_test: <pandas series> output/target
    :param classifier: <sklearn method> model klasifikasi
    :param compute_score: <bool> True: menampilkan score, False: tidak
    :return test_predict: <list> hasil prediksi data input
    :return score: <float> akurasi model
    """
    if compute_score:
          score = classifier.score(X_test, y_test)
          print(f"Accuracy: {score:.4f}")

    test_predict = classifier.predict(X_test)
     
    return test_predict, score

In [79]:
X_test_clean = extractTest(data = X_test,
                           numerical_column = numerical_column,
                           categorical_column = categorical_column,
                           ohe_column = OHE_columns,
                           imputer_numerical = imputer_numerical,
                           standardizer = standardizer)

In [80]:
# Logistic Regression Performance
logreg_test_predict, score = testPrediction(X_test = X_test_clean,
                                            y_test = y_test,
                                            classifier = logreg,
                                            compute_score = True)

Accuracy: 0.9010


In [81]:
# K nearest neighbor Performance
knn_test_predict, score = testPrediction(X_test = X_test_clean,
                                         y_test = y_test,
                                         classifier = knn,
                                         compute_score = True)

Accuracy: 0.8898


In [82]:
# Random Forest Performance
rf_test_predict, score = testPrediction(X_test = X_test_clean,
                                        y_test = y_test,
                                        classifier = random_forest,
                                        compute_score = True)

Accuracy: 0.9022
