# Case Study - Term Deposit

**Author:** Priya Sharma

**Email:** priyasharma1908@gmail.com

## Objective

The objective of this report is to present the analysis and results from modeling client data to predict if client will subscribe to term deposit.

We have followed the **scientific method** for this analysis and have broken down the steps as follows:

* **Objective**

* **Research**

* **Hypothesis**

* **Analysis**

* **Conclusion**

In [1]:
# Loading required packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
import shap

from dython.nominal import theils_u, correlation_ratio
from pandas.api.types import is_numeric_dtype
from scipy import stats
from sklearn import tree

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from IPython.display import display, HTML

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Load dataset
term_deposit_data = pd.read_csv("./bank-additional-full.csv", sep=";")

## Research

### Term Deposit

* A term deposit is a type of deposit account held at a financial institution where money is locked up for some set period of time.

* Term deposits are usually short-term deposits with maturities ranging from one month to a few years.

* Typically, term deposits offer higher interest rates than traditional liquid savings accounts, whereby customers can withdraw their money at any time.

## User defined functions

In [3]:
# Utility functions
def remove_rows_with_unknowns(df:'pd.DataFrame')->'pd.DataFrame':
    """
    Function to drop rows having 'unknown' values for any variable.
    
    Args:
        df: Dataframe
    
    Returns:
        Dataframe without any 'unknown' values
    """
    return(df[~df.replace('unknown', np.nan).isna().sum(axis=1).astype('bool')])

def min_max_scaling(train_df:'pd.DataFrame', test_df:'pd.DataFrame')->'pd.DataFrame':
    """
    Function to scale numeric columns of train and test dataframes using MinMaxScaler
    
    Args:
        train_df: Training dataframe. Scaler will be built using this
        test_df: Test dataframe. Scaler built from train would be applied on test
        
    Returns:
        Dataframe with scaled numeric columns
    """
    for column in train_df.columns:
        if(is_numeric_dtype(train_df[column])):
            scaler = MinMaxScaler()
            train_df[column] = scaler.fit_transform(train_df[[column]])
            test_df[column] = scaler.transform(test_df[[column]])
    return(train_df, test_df)

def compute_correlation(df:'pd.DataFrame')->'pd.DataFrame':
    """
    Function to compute correlation values as follows:
    - Pearson correlation (numeric-numeric)
    - Correlation Ratio (numeric-categorical)
    - Theil's U (categorical-categorical)
    
    Args:
        df: Dataset for which to compute correlation matrix
        
    Returns:
        Correlation matrix as dataframe
    """
    # Get list of columns
    list_of_columns = df.columns
    
    # Initialize empty dataframe for correlation matrix
    corr_df = pd.DataFrame(index=list_of_columns, columns=list_of_columns)
    
    # Iterate over each column
    for i in range(len(list_of_columns)):
        # For each column, iterate over list of columns again to get pair-wise columns
        # Note: We are iterating over (i, j) and (j, i) separately as Theil's U is not symmetric
        for j in range(len(list_of_columns)):
            if is_numeric_dtype(df[list_of_columns[i]]):
                if is_numeric_dtype(df[list_of_columns[j]]):
                    # Case 1: Both are numeric
                    corr_value = np.corrcoef(df[list_of_columns[i]], df[list_of_columns[j]])
                    corr_df.loc[list_of_columns[i], list_of_columns[j]] = corr_value[0][1]
                else:
                    # Case 2: One is categorical
                    corr_value = correlation_ratio(df[list_of_columns[j]], df[list_of_columns[i]])
                    corr_df.loc[list_of_columns[i], list_of_columns[j]] = corr_value
            elif is_numeric_dtype(df[list_of_columns[j]]):
                if is_numeric_dtype(df[list_of_columns[i]]):
                    # Case 1: Both are numeric
                    corr_value = np.corrcoef(df[list_of_columns[i]], df[list_of_columns[j]])
                    corr_df.loc[list_of_columns[i], list_of_columns[j]] = corr_value[0][1]
                else:
                    # Case 2: One is categorical
                    corr_value = correlation_ratio(df[list_of_columns[i]], df[list_of_columns[j]])
                    corr_df.loc[list_of_columns[i], list_of_columns[j]] = corr_value
            else:
                # Case 3: Both are categorical
                corr_value = theils_u(df[list_of_columns[i]], df[list_of_columns[j]])
                corr_df.loc[list_of_columns[i], list_of_columns[j]] = corr_value
    return(pd.DataFrame(corr_df.astype('float64').round(2)))

def get_model_summary(model:'sklearn.linear_model.LogisticRegression', X:'pd.DataFrame', y:'pd.Series')->'pd.DataFrame':
    """
    Function to compute model summary for sklearn Logistic Regression model as sklearn does not have
    implimentation of the same.
    
    Args:
        model: sklearn.linear_model.LogitsticRegression model object
        X: DataFrame with independent variables from training set
        y: Series with dependent variable from training set
        
    Returns:
        DataFrame containing model summary
    """
    # Extract model parameters
    params = np.append(model.intercept_, model.coef_)
    
    # Make predictions for training set (Fitted values)
    predictions = model.predict(X)
    
    # Original dataset with column 'Constant' for intercept
    newX = pd.DataFrame({
        "Constant": np.ones(len(X))
    }).join(pd.DataFrame(X).reset_index(drop=True))
    
    # Compute t stat and p value
    MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))
    
    var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
    sd_b = np.sqrt(var_b)
    ts_b = params/ sd_b
    
    p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)))) for i in ts_b]
    
    sd_b = np.round(sd_b,3)
    ts_b = np.round(ts_b,3)
    p_values = np.round(p_values,3)
    params = np.round(params,4)
    
    # Return model summary as DataFrame
    myDF3 = pd.DataFrame(index = newX.columns)
    myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
    
    return(myDF3)

def missing_value_mode_treatment(train_df:'pd.DataFrame', test_df:'pd.DataFrame')->'tuple':
    """
    Function to impute missing values using mode of the variable.
    
    Args:
        train_df: Training dataset
        test_df: Test dataset
        
    Returns:
        Tuple with first element as the treated training dataset and second element as treated test dataset
    """
    # Get columns with unknown values
    unknown_count = train_df.replace('unknown', np.nan).isna().sum(axis=0)
    unknown_columns = list(unknown_count[unknown_count > 0].index)
    
    # For each column
    for column in unknown_columns:
        # Get mode
        mode_value = train_df[column].mode()[0]
        
        # Update unknown with mode in training set
        train_df[column] = train_df[column].replace("unknown", np.nan).fillna(mode_value)
        
        # Update unknown with model in test set
        test_df[column] = test_df[column].replace("unknown", np.nan).fillna(mode_value)
    return(train_df, test_df)

def remove_outliers_iqr(train_x:'pd.DataFrame', test_x:'pd.DataFrame')->'tuple':
    """
    Function to remove outliers using Inter Quartile Range method
    
    Args:
        train_x: Training dataset
        test_x: Test dataset
        
    Returns:
        Tuple with first element as the treated training dataset and second element as treated test dataset
    """
    # Compute Q1 and Q3
    q1 = np.quantile(train_x, 0.25)
    q3 = np.quantile(train_x, 0.75)
    
    # Compute IQR
    iqr = q3 - q1
    
    # Get upper and lower limits
    lower_limit = q1 - 1.5 * iqr
    upper_limit = q3 + 1.5 * iqr
    
    # Cap outliers to upper and lower limits
    train_x[train_x < lower_limit] = lower_limit
    train_x[train_x > upper_limit] = upper_limit
    test_x[test_x < lower_limit] = lower_limit
    test_x[test_x > upper_limit] = upper_limit
    return(train_x, test_x)

## Hypothesis

The hypothesis for this analysis is as follows:

**A combination of CLIENT DATA, LAST CONTACT DATA and ADDITIONAL ATTRIBUTES can be used to predict if client will subscribe to a term deposit.**