# Executive Summary:


***

This is an analysis of the Portuguese Bank Telemarketing dataset. In the following sections I will be using Python to describe trends within the data and attempting to identify variables that have influence on outcomes (the 'y' variable) within the dataset. 
<br>

**Process:**
* I first identified and explored client-specific and selected campaign variables within the dataset. I selected the campaign variables based on the number of features belonging to each variable (which was few compared to some of the other campaign variables) in order to not overwhelm the step-wise selection function. 
* I created a logistic regression model based on the results of a step-wise feature selection function. 
    * I partitioned the data and plotted the AUC of the variables against the training and testing data, and then selected the best variables for the model. 
    * Ultimately, I selected 'contact_telephone', 'poutcome_success', 'job_blue-collar', 'age_41-50', and 'age_31-40' for the final iteration of the model. 
    
**Findings:**
* The model has 89% overall accuracy, and showed strong accuracy in predicting unsuccessful outcomes (where the client responded "no" to a deposit subscription.
* While the model was not as accurate in terms of predicting successful outcomes, this could be a helpful model to predict which groups will not be effective targets in future campaigns

**Rationale:**
* I chose logistic regression as the method for developing a model for this dataset based on the ease of interpretation of the model. I've worked through several iterations of the model, and ended up narrowing down the dataset into groups of variables. I was initially interested in the economic variables, but then decided to use client-specific variables, as that might be a better fit in terms of usability. I figured that the bank might be more interested in potential targeting opportunities that could be employed in future campaigns, and adjusted my question and model accordingly. 


# Section 1: Importing Packages and Dataset
***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

#additional packages
import seaborn as sns
import scikitplot as skplt
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Section 2: Inspecting the Initial Dataset
***

### 2.1: Inspecting the columns and shape of the dataset

Before any analysis, we must first inspect the data and make sure that it's complete.

In [None]:
#loading the csv file into the pandas dataframe
df = pd.read_csv('/kaggle/input/dataset-for-bta-419-2023/BTA_419_2023_Data.csv')

#checking the count of non-null values and data types
print(df.info())

#inspecting the shape of the dataframe
print("The shape of the dataset is: ",df.shape)

All columns have the same counts of non-null values. The datatypes range from float64, int64, and objects. The shape of the dataset is 41,188 instances and 21 columns.

In [None]:
#checking for null values
df.isna().sum()

None of the columns show any null values. 

In [None]:
#looking at the variables and head of the dataset
df.head(5)

### 2.2: Variables definition
The variable of interest is located within the 'y' column of the dataset, where a value of 'yes' indicates that the client subscribed to a term deposit.
There's a mix of categorical and quantitative variables in the dataset. A detailed breakdown of the variables is as follows:
#### i. Input variables:
##### <u>Bank client data:</u>
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')

##### <u>Related with the last contact of the current campaign:</u>
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

##### <u>Other attributes:</u>
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

##### <u>Social and economic context attributes</u>
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)

#### ii. Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

# Section 3: Business Question Definition + Exploring the Dataset
***

### 3.1: Defining the Business Question
    
   The Portugeuse bank would likely be interested in finding out which variables had a predictive nature towards the outcome variable, 'y'. A question related to this dataset could be: ***which demographic or campaign variables are likely to influence an outcome?*** This question could be relevant to the bank's future marketing efforts, and could result in savings or increased revenues for further campaigns. The demographic or campaign variables are ones that are likely to be known by the bank before the telemarketing campaign is being conducted, and thus could be of interest when developing a model for bank.
   
### 3.2: Exploring the Dataset

In order to develop an answer to the business question, we must first identify variables that might be candidates for a model. For this analysis, candidate variables for the model are ones that could be interpreted as having some sort of influence on the distribution of the outcome variable, 'y'.

***
#### i. The outcome variable 'y'
The first variable to look at is the outcome variable, 'y'. 



In [None]:
#visualizing the 'y' variable outcomes
sns.countplot(x='y',
             data=df)
plt.title("Y distribution")
plt.show()
print(df['y'].value_counts())

#calculate the rate of subscription to term deposits
success_rate = round(sum(df['y'] == 'yes')/len(df['y'])*100,2)
print("The success rate is: ",success_rate, "%")
failure_rate = round(sum(df['y'] == 'no')/len(df['y'])*100,2)
print("The failure rate is: ", failure_rate, "%")

The rate of successful outcomes for the entire dataset is 11.27% and the rate of failed outcomes is 88.73%. 
***

#### ii. Client specific variables
Next, we're looking at the client-specific variables, including: age, job, education, marital, loan, default, and housing.

In [None]:
#looking at the distribution of the demographic variables
fig, axes = plt.subplots(1,2, figsize=(10,5))
fig.suptitle("Age variable")
sns.violinplot(y='age',
            x='y',
           data=df,
           ax=axes[0])
axes[0].set_title("Age")
sns.histplot(x='age',
            hue='y',
            data=df,
            ax=axes[1])
axes[1].set_title("Age and y distribution")
plt.tight_layout()
plt.show()

#looking at the mean age of outcomes
print("Summary of mean ages by outcome: ")
print(df.groupby('y').age.mean())

**Observations:**
* The average age of successful outcomes is a little bit higher than failed outcomes
* Successful outcomes tended to be distributed more towards the higher end of the age range, where failed outcomes were grouped closer to the lower end
* The age variable could be a candidate variable for the logistic regression model

In [None]:
#visualizing job and education
fig, axes = plt.subplots(1,2, figsize=(10,5))
fig.suptitle("Demographic variables")
sns.countplot(x='job',
            data=df,
            hue='y',
            ax=axes[0])
axes[0].tick_params(labelrotation=90)
axes[0].set_title("Job")
sns.countplot(x='education',
             data=df,
              hue='y',
             ax=axes[1])
axes[1].tick_params(labelrotation=90)
axes[1].set_title("Education")
plt.tight_layout()
plt.show()

In [None]:
#getting the ratios of outcomes
print("Job summary by y:")
print(df.groupby('job').y.value_counts(normalize=True))

In [None]:
#getting the ratios of outcomes
print("Education summary by y:")
print(df.groupby('education').y.value_counts(normalize=True))

**Observations:**
* The most common job types are admin and blue-collar, and the least common are unknown and housemaids; the job types with the highest rates of subscription are students and retirees, with 31% and 22% relating to successful outcomes, respectively
* The most common education type is university degree and the least common is illiterate; the education type with the highest ratio of successful outcome is illiterate with 22%
* Both job and education could be good candidate variables for the model

In [None]:
#visualizing the variable
sns.countplot(x='marital',
             data=df,
              hue='y')
plt.title("Marital")
plt.show()

In [None]:
#showing the percentages of outcomes by marital variable
print("Marital summary by y: ")
print(df.groupby('marital').y.value_counts(normalize=True))

**Observations:**
* The most common marital type is married, the least common is unknown; the highest rate of successful outcome is unknown with 15% and then single with 14%
* This could be a good candidate variable due to the variations of successful outcome rates


In [None]:
fig, axes = plt.subplots(1,3, figsize=(10,5))
fig.suptitle("Loan variables")
sns.countplot(x='default',
           data=df,
            hue='y',
           ax=axes[0])
axes[0].set_title("Default")
sns.countplot(x='housing',
            data=df,
            hue='y',
            ax=axes[1])
axes[1].set_title("Housing")
sns.countplot(x='loan',
            data=df,
            hue='y',
            ax=axes[2])
axes[2].set_title("Loan")
plt.tight_layout()
plt.show()

In [None]:
#showing the percentages of outcomes by variable
print("Default summary by y: ")
print(df.groupby('default').y.value_counts(normalize=True))

In [None]:
#showing the percentages of outcomes by variable
print("Housing summary by y: ")
print(df.groupby('housing').y.value_counts(normalize=True))

In [None]:
#showing the percentages of outcomes by variable
print("Loan summary by y: ")
print(df.groupby('loan').y.value_counts(normalize=True))

**Observations:**
* The distribution of the outcome variable for all three of the loan variables follows the overall distribution of the 'y' variable for the dataset
* The three loan variables are not candidate variables for the model

***
#### iii. Campaign variables
Next, we'll look at the campaign variables, including: contact and poutcome. (Note: for the sake of the model, we will only look at two of the campaign variables.)


In [None]:
#visualizing contact
fig, axes = plt.subplots(1,2, figsize=(10,5))
fig.suptitle("Campaign variables")
sns.countplot(x='contact',
           data=df,
            hue='y',
             ax=axes[0])
axes[0].set_title("Contact")
sns.countplot(x='poutcome',
            hue='y',
           data=df,
             ax=axes[1])
axes[1].set_title("Poutcome")
plt.tight_layout()
plt.show()

In [None]:
#showing the percentages of outcomes by variable
print("Contact summary by y: ")
print(df.groupby('contact').y.value_counts(normalize=True))

In [None]:
print("Summary of poutcome by outcome: ")
print(df.groupby('poutcome').y.value_counts(normalize=True))

**Observations:**
* The most common method of contact is via cellular
* The distribution of outcomes for the contact variable is outside the average distribution of the dataset, with 14% of clients reached via cellular related to a successful outcome compared to only 5% of clients reached via telephone
* The distribution of outcomes for the poutcome variable is more varied than the average for the dataset
* Both contact and poutcome are good candidate variables for the model

***
### 3.3 Identifying the candidate variables
After the exploratory analysis, I've narrowed down the candidate variables to: age, job, education, marital, contact, and poutcome. Next, I will prepare the dataset for analysis with the logistic regression model.

# Section 4: Manipulating the dataset to prepare for analyses
***

### 4.1: Changing column datatypes for analysis
First, I need to change the values in columns 'y' and 'age' to binary values to be used in the logistic regression model. 

In [None]:
#changing the target column value to 0,1
df['y'] = df['y'].replace(to_replace = ['yes', 'no'], value=[1,0])
df['y'].value_counts()

In [None]:
#creating age bins for the age variable
df['age']=pd.cut(x=df['age'], bins=[0,20,30,40,50,60,70,99],labels=['0-20','21-30','31-40','41-50','51-60','61-70','71-99'])
df['age'].value_counts()

### 4.2: Creating dummy columns
Next, I need to create dummy columns with binary values for the categorical candidate variables.

In [None]:
#creating dummy columns for all categorical candidate variables
cat_cols= ['age','job','education','marital', 'contact','poutcome']
dummies = pd.get_dummies(df, prefix=cat_cols, columns=cat_cols)
print(dummies.columns)
dummies.shape

In [None]:
#dropping non-candidate columns
df = dummies.drop(dummies.columns[:14], axis=1)
print(df.columns)
df.shape

# Section 5: Developing a Model using Logistic Regression
***

After the inspection, initial analysis and manipulation of the dataset, I can now begin creating the logistic regression model using the candidate variables identified previously. 


### 5.1 Split the dataset into a training and testing sample
First, I'll begin with splitting the dataset into a training and testing sample. I'll be using a test size of 30% with a random state of 16.

In [None]:
#splitting the dataset into train and test sets
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=16)
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)
print("The shape of the training dataset is: ", train.shape)
print("The shape of the testing dataset is: ", test.shape)


### 5.2 Defining base functions for variable selection
Next, I'll be defining the functions that will be used to select the variables for the logistic regression function. I'll be using a stepwise selection process. The functions I'll be implementing are auc(), which will evaluate the AUC of each variable, and next_best(), which will select the next best variable based on the AUC calculation.

In [None]:
#creating a function to evaluate the auc
def auc(variables, target, basetable):
    X = basetable[variables]
    y = basetable[target]
    logreg = linear_model.LogisticRegression()
    logreg.fit(X,y.values.ravel())
    predictions = logreg.predict_proba(X)[:,1]
    auc=roc_auc_score(y, predictions)
    return(auc)

In [None]:
#creating a function to evaluate the next best variable based on auc
def next_best(current_variables, candidate_variables, target, basetable):
    best_auc = -1
    best_variable = None
    for v in candidate_variables:
        auc_v = auc(current_variables + [v], target, basetable)
        if auc_v >= best_auc:
            best_auc=auc_v
            best_variable=v
    return best_variable

### 5.3 Entering the candidate variables into the stepwise selection functions
Now, I'll use the previously defined functions to evaluate the candidate variables through a stepwise selection process. The output of the for loop will be a list called 'current_variables' that will contain the variables selected by the process.

In [None]:
#initializing the candidate_variables and current_variables
candidate_variables = ['age_0-20', 'age_21-30', 'age_31-40', 'age_41-50', 'age_51-60', 'age_61-70', 'age_71-99', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'job_unknown', 'education_basic.4y', 'education_basic.6y', 'education_basic.9y', 'education_high.school', 'education_illiterate', 'education_professional.course', 'education_university.degree', 'education_unknown', 'marital_divorced', 'marital_married', 'marital_single', 'marital_unknown', 'contact_cellular', 'contact_telephone', 'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success']
current_variables = []

#looping the variables through the functions in stepwise selection
number_iterations = 10
for i in range(0, number_iterations):
    next_variable = next_best(current_variables, candidate_variables, ['y'], df)
    current_variables = current_variables + [next_variable]
    candidate_variables.remove(next_variable)
    print("Variable added in step " + str(i+1)  + " is " + next_variable)

print('\n', current_variables)

After entering the candidate variables into the stepwise selection loop, we can see the variables that have been selected based on their AUC calculations. 

### 5.4 Defining functions for visualization of the AUC curve
Now that the variables have been selected, I'll create a function to visualize the AUC scores on the train and test samples and thus determine the cut-off point. I'll first create a function auc_train_test() that will create the train and testing lists of variables and the logistic regression model object. Then I'll fit the model based on the training values, and then calculate the predictions and AUC values on both the training and testing data.

In [None]:
#creating a function to create the training/testing variables
def auc_train_test(variables, target, train, test):
    X_train = train[variables]
    X_test = test[variables]
    y_train = train[target]
    y_test = test[target]
    logreg = linear_model.LogisticRegression()
    
#fitting the model on train data
    logreg.fit(X_train, y_train.values.ravel())
    
#calculating the predictions both on train and test data
    predictions_train = logreg.predict_proba(X_train)[:,1]
    predictions_test = logreg.predict_proba(X_test)[:,1]
    
#calculating the AUC both on train and test data
    auc_train = roc_auc_score(y_train, predictions_train)
    auc_test = roc_auc_score(y_test, predictions_test)
    return(auc_train, auc_test)

### 5.5 Entering the current variables into the AUC training and testing functions
I'll next enter the current variables into the functions created to evaluate the AUC values partitioned by training and testing data.

In [None]:
# initializing lists of values
auc_values_train = []
auc_values_test = []
variables_evaluate = []

#looping over the variables in current_variables
for v in current_variables:
    variables_evaluate.append(v)
    auc_train, auc_test = auc_train_test(variables_evaluate, ['y'], train, test)
    auc_values_train.append(auc_train)
    auc_values_test.append(auc_test)

### 5.6 Plotting the AUC curve
Next, I'll visualize the AUC values of the training and testing data to determine the cut-off point for the model.

In [None]:
#creating the AUC curve
res = pd.DataFrame(dict(variables=current_variables, auc=auc_values_test))
x = np.array(range(0,len(auc_values_train)))
y_train = np.array(auc_values_train)
y_test = np.array(auc_values_test)
plt.xticks(x, current_variables, rotation = 90)
plt.plot(x,y_train, label='Train')
plt.plot(x,y_test, label='Test')
plt.ylim((0.6, 0.8))
plt.title(f'Best AUC = {round(res.auc.max(),3)}')
plt.legend()
plt.show()


As we can see in the plot, the AUC values of both the training and testing value seem to level off after the 'age_31-40' variable. This will be the cut-off point as to not add too much complexity to the model. The AUC values do increase after this point, but the increased value is marginal compared to the first five variables. The best variables for the model are thus: 'contact_telephone', 'poutcome_success', 'job_blue-collar', 'age_41-50', and 'age_31-40'.

## Section 6: Evaluating the model
***
Now that I've created the model and identified the best variables according the their AUC values, we can now evaluate its performance.

### 6.1 Plotting the cumulative gains curve
First, I will enter the list of best variables determined by the AUC curve in the previous section and initialize the training and testing sets with these variables. Then I will create and fit model based on these sets, and finally plot the cumulative gains curve.

In [None]:
# initializing the best variables
best_vars=['contact_telephone','poutcome_success', 'job_blue-collar', 'age_41-50','age_31-40']

#initializing train and test sets with best variables
X_train = train[best_vars]
X_test = test[best_vars]
y_train = train['y']
y_test = test['y']

#creating logistic regression object
logreg = linear_model.LogisticRegression()
    
#fitting the model on the training data
logreg.fit(X_train, y_train)
    
#running predictions on testing data
predictions_test = logreg.predict_proba(X_test)

#ploting the cumulative gains graph
skplt.metrics.plot_cumulative_gain(y_test, predictions_test)

ax = plt.gca()
line = ax.lines[0]
xd = line.get_xdata()
yd = line.get_ydata()
plt.show()

We can see that the model performs well at the start of the cumulative gains curve; where the top 5% of customers with the highest predicted probabilities contain about 20% 'yes' respondents. 

### 6.2 Creating the confusion matrix
First, I'll create the confusion matrix with the predicted data and use it to evaluate the model.

In [None]:
#defining the y prediction data
y_pred = logreg.predict(X_test)

#creating the confusion matrix
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

Next, we'll visualize the confusion matrix.

In [None]:
#plotting the confusion matrix
class_names=[0,1]
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()

So, as we can see, 10,786 + 257 are actual predictions and 1,166 + 148 are incorrect predictions. This seems to suggest that the model performs well (at least at predicting unsuccessful outcomes. 

### 6.3 Evaluating the model using classification_report
Now we'll evaluate the accuracy, precision, and recall of the model.

In [None]:
#running the classification report on the model
target_names = ['unsuccessful outcome', 'successful outcome']
print(classification_report(y_test, y_pred, target_names=target_names))

We can see that the model has a classification rate of 89%. The model is 90% accurate when predicting unsuccessful outcomes, and 63% accurate when predicting successful outcomes. The model can detect unsuccessful outcomes 99% of the time, and successful outcomes 18% of the time. 