### Project - Thera Bank Personal Loan Campaign

The dataset contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

### Domain:

Banking

### Context:

This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.

#### 1. Import the necessary packages

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score

#### Read the dataset into dataframe

In [None]:
df = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
df.head()

#### Check data type

In [None]:
df.info()

#### statistical summary

In [None]:
df.describe().transpose() 

#### Observations
* No null values
* -3 value of Experince doesn't have any meaning. All negative values of experince should be replaced with 0
* Education should be categorical
* Securities Account, CD Account, Online and CreditCard are categorical variables



In [None]:
#df['Experience'] = df[df['Experience'] < 0]['Experience'] = 0
#df[df['Experience'] < 0]['Experience'] = 0
def replacenegative(s):
    if (s < 0):
        ret = 0
    else:
        ret = s
    return ret

df['Experience'] = df['Experience'].apply(replacenegative)
df.describe().transpose() 

#### EDA
##### Number of unique in each column?

In [None]:
df.nunique()

##### Number of people with zero mortgage

In [None]:
df[df['Mortgage'] == 0]['Mortgage'].count()

##### Number of people with zero credit card spending per month

In [None]:
df[df['CCAvg'] == 0]['CCAvg'].count()

#### Value counts of all categorical columns.

In [None]:
for i in ['Family', 'Education', 'Securities Account', 'CD Account', 'Online', 'CreditCard']: 
    print(df[i].value_counts(normalize=False))
    print()

## Univariate Analyis

In [None]:
# Drop ID column
df.drop(columns=['ID'], inplace=True)

In [None]:
# distplot for continous columns
for i in ['Age','Experience','Income', 'ZIP Code', 'CCAvg', 'Mortgage']:
    sns.distplot(df[i])
    plt.show()

In [None]:
# Convert all relevant columns to categorical type 
df['Family'] = df['Family'].astype('category')
df['Education'] = df['Education'].astype('category')
df['Securities Account'] = df['Securities Account'].astype('category')
df['CD Account'] = df['CD Account'].astype('category')
df['Online'] = df['Online'].astype('category')
df['CreditCard'] = df['CreditCard'].astype('category')
#df['Personal Loan'] = df['Personal Loan'].astype('category')
df.info()

In [None]:
# Countplot for Categorical variables
for i in ['Family','Education','Securities Account', 'CD Account','Online','CreditCard','Personal Loan']:
     sns.countplot(df[i])
     plt.show()

In [None]:
for i in ['Family','Education','Securities Account', 'CD Account','Online','CreditCard','Personal Loan']:
    print(df[i].value_counts(normalize=True))
    print()

## Bivariate Analyis

In [None]:
pd.crosstab(df['Family'],df['Personal Loan'],normalize='index')

In [None]:
pd.crosstab(df['Education'],df['Personal Loan'],normalize='index')

In [None]:
pd.crosstab(df['Securities Account'],df['Personal Loan'],normalize='index')

In [None]:
pd.crosstab(df['CD Account'],df['Personal Loan'],normalize='index')

In [None]:
pd.crosstab(df['Online'],df['Personal Loan'],normalize='index')

In [None]:
pd.crosstab(df['CreditCard'],df['Personal Loan'],normalize='index')

#### Conclusions from Bivariate analysis
* Family: There is a relation with number of family members or personal loan
* Education: Very clear dependency on the education
* Securities Account: Customers with Securities Account are taking more personal loan
* CD Account: The data is highly skewed. But customers with CD Account are very likely to take personal loan. This is a very solid relationship with 46% success
* Online Account: Not much dependency. This column can be dropped
* CreditCard: Not much dependency on personal loan. This column can be dropped
* Mortgage and Zip code data is very highly skewed. These columns can be dropped based on univariate analysis


In [None]:
df.drop(columns=['Online', 'CreditCard', 'Mortgage', 'ZIP Code'], inplace=True)
df.head()

### Define X and Y variables

In [None]:
X = df.drop('Personal Loan', axis=1)
Y = df[['Personal Loan']]
#Convert categorical vriables to dummy variables
#X = pd.get_dummies(X, drop_first=True)

### Split the data into training and test set in the ratio of 70:30 respectively

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 0)

## Logistic Regression

In [None]:
reg = LogisticRegression()
reg.fit(X_train, y_train) 

In [None]:
# predict y for the test set
y_pred = reg.predict(X_test) 

# show original y and predictaed y side by side 
z = X_test.copy()
z['Observed Personal Loan'] = y_test
z['Predicted Personal Loan'] = y_pred
z.head()

## Model Performance

#### Print confusion Matrix

In [None]:
# print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# print confusion matrix in graphic formatcm = confusion_matrix(y_test, y_pred)
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()
print(draw_cm(y_test,y_pred))

#### Accuracy Score

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
print("Trainig accuracy",reg.score(X_train,y_train))  
print()
print("Testing accuracy",reg.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_pred))
print()
print("Precision:",precision_score(y_test,y_pred))

## Parameters of Logistic Regression

In [None]:
reg.get_params()

### Solver

In [None]:
train_score=[]
test_score=[]
solver = ['newton-cg','lbfgs','liblinear','sag','saga']
for i in solver:
    model = LogisticRegression(solver=i)  
    model.fit(X_train, y_train) 
    y_predict = model.predict(X_test)     
    train_score.append(round(model.score(X_train, y_train),3))
    test_score.append(round(model.score(X_test, y_test),3))
    
print(solver)
print(train_score)
print(test_score)

#### Solver conclusion
* liblinear solver gives best result on training and test set

### max_iter

In [None]:
train_score=[]
test_score=[]
max_iter = [50, 100, 200, 500, 1000]
for i in max_iter:
    model = LogisticRegression(solver='liblinear',max_iter=i)
    model.fit(X_train, y_train) 
    y_predict = model.predict(X_test)     
    train_score.append(round(model.score(X_train, y_train),3))
    test_score.append(round(model.score(X_test, y_test),3))
    
print(max_iter)
print(train_score)
print(test_score)

#### max_iter conclusion
* No impact in increasing the max_iter from teh default 100. Higher number of iterations result in overfitting on training data without any improvement in test score 

### tol

In [None]:
train_score=[]
test_score=[]
tol = [0.00005, 0.0001, 0.0002, 0.0005, 0.001]
for i in tol:
    model = LogisticRegression(solver='liblinear',tol=i)
    model.fit(X_train, y_train) 
    y_predict = model.predict(X_test)     
    train_score.append(round(model.score(X_train, y_train),3))
    test_score.append(round(model.score(X_test, y_test),3))
    
print(tol)
print(train_score)
print(test_score)

#### tol conclusion
* 0.0005 gives the best result for tol

### C

In [None]:
train_score=[]
test_score=[]
C = [0.01,0.1,0.25,0.5,0.75,1,2, 3]
for i in C:
    model = LogisticRegression(solver='liblinear',tol=0.0005,C=i)
    model.fit(X_train, y_train) 
    y_predict = model.predict(X_test)     
    train_score.append(round(model.score(X_train, y_train),3))
    test_score.append(round(model.score(X_test, y_test),3))
    
print(C)
print(train_score)
print(test_score)

#### C conclusion
* Best value for C is 0.5

### penalty

In [None]:
train_score=[]
test_score=[]
penalty = ['l1', 'l2']
for i in penalty:
    model = LogisticRegression(solver='liblinear',tol=0.0005,C=0.5,penalty=i)
    model.fit(X_train, y_train) 
    y_predict = model.predict(X_test)     
    train_score.append(round(model.score(X_train, y_train),3))
    test_score.append(round(model.score(X_test, y_test),3))
    
print(penalty)
print(train_score)
print(test_score)

#### penalty conclusion
* Default l2 is better for penalty

# Final LogisticRegression Model

In [None]:
model = LogisticRegression(solver='liblinear',tol=0.0005,C=0.5)
model.fit(X_train, y_train) 
y_predict = model.predict(X_test) 
print(draw_cm(y_test,y_predict))
print("accuracy_score",accuracy_score(y_test, y_predict))  
print("Trainig accuracy",model.score(X_train,y_train))  
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))


# Businessconclusion of  model
* Only 4% of the customers had Persoan loan in the original data set.The model had a reall rate of 57%. This means the model predicted the wrong outcome in 57% of teh cases. 
* Model is able to predict accurately 43% of teh customers who accepted Personal Loan. From business perspective, this is quite impressive. If the marketting team reaches to the customers based on the model, 43% of the customers will accept the persoanl loan. In the original data, the acceptance was only 4%. This helps teh business to reduce teh cost by relying on the model
* CD account has a very strong correlation with personal loan. People with CD account are taking more personal loans. The model also identified a high coefficient for CD account (2.21) compared to other features
* Education also has a strong correlation with Personal loan. More educated people took the personal loan (1.69 cofficient)

In [None]:
print(model.coef_)
print(X_train.columns)