this is an outline notebook- sections are suggested steps, but more or less steps can be followed to reach your end goal

# Classification problem - predicting the take up of a credit card offer 

The bank wants to better understand the demographics and other characteristics of its customers that accept a credit card offer and that do not accept a credit card. You have been challenged to predict whether a customer will or wont accept the offer using a machine learning model. 

**Our main question is: <br/>**
*Will the customer accept the credit card offer? Y/N*


The **definition of the features** is the following:
- **Customer Number:** A sequential number assigned to the customers (this column is hidden and excluded – this unique identifier will not be used directly).
- **Offer Accepted:** Did the customer accept (Yes) or reject (No) the offer. Reward: The type of reward program offered for the card.
- **Reward Type:** air miles, cash back or points... what type of credit card rewards the customer takes
- **Mailer Type:** Letter or postcard.
- **Income Level:** Low, Medium or High.
- **#Bank Accounts Open:** How many non-credit-card accounts are held by the customer.
- **Overdraft Protection:** Does the customer have overdraft protection on their checking account(s) (Yes or No).
- **Credit Rating:** Low, Medium or High.
- **#Credit Cards Held:** The number of credit cards held at the bank.
- **#Homes Owned:** The number of homes owned by the customer.
- **Household Size:** Number of individuals in the family.
- **Own Your Home:** Does the customer own their home? (Yes or No).
- **Average Balance:** Average account balance (across all accounts over time). Q1, Q2, Q3 and Q4
- **Balance:** Average balance for each quarter in the last year


Be careful- the data is imbalanced towards customers who say No to the offer. This imbalance would have to be managed with sampling methods. 

## Import Libraries

In [88]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 

## Read data as a pandas data frame, preview top 10 rows

In [89]:
df=pd.read_csv('creditcardmarketing.csv')

In [90]:
df.head(10)

Unnamed: 0,Customer_number,Offer_Accepted,Reward_Type,Mailer_Type,Income,No_open_bank_accounts,Overdraft_protection,Credit_rating,No_credit_cards,Homes_owned,Household_size,Own_your_home?,Average_Balance,Q1_balance,Q2_balance,Q3_balance,Q4_balance
0,1,No,Air Miles,Letter,High,1,No,High,2,1,4,No,1160.75,1669.0,877.0,1095.0,1002.0
1,2,No,Air Miles,Letter,Medium,1,No,Medium,2,2,5,Yes,147.25,39.0,106.0,78.0,366.0
2,3,No,Air Miles,Postcard,High,2,No,Medium,2,1,2,Yes,276.5,367.0,352.0,145.0,242.0
3,4,No,Air Miles,Letter,Medium,2,No,High,1,1,4,No,1219.0,1578.0,1760.0,1119.0,419.0
4,5,No,Air Miles,Letter,Medium,1,No,Medium,2,1,6,Yes,1211.0,2140.0,1357.0,982.0,365.0
5,6,No,Air Miles,Letter,Medium,1,No,High,3,1,4,No,1114.75,1847.0,1365.0,750.0,497.0
6,7,No,Air Miles,Letter,Medium,1,No,Medium,2,1,3,No,283.75,468.0,188.0,347.0,132.0
7,8,No,Cash Back,Postcard,Low,1,No,Medium,4,1,4,Yes,278.5,132.0,391.0,285.0,306.0
8,9,No,Air Miles,Postcard,Medium,1,No,Low,2,1,4,Yes,1005.0,894.0,891.0,882.0,1353.0
9,10,No,Air Miles,Letter,High,2,No,Low,3,2,4,Yes,974.25,1814.0,1454.0,514.0,115.0


In [91]:
df['Offer_Accepted'].value_counts()

No     16977
Yes     1023
Name: Offer_Accepted, dtype: int64

In [92]:
# sampling technique- random oversamlpling, smote, tomeklinks - better balance 


# Exploratory Data Analysis

In this part we want to familirize ourselves with the data set. We are going to look at the following steps:
- assess dataframe 
- shape, dtypes, summary statistics
- null values, white spaces, duplicates, (amount)unique values per col /unique(written same), mislabeled classes (male ≠ Male), typos/inconsistent capitalisation, irrelevant columns
- missing data
- outliers 

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Customer_number        18000 non-null  int64  
 1   Offer_Accepted         18000 non-null  object 
 2   Reward_Type            18000 non-null  object 
 3   Mailer_Type            18000 non-null  object 
 4   Income                 18000 non-null  object 
 5   No_open_bank_accounts  18000 non-null  int64  
 6   Overdraft_protection   18000 non-null  object 
 7   Credit_rating          18000 non-null  object 
 8   No_credit_cards        18000 non-null  int64  
 9   Homes_owned            18000 non-null  int64  
 10  Household_size         18000 non-null  int64  
 11  Own_your_home?         18000 non-null  object 
 12  Average_Balance        17976 non-null  float64
 13  Q1_balance             17976 non-null  float64
 14  Q2_balance             17976 non-null  float64
 15  Q3

In [94]:
# nulls - 0 balance assumption or remove rows ? 
# or check if the nulls create a problem for the chosen model 


In [95]:
df=df.dropna()

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17976 entries, 0 to 17999
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Customer_number        17976 non-null  int64  
 1   Offer_Accepted         17976 non-null  object 
 2   Reward_Type            17976 non-null  object 
 3   Mailer_Type            17976 non-null  object 
 4   Income                 17976 non-null  object 
 5   No_open_bank_accounts  17976 non-null  int64  
 6   Overdraft_protection   17976 non-null  object 
 7   Credit_rating          17976 non-null  object 
 8   No_credit_cards        17976 non-null  int64  
 9   Homes_owned            17976 non-null  int64  
 10  Household_size         17976 non-null  int64  
 11  Own_your_home?         17976 non-null  object 
 12  Average_Balance        17976 non-null  float64
 13  Q1_balance             17976 non-null  float64
 14  Q2_balance             17976 non-null  float64
 15  Q3

## Visualisations

We want to visualise the relationships between between the different features in the data.

In [97]:
# how many accepted the offer vs who didn't


In [98]:
# what's the avg balance of customers who accepted the offer vs who didn't


In [99]:
# how do the different income levels of who accepted the offer compare with those who didn't


In [100]:
# how does yes/no look with the number of homes owned


Let's see the relationship between the quarters and offer accepted.

In [101]:
# what's the q1 balance of customers who accepted the offer vs who didn't


In [102]:
# what's the q2 balance of customers who accepted the offer vs who didn't


In [103]:
# what's the q3 balance of customers who accepted the offer vs who didn't


In [104]:
# what's the q4 balance of customers who accepted the offer vs who didn't



# Cleaning & Wrangling

**Tasks**
- drop 'customer_number' column
- drop null values
- convert float columns to int

In [105]:
# before cleaning, create a copy of the dataframe


In [106]:
# drop customer_number column


In [107]:
# drop rows with missing values


In [108]:
# converting columns from float to int


In [109]:
# test using info()


# Preprocessing

**Tasks:**
- num vs cat data (split)
- multicollinearity
- imbalance
- distribution plots (normalising, scaling, outlier detection)
- normalizer
- encoding into dummies

In [110]:
# split numerical and categorical data into two dataframes

data_num=df.select_dtypes(include=['number'])

In [111]:
data_cat=df.select_dtypes(include=['object'])

In [112]:
#correlation matrix for numerical columns -any highly correlated pairs we should drop ? 


In [113]:
#scaling numerical columns with normalizer if needed 


In [114]:
#encoding categorical features if needed 
data_cat.head()

Unnamed: 0,Offer_Accepted,Reward_Type,Mailer_Type,Income,Overdraft_protection,Credit_rating,Own_your_home?
0,No,Air Miles,Letter,High,No,High,No
1,No,Air Miles,Letter,Medium,No,Medium,Yes
2,No,Air Miles,Postcard,High,No,Medium,Yes
3,No,Air Miles,Letter,Medium,No,High,No
4,No,Air Miles,Letter,Medium,No,Medium,Yes


In [115]:
X_cat=pd.get_dummies(data_cat, drop_first=True)

In [116]:
X_cat.head()

Unnamed: 0,Offer_Accepted_Yes,Reward_Type_Cash Back,Reward_Type_Points,Mailer_Type_Postcard,Income_Low,Income_Medium,Overdraft_protection_Yes,Credit_rating_Low,Credit_rating_Medium,Own_your_home?_Yes
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,1,1
2,0,0,0,1,0,0,0,0,1,1
3,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,1,1


In [117]:
data_num

Unnamed: 0,Customer_number,No_open_bank_accounts,No_credit_cards,Homes_owned,Household_size,Average_Balance,Q1_balance,Q2_balance,Q3_balance,Q4_balance
0,1,1,2,1,4,1160.75,1669.0,877.0,1095.0,1002.0
1,2,1,2,2,5,147.25,39.0,106.0,78.0,366.0
2,3,2,2,1,2,276.50,367.0,352.0,145.0,242.0
3,4,2,1,1,4,1219.00,1578.0,1760.0,1119.0,419.0
4,5,1,2,1,6,1211.00,2140.0,1357.0,982.0,365.0
...,...,...,...,...,...,...,...,...,...,...
17995,17996,1,1,1,5,167.50,136.0,65.0,71.0,398.0
17996,17997,1,3,1,3,850.50,984.0,940.0,943.0,535.0
17997,17998,1,2,1,4,1087.25,918.0,767.0,1170.0,1494.0
17998,17999,1,4,2,2,1022.25,626.0,983.0,865.0,1615.0


In [118]:
data_num.shape

(17976, 10)

In [119]:
X_cat.shape

(17976, 10)

In [120]:
# bring numbers and cat back together into one frame 

df=pd.concat((data_num, X_cat),axis=1)

In [121]:
X.shape

(18000, 19)

# split off the dependant class (label)


In [122]:
# drop target class from X 

X=df.drop(columns=['Offer_Accepted_Yes'])

In [123]:
X.head()

Unnamed: 0,Customer_number,No_open_bank_accounts,No_credit_cards,Homes_owned,Household_size,Average_Balance,Q1_balance,Q2_balance,Q3_balance,Q4_balance,Reward_Type_Cash Back,Reward_Type_Points,Mailer_Type_Postcard,Income_Low,Income_Medium,Overdraft_protection_Yes,Credit_rating_Low,Credit_rating_Medium,Own_your_home?_Yes
0,1,1,2,1,4,1160.75,1669.0,877.0,1095.0,1002.0,0,0,0,0,0,0,0,0,0
1,2,1,2,2,5,147.25,39.0,106.0,78.0,366.0,0,0,0,0,1,0,0,1,1
2,3,2,2,1,2,276.5,367.0,352.0,145.0,242.0,0,0,1,0,0,0,0,1,1
3,4,2,1,1,4,1219.0,1578.0,1760.0,1119.0,419.0,0,0,0,0,1,0,0,0,0
4,5,1,2,1,6,1211.0,2140.0,1357.0,982.0,365.0,0,0,0,0,1,0,0,1,1


In [124]:
# define the target y
y=df['Offer_Accepted_Yes']

In [125]:
#checking the len of x_normalized & cat_clean before merging back together in X


In [126]:
# bring the numerical data (scaled)in with the categorical using concat



# Modelling

**- iteration 1 (X)**

In our first iteration we only used preprocessing and encoding, we use this as a benchmark for the next iterations to compare to.

**- iteration 2 (X_i2)**

SMOTE sampling to improve the imbalance of the target
drop some selected columns 

**- iteration 3 (X_i3)**

example - dropping quarterly balance columns to reduce noise, encode numerical features to categories as appropriate 
implementing KNN or decision tree



## Modeling (X)

In [127]:
#import model

from sklearn.neighbors import KNeighborsClassifier

### Test & Train

In [128]:
#train test split - splitting X and y each into 2 data sets(train data and test data)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=40)

In [129]:
#model development
#create a Logistic Regression classifier (classification) object using LogisticRegression() function
#fit model on the train set using fit()

neigh = KNeighborsClassifier(n_neighbors=5)

neigh.fit(X_train, y_train)

# knn requires everything to be numerical , no nulls 

KNeighborsClassifier()

In [28]:
#perform prediction on the test set using predict()


In [29]:
#check the predictions array


### Accuracy metrics and visuals

In [30]:
#calculating the accuracy score
from sklearn.metrics import accuracy_score
    

#### Confusion matrix

#### ROC/AUC

**ROC Curve**
Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.

- the closer to left hand corner the better
- should not be below 0.5 (random) 'red' line

**AUC** - area under the curve the bigger the area under the curve the better the model
1 represents perfect classifier, and 0.5 represents a worthless classifier.<br/>



## Modeling (X_i2)

In [31]:
#drop selected columns 



In [32]:
# import needed libraries


In [33]:
#define smote variable


In [34]:
#recreate X and y applying smote


### Test & Train

In [35]:
#redo train test split it2


In [36]:
#apply model it2 


### Accuracy metrics and visuals

#### Confusion matrix

#### ROC

**Comparision accuracy and recall it1 and it2**

accuracy it1 = 
accuracy it2 = 



## Modeling (X_i3)

### Test and Train 

### Accuracy metrics and visuals

#### Confusion matrix

**Comparision confusion matrix it1, it2, it3




#### ROC

**Comparision ROC & AUC it1, it2 and it3 



# Findings and Conclusion


