## Case Study: Classification

The bank wants to understand the demographics and other characteristics of its customers that accept a credit card offer and that do not accept a credit card.

**Will the customer accept the credit card offer? Y/N**

#### Glossary
EDA = exploratory data analysis<br/>
num = numerical <br/>
cat = categorical<br/>
avg = averagre<br/>

### 0. Preparation
#### Importing needed libraries

In [1]:
import pandas as pd
import numpy as np

#### Loading the data set

In [2]:
data = pd.read_csv('creditcardmarketing_jupyter.csv')

In [None]:
data.to_csv('creditcardmarketing_jupyter.csv', header=['Customer Number','Offer Accepted','Reward','Mailer Type', 'Income Level','# Bank Accounts Open','Overdraft Protection','Credit Rating','# Credit Cards Held','# Homes Owned','Household Size','Own Your Home','Average Balance','Q1 Balance','Q2 Balance','Q3 Balance','Q4 Balance'], index=False)

### 1. EDA

In the EDA we want to familirize ourselves with the data set. We are going to look at the following steps:

- check shape
- data types (correct type for model, same units?)
- null values, white spaces, duplicates, (amount)unique values per col /unique(written same), mislabeled classes (male ≠ Male), typos/inconsistent capitalisation, irrelevant columns
- missing data
- imbalance
- num vs cat data (split)
- metrics (min/max difference features)
- distribution plots (normalising, scaling, outlier detection)
- multicollinearity

We are having a first look at the data set using head() and shape().

In [3]:
data.shape

(18000, 18)

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,Customer Number,Offer Accepted,Reward,Mailer Type,Income Level,# Bank Accounts Open,Overdraft Protection,Credit Rating,# Credit Cards Held,# Homes Owned,Household Size,Own Your Home,Average Balance,Q1 Balance,Q2 Balance,Q3 Balance,Q4 Balance
0,0,1,No,Air Miles,Letter,High,1,No,High,2,1,4,No,1160.75,1669.0,877.0,1095.0,1002.0
1,1,2,No,Air Miles,Letter,Medium,1,No,Medium,2,2,5,Yes,147.25,39.0,106.0,78.0,366.0
2,2,3,No,Air Miles,Postcard,High,2,No,Medium,2,1,2,Yes,276.5,367.0,352.0,145.0,242.0
3,3,4,No,Air Miles,Letter,Medium,2,No,High,1,1,4,No,1219.0,1578.0,1760.0,1119.0,419.0
4,4,5,No,Air Miles,Letter,Medium,1,No,Medium,2,1,6,Yes,1211.0,2140.0,1357.0,982.0,365.0


We see that we have 17999 rows and 17 features.
From the head() we can see that there is a mix of numerical (num) and categorical (cat) data.
The **definition of the features** is the following:
- **Customer Number:** A sequential number assigned to the customers (this column is hidden and excluded – this unique identifier will not be used directly).
- **Offer Accepted:** Did the customer accept (Yes) or reject (No) the offer. Reward: The type of reward program offered for the card.
- **Mailer Type:** Letter or postcard.
- **Income Level:** Low, Medium or High.
- **#Bank Accounts Open:** How many non-credit-card accounts are held by the customer.
- **Overdraft Protection:** Does the customer have overdraft protection on their checking account(s) (Yes or No).
- **Credit Rating:** Low, Medium or High.
- **#Credit Cards Held:** The number of credit cards held at the bank.
- **#Homes Owned:** The number of homes owned by the customer.
- **Household Size:** Number of individuals in the family.
- **Own Your Home:** Does the customer own their home? (Yes or No).
- **Average Balance:** Average account balance (across all accounts over time). Q1, Q2, Q3 and Q4
- **Balance:** Average balance for each quarter in the last year


Now we will be looking at the **data types** the features are stored as using dtypes.

In [5]:
data.dtypes

Unnamed: 0                int64
Customer Number           int64
Offer Accepted           object
Reward                   object
Mailer Type              object
Income Level             object
# Bank Accounts Open      int64
Overdraft Protection     object
Credit Rating            object
# Credit Cards Held       int64
# Homes Owned             int64
Household Size            int64
Own Your Home            object
Average Balance         float64
Q1 Balance              float64
Q2 Balance              float64
Q3 Balance              float64
Q4 Balance              float64
dtype: object

We can see that there is a feature 'unnamed' which came by mistake during the load of the data set, this we can already put on our list to drop in the data cleaning phase.<br/>
The same goes for the 'Customer Number' as this is only an index, which we already have, so we dont need it twice.<br/><br/>
The rest of the data types look suiting to their description.
<br/><br/>
As we can see from the data types, the **feature names are not standardized** yet. So we will go ahead and make them all lowercase and replace the whitespaces with '_'.

In [None]:
#code to make them lowercase and replace whitespaces

Next we will check, if there are any **nulls and NaN values** in our data set.

In [6]:
data.isna().sum()

Unnamed: 0               0
Customer Number          0
Offer Accepted           0
Reward                   0
Mailer Type              0
Income Level             0
# Bank Accounts Open     0
Overdraft Protection     0
Credit Rating            0
# Credit Cards Held      0
# Homes Owned            0
Household Size           0
Own Your Home            0
Average Balance         24
Q1 Balance              24
Q2 Balance              24
Q3 Balance              24
Q4 Balance              24
dtype: int64

We can see that there are 24 entries for avg balance and q1-q5 balances, that have null values.<br/>
We suspect, that these 24 values are all the same rows for the 5 features.
24 out of 18.000 values is a faily small amount, which would justify to either drop the rows or replace the null values with the mean.<br/>
We will do so in the data cleaning phase.

Lets now also check for **duplicates**:

In [11]:
data.duplicated(subset=None, keep='first').unique()

array([False])

It looks like we dont have any duplicates, as the unique() function returns only 'False' which means, no duplicate. True would stand for a duplicated row.

Lets **split already now the num and cat features** so we can look at them specifically.

In [22]:
data_num = data.select_dtypes(include=['number'])
data_num.head()

Unnamed: 0.1,Unnamed: 0,Customer Number,# Bank Accounts Open,# Credit Cards Held,# Homes Owned,Household Size,Average Balance,Q1 Balance,Q2 Balance,Q3 Balance,Q4 Balance
0,0,1,1,2,1,4,1160.75,1669.0,877.0,1095.0,1002.0
1,1,2,1,2,2,5,147.25,39.0,106.0,78.0,366.0
2,2,3,2,2,1,2,276.5,367.0,352.0,145.0,242.0
3,3,4,2,1,1,4,1219.0,1578.0,1760.0,1119.0,419.0
4,4,5,1,2,1,6,1211.0,2140.0,1357.0,982.0,365.0


In [21]:
data_cat = data.select_dtypes(include=['object'])
data_cat.head()

Unnamed: 0,Offer Accepted,Reward,Mailer Type,Income Level,Overdraft Protection,Credit Rating,Own Your Home
0,No,Air Miles,Letter,High,No,High,No
1,No,Air Miles,Letter,Medium,No,Medium,Yes
2,No,Air Miles,Postcard,High,No,Medium,Yes
3,No,Air Miles,Letter,Medium,No,High,No
4,No,Air Miles,Letter,Medium,No,Medium,Yes


Next step is to look at the **unique values in the cat features**:

In [17]:
data['Offer Accepted'].unique()

array(['No', 'Yes'], dtype=object)

In [23]:
data['Reward'].unique()

array(['Air Miles', 'Cash Back', 'Points'], dtype=object)

In [24]:
data['Mailer Type'].unique()

array(['Letter', 'Postcard'], dtype=object)

In [25]:
data['Income Level'].unique()

array(['High', 'Medium', 'Low'], dtype=object)

In [26]:
data['Overdraft Protection'].unique()

array(['No', 'Yes'], dtype=object)

In [27]:
data['Credit Rating'].unique()

array(['High', 'Medium', 'Low'], dtype=object)

In [28]:
data['Own Your Home'].unique()

array(['No', 'Yes'], dtype=object)

In [18]:
def unique_values_features(x):
    return data[x].unique()

Looks good so far, we have clear classes

In [None]:
cols_cat = list(df.select_dtypes(include=['object']).columns)

for col in cols_cat:
    print("Frequency analysis of column: ",col)
    my_data = df[col].value_counts().reset_index()
    ax = sns.barplot(x=col, y="index", data = my_data).set_title(col.upper())
    plt.figure()
    print

In [None]:
data_num.hist(bins=15, figsize=(15, 6), layout=(2, 4));

## Data cleaning

- drop column 'Unnamed: 0'
- drop rows with null values