# Task Description

The data set, related to customer credit risk, and the description of the attributes are uploaded in the Files. 

There are three tasks for the semestral project, such that
1) Develop a prediction model to classify the customers as good or bad

2) Cluster the customers into various groups

3) Provide some ideas on how frequent pattern mining could be utilized to uncover some patterns in the data and/or to enhance the classification

# Import data
Import the dataset with Pandas pd.read_csv command. Data is seperated with semicolon, so seperation parameter has to be set.

In [7]:
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
from apyori import apriori

dataset = pd.read_csv('../data/project_data.csv',sep=';')

# Data description

Description of the attributes (columns) in the data and their values:

Attribute X01: Status of existing checking account of the customer
A11: X01 < 0 EUR
A12: 0 <= X01 < 200 EUR
A13: X01 >= 200 EUR
A14: no checking account

Attribute X02: Duration of the credit (requested by the customer from the bank) in month

Attribute X03: Credit history of the customer
A30: no credits taken or all credits paid back duly
A31: all credits at this bank paid back duly
A32: existing credits paid back duly till now
A33: delay in paying off in the past
A34: critical account or other credits existing (not at this bank)

Attribute X04: Purpose of the credit
A40: car (new)
A41: car (used)
A42: furniture/equipment
A43: radio/television
A44: domestic appliances
A45: repairs
A46: education
A47: vacation
A48: retraining
A49: business
A410: others

Attribute X05: Credit amount in EUR

Attibute X06: Savings account/bonds of the customer
A61: X06 < 100 EUR
A62: 100 <= X06 < 500 EUR
A63: 500 <= X06 < 1000 EUR
A64: X06 >= 1000 EUR
A65: unknown/no savings account

Attribute X07: Present employment of the customer since
A71: unemployed
A72: X07 < 1 year
A73: 1 <= X07 < 4 years  
A74: 4 <= X07 < 7 years
A75: X07 >= 7 years

Attribute X08: Installment rate in percentage of disposable income

Attribute X09: Personal status and sex of the customer
A91: male - divorced/separated
A92: female - divorced/separated/married
A93: male - single
A94: male - married/widowed
A95: female - single

Attribute X10: Other debtors or guarantors for the credit
A101: none
A102: co-applicant
A103: guarantor

Attribute X11: Present residence of the customer since (in years)

Attribute X12: Property owned by the customer
A121: real estate
A122: if not A121 - building society savings agreement/life insurance
A123: if not A121/A122 - car or other, not in attribute X06
A124: unknown/no property

Attribute X13: Age of the customer in years

Attribute X14: Other installment plans of the customer
A141: bank
A142: stores
A143: none

Attribute X15: Housing situation of the customer
A151: renting
A152: owning
A153: accommodation (ie. living) for free

Attribute X16: Number of existing credits of the customer at this bank

Attribute X17: Job situation of the customer
A171: unemployed/unskilled  - non-resident
A172: unskilled - resident
A173: skilled employee/official
A174: management/self-employed/highly qualified employee/officer

Attribute X18: Number of people the customer being liable to provide maintenance for

Attribute X19: Telephone of the customer (Note: the data are from 1994, having a phone was not usual back then)
A191: none
A192: yes, registered under the customers name

Attribute X20: if the customer is a foreign worker
A201: yes
A202: no

Attribute Y: label
1 = good customer, ie. paid back the requested credit (see the attributes X02, X04, X05)
2 = bad customer, ie. did not pay back the requested credit (see the attributes X02, X04, X05)

Please, note, that it is five times worse to classify customers as good when they are bad than it is to classify customers bad when they are good!

Display first 5 rows of the data

In [3]:
dataset.head()

Unnamed: 0,X01,X02,X03,X04,X05,X06,X07,X08,X09,X10,...,X12,X13,X14,X15,X16,X17,X18,X19,X20,Y
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


In [4]:
# Any missing values?
dataset.isnull().values.any()

False

In [5]:
dataset.describe()

Unnamed: 0,X02,X05,X08,X11,X13,X16,X18,Y
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


# Task desciption

There are three tasks for the semestral project, such that

1) Develop a prediction model to classify the customers as good or bad;

2) Cluster the customers into various groups;

3) Provide some ideas on how frequent pattern mining could be utilized to uncover some patterns in the data and/or to enhance the classification;

In [10]:
records = []
for i in range(0, 1000):
    records.append([str(dataset.values[i,j]) for j in range(0, 21)])

In [42]:
association_rules = apriori(records, min_support=0.05, min_confidence=0.6, min_lift=2, min_length=2, use_colnames=True)
association_results = list(association_rules)
print(association_results)

[RelationRecord(items=frozenset({'A153', 'A124'}), support=0.104, ordered_statistics=[OrderedStatistic(items_base=frozenset({'A124'}), items_add=frozenset({'A153'}), confidence=0.6753246753246753, lift=6.253006253006253), OrderedStatistic(items_base=frozenset({'A153'}), items_add=frozenset({'A124'}), confidence=0.9629629629629629, lift=6.2530062530062525)]), RelationRecord(items=frozenset({'A192', 'A174'}), support=0.127, ordered_statistics=[OrderedStatistic(items_base=frozenset({'A174'}), items_add=frozenset({'A192'}), confidence=0.8581081081081081, lift=2.124029970564624)]), RelationRecord(items=frozenset({'A153', 'A124', '1'}), support=0.096, ordered_statistics=[OrderedStatistic(items_base=frozenset({'A124'}), items_add=frozenset({'A153', '1'}), confidence=0.6233766233766234, lift=6.296733569460842), OrderedStatistic(items_base=frozenset({'A153'}), items_add=frozenset({'A124', '1'}), confidence=0.888888888888889, lift=6.216006216006217), OrderedStatistic(items_base=frozenset({'A124'

In [43]:
print(len(association_results))

402


In [47]:
def inspect(association_results):
    '''
    function to put the result in well organised pandas dataframe
    '''
    lhs         = [tuple(result[2][0][0]) for result in association_results]
    rhs         = [tuple(result[2][0][1]) for result in association_results]
    supports    = [result[1] for result in association_results]
    confidences = [result[2][0][2] for result in association_results]
    lifts       = [result[2][0][3] for result in association_results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

resultsinDataFrame = pd.DataFrame(inspect(association_results), columns = ['Item #1', 'Item #2', 'Support', 'Confidence', 'Lift'])
pd.DataFrame(resultsinDataFrame)

Unnamed: 0,Item #1,Item #2,Support,Confidence,Lift
0,"(A124,)","(A153,)",0.104,0.675325,6.253006
1,"(A174,)","(A192,)",0.127,0.858108,2.124030
2,"(A124,)","(A153, 1)",0.096,0.623377,6.296734
3,"(A174,)","(1, A192)",0.125,0.844595,2.111486
4,"(A153,)","(2, A124)",0.082,0.759259,6.123059
...,...,...,...,...,...
397,"(A75, 1, A191, A173)","(A201, A93, 4, A143)",0.053,0.609195,2.107943
398,"(A101, A152, A14, A75)","(A201, A93, 4, A143)",0.052,0.619048,2.142033
399,"(A101, A75, A191, A173)","(A201, A93, 4, A143)",0.054,0.635294,2.198250
400,"(A101, A152, A14, A75)","(A201, 1, 4, A143, A93)",0.051,0.607143,2.176139


In [48]:
resultsinDataFrame.nlargest(n=10, columns='Lift')

Unnamed: 0,Item #1,Item #2,Support,Confidence,Lift
336,"(A153, A173)","(A201, A101, A124, A93)",0.05,0.793651,7.631258
376,"(2, A153)","(A101, A201, A124, 4, A93)",0.057,0.662791,7.618284
300,"(2, A153)","(A101, A124, 4, A93)",0.057,0.662791,7.531712
163,"(2, A153)","(A101, A124, A61)",0.053,0.616279,7.515598
306,"(2, A153)","(A201, A101, A124, A61)",0.053,0.616279,7.515598
207,"(A153, A173)","(A101, A124, A93)",0.05,0.793651,7.487272
388,"(2, A153, 1)","(A101, A201, A124, 4, A93)",0.05,0.649351,7.463801
315,"(A153,)","(A101, A201, A124, 4, A93)",0.07,0.648148,7.449979
377,"(A153, A143)","(A101, A201, A124, 4, A93)",0.053,0.646341,7.429212
342,"(2, A153, 1)","(A101, A124, 4, A93)",0.05,0.649351,7.378985


In [45]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    rule_string = "Rule: "
    i = 0
    for item1 in items:
        rule_string = rule_string + items[i] + " -> "
        i = i + 1
    #print("Rule: " + items[0] + " -> " + items[1])
    print(rule_string)

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")
    

Rule: A153 -> A124 -> 
Support: 0.104
Confidence: 0.6753246753246753
Lift: 6.253006253006253
Rule: A192 -> A174 -> 
Support: 0.127
Confidence: 0.8581081081081081
Lift: 2.124029970564624
Rule: A153 -> A124 -> 1 -> 
Support: 0.096
Confidence: 0.6233766233766234
Lift: 6.296733569460842
Rule: 1 -> A192 -> A174 -> 
Support: 0.125
Confidence: 0.8445945945945946
Lift: 2.1114864864864864
Rule: 2 -> A153 -> A124 -> 
Support: 0.082
Confidence: 0.7592592592592593
Lift: 6.123058542413381
Rule: 2 -> A192 -> A174 -> 
Support: 0.097
Confidence: 0.6554054054054055
Lift: 2.087278361163712
Rule: A153 -> A124 -> 4 -> 
Support: 0.094
Confidence: 0.6103896103896104
Lift: 6.2926763957691785
Rule: 4 -> A192 -> A174 -> 
Support: 0.094
Confidence: 0.6351351351351352
Lift: 2.1901211556383973
Rule: A101 -> A124 -> A153 -> 
Support: 0.1
Confidence: 0.6493506493506493
Lift: 6.304375236414072
Rule: A101 -> A192 -> A174 -> 
Support: 0.119
Confidence: 0.8040540540540541
Lift: 2.138441633122484
Rule: A153 -> A124 -> A

Rule: A101 -> A201 -> 4 -> A192 -> A174 -> A143 -> 
Support: 0.069
Confidence: 0.6448598130841122
Lift: 2.1352973943182523
Rule: A101 -> A72 -> A201 -> 4 -> A92 -> A143 -> 
Support: 0.051
Confidence: 0.621951219512195
Lift: 2.0062942564909516
Rule: A101 -> 4 -> A93 -> A75 -> A143 -> A34 -> 
Support: 0.06
Confidence: 0.631578947368421
Lift: 2.098268928134289
Rule: A101 -> A201 -> 4 -> A92 -> A151 -> A32 -> 
Support: 0.051
Confidence: 0.6071428571428571
Lift: 2.1839671120246655
Rule: A101 -> A201 -> 4 -> A192 -> A174 -> A152 -> 
Support: 0.05
Confidence: 0.8064516129032259
Lift: 2.1737240239979134
Rule: A101 -> A201 -> 4 -> A61 -> A192 -> A174 -> 
Support: 0.056
Confidence: 0.8235294117647058
Lift: 2.2197558268590454
Rule: A101 -> A201 -> 4 -> A192 -> A174 -> A93 -> 
Support: 0.058
Confidence: 0.6170212765957447
Lift: 2.3109411108454854
Rule: A201 -> A124 -> 4 -> A143 -> A153 -> A93 -> 
Support: 0.054
Confidence: 0.6585365853658536
Lift: 7.005708354955889
Rule: 4 -> A14 -> A75 -> A143 ->