# Machine Learning Pipeline Applied to Credit Data

The module `pipeline_library` contains functions that allow us to apply the machine learning pipeline. The pipeline involves the following steps:

1. Read/load data
2. Explore the data
3. Pre-process and clean the data
4. Generate features/predictors for the ML model
5. Build a machine learning classifier 
6. Evaluate the classifier

We will utilize `pipeline_library` to analyze credit data and predict inividuals who will
experience financial distress in the next two years. The credit data is a modified version of data from https://www.kaggle.com/c/GiveMeSomeCredit.

In [1]:
import pipeline_library as pl



### 1. Read the data

In [2]:
df = pl.load_csv_data("data/credit-data.csv")

### 2. Explore the data
Run `show_data_summary()` to see a the first five rows of the dataframe, column names and types, and summary statistics of each column/variable.

In [3]:
pl.show_data_summary(df)

   PersonID  SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  \
0     98976                 0                              1.000000   55   
1     98991                 0                              0.547745   71   
2     99012                 0                              0.044280   51   
3     99023                 0                              0.914249   55   
4     99027                 0                              0.026599   45   

   zipcode  NumberOfTime30-59DaysPastDueNotWorse   DebtRatio  MonthlyIncome  \
0    60601                                     0  505.000000            0.0   
1    60601                                     0    0.459565        15666.0   
2    60601                                     0    0.014520         4200.0   
3    60601                                     4    0.794875         9052.0   
4    60601                                     0    0.049966        10406.0   

   NumberOfOpenCreditLinesAndLoans  NumberOfTimes90DaysLate  \
0    

The label, or outcome variable, that we will be using is `SeriousDlqin2yrs`. We can take a look at this label more closely with `show_label_detail()`.

In [4]:
label = "SeriousDlqin2yrs"
pl.show_label_detail(df, label)

Label value counts
0    34396
1     6620
Name: SeriousDlqin2yrs, dtype: int64

Proportion of 'yes' observations
0.1614004291008387


We can visualize a variable we're interested in with a simple histogram. We'll run `show_histogram()` to see a plot of the column `NumberOfDependents`.

In [5]:
pl.show_histogram(df, "NumberOfDependents")

AxesSubplot(0.125,0.11;0.775x0.77)


We can see if there is correlation between two variables with `show_correlation()`. Running it without specifying columns will print a dataframe showing the correlation between every combination of columns, or we can pass it the names of two columns we're interested in to see just that correlation. Let's see if there's a correlation between the label and a couple of different variables (features).

In [6]:
pl.show_correlation(df, cols=[label, "NumberOfOpenCreditLinesAndLoans"])

-0.03989766280591296

In [7]:
pl.show_correlation(df, cols=[label, "age"])

-0.1737278443559057

### 3. Pre-process the data
Find columns with null values and fill those null values with the mean value of the column.

In [8]:
pl.preprocess_data(df)

Replacing 7974 nulls in column MonthlyIncome with mean value 6578.995732703832
Replacing 1037 nulls in column NumberOfDependents with mean value 0.7732309462467796


### 4. Generate features
Prepare features for the model by transforming appropriate continuous variables into discrete variables and transforming appropriate discrete variables into dummy variables.

Here, we'll transform age into a discrete variable by creating age bins/categories. Age 21-31 will be "Young Adult", 32-64 will be "Adult", and 65-110 will be "Senior".

In [9]:
pl.make_discrete(df, "age", [21, 32, 65, 111], ["Young Adult", "Adult", "Senior"])
df["age"].value_counts()

Adult          29197
Senior          8041
Young Adult     3778
Name: age, dtype: int64

Make dummy variable from the column `NumberOfTimes90DaysLate` with `make_dummy()`. A value of 0 will represent an individual who has never been 90 days late and a value of 1 will represent an indivdual who has been 90 days late at least once.

In [10]:
pl.make_dummy(df, "NumberOfTimes90DaysLate")
df["NumberOfTimes90DaysLate_dummy"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[new_col][df[col] >= 1] = 1


0    37586
1     3430
Name: NumberOfTimes90DaysLate_dummy, dtype: int64

### 5. Build decision tree classifier
Now we will define the columns we want to use as features, split the data, then use the training X and y data to fit the decision tree.

In [11]:
selected_features = [
    "RevolvingUtilizationOfUnsecuredLines",
    #"age",
    "zipcode",
    "NumberOfTime30-59DaysPastDueNotWorse",
    "DebtRatio",
    "MonthlyIncome",
    "NumberOfOpenCreditLinesAndLoans",
    "NumberOfTimes90DaysLate_dummy",
    "NumberRealEstateLoansOrLines",
    "NumberOfTime60-89DaysPastDueNotWorse",
    "NumberOfDependents"
]

test_size = 0.3
x_train, x_test, y_train, y_test = pl.split_data(df, selected_features, label, test_size)
dec_tree = pl.fit_decision_tree(x_train, y_train)
type(dec_tree)

sklearn.tree.tree.DecisionTreeClassifier

### 6. Evaluate classifier
Using the trained decision tree model and our X and y data, we will evaluate the accuracy of the model.

In [12]:
threshold = 0.4
pl.evaluate_model(dec_tree, x_test, y_test, threshold)

0.8132466477041853