In [1]:
import unittest
from week2.hw2.src.neighbors import Neighbors
import os



In [2]:
dirname=os.path.dirname(os.path.realpath('__file__'))
csv = os.path.join(dirname, '../data/credit-data.csv')

# Data Ingestion
The first element of the pipeline simply uses pandas to ingest the csv data, used below. This ingest implementation may evolve with other implementations.

In [3]:
neighbors = Neighbors()
data = neighbors.ingest(csv)

data.tail()

Unnamed: 0,PersonID,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,zipcode,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
41011,123722,0,0.31136,48,60644,0,0.382311,4872.0,11,0,2,0,3.0
41012,123729,0,0.03881,45,60644,0,0.15613,6500.0,13,0,1,0,3.0
41013,123730,0,0.007576,74,60644,0,14.0,,9,0,0,0,0.0
41014,123739,0,0.052153,72,60644,0,382.0,,8,0,0,0,0.0
41015,123753,0,1.368872,60,60644,0,0.039417,3500.0,5,3,0,1,0.0


# Discretization and Creation of Dummies
In the next phase of the pipeline, we begin to set up the hypothesis that debt ratio ("Debt Ratio" in the data) is the best indicator of Financial Distress ("SeriousDlqin2yrs").

Below, we discretize debt ratio data into four buckets, then create dummy variables for each of the buckets in the new column, "DebtClassification".

In [4]:
data =  neighbors.preprocess(data)
data['DebtClassification'] = neighbors.discretize(data, "DebtRatio", labels=['High Debt', 'Above Average Debt', 'Below Average Debt', 'Low Debt'])
data = neighbors.dummify(data, 'DebtClassification')

# Classification and Prediction
Given the four buckets of debt ratio we created above, let's test a K-Nearest-Neighbors model with 10 neighbors, using only the debt classifications from above as our features in the below model.

In [5]:
features = ['High Debt', 'Above Average Debt', 'Below Average Debt', 'Low Debt']
target = 'SeriousDlqin2yrs'
kwargs = {"n_neighbors": 10}
classifier, test_features, test_target = neighbors.classify(data, features, target, **kwargs)
prediction = neighbors.predict(classifier, test_features)

# Evaluation
Finally, let's evaluate how our model performs against the test data we split above

In [7]:
evaluation = neighbors.evaluate_classifier(prediction, test_target)
print("Accuracy score for {} neighbors: {}".format(10, evaluation))

Accuracy score for 10 neighbors: 0.8391028766455387


# Results and Conclusion
This simple model appears to have shown 83.9% accuracy on our test data. It appears through this simple model that debt ratio is a good predictor of future financial stress. 

## Future Work
* Cross-validation to find the best number of neighbors for the model
* Implementation of other models to test against KNN
* Use of other combinations of features to test against the current model