Tartan Data Science Cup III

Abstract

Host : Capital One
Prize : AppleTV + USD 50 Amazon Card
Problem : Binary Classification
Evaluation : Brier Score
Period : Feb 4 2017 9AM ~ 6PM (9 hours)

Capital One hosted a 1-day data science hackathon with CMU Department of Statistics to challenge undergraduates and potential employees with real-world machine learning problem. Loan data from Lending Club was provided which contained information about the loaner and loans itself in numeric and categorical variables. Using the train dataset, we were challenged to predict if the 'loan_status' is in one of the following groups:

Group 0: "Good" -- The loan is fully paid off ("Fully Paid"), in its grace period ("In Grace Period"), or currently being paid back ("Current").
Group 1: "Bad" -- The loan is in default ("Default"), payment on the loan is late ("Late (16-30 days)" or "Late (31-120 days)"), or the loan has been charged off ("Charged Off").

Competition was co-hosted by Capital One and CMU. Four employees from Captial One was actively communicating and providing helps to the participants, they were really nice. Staffs from CMU Department of Statistics were well-organized and encouraging throughout the competition.

My team, 'Anything Random', approached this problem by generating appropriate feature representation on top of the fine-tuned XGBoost model. 100+ participants with 40+ teams competed in this one-day data science competition. 15 Finalists were selected by their prediction accuracy to move on to 2nd stage. 2~3 pages of technical report write-up was submitted describing the problem, methods and results of our machine learning pipeline. 8 finalists were selected to make 5-min presentation on team's effort.

My team ended up with 2nd place overall. It was nice to hear that we were 1st place in prediction accuracy among all participants.

Cross-Validation Result

Brier Score	Numeric Only (9)	+ Categorical (26)	+ new features (29)	outlier clipping
RandomForest	0.0898	0.0857	0.0855	0.0854
LogisticRegression	-	-	-	0.0861
ExtraTrees	-	-	-	0.0854
XGBoost	-	-	-	0.0844

How to Run

[Data]

Place data in input directory.

[Code]

Above results can be replicated by runinng all commands in anything_random.ipynb notebook.

Make sure you are on Python 3.5.2 with library versions same as specified in requirements.txt

Overall Result

1st place in Prediction Accuracy
2nd place in Overall [Prediction Accuracy + Report Write-up + Oral Presentation]

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
input		input
output		output
README.md		README.md
requirements.txt		requirements.txt
work_diary.numbers		work_diary.numbers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tartan Data Science Cup III

Abstract

Cross-Validation Result

How to Run

Overall Result

About

Releases

Packages

Languages

kweonwooj/tartan_data_science_cup_s17

Folders and files

Latest commit

History

Repository files navigation

Tartan Data Science Cup III

Abstract

Cross-Validation Result

How to Run

Overall Result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages